SlideShare a Scribd company logo
Copyright © 2017 MathWorks, Inc 1
Girish Venkataramani, Avinash Nehemiah
May 2017
Deep Learning and Vision Algorithm
Development in MATLAB Targeting
Embedded GPUs
Copyright © 2017 MathWorks, Inc 2
Design Deep
Learning & Vision
Algorithms
Talk Outline
High Performance
Embedded
Implementation
Highlights
• Manage large image sets
• Automate image labeling
• Easy access to models
• Pre-built training
frameworks
Highlights
• Automate compilation of
MATLAB to CUDA
• 14x faster than pyCaffe
60% faster than C++ Caffe
3x faster than TensorFlow
Accelerate and Scale
Training
Highlights
• Acceleration with GPUs
• Scale to clusters
Copyright © 2017 MathWorks, Inc 3
Let’s Use Object Detection as an Example
TRUCK
SUV
CAR
In our example we’ll use deep learning for object detection.
Copyright © 2017 MathWorks, Inc 5
Transfer Learning Workflow
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Training Data
Labels: Car, Truck,
Large Truck, SUV, Van
Alexnet, VGG-16,
VGG-19, GoogLeNet
Copyright © 2017 MathWorks, Inc 6
Manage Large Sets of Images
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Handle Large Sets of Images
Easily manage large sets of images
- Single line of code to access images
- Operates on disk, database, big-data file system
imageData = imageDataStore(‘vehicles’)
Easily manage large sets of images
- Single line of code to access images
- Operates on disk, database, big-data file system
Organize Images in Folders
(~ 10,000 images , 5 folders)
Copyright © 2017 MathWorks, Inc 7
Automate Ground Truth Labeling
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Ground Truth Labeling
Copyright © 2017 MathWorks, Inc 8
Automate Ground Truth Labeling
Automate Ground Truth Labeling
Copyright © 2017 MathWorks, Inc 9
Access Reference Models in MATLAB
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Easily Load Reference Networks
Access Models with 1-line of MATLAB Code
Net1 = alexnet
Net2 = vgg16
Net3 = vgg19
Copyright © 2017 MathWorks, Inc 10
Access Reference Models in MATLAB
Easily manage large sets of images
- Single line of code to access images
- Operates on disk, database, big-data file system
1. Reference Models
2. Model Importer
3. Tutorials
Copyright © 2017 MathWorks, Inc 11
Modify Network Structure
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Simple MATLAB API to modify layers:
layers(23) = fullyConnectedLayer(5, 'Name','fc8');
layers(25) = classificationLayer('Name',‘VehicleClassifier')
Copyright © 2017 MathWorks, Inc 12
Training Object Detectors
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Train Any Network
trainNetwork(datastore, layers, options)
Pre-built Frameworks for Computer Vision
• Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN
• Machine Learning: ACF, Cascade Object Detectors
Copyright © 2017 MathWorks, Inc 13
Visualizing and Debugging Intermediate Results
Filters
…
Activations
Deep Dream
Training Accuracy
Visualization
Deep Dream
Layer Activations Feature Visualization
• Many options for visualizations and debugging
• Examples to get started
Copyright © 2017 MathWorks, Inc 14
Real World Systems Use More Than
Deep Learning
Deep learning vehicle detector performance degraded with environmental effects (fog, etc. )
Fog Removal
Challenge: Deep learning frameworks do not include “classical” computer vision
Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation
Copyright © 2017 MathWorks, Inc 15
Talk Outline
Design Deep
Learning & Vision
Algorithms
High Performance
Embedded
Implementation
Accelerate and Scale
Training
Can you solve “real” problems for production
systems with MATLAB?
Copyright © 2017 MathWorks, Inc 16
Single code change
trainingOptions(‘sgdm’,…
‘ExecutionEnvironment’,’CPU’)
Accelerate and Scale Computing
Multi-core CPU
‘ExecutionEnvironment’,’GPU’)
GPU
‘ExecutionEnvironment’,’multi-GPU’) Multiple
GPU
‘ExecutionEnvironment’,’parallel’)
Cluster/
Cloud
Copyright © 2017 MathWorks, Inc 17
After Many Iterations to Find The Best Model
Copyright © 2017 MathWorks, Inc 18
Talk Outline
Design Deep
Learning & Vision
Algorithms
High Performance
Embedded
Implementation
Accelerate and Scale
Training
Can you create high performance implementation
from MATLAB code ?
Copyright © 2017 MathWorks, Inc 19
Presenting the MATLAB to CUDA parallelizing compiler
Why?
• Alexnet inference using MATLAB solution is
• ~14x faster than pyCaffe and 50% faster than C++-Caffe
• ~ 4x faster and ~3x less memory-use than TensorFlow
Copyright © 2017 MathWorks, Inc 20
Sample Generated CUDA Code
MATLAB source code Auto-generated CUDA code
Copyright © 2017 MathWorks, Inc 21
MATLAB to CUDA compiler flow
Control-flow graph
Intermediate representation
(CFG – IR)
Front-end
Parallel loop creation
Library function mapping
CUDA kernel creation
cudaMemcpy
minimization
Shared memory synthesis
CUDA code emission
….
Traditional compiler
optimizations
….
(×) cublas-gemm
() cuSolver calls
fft cuFFT calls
nnet cuDNN calls
Library function mapping
Parallel loop creation
Identify loop-nests that will
become CUDA kernels
…
.
CUDA kernel creation
Convert loop to CUDA kernel
Thread/blocks inferred from loop dims
cudaMemcpy
minimization
Shared memory synthesis
Perform Use-def analysis.
cudaMalloc GPU vars, insert memcpy
Infer data locality. Map to shared
memory. Synthesize shared memory
access
CUDA kernel
optimizations
Copyright © 2017 MathWorks, Inc 22
MATLAB to CUDA compiler:
Creating large parallel loops!
Control-flow graph
Intermediate representation
(CFG – IR)
Front-end
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Library function mapping
CUDA code emission
….
Traditional compiler
optimizations
…
.
Loop
optimizations
Scalarization
Loop fusion
Scalar replacement
Parallel loop creation
CUDA kernel creation
cudaMemcpy
minimization
Shared memory synthesis
CUDA kernel
optimizations
Copyright © 2017 MathWorks, Inc 23
MATLAB to CUDA compiler:
Creating large parallel loops!
Control-flow graph
Intermediate representation
(CFG – IR)
Front-end
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Library function mapping
CUDA code emission
….
Traditional compiler
optimizations
…
.
Loop
optimizations
2 kernels (size N), 20*N bytes
1 kernel (size N), 16*N bytes
Scalarization
Loop fusion
Scalar replacement
Parallel loop creation
CUDA kernel creation
cudaMemcpy
minimization
Shared memory synthesis
CUDA kernel
optimizations
Copyright © 2017 MathWorks, Inc 24
cudaMemcpy minimization
A(:) = ….
C(:) = ….
for i = 1:N
….
gB = kernel1(gA);
gA = kernel2(gB);
if (some_condition)
gC = kernel3(gA, gB);
end
….
end
…. = C;
cudaMemcpy
*definitely* needed
cudaMemcpy
*not* needed
cudaMemcpy
*may be* needed
Observations
• Equivalent to Partial redundancy elimination (PRE)
• Dynamic strategy – track memory location with a
status flag per variable
• Use-Def to determine where to insert memcpy
A(:) = …
A_isDirtyOnCpu = true;
…
for i = 1:N
if (A_isDirtyOnCpu)
cudaMemcpy(gA, A);
A_isDirtyOnCpu = false;
end
gB = kernel1(gA);
gA = kernel2(gB);
if (somecondition)
gC = kernel3(gA, gB);
C_isDirtyOnGpu = true;
end
…
end
…
if (C_isDirtyOnGpu)
cudaMemcpy(C, gC);
C_isDirtyOnGpu = false;
end
… = C;
Assume gA, gB and gC are mapped to GPU memory Generated (pseudo) code
Copyright © 2017 MathWorks, Inc 25
Example: Compiling fog-rectification algorithm
Copyright © 2017 MathWorks, Inc 26
MATLAB to CUDA compilation of computer vision
applications
Distance
transform
Fog removal
SURF feature
extraction
Ray tracing
Stereo disparity
Copyright © 2017 MathWorks, Inc 27
Deep learning prediction performance: Alexnet
Framerate(Fps)
Batch Size
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz
GPU Tesla K40c
0
200
400
600
800
1000
1200
1400
1 16 32 64
Py-Caffe
TensorFlow
Copyright © 2017 MathWorks, Inc 28
Deep learning prediction performance: Alexnet
0
1
2
3
4
5
6
7
8
9
CPU resident memory GPU peak memory (nvidia-smi)
Memoryusage(GB)
Batch Size
1 16 32 64
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz
GPU Tesla K40c
Py-Caffe
MATLABtoCUDAcompiler
TensorFlow
MATLABonCPU+GPU
C++-Caffe
Copyright © 2017 MathWorks, Inc 29
Deep learning prediction performance: Alexnet
Jetson (Tegra) TX1
0
50
100
150
200
250
1 16 32 64 128
Framerate(Fps)
Batch Size
C++-Caffe
MATLAB to CUDA
compiler
Copyright © 2017 MathWorks, Inc 30
Create CNNs with MATLAB,
Deploy with MATLAB to CUDA compiler
Alexnet YOLO
People detection Lane detection
~20 Fps (K40c)
~30 Fps
(Tegra X1)
~66 Fps
(Tegra X1)
(K40c)
Copyright © 2017 MathWorks, Inc 31
Conclusions
Design Deep
Learning & Vision
Algorithm
Accelerate and Scale
Training
Deep learning design is
easy in MATLAB
Managing datasets and
scaling up training is easy
in MATLAB
MATLAB to CUDA compiler
10x – 14x faster than pyCaffe
1.3x – 4x faster than TensorFlow
1.07 – 1.6x faster than C++ Caffe
High Performance
Embedded
Implementation
Copyright © 2017 MathWorks, Inc 32
What next?
www.mathworks.com/matlab-cuda-beta
MATLAB to CUDA compiler:
Sign up for our beta program
Try deep learning in MATLAB
Visit our booth and see our demos
Booth #: 808

Recommended

"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
Edge AI and Vision Alliance
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...
Edge AI and Vision Alliance
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Edge AI and Vision Alliance
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
Edge AI and Vision Alliance
 
Backend Cloud Storage Access in Video Streaming
Backend Cloud Storage Access in Video Streaming
Rufael Mekuria
 
Presentation NBMP and PCC
Presentation NBMP and PCC
Rufael Mekuria
 
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
Edge AI and Vision Alliance
 
Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017
Aran Khanna, Software Engineer, Amazon Web Services at MLconf ATL 2017
MLconf
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
Lablup Inc.
 
ML6 talk at Nexxworks Bootcamp
ML6 talk at Nexxworks Bootcamp
Karel Dumon
 
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
NVIDIA Taiwan
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
Karel Dumon
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
Matthias Feys
 
SigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the Untunable
SigOpt
 
Hadoop + GPU
Hadoop + GPU
Vladimir Starostenkov
 
"Blending Cloud and Edge Machine Learning to Deliver Real-time Video Monitori...
"Blending Cloud and Edge Machine Learning to Deliver Real-time Video Monitori...
Edge AI and Vision Alliance
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
Edge AI and Vision Alliance
 
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo Summit
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Ilham Amezzane
 
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
Edge AI and Vision Alliance
 
"Implementing the TensorFlow Deep Learning Framework on Qualcomm’s Low-power ...
"Implementing the TensorFlow Deep Learning Framework on Qualcomm’s Low-power ...
Edge AI and Vision Alliance
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
PAPIs.io
 
TinyML as-a-Service
TinyML as-a-Service
Hiroshi Doyu
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用
NVIDIA Taiwan
 
Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLAB
Ray Phan
 
Image Processing Basics
Image Processing Basics
Nam Le
 

More Related Content

What's hot (20)

Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
Lablup Inc.
 
ML6 talk at Nexxworks Bootcamp
ML6 talk at Nexxworks Bootcamp
Karel Dumon
 
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
NVIDIA Taiwan
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
Karel Dumon
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
Matthias Feys
 
SigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the Untunable
SigOpt
 
Hadoop + GPU
Hadoop + GPU
Vladimir Starostenkov
 
"Blending Cloud and Edge Machine Learning to Deliver Real-time Video Monitori...
"Blending Cloud and Edge Machine Learning to Deliver Real-time Video Monitori...
Edge AI and Vision Alliance
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
Edge AI and Vision Alliance
 
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo Summit
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Ilham Amezzane
 
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
Edge AI and Vision Alliance
 
"Implementing the TensorFlow Deep Learning Framework on Qualcomm’s Low-power ...
"Implementing the TensorFlow Deep Learning Framework on Qualcomm’s Low-power ...
Edge AI and Vision Alliance
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
PAPIs.io
 
TinyML as-a-Service
TinyML as-a-Service
Hiroshi Doyu
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用
NVIDIA Taiwan
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
Lablup Inc.
 
ML6 talk at Nexxworks Bootcamp
ML6 talk at Nexxworks Bootcamp
Karel Dumon
 
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
NVIDIA深度學習教育機構 (DLI): Object detection with jetson
NVIDIA Taiwan
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
Edge AI and Vision Alliance
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
Karel Dumon
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
Matthias Feys
 
SigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the Untunable
SigOpt
 
"Blending Cloud and Edge Machine Learning to Deliver Real-time Video Monitori...
"Blending Cloud and Edge Machine Learning to Deliver Real-time Video Monitori...
Edge AI and Vision Alliance
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
Edge AI and Vision Alliance
 
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo and the Convergence of Machine Learning, Big Data, and Supercomputing
Accumulo Summit
 
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Ilham Amezzane
 
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
"Approaches for Energy Efficient Implementation of Deep Neural Networks," a P...
Edge AI and Vision Alliance
 
"Implementing the TensorFlow Deep Learning Framework on Qualcomm’s Low-power ...
"Implementing the TensorFlow Deep Learning Framework on Qualcomm’s Low-power ...
Edge AI and Vision Alliance
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
PAPIs.io
 
TinyML as-a-Service
TinyML as-a-Service
Hiroshi Doyu
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用
NVIDIA Taiwan
 

Viewers also liked (9)

Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLAB
Ray Phan
 
Image Processing Basics
Image Processing Basics
Nam Le
 
8085 Paper Presentation slides,ppt,microprocessor 8085 ,guide, instruction set
8085 Paper Presentation slides,ppt,microprocessor 8085 ,guide, instruction set
Saumitra Rukmangad
 
8085 microprocessor architecture ppt
8085 microprocessor architecture ppt
Parvesh Gautam
 
Digital Image Processing
Digital Image Processing
Sahil Biswas
 
Introduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Introduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
IoT architecture
IoT architecture
Sumit Sharma
 
Introduction to Digital Image Processing Using MATLAB
Introduction to Digital Image Processing Using MATLAB
Ray Phan
 
Image Processing Basics
Image Processing Basics
Nam Le
 
8085 Paper Presentation slides,ppt,microprocessor 8085 ,guide, instruction set
8085 Paper Presentation slides,ppt,microprocessor 8085 ,guide, instruction set
Saumitra Rukmangad
 
8085 microprocessor architecture ppt
8085 microprocessor architecture ppt
Parvesh Gautam
 
Digital Image Processing
Digital Image Processing
Sahil Biswas
 
Introduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Introduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 

Similar to "Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks (20)

A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen Donigian
Data Con LA
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
AI On the Edge: Model Compression
AI On the Edge: Model Compression
Apache MXNet
 
ICGIS 2018 - Cloud-powered Machine Learnings on Geospactial Services (Channy ...
ICGIS 2018 - Cloud-powered Machine Learnings on Geospactial Services (Channy ...
Channy Yun
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
OpenACC
 
Heterogeneous Data Mining with Spark
Heterogeneous Data Mining with Spark
KNIMESlides
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
Sri Ambati
 
Big Data Analytics With MATLAB
Big Data Analytics With MATLAB
CodeOps Technologies LLP
 
Time series modeling workd AMLD 2018 Lausanne
Time series modeling workd AMLD 2018 Lausanne
Sunil Mallya
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Debugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet Gluon
Apache MXNet
 
Boston hug-2012-07
Boston hug-2012-07
Ted Dunning
 
Distributed deep learning optimizations
Distributed deep learning optimizations
geetachauhan
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
geetachauhan
 
Developing and Deploying Deep Learning Based Computer Vision Systems - Alka N...
Developing and Deploying Deep Learning Based Computer Vision Systems - Alka N...
CodeOps Technologies LLP
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
Update on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPC
inside-BigData.com
 
"Performing Multiple Perceptual Tasks With a Single Deep Neural Network," a P...
"Performing Multiple Perceptual Tasks With a Single Deep Neural Network," a P...
Edge AI and Vision Alliance
 
System mldl meetup
System mldl meetup
Ganesan Narayanasamy
 
A Gentle Introduction to GPU Computing by Armen Donigian
A Gentle Introduction to GPU Computing by Armen Donigian
Data Con LA
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
AI On the Edge: Model Compression
AI On the Edge: Model Compression
Apache MXNet
 
ICGIS 2018 - Cloud-powered Machine Learnings on Geospactial Services (Channy ...
ICGIS 2018 - Cloud-powered Machine Learnings on Geospactial Services (Channy ...
Channy Yun
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
OpenACC
 
Heterogeneous Data Mining with Spark
Heterogeneous Data Mining with Spark
KNIMESlides
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
Sri Ambati
 
Time series modeling workd AMLD 2018 Lausanne
Time series modeling workd AMLD 2018 Lausanne
Sunil Mallya
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Debugging and Performance tricks for MXNet Gluon
Debugging and Performance tricks for MXNet Gluon
Apache MXNet
 
Boston hug-2012-07
Boston hug-2012-07
Ted Dunning
 
Distributed deep learning optimizations
Distributed deep learning optimizations
geetachauhan
 
Distributed deep learning optimizations for Finance
Distributed deep learning optimizations for Finance
geetachauhan
 
Developing and Deploying Deep Learning Based Computer Vision Systems - Alka N...
Developing and Deploying Deep Learning Based Computer Vision Systems - Alka N...
CodeOps Technologies LLP
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
Update on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPC
inside-BigData.com
 
"Performing Multiple Perceptual Tasks With a Single Deep Neural Network," a P...
"Performing Multiple Perceptual Tasks With a Single Deep Neural Network," a P...
Edge AI and Vision Alliance
 

More from Edge AI and Vision Alliance (20)

“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
Edge AI and Vision Alliance
 
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
Edge AI and Vision Alliance
 
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Edge AI and Vision Alliance
 
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
Edge AI and Vision Alliance
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
Edge AI and Vision Alliance
 
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
 
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
“How Qualcomm Is Powering AI-driven Multimedia at the Edge,” a Presentation f...
“How Qualcomm Is Powering AI-driven Multimedia at the Edge,” a Presentation f...
Edge AI and Vision Alliance
 
“OAAX: One Standard for AI Vision on Any Compute Platform,” a Presentation fr...
“OAAX: One Standard for AI Vision on Any Compute Platform,” a Presentation fr...
Edge AI and Vision Alliance
 
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
Edge AI and Vision Alliance
 
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
Edge AI and Vision Alliance
 
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
Edge AI and Vision Alliance
 
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
Edge AI and Vision Alliance
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
Edge AI and Vision Alliance
 
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
Edge AI and Vision Alliance
 
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
Edge AI and Vision Alliance
 
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
Edge AI and Vision Alliance
 
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
“How to Right-size and Future-proof a Container-first Edge AI Infrastructure,...
Edge AI and Vision Alliance
 
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
“Image Tokenization for Distributed Neural Cascades,” a Presentation from Goo...
Edge AI and Vision Alliance
 
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
“Key Requirements to Successfully Implement Generative AI in Edge Devices—Opt...
Edge AI and Vision Alliance
 
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
“Bridging the Gap: Streamlining the Process of Deploying AI onto Processors,”...
Edge AI and Vision Alliance
 
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
Edge AI and Vision Alliance
 
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
“Why It’s Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
 
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
“Solving Tomorrow’s AI Problems Today with Cadence’s Newest Processor,” a Pre...
Edge AI and Vision Alliance
 
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
“State-space Models vs. Transformers for Ultra-low-power Edge AI,” a Presenta...
Edge AI and Vision Alliance
 
“How Qualcomm Is Powering AI-driven Multimedia at the Edge,” a Presentation f...
“How Qualcomm Is Powering AI-driven Multimedia at the Edge,” a Presentation f...
Edge AI and Vision Alliance
 
“OAAX: One Standard for AI Vision on Any Compute Platform,” a Presentation fr...
“OAAX: One Standard for AI Vision on Any Compute Platform,” a Presentation fr...
Edge AI and Vision Alliance
 
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
“Improved Data Sampling Techniques for Training Neural Networks,” a Presentat...
Edge AI and Vision Alliance
 
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
“Cost-efficient, High-quality AI for Consumer-grade Smart Home Cameras,” a Pr...
Edge AI and Vision Alliance
 
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
“Edge AI Optimization on Rails—Literally,” a Presentation from Wabtec
Edge AI and Vision Alliance
 
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
“How Large Language Models Are Impacting Computer Vision,” a Presentation fro...
Edge AI and Vision Alliance
 

Recently uploaded (20)

Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
Quantum AI: Where Impossible Becomes Probable
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
 
Securing AI - There Is No Try, Only Do!.pdf
Securing AI - There Is No Try, Only Do!.pdf
Priyanka Aash
 
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
 
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
ReSTIR [DI]: Spatiotemporal reservoir resampling for real-time ray tracing ...
revolcs10
 
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Scaling in space and time with Temporal", Andriy Lupa.pdf
Fwdays
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
AI Agents and FME: A How-to Guide on Generating Synthetic Metadata
Safe Software
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
 
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
EIS-Webinar-Engineering-Retail-Infrastructure-06-16-2025.pdf
Earley Information Science
 
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
 
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
Quantum AI: Where Impossible Becomes Probable
Quantum AI: Where Impossible Becomes Probable
Saikat Basu
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
Daily Lesson Log MATATAG ICT TEchnology 8
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 

"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks

  • 1. Copyright © 2017 MathWorks, Inc 1 Girish Venkataramani, Avinash Nehemiah May 2017 Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs
  • 2. Copyright © 2017 MathWorks, Inc 2 Design Deep Learning & Vision Algorithms Talk Outline High Performance Embedded Implementation Highlights • Manage large image sets • Automate image labeling • Easy access to models • Pre-built training frameworks Highlights • Automate compilation of MATLAB to CUDA • 14x faster than pyCaffe 60% faster than C++ Caffe 3x faster than TensorFlow Accelerate and Scale Training Highlights • Acceleration with GPUs • Scale to clusters
  • 3. Copyright © 2017 MathWorks, Inc 3 Let’s Use Object Detection as an Example TRUCK SUV CAR In our example we’ll use deep learning for object detection.
  • 4. Copyright © 2017 MathWorks, Inc 5 Transfer Learning Workflow Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Training Data Labels: Car, Truck, Large Truck, SUV, Van Alexnet, VGG-16, VGG-19, GoogLeNet
  • 5. Copyright © 2017 MathWorks, Inc 6 Manage Large Sets of Images Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Handle Large Sets of Images Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system imageData = imageDataStore(‘vehicles’) Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system Organize Images in Folders (~ 10,000 images , 5 folders)
  • 6. Copyright © 2017 MathWorks, Inc 7 Automate Ground Truth Labeling Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Ground Truth Labeling
  • 7. Copyright © 2017 MathWorks, Inc 8 Automate Ground Truth Labeling Automate Ground Truth Labeling
  • 8. Copyright © 2017 MathWorks, Inc 9 Access Reference Models in MATLAB Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Easily Load Reference Networks Access Models with 1-line of MATLAB Code Net1 = alexnet Net2 = vgg16 Net3 = vgg19
  • 9. Copyright © 2017 MathWorks, Inc 10 Access Reference Models in MATLAB Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system 1. Reference Models 2. Model Importer 3. Tutorials
  • 10. Copyright © 2017 MathWorks, Inc 11 Modify Network Structure Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Simple MATLAB API to modify layers: layers(23) = fullyConnectedLayer(5, 'Name','fc8'); layers(25) = classificationLayer('Name',‘VehicleClassifier')
  • 11. Copyright © 2017 MathWorks, Inc 12 Training Object Detectors Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Train Any Network trainNetwork(datastore, layers, options) Pre-built Frameworks for Computer Vision • Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN • Machine Learning: ACF, Cascade Object Detectors
  • 12. Copyright © 2017 MathWorks, Inc 13 Visualizing and Debugging Intermediate Results Filters … Activations Deep Dream Training Accuracy Visualization Deep Dream Layer Activations Feature Visualization • Many options for visualizations and debugging • Examples to get started
  • 13. Copyright © 2017 MathWorks, Inc 14 Real World Systems Use More Than Deep Learning Deep learning vehicle detector performance degraded with environmental effects (fog, etc. ) Fog Removal Challenge: Deep learning frameworks do not include “classical” computer vision Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation
  • 14. Copyright © 2017 MathWorks, Inc 15 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you solve “real” problems for production systems with MATLAB?
  • 15. Copyright © 2017 MathWorks, Inc 16 Single code change trainingOptions(‘sgdm’,… ‘ExecutionEnvironment’,’CPU’) Accelerate and Scale Computing Multi-core CPU ‘ExecutionEnvironment’,’GPU’) GPU ‘ExecutionEnvironment’,’multi-GPU’) Multiple GPU ‘ExecutionEnvironment’,’parallel’) Cluster/ Cloud
  • 16. Copyright © 2017 MathWorks, Inc 17 After Many Iterations to Find The Best Model
  • 17. Copyright © 2017 MathWorks, Inc 18 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you create high performance implementation from MATLAB code ?
  • 18. Copyright © 2017 MathWorks, Inc 19 Presenting the MATLAB to CUDA parallelizing compiler Why? • Alexnet inference using MATLAB solution is • ~14x faster than pyCaffe and 50% faster than C++-Caffe • ~ 4x faster and ~3x less memory-use than TensorFlow
  • 19. Copyright © 2017 MathWorks, Inc 20 Sample Generated CUDA Code MATLAB source code Auto-generated CUDA code
  • 20. Copyright © 2017 MathWorks, Inc 21 MATLAB to CUDA compiler flow Control-flow graph Intermediate representation (CFG – IR) Front-end Parallel loop creation Library function mapping CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA code emission …. Traditional compiler optimizations …. (×) cublas-gemm () cuSolver calls fft cuFFT calls nnet cuDNN calls Library function mapping Parallel loop creation Identify loop-nests that will become CUDA kernels … . CUDA kernel creation Convert loop to CUDA kernel Thread/blocks inferred from loop dims cudaMemcpy minimization Shared memory synthesis Perform Use-def analysis. cudaMalloc GPU vars, insert memcpy Infer data locality. Map to shared memory. Synthesize shared memory access CUDA kernel optimizations
  • 21. Copyright © 2017 MathWorks, Inc 22 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations
  • 22. Copyright © 2017 MathWorks, Inc 23 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations 2 kernels (size N), 20*N bytes 1 kernel (size N), 16*N bytes Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations
  • 23. Copyright © 2017 MathWorks, Inc 24 cudaMemcpy minimization A(:) = …. C(:) = …. for i = 1:N …. gB = kernel1(gA); gA = kernel2(gB); if (some_condition) gC = kernel3(gA, gB); end …. end …. = C; cudaMemcpy *definitely* needed cudaMemcpy *not* needed cudaMemcpy *may be* needed Observations • Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a status flag per variable • Use-Def to determine where to insert memcpy A(:) = … A_isDirtyOnCpu = true; … for i = 1:N if (A_isDirtyOnCpu) cudaMemcpy(gA, A); A_isDirtyOnCpu = false; end gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); C_isDirtyOnGpu = true; end … end … if (C_isDirtyOnGpu) cudaMemcpy(C, gC); C_isDirtyOnGpu = false; end … = C; Assume gA, gB and gC are mapped to GPU memory Generated (pseudo) code
  • 24. Copyright © 2017 MathWorks, Inc 25 Example: Compiling fog-rectification algorithm
  • 25. Copyright © 2017 MathWorks, Inc 26 MATLAB to CUDA compilation of computer vision applications Distance transform Fog removal SURF feature extraction Ray tracing Stereo disparity
  • 26. Copyright © 2017 MathWorks, Inc 27 Deep learning prediction performance: Alexnet Framerate(Fps) Batch Size CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c 0 200 400 600 800 1000 1200 1400 1 16 32 64 Py-Caffe TensorFlow
  • 27. Copyright © 2017 MathWorks, Inc 28 Deep learning prediction performance: Alexnet 0 1 2 3 4 5 6 7 8 9 CPU resident memory GPU peak memory (nvidia-smi) Memoryusage(GB) Batch Size 1 16 32 64 CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c Py-Caffe MATLABtoCUDAcompiler TensorFlow MATLABonCPU+GPU C++-Caffe
  • 28. Copyright © 2017 MathWorks, Inc 29 Deep learning prediction performance: Alexnet Jetson (Tegra) TX1 0 50 100 150 200 250 1 16 32 64 128 Framerate(Fps) Batch Size C++-Caffe MATLAB to CUDA compiler
  • 29. Copyright © 2017 MathWorks, Inc 30 Create CNNs with MATLAB, Deploy with MATLAB to CUDA compiler Alexnet YOLO People detection Lane detection ~20 Fps (K40c) ~30 Fps (Tegra X1) ~66 Fps (Tegra X1) (K40c)
  • 30. Copyright © 2017 MathWorks, Inc 31 Conclusions Design Deep Learning & Vision Algorithm Accelerate and Scale Training Deep learning design is easy in MATLAB Managing datasets and scaling up training is easy in MATLAB MATLAB to CUDA compiler 10x – 14x faster than pyCaffe 1.3x – 4x faster than TensorFlow 1.07 – 1.6x faster than C++ Caffe High Performance Embedded Implementation
  • 31. Copyright © 2017 MathWorks, Inc 32 What next? www.mathworks.com/matlab-cuda-beta MATLAB to CUDA compiler: Sign up for our beta program Try deep learning in MATLAB Visit our booth and see our demos Booth #: 808