SlideShare a Scribd company logo
Team 6:
Sourabh Ketkale : 010470785
Sahil Kaw : 010725104
Siddhi Pai : 010702458
Goutham Nekkalapu : 010815233
Prince Jacob Chandy : 010807225
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 Comparison to Optimized BLAS package : For higher order matrices the speedup
of BLAS packages was higher in comparison of the baseline CPU.
 Comparison to an optimized GPU implementation: Without batching the GPU
attained 2.8 times speedup to baseline CPU.
 Linear Quantization: We make use of the 8 bit quantization technique to
convert activations into unsigned character and weights into signed character
with biases which are coded as 32 bit
 Intel SSE3: We are able to achieve the 3* speed up because it provides
support to pmaddubsw.
 Intel SSE4: These instruction set provide optimization to convert 16 bit to 32
bit instruction and thereby we achieve 9% relative speed improvement over
SSE3 benchmark.
BATCHING: With batching we can further overcome the GPU performance by applying
batching on neural networks in bulk so that we can take advantage of CPU caching of
both weights and activation.
LAZY EVALUATION: A Neural network only compute a fraction of state and thereby we
can reduce the number of parameters that needs to be visited at every point and
thereby reducing the number of the arithmetic and memory operations using Gaussian
Selection technique.
BATCHED LAZY EVALUATION: Implementing the Lazy Evaluation on smaller batches
in the speech evaluation readily improve the performance of the CPU over GPU.
 Auto encoder is an artificial Neural network used for learning efficient codings.
 The stacked auto encoder is a deep learning model consists of multiple auto-
encoders.
 XEON PHI is a small cluster of 60 cores and each core has 4 hardware threads. It has
8GB of memory, a file system and the Linux Operating System and 1 GHZ of clock
speed. It has 32 KB L1 data cache and 512 KB L2 cache
 Thread oversubscription means number of thread in parallel is more than the number
of the threads of the XEON PHI supports
 It greatly decrease the performance of the XENON PHI as it leads to context switching
and in a many core processor its very expensive
Solution:
 MapReduce method can effectively determine the number of threads required by
MKL(Math Kernal libraries) function.
 MKL libraries itself also determine the number of threads required by the process but
not suited for model parallelism and asynchronous training
Basic Design of Xeon Phi:
Training dataset for Neural networks are very huge so a lot of I/O takes place between RAM
and the memory and thus this time also needs consideration.
To solve this we generally keep all parameters and the temporary variables always stored in
global memory of Xeon Phi and keep on transferring the training dataset.
Parallel Design:
 Data Parallelism : Is achieved by Vector Processing Unit to compute the data wise operation
in each model replica.
 Task Parallelism: Is achieved by multiple threads in the XEON PHI
 Affinity Mode: Affinity sets up the mapping between the thread and the core.
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
what is really holding us back with ‘deep learning’ ?
For achieving this kind of computing, one can’t depend upon a single system;
you need ‘large scale distributed systems’
You have multiple model
replicas, each consisting
of multiple machines, that
train on different subset
of data. And they publish
updates to the global model
parameter server
Model Parallelism
Data Parallelism
Whole system co-design
 Model partitioning – working set of the model is stored in L3 cache
 Local weight computation at the parameter server
Exploiting Asynchrony (as weight updates are commutative and associative)
 Multi-threaded weight updates without locks
 Asynchronous batch updates – aggregate the weights and update to parameter server
only when we have large enough aggregation
 To achieve this, GeePS needs to overcome the challenges of limited GPU memory,
and inter-machine communication (data movement overheads), GPU stalls
 Parameter server works by separating the problems of processing data and the
problem of communicating and synchronizing them between different machines
 GeePS is a parameter server supporting data-parallel model training
The authors tried using an existing state-of-the-art parameter server system (IterStore)
with GPU based ML…
To enable a parameter server to support parallel ML applications running on distributed
GPUs the authors make three important changes:
 Explicit use of GPU memory for the parameter cache
 Batch-based parameter access methods
 Parameter server management of GPU memory on behalf of the application
GPUs using a CPU-based parameter server
GPU based parameter server
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Two ways to achieve parallelism:
• By distributing deep computation into a Hadoop cluster or cloud of computing nodes
• By using field programmable gate arrays (FPGA) hardware acceleration to speed up
computationally intensive deep learning Kernels
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 Performance bottle necks in Deep learning of CNN
 Design Distributed Hadoop clusters with separation of kernels processed Standard or
accelerated FPGA based nodes
 Design and synthesis of the reconfigurable architecture to support Kernel
acceleration on
 Designing a interface library to achieve compatibility between FPGA nodes and
general purpose nodes
Kernel Identification
 Approach to Distributed Algorithm With FPGA-Based Nodes
Design and Implementation Of Reconfigurable Architecture
For Deep Learning Kernels
Seamless Integration of the Distributed Algorithm with the
Accelerated Kernels
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 To cash on the advantage to achieve fine grain parallelism with the help of
reconfigurable hardware which cannot be done in case of GPU’s
 The performance per watt ratio is better with FPGA’s which can exploit computation
power with lower energy consumption on power intensive environments like mobile
devices, data centers
 Support with all the open source framework for the
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 A set of programming languages, models and tools
supporting the Intel x86 architecture can also be used
on the Intel Xeon Phi coprocessor with little change.
 As a result, instead of redesigning new algorithms or
models for GPU in CUDA or OpenCL.
 The vector-intensive algorithms can take advantage of
the above mentioned architecture
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 OpenMP and Intel MKL (Math Kernel Library)
packages are used to parallelize them.
 Many matrix multiplications and are tackled by
the Intel MKL packages.
lAchieves a 302-fold speedup compared with the
un-optimized sequential algorithm
.
 Thread parallelism
 Controlled Hogwild
 Arbitrary Order of Synchronization
 Vectorization
 Speed up of the algorithm, compared to one
thread on the Xeon Phi and that of on sequential
version executed on Xeon E5
 Execution times for all thread counts and CNN
architecture sizes on the Xeon Phi, and the
sequential version on Xeon E5
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Implements Deep Learning on low cost platforms.
Low platform device adopts task flexible architecture and
multiple parallelism to cover functions of CDBN.
 complex function
 an additional stage
 random number generation
Additional tradeoff
 Arithmetic Precision
 Hardware Parallelism
 Memory Input output bandwidth
 Random number generator
 By implementing 3 key features
 Deep network learning engine with dual threaded 4 stage task level pipeline.
 Deep network inference engine with dynamically reconfigurable systolic PE array.
 True Random number generator.
 High computational throughput and memory bandwidth
 Implementing and optimizing the 1D , 2D and multi channel 2D convolution operations
on GPU and INTEL MIC
 Hence, we go for many core architecture.
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 For 1D and 2D : Register tiling.
 For Multi-channel 2D convolution: Local Memory tiling.
On Intel MIC, our solution gets up to 25% of the theoretical
peak performance.
 Deep Learning algorithms being Computing power intensive, it depends on the use
case scenario to choose the framework and hardware
 GPU :
 Pro: They provide huge computational power
 Can be used as a cluster of GPU’s
 But huge power consumption and algorithms have to be designed and implemented again
in CUDA/OpenCL
 FPGAs :
 Pro : Low power consumption when compared to GPUs
 But, design of algorithm on this can be time consuming
 A potential speed-up of 12.6 times and an energy reduction of 87.5% on a 6-node
FPGA accelerated Hadoopcluster
 Xeon Phi co-processor:
 Pro : Offers considerable amount of computation power, very easy to migrate to this platform
from normal CPU. Can Even improve this performance by combing with Hadoop MapReduce
method
 But, to run huge datasets, should use higher end processor
 X86
 CPU: Can improve the performance by fixed point implementation, batching and lazy
evaluation.

More Related Content

PDF
Survey_Report_Deep Learning Algorithm
PDF
A novel mrp so c processor for dispatch time curtailment
PDF
Affect of parallel computing on multicore processors
PDF
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
PDF
Machine Learning with New Hardware Challegens
PDF
Volume 2-issue-6-2040-2045
PDF
High Performance Medical Reconstruction Using Stream Programming Paradigms
PPTX
Tensor Processing Unit (TPU)
Survey_Report_Deep Learning Algorithm
A novel mrp so c processor for dispatch time curtailment
Affect of parallel computing on multicore processors
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
Machine Learning with New Hardware Challegens
Volume 2-issue-6-2040-2045
High Performance Medical Reconstruction Using Stream Programming Paradigms
Tensor Processing Unit (TPU)

What's hot (17)

PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
PPTX
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
PDF
Assisting User’s Transition to Titan’s Accelerated Architecture
PPTX
Enery efficient data prefetching
PDF
Interface for Performance Environment Autoconfiguration Framework
PPTX
Lec06 memory
PDF
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
PDF
Manycores for the Masses
PPTX
KIISE:SIGDB Workshop presentation.
PPTX
Google TPU
PPTX
TPU paper slide
PDF
Effective Sparse Matrix Representation for the GPU Architectures
PDF
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
PDF
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
PDF
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
PPTX
Lec04 gpu architecture
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Assisting User’s Transition to Titan’s Accelerated Architecture
Enery efficient data prefetching
Interface for Performance Environment Autoconfiguration Framework
Lec06 memory
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Manycores for the Masses
KIISE:SIGDB Workshop presentation.
Google TPU
TPU paper slide
Effective Sparse Matrix Representation for the GPU Architectures
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Lec04 gpu architecture
Ad

Similar to DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0 (20)

PPTX
AI Hardware Landscape 2021
PPTX
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
PDF
Hardware for Deep Learning AI ML CNN.pdf
PDF
A Platform for Accelerating Machine Learning Applications
PDF
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
PDF
Deep learning: Hardware Landscape
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Professional Project - C++ OpenCL - Platform agnostic hardware acceleration ...
PDF
FPGAs for Supercomputing: The Why and How
PPTX
APSys Presentation Final copy2
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
PDF
AI is Impacting HPC Everywhere
PDF
Scaling Deep Learning Algorithms on Extreme Scale Architectures
PDF
Matlab Deep Learning Hdl Toolbox Ug The Mathworks Inc
PPTX
IoT Tech Expo 2023_Pedro Trancoso presentation
PDF
Deep Learning on the SaturnV Cluster
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
PDF
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
AI Hardware Landscape 2021
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Hardware for Deep Learning AI ML CNN.pdf
A Platform for Accelerating Machine Learning Applications
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Deep learning: Hardware Landscape
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Professional Project - C++ OpenCL - Platform agnostic hardware acceleration ...
FPGAs for Supercomputing: The Why and How
APSys Presentation Final copy2
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
AI is Impacting HPC Everywhere
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Matlab Deep Learning Hdl Toolbox Ug The Mathworks Inc
IoT Tech Expo 2023_Pedro Trancoso presentation
Deep Learning on the SaturnV Cluster
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Arpan_booth_talk_2 DNN and Tsnor Floww.pdf
Ad

DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0

  • 1. Team 6: Sourabh Ketkale : 010470785 Sahil Kaw : 010725104 Siddhi Pai : 010702458 Goutham Nekkalapu : 010815233 Prince Jacob Chandy : 010807225
  • 4.  Comparison to Optimized BLAS package : For higher order matrices the speedup of BLAS packages was higher in comparison of the baseline CPU.  Comparison to an optimized GPU implementation: Without batching the GPU attained 2.8 times speedup to baseline CPU.
  • 5.  Linear Quantization: We make use of the 8 bit quantization technique to convert activations into unsigned character and weights into signed character with biases which are coded as 32 bit  Intel SSE3: We are able to achieve the 3* speed up because it provides support to pmaddubsw.  Intel SSE4: These instruction set provide optimization to convert 16 bit to 32 bit instruction and thereby we achieve 9% relative speed improvement over SSE3 benchmark.
  • 6. BATCHING: With batching we can further overcome the GPU performance by applying batching on neural networks in bulk so that we can take advantage of CPU caching of both weights and activation. LAZY EVALUATION: A Neural network only compute a fraction of state and thereby we can reduce the number of parameters that needs to be visited at every point and thereby reducing the number of the arithmetic and memory operations using Gaussian Selection technique. BATCHED LAZY EVALUATION: Implementing the Lazy Evaluation on smaller batches in the speech evaluation readily improve the performance of the CPU over GPU.
  • 7.  Auto encoder is an artificial Neural network used for learning efficient codings.  The stacked auto encoder is a deep learning model consists of multiple auto- encoders.  XEON PHI is a small cluster of 60 cores and each core has 4 hardware threads. It has 8GB of memory, a file system and the Linux Operating System and 1 GHZ of clock speed. It has 32 KB L1 data cache and 512 KB L2 cache
  • 8.  Thread oversubscription means number of thread in parallel is more than the number of the threads of the XEON PHI supports  It greatly decrease the performance of the XENON PHI as it leads to context switching and in a many core processor its very expensive Solution:  MapReduce method can effectively determine the number of threads required by MKL(Math Kernal libraries) function.  MKL libraries itself also determine the number of threads required by the process but not suited for model parallelism and asynchronous training
  • 9. Basic Design of Xeon Phi: Training dataset for Neural networks are very huge so a lot of I/O takes place between RAM and the memory and thus this time also needs consideration. To solve this we generally keep all parameters and the temporary variables always stored in global memory of Xeon Phi and keep on transferring the training dataset. Parallel Design:  Data Parallelism : Is achieved by Vector Processing Unit to compute the data wise operation in each model replica.  Task Parallelism: Is achieved by multiple threads in the XEON PHI  Affinity Mode: Affinity sets up the mapping between the thread and the core.
  • 11. what is really holding us back with ‘deep learning’ ?
  • 12. For achieving this kind of computing, one can’t depend upon a single system; you need ‘large scale distributed systems’
  • 13. You have multiple model replicas, each consisting of multiple machines, that train on different subset of data. And they publish updates to the global model parameter server Model Parallelism Data Parallelism
  • 14. Whole system co-design  Model partitioning – working set of the model is stored in L3 cache  Local weight computation at the parameter server Exploiting Asynchrony (as weight updates are commutative and associative)  Multi-threaded weight updates without locks  Asynchronous batch updates – aggregate the weights and update to parameter server only when we have large enough aggregation
  • 15.  To achieve this, GeePS needs to overcome the challenges of limited GPU memory, and inter-machine communication (data movement overheads), GPU stalls  Parameter server works by separating the problems of processing data and the problem of communicating and synchronizing them between different machines  GeePS is a parameter server supporting data-parallel model training
  • 16. The authors tried using an existing state-of-the-art parameter server system (IterStore) with GPU based ML… To enable a parameter server to support parallel ML applications running on distributed GPUs the authors make three important changes:  Explicit use of GPU memory for the parameter cache  Batch-based parameter access methods  Parameter server management of GPU memory on behalf of the application
  • 17. GPUs using a CPU-based parameter server GPU based parameter server
  • 19. Two ways to achieve parallelism: • By distributing deep computation into a Hadoop cluster or cloud of computing nodes • By using field programmable gate arrays (FPGA) hardware acceleration to speed up computationally intensive deep learning Kernels
  • 21.  Performance bottle necks in Deep learning of CNN  Design Distributed Hadoop clusters with separation of kernels processed Standard or accelerated FPGA based nodes  Design and synthesis of the reconfigurable architecture to support Kernel acceleration on  Designing a interface library to achieve compatibility between FPGA nodes and general purpose nodes
  • 22. Kernel Identification  Approach to Distributed Algorithm With FPGA-Based Nodes Design and Implementation Of Reconfigurable Architecture For Deep Learning Kernels Seamless Integration of the Distributed Algorithm with the Accelerated Kernels
  • 24.  To cash on the advantage to achieve fine grain parallelism with the help of reconfigurable hardware which cannot be done in case of GPU’s  The performance per watt ratio is better with FPGA’s which can exploit computation power with lower energy consumption on power intensive environments like mobile devices, data centers  Support with all the open source framework for the
  • 27.  A set of programming languages, models and tools supporting the Intel x86 architecture can also be used on the Intel Xeon Phi coprocessor with little change.  As a result, instead of redesigning new algorithms or models for GPU in CUDA or OpenCL.  The vector-intensive algorithms can take advantage of the above mentioned architecture
  • 30.  OpenMP and Intel MKL (Math Kernel Library) packages are used to parallelize them.  Many matrix multiplications and are tackled by the Intel MKL packages.
  • 31. lAchieves a 302-fold speedup compared with the un-optimized sequential algorithm
  • 32. .
  • 33.  Thread parallelism  Controlled Hogwild  Arbitrary Order of Synchronization  Vectorization
  • 34.  Speed up of the algorithm, compared to one thread on the Xeon Phi and that of on sequential version executed on Xeon E5  Execution times for all thread counts and CNN architecture sizes on the Xeon Phi, and the sequential version on Xeon E5
  • 38. Implements Deep Learning on low cost platforms. Low platform device adopts task flexible architecture and multiple parallelism to cover functions of CDBN.
  • 39.  complex function  an additional stage  random number generation Additional tradeoff  Arithmetic Precision  Hardware Parallelism  Memory Input output bandwidth  Random number generator
  • 40.  By implementing 3 key features  Deep network learning engine with dual threaded 4 stage task level pipeline.  Deep network inference engine with dynamically reconfigurable systolic PE array.  True Random number generator.
  • 41.  High computational throughput and memory bandwidth  Implementing and optimizing the 1D , 2D and multi channel 2D convolution operations on GPU and INTEL MIC  Hence, we go for many core architecture.
  • 43.  For 1D and 2D : Register tiling.  For Multi-channel 2D convolution: Local Memory tiling.
  • 44. On Intel MIC, our solution gets up to 25% of the theoretical peak performance.
  • 45.  Deep Learning algorithms being Computing power intensive, it depends on the use case scenario to choose the framework and hardware  GPU :  Pro: They provide huge computational power  Can be used as a cluster of GPU’s  But huge power consumption and algorithms have to be designed and implemented again in CUDA/OpenCL  FPGAs :  Pro : Low power consumption when compared to GPUs  But, design of algorithm on this can be time consuming  A potential speed-up of 12.6 times and an energy reduction of 87.5% on a 6-node FPGA accelerated Hadoopcluster
  • 46.  Xeon Phi co-processor:  Pro : Offers considerable amount of computation power, very easy to migrate to this platform from normal CPU. Can Even improve this performance by combing with Hadoop MapReduce method  But, to run huge datasets, should use higher end processor  X86  CPU: Can improve the performance by fixed point implementation, batching and lazy evaluation.