DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0

Team 6:
Sourabh Ketkale : 010470785
Sahil Kaw : 010725104
Siddhi Pai : 010702458
Goutham Nekkalapu : 010815233
Prince Jacob Chandy : 010807225

 Comparison to Optimized BLAS package : For higher order matrices the speedup
of BLAS packages was higher in comparison of the baseline CPU.
 Comparison to an optimized GPU implementation: Without batching the GPU
attained 2.8 times speedup to baseline CPU.

 Linear Quantization: We make use of the 8 bit quantization technique to
convert activations into unsigned character and weights into signed character
with biases which are coded as 32 bit
 Intel SSE3: We are able to achieve the 3* speed up because it provides
support to pmaddubsw.
 Intel SSE4: These instruction set provide optimization to convert 16 bit to 32
bit instruction and thereby we achieve 9% relative speed improvement over
SSE3 benchmark.

BATCHING: With batching we can further overcome the GPU performance by applying
batching on neural networks in bulk so that we can take advantage of CPU caching of
both weights and activation.
LAZY EVALUATION: A Neural network only compute a fraction of state and thereby we
can reduce the number of parameters that needs to be visited at every point and
thereby reducing the number of the arithmetic and memory operations using Gaussian
Selection technique.
BATCHED LAZY EVALUATION: Implementing the Lazy Evaluation on smaller batches
in the speech evaluation readily improve the performance of the CPU over GPU.

 Auto encoder is an artificial Neural network used for learning efficient codings.
 The stacked auto encoder is a deep learning model consists of multiple auto-
encoders.
 XEON PHI is a small cluster of 60 cores and each core has 4 hardware threads. It has
8GB of memory, a file system and the Linux Operating System and 1 GHZ of clock
speed. It has 32 KB L1 data cache and 512 KB L2 cache

 Thread oversubscription means number of thread in parallel is more than the number
of the threads of the XEON PHI supports
 It greatly decrease the performance of the XENON PHI as it leads to context switching
and in a many core processor its very expensive
Solution:
 MapReduce method can effectively determine the number of threads required by
MKL(Math Kernal libraries) function.
 MKL libraries itself also determine the number of threads required by the process but
not suited for model parallelism and asynchronous training

Basic Design of Xeon Phi:
Training dataset for Neural networks are very huge so a lot of I/O takes place between RAM
and the memory and thus this time also needs consideration.
To solve this we generally keep all parameters and the temporary variables always stored in
global memory of Xeon Phi and keep on transferring the training dataset.
Parallel Design:
 Data Parallelism : Is achieved by Vector Processing Unit to compute the data wise operation
in each model replica.
 Task Parallelism: Is achieved by multiple threads in the XEON PHI
 Affinity Mode: Affinity sets up the mapping between the thread and the core.

what is really holding us back with ‘deep learning’ ?

For achieving this kind of computing, one can’t depend upon a single system;
you need ‘large scale distributed systems’

You have multiple model
replicas, each consisting
of multiple machines, that
train on different subset
of data. And they publish
updates to the global model
parameter server
Model Parallelism
Data Parallelism

Whole system co-design
 Model partitioning – working set of the model is stored in L3 cache
 Local weight computation at the parameter server
Exploiting Asynchrony (as weight updates are commutative and associative)
 Multi-threaded weight updates without locks
 Asynchronous batch updates – aggregate the weights and update to parameter server
only when we have large enough aggregation

 To achieve this, GeePS needs to overcome the challenges of limited GPU memory,
and inter-machine communication (data movement overheads), GPU stalls
 Parameter server works by separating the problems of processing data and the
problem of communicating and synchronizing them between different machines
 GeePS is a parameter server supporting data-parallel model training

The authors tried using an existing state-of-the-art parameter server system (IterStore)
with GPU based ML…
To enable a parameter server to support parallel ML applications running on distributed
GPUs the authors make three important changes:
 Explicit use of GPU memory for the parameter cache
 Batch-based parameter access methods
 Parameter server management of GPU memory on behalf of the application

GPUs using a CPU-based parameter server
GPU based parameter server

Two ways to achieve parallelism:
• By distributing deep computation into a Hadoop cluster or cloud of computing nodes
• By using field programmable gate arrays (FPGA) hardware acceleration to speed up
computationally intensive deep learning Kernels

 Performance bottle necks in Deep learning of CNN
 Design Distributed Hadoop clusters with separation of kernels processed Standard or
accelerated FPGA based nodes
 Design and synthesis of the reconfigurable architecture to support Kernel
acceleration on
 Designing a interface library to achieve compatibility between FPGA nodes and
general purpose nodes

Kernel Identification
 Approach to Distributed Algorithm With FPGA-Based Nodes
Design and Implementation Of Reconfigurable Architecture
For Deep Learning Kernels
Seamless Integration of the Distributed Algorithm with the
Accelerated Kernels

 To cash on the advantage to achieve fine grain parallelism with the help of
reconfigurable hardware which cannot be done in case of GPU’s
 The performance per watt ratio is better with FPGA’s which can exploit computation
power with lower energy consumption on power intensive environments like mobile
devices, data centers
 Support with all the open source framework for the

 A set of programming languages, models and tools
supporting the Intel x86 architecture can also be used
on the Intel Xeon Phi coprocessor with little change.
 As a result, instead of redesigning new algorithms or
models for GPU in CUDA or OpenCL.
 The vector-intensive algorithms can take advantage of
the above mentioned architecture

 OpenMP and Intel MKL (Math Kernel Library)
packages are used to parallelize them.
 Many matrix multiplications and are tackled by
the Intel MKL packages.

lAchieves a 302-fold speedup compared with the
un-optimized sequential algorithm

 Thread parallelism
 Controlled Hogwild
 Arbitrary Order of Synchronization
 Vectorization

 Speed up of the algorithm, compared to one
thread on the Xeon Phi and that of on sequential
version executed on Xeon E5
 Execution times for all thread counts and CNN
architecture sizes on the Xeon Phi, and the
sequential version on Xeon E5

Implements Deep Learning on low cost platforms.
Low platform device adopts task flexible architecture and
multiple parallelism to cover functions of CDBN.

 complex function
 an additional stage
 random number generation
Additional tradeoff
 Arithmetic Precision
 Hardware Parallelism
 Memory Input output bandwidth
 Random number generator

 By implementing 3 key features
 Deep network learning engine with dual threaded 4 stage task level pipeline.
 Deep network inference engine with dynamically reconfigurable systolic PE array.
 True Random number generator.

 High computational throughput and memory bandwidth
 Implementing and optimizing the 1D , 2D and multi channel 2D convolution operations
on GPU and INTEL MIC
 Hence, we go for many core architecture.

 For 1D and 2D : Register tiling.
 For Multi-channel 2D convolution: Local Memory tiling.

On Intel MIC, our solution gets up to 25% of the theoretical
peak performance.

 Deep Learning algorithms being Computing power intensive, it depends on the use
case scenario to choose the framework and hardware
 GPU :
 Pro: They provide huge computational power
 Can be used as a cluster of GPU’s
 But huge power consumption and algorithms have to be designed and implemented again
in CUDA/OpenCL
 FPGAs :
 Pro : Low power consumption when compared to GPUs
 But, design of algorithm on this can be time consuming
 A potential speed-up of 12.6 times and an energy reduction of 87.5% on a 6-node
FPGA accelerated Hadoopcluster

 Xeon Phi co-processor:
 Pro : Offers considerable amount of computation power, very easy to migrate to this platform
from normal CPU. Can Even improve this performance by combing with Hadoop MapReduce
method
 But, to run huge datasets, should use higher end processor
 X86
 CPU: Can improve the performance by fixed point implementation, batching and lazy
evaluation.

DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0

More Related Content

What's hot (17)

Similar to DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0 (20)

DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0