SlideShare a Scribd company logo
Scaling Deep Learning
Algorithms on Extreme Scale
Architectures
ABHINAV VISHNU
1
Team Lead, Scalable Machine Learning, Pacific Northwest National Laboratory
MVAPICH User Group (MUG) 2017
The rise of Deep Learning!
2
FeedForward	
  	
  
Back-­‐propaga/on	
  
Several	
  scien,fic	
  applica,ons	
  have	
  shown	
  remarkable	
  improvements	
  in	
  modeling/classifica,on	
  tasks	
  !!	
  
Human	
  accuracy!	
  
Challenges in Using Deep Learning
3
•  How	
  to	
  design	
  DNN	
  topology?	
  
•  Which	
  samples	
  are	
  important?	
  
•  How	
  to	
  handle	
  unlabeled	
  data?	
  
•  Supercomputers	
  are	
  typically	
  
used	
  for	
  simula=on	
  –	
  effec=ve	
  for	
  
DL	
  implementa=ons?	
  
•  How	
  much	
  effort	
  required	
  for	
  
using	
  DL	
  algorithms?	
  
•  Will	
  it	
  only	
  reduce	
  =me-­‐to-­‐
solu=on	
  or	
  improve	
  baseline	
  
performance	
  of	
  the	
  model?	
  
Vision for Machine/Deep Learning R&D
4
Novel	
  Machine	
  
Learning/Deep	
  
Learning	
  Algorithms	
  
Extreme	
  Scale	
  	
  
ML/DL	
  Algorithms	
  
MaTEx:	
  Machine	
  
Learning	
  Toolkit	
  for	
  
Extreme	
  Scale	
  
DL	
  Applica=ons:	
  
HEP,	
  SLAC,	
  Power	
  
Grid,	
  HPC,	
  Chemistry	
  
Novel ML/DL Algorithms: Pruning Neurons
Training Phase Re-training Phase
Proposed Adaptive
Pruning During the
Training Phase
State of the art Pruning
After training, requiring
Re-training
(a) (b)
Error decay
Error fixed
Which	
  neurons	
  are	
  important?	
  
Adap=ve	
  Neuron	
  Apoptosis	
  for	
  Accelera=ng	
  	
  DL	
  
Algorithms	
  
Area	
  Under	
  Curve	
  -­‐	
  ROC:	
  
1)  Improved	
  from	
  0.88	
  to	
  0.94	
  
2)  2.5x	
  speedup	
  in	
  learning	
  
/me	
  
3)  3x	
  simpler	
  model	
  
1 1.5
3
5
1.7 2.3
5
8
2.8
4.1
9
15
1
4
11
21
0
5
10
15
20
25
Default Conser. Normal Aggressive
Improvement
Speedup and Parameter Reduction vs 20 cores
without Apoptosis
20 Cores 40 Cores 80 Cores Parameter Reduction
Novel ML/DL Algorithms: Neuro-genesis
6
Training
Can	
  you	
  create	
  neural	
  network	
  topologies	
  semi-­‐automa,cally?	
  
Genera=ng	
  Neural	
  Networks	
  from	
  BluePrints	
  
10
20 30 40 50
2000 1500
42
16
10
84
32
20
125
58
30
167
64
40
208
80
50
2880 2160
1600 1200
2000 1500
16
10
32
20
58
30
64
40
80
50
1600 1200
2000 1500
Novel ML/DL Algorithms: Sample Pruning
7
Epoch
Epoch #
Batch0
Batch1
Batchn
Batch0
Batch1
Batchn
Eon
Epoch
Batch0
Batch1
Batchp
Batch0
Batch1
Batchn
Eon
Which	
  Samples	
  are	
  Important?	
  
YinYang	
  Deep	
  Learning	
  for	
  Large	
  Scale	
  Systems	
  
Scaling DL Algorithms Using Asynchronous
Primitives
August 15, 2017 8
Interconnect(
(NVLINK,(PCI1Ex,(InfiniBand)(
All-to-All reduction (MPI_Allreduce, NCCL_allreduce)
Interconnect(
(NVLINK,(PCI1Ex,(InfiniBand)(
Not started CompletedIn Progress Asynchronous thread
Master thread
Enqueues
Async thread
Dequeues
MPI_Allreduce
Sample Results
August 15, 2017 9
0
1
2
3
4
5
6
7
8
9
4 8 16 32 64 128
Batches	per	Second
Number	of	GPUs
AGD BaG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
8 16 32 64
Batches	per	Second
Number	of	Compute	Nodes
SGD AGD
PIC	
  
MVAPICH	
  
Strong	
  Scaling	
  
SummitDev	
  	
  
IBM	
  Spectrum	
  MPI	
  
Weak	
  Scaling	
  
What does Fault Tolerant Deep Learning
Need from MPI?
August 15, 2017 10
MPI	
  has	
  been	
  cri=cized	
  heavily	
  	
  
for	
  lack	
  of	
  fault	
  tolerance	
  support	
  
1)  Exis=ng	
  MPI	
  implementa=on	
  
2)  User-­‐Level	
  Fault	
  Mi=ga=on	
  
3)  Reinit	
  Proposal	
  
Which	
  proposal	
  is	
  necessary	
  and	
  sufficient?	
  
…"
//"Original"on_gradients_ready"
On_gradients_ready(float"*buf)"{"
"
//"conduct"in;place"allreduce"of"gradients"
"rc"="MPI_Allreduce"(…",""…);"
"
//"average"the"gradients"by"communicator"size"
"
…"
"
…"
//"Fault"tolerant"on_gradients_ready"
On_gradients_ready(float"*buf)"{"
"
//"conduct"in;place"allreduce"of"gradients"
"rc"="MPI_Allreduce"(…,""…);"
"
While&(rc&!=&MPI_SUCCESS)&{&
//&shrink&the&communicator&to&a&new&comm.&
MPIX_Comm_shrink(origcomm,&&newcomm);&
rc&=&MPI_Allreduce(…,&…);&
}&
//"average"the"gradients"by"communicator"size"
…"
"
Code Snippet of Original Callback Code Snippet for Fault tolerant DL
Impact of DL on Other Application Domains
11
Computa=onal	
  	
  
Chemistry	
  
Buildings,	
  Power	
  Grid	
  
t	
  
When	
  mul,-­‐bit	
  faults	
  result	
  in	
  applica,on	
  
error?	
  
HPC	
  
What	
  DL	
  techniques	
  are	
  useful	
  for	
  
Energy	
  Modeling	
  of	
  Buildings?	
  
Can	
  molecular	
  structure	
  
predict	
  the	
  molecular	
  
proper,es?	
  
MaTEx: Machine Learning Toolkit for Extreme
Scale
MaTEx
August 14, 2017 12
1)  Open	
  source	
  soNware	
  with	
  users	
  in	
  academia,	
  laboratories	
  and	
  industry	
  
2)  Supports	
  graphics	
  processing	
  unit	
  (GPU),	
  central	
  processing	
  unit	
  (CPU)	
  clusters/
LCFs	
  with	
  high-­‐end	
  systems/interconnects	
  
3)  Machine	
  Learning	
  Toolkit	
  for	
  Extreme	
  Scale	
  -­‐MaTEx:	
  github.com/matex-­‐org/
matex	
  
Architectures Supported by MaTEx
August 14, 2017 13
K20	
  	
  
(Gemini)	
  
GPU	
  
Arch.	
  
K40	
   K80	
   P100	
  
Interconnect	
   InfiniBand	
   Ethernet	
   Omni-­‐Path	
  
CPU	
  
Arch.	
  
Xeon	
  (SB,	
  
Haswell)	
  
Intel	
  Knights	
  
Landing	
  
Power	
  8	
  
Comparing	
  the	
  Performance	
  of	
  NVIDIA	
  DGX-­‐1	
  and	
  	
  
Intel	
  KNL	
  on	
  Deep	
  Learning	
  Workloads,	
  
ParLearning’17,	
  IPDPS’17	
  
Demystifying Extreme Scale DL
August 14, 2017 14
	
  
TF	
  Run=me	
  
	
  
TF	
  Scripts	
  
(gRPC)	
  
Data	
  Readers	
  
Architectures	
  
Google-­‐TensorFlow	
  
	
  
TF	
  Run=me	
  
(MPI	
  Changes)	
  
	
  
Data	
  Readers	
  
Architectures	
  
MaTEx-­‐TensorFlow	
  
TF	
  Scripts	
  
Requires	
  no	
  
TF	
  specific	
  
changes	
  for	
  
users	
  
Not	
  aerac/ve	
  
for	
  scien/sts!	
  
Supports	
  automa=c	
  distribu=on	
  of	
  HDF5,	
  CSV,	
  PNetCDF	
  formats	
  
Example Code Changes
August 14, 2017 15
6
1 import tensorflow as tf 1 import tensorflow as tf
2 import numpy as np 2 import numpy as np
3 ... 3 ...
4 from datasets import DataSet 4
5 ... 5 ...
6 image_net = DataSet(...) 6
7 data = image_net.training_data 7 data = ... # Load training data
8 labels = image_net.training_labels 8 labels = ... # Load Labels
9 ... 9 ...
10 # Setting up the network 10 # Setting up the network
11 ... 11 ...
12 # Setting up optimizer 12 # Setting up optimizer
13 ... 13 ...
14 init = tf.global_variables_initializer() 14 init = tf.global_variables_initializer()
15 sess = tf.Session() 15 sess = tf.Session()
16 sess.run(init) 16 sess.run(init)
17 ... 17 ...
18 # Run training regime 18 # Run training regime
Fig. 3: (Left) A sample MaTEx-TensorFlow script, (Right) Original TensorFlow script. Notice that MaTEx-TensorFlow requires
no TensorFlow specific changes.
Name CPU (#cores) GPU Network MPI cuDNN CUDA Nodes #cores
K40 Haswell (20) K40 IB OpenMPI 1.8.3 4 7.5 8 160
SP Ivybridge (20) N/A IB OpenMPI 1.8.4 N/A N/A 20 400
TABLE I: Hardware and Software Description. IB (InfiniBand). The proposed research extends Baseline-Caffe incorporating
MaTEx-­‐TensorFlow	
  Code	
   Original	
  TF	
  Code	
  
User-­‐transparent	
  Distributed	
  TensorFlow,	
  A.	
  Vishnu	
  et	
  al.,	
  	
  Arxiv’17	
  
Supports	
  automa=c	
  distribu=on	
  of	
  HDF5,	
  CSV,	
  PNetCDF	
  formats	
  
User-Transparent Distributed Keras
August 14, 2017 16
1)  Distributed	
  Keras	
  with	
  MPI	
  available	
  	
  on	
  github.com/matex-­‐org/matex	
  
2)  Currently	
  the	
  only	
  Keras	
  implementa/on	
  that	
  does	
  not	
  require	
  any	
  MPI	
  specific	
  
changes	
  to	
  code	
  
3)  Tested	
  on	
  NERSC	
  architectures	
  
1
1 import tensorflow as tf 1 import tensorflow as tf
2 import numpy as np 2 import numpy as np
3 # Keras Imports 3 # Keras Imports
4 ... 4 ...
5 dataset = tf.DataSet(...) 5
6 data = dataset.training_data 6 data = ... # Load training data
7 labels = dataset.training_labels 7 labels = ... # Load Labels
8 ... 8 ...
9 # Defining Keras Model 9 # Defining Keras Model
10 ... 10 ...
11 # Call to Keras training method 11 # Call to Keras training method
12 ... 12 ...
August 14, 2017 17
Use-Case: SLAC Water/Ice Classification
Reducing	
  the	
  ,me	
  to	
  new	
  science	
  -­‐	
  From	
  Experiment	
  to	
  Publica=on	
  
Typical	
  Experiment:	
  
1)  ~100	
  images/sec	
  
2)  	
  ~100	
  TB	
  of	
  data	
  
3)  Problem	
  further	
  exacerbated	
  for	
  upcoming	
  LCLS-­‐2	
  (up	
  to	
  1M	
  images/sec)	
  
4)	
  	
  Several	
  domains	
  exhibit	
  these	
  characteris=cs	
  
	
  
Typical	
  Problems:	
  
1)  Too	
  many	
  images	
  –	
  can	
  we	
  find	
  the	
  important	
  ones?	
  
2)  Unknown	
  whether	
  the	
  experiment	
  is	
  on	
  the	
  “right	
  track”:	
  
1)  Results	
  not	
  known	
  =ll	
  post-­‐hoc	
  data	
  analysis	
  
3)  If	
  the	
  experiment	
  succeeds:	
  
1)  Exorbitant	
  =me	
  spent	
  (several	
  man	
  days)	
  in	
  data	
  cleaning/labeling	
  
2)  Several	
  man	
  days	
  spent	
  in	
  manual	
  data	
  analysis	
  (such	
  as	
  genera=ng	
  probability	
  distribu=on	
  
func=ons)	
  
	
  Can	
  we	
  do	
  beJer?	
  
August 14, 2017 18
Sample Proof:
Distinguishing Water from Ice
	
  
	
  
Dataset	
  Specifica/on:	
  
1)  ~68GB	
  of	
  data	
  consis=ng	
  of	
  images	
  with	
  Water	
  
and	
  Ice	
  crystals	
  
2)  Scien=sts	
  spent	
  17	
  man	
  days	
  labeling	
  each	
  
image	
  as	
  represen=ng	
  Water	
  or	
  Ice	
  
3)  Objec=ve	
  –	
  can	
  we	
  reduce	
  the	
  labeling	
  =me,	
  
while	
  achieving	
  very	
  high	
  accuracy?	
  
1)  We	
  take	
  4000	
  samples	
  and	
  consider	
  
following	
  data	
  splits:	
  
1)  Label	
  1200	
  to	
  2800	
  samples	
  using	
  
Deep	
  Learning	
  (Convolu=onal	
  +	
  
Deep	
  Neural	
  architectures)	
  and	
  see	
  
the	
  accuracy	
  on	
  remaining	
  samples	
  
(2800	
  –	
  1200)	
  
2)  Observa/on:	
  With	
  2800	
  samples,	
  we	
  can	
  
accurately	
  classify	
  ~97%	
  of	
  remaining	
  
samples	
  correctly	
  	
  
4)  Conclusion:	
  major	
  reduc=on	
  in	
  labeling	
  =me	
  
with	
  results	
  matching	
  human	
  labeling	
  
1)  Poten=al	
  for	
  significant	
  reduc=on	
  in	
  =me	
  
for	
  scien=fic	
  discovery	
  	
  
2)  Labeling	
  only	
  “boundary”	
  samples	
  would	
  
further	
  reduce	
  the	
  human	
  effort	
  
0.45	
  
0.55	
  
0.65	
  
0.75	
  
0.85	
  
0.95	
  
0	
   20	
   40	
   60	
   80	
   100	
   120	
   140	
  
Tes=ng	
  Accuracy	
  vs.	
  Time	
  (in	
  minutes)	
  
Water/Ice	
  dataset	
accuracy	
  1203	
   accuracy	
  2005	
   accuracy	
  2807	
   accuracy	
  3609	
  
Model	
  re-­‐trained	
  
and	
  
recommenda/ons	
  
changed	
  
Prototype for Semi-Supervised Learning
Collaborators
20
Jeff	
  Daily	
   Charles	
  Siegel	
   Vinay	
  Amatya	
   Leon	
  Song	
   Ang	
  Li	
  
Garrep	
  	
  
Goh	
  
Malachi	
  
Schram	
  
Joseph	
  
Manzano	
  
Vikas	
  
Chandan	
  
Thomas	
  J	
  Lane@SLAC	
  
Thanks!
MaTEx
August 14, 2017 21
Contact:	
  abhinav.vishnu@pnnl.gov	
  
MaTEx	
  webpage:	
  hpps://github.com/matex-­‐org/matex/	
  
Publica=ons:	
  hpps://github.com/matex-­‐org/matex/wiki/publica=ons	
  
	
  

More Related Content

PPTX
Optalysys Optical Processing for HPC
PDF
Perspective on HPC-enabled AI
PDF
HPC + Ai: Machine Learning Models in Scientific Computing
PDF
Emerson Technology Group (ETG)
PDF
Deep Learning Use Cases using OpenPOWER systems
PPT
Cloud Computing Examples at ICHEC
PPT
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...
PDF
HCL Infotech enables One of India's High-Tech Research Centre Setup
Optalysys Optical Processing for HPC
Perspective on HPC-enabled AI
HPC + Ai: Machine Learning Models in Scientific Computing
Emerson Technology Group (ETG)
Deep Learning Use Cases using OpenPOWER systems
Cloud Computing Examples at ICHEC
Fat Nodes & GPGPUs - Red-shifting your infrastructure without breaking the bu...
HCL Infotech enables One of India's High-Tech Research Centre Setup

What's hot (20)

PPTX
Bringing HPC to tackle your business problems
PPTX
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
PDF
Exascale Computing Project - Driving a HUGE Change in a Changing World
PPT
An example of discovering simple patterns using basic data mining
PDF
Gt data mining ai algorithm for fabs
PDF
Pragmatic Analytics - Case Studies of High Performance Computing for Better B...
PDF
Moving from Artisanal to Industrial Machine Learning
PPTX
Smarter Innovation at Scale
PPTX
A New Supercomputer Rises at the University of Adelaide
PDF
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
PDF
13 Supercomputer-Scale AI with Cerebras Systems
PDF
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
PDF
Update on the Exascale Computing Project (ECP)
PDF
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
PDF
The ECP Exascale Computing Project
PPTX
Big data analytics_7_giants_public_24_sep_2013
PPTX
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
PPTX
Store app a shared storage appliance for efficient and scalable virtualized h...
PPTX
A Supercomputer for CERFACS
PDF
Is Your Software Development Process Green?
Bringing HPC to tackle your business problems
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Exascale Computing Project - Driving a HUGE Change in a Changing World
An example of discovering simple patterns using basic data mining
Gt data mining ai algorithm for fabs
Pragmatic Analytics - Case Studies of High Performance Computing for Better B...
Moving from Artisanal to Industrial Machine Learning
Smarter Innovation at Scale
A New Supercomputer Rises at the University of Adelaide
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
13 Supercomputer-Scale AI with Cerebras Systems
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
Update on the Exascale Computing Project (ECP)
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
The ECP Exascale Computing Project
Big data analytics_7_giants_public_24_sep_2013
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Store app a shared storage appliance for efficient and scalable virtualized h...
A Supercomputer for CERFACS
Is Your Software Development Process Green?
Ad

Similar to Scaling Deep Learning Algorithms on Extreme Scale Architectures (20)

PDF
Data Parallel Deep Learning
PDF
TensorFlow 16: Building a Data Science Platform
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
PDF
A Platform for Accelerating Machine Learning Applications
PDF
AI and Deep Learning
PDF
Deep Dive on Deep Learning (June 2018)
PPTX
Deep Learning with Apache Spark: an Introduction
PDF
Accelerating stochastic gradient descent using adaptive mini batch size3
PDF
Toronto meetup 20190917
PDF
Biomedical Signal and Image Analytics using MATLAB
PDF
Start machine learning in 5 simple steps
PDF
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
PPTX
Computer Vision for Beginners
PPTX
Final training course
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PPTX
B4UConference_machine learning_deeplearning
PDF
Deep learning with Keras
PPTX
Anomaly detection, part 1
PDF
Adtech scala-performance-tuning-150323223738-conversion-gate01
PDF
Adtech x Scala x Performance tuning
Data Parallel Deep Learning
TensorFlow 16: Building a Data Science Platform
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
A Platform for Accelerating Machine Learning Applications
AI and Deep Learning
Deep Dive on Deep Learning (June 2018)
Deep Learning with Apache Spark: an Introduction
Accelerating stochastic gradient descent using adaptive mini batch size3
Toronto meetup 20190917
Biomedical Signal and Image Analytics using MATLAB
Start machine learning in 5 simple steps
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Computer Vision for Beginners
Final training course
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
B4UConference_machine learning_deeplearning
Deep learning with Keras
Anomaly detection, part 1
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech x Scala x Performance tuning
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
Teaching material agriculture food technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Tartificialntelligence_presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A comparative analysis of optical character recognition models for extracting...
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
Teaching material agriculture food technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
NewMind AI Weekly Chronicles - August'25-Week II
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
Tartificialntelligence_presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
1. Introduction to Computer Programming.pptx
Programs and apps: productivity, graphics, security and other tools
Digital-Transformation-Roadmap-for-Companies.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf

Scaling Deep Learning Algorithms on Extreme Scale Architectures

  • 1. Scaling Deep Learning Algorithms on Extreme Scale Architectures ABHINAV VISHNU 1 Team Lead, Scalable Machine Learning, Pacific Northwest National Laboratory MVAPICH User Group (MUG) 2017
  • 2. The rise of Deep Learning! 2 FeedForward     Back-­‐propaga/on   Several  scien,fic  applica,ons  have  shown  remarkable  improvements  in  modeling/classifica,on  tasks  !!   Human  accuracy!  
  • 3. Challenges in Using Deep Learning 3 •  How  to  design  DNN  topology?   •  Which  samples  are  important?   •  How  to  handle  unlabeled  data?   •  Supercomputers  are  typically   used  for  simula=on  –  effec=ve  for   DL  implementa=ons?   •  How  much  effort  required  for   using  DL  algorithms?   •  Will  it  only  reduce  =me-­‐to-­‐ solu=on  or  improve  baseline   performance  of  the  model?  
  • 4. Vision for Machine/Deep Learning R&D 4 Novel  Machine   Learning/Deep   Learning  Algorithms   Extreme  Scale     ML/DL  Algorithms   MaTEx:  Machine   Learning  Toolkit  for   Extreme  Scale   DL  Applica=ons:   HEP,  SLAC,  Power   Grid,  HPC,  Chemistry  
  • 5. Novel ML/DL Algorithms: Pruning Neurons Training Phase Re-training Phase Proposed Adaptive Pruning During the Training Phase State of the art Pruning After training, requiring Re-training (a) (b) Error decay Error fixed Which  neurons  are  important?   Adap=ve  Neuron  Apoptosis  for  Accelera=ng    DL   Algorithms   Area  Under  Curve  -­‐  ROC:   1)  Improved  from  0.88  to  0.94   2)  2.5x  speedup  in  learning   /me   3)  3x  simpler  model   1 1.5 3 5 1.7 2.3 5 8 2.8 4.1 9 15 1 4 11 21 0 5 10 15 20 25 Default Conser. Normal Aggressive Improvement Speedup and Parameter Reduction vs 20 cores without Apoptosis 20 Cores 40 Cores 80 Cores Parameter Reduction
  • 6. Novel ML/DL Algorithms: Neuro-genesis 6 Training Can  you  create  neural  network  topologies  semi-­‐automa,cally?   Genera=ng  Neural  Networks  from  BluePrints   10 20 30 40 50 2000 1500 42 16 10 84 32 20 125 58 30 167 64 40 208 80 50 2880 2160 1600 1200 2000 1500 16 10 32 20 58 30 64 40 80 50 1600 1200 2000 1500
  • 7. Novel ML/DL Algorithms: Sample Pruning 7 Epoch Epoch # Batch0 Batch1 Batchn Batch0 Batch1 Batchn Eon Epoch Batch0 Batch1 Batchp Batch0 Batch1 Batchn Eon Which  Samples  are  Important?   YinYang  Deep  Learning  for  Large  Scale  Systems  
  • 8. Scaling DL Algorithms Using Asynchronous Primitives August 15, 2017 8 Interconnect( (NVLINK,(PCI1Ex,(InfiniBand)( All-to-All reduction (MPI_Allreduce, NCCL_allreduce) Interconnect( (NVLINK,(PCI1Ex,(InfiniBand)( Not started CompletedIn Progress Asynchronous thread Master thread Enqueues Async thread Dequeues MPI_Allreduce
  • 9. Sample Results August 15, 2017 9 0 1 2 3 4 5 6 7 8 9 4 8 16 32 64 128 Batches per Second Number of GPUs AGD BaG 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 8 16 32 64 Batches per Second Number of Compute Nodes SGD AGD PIC   MVAPICH   Strong  Scaling   SummitDev     IBM  Spectrum  MPI   Weak  Scaling  
  • 10. What does Fault Tolerant Deep Learning Need from MPI? August 15, 2017 10 MPI  has  been  cri=cized  heavily     for  lack  of  fault  tolerance  support   1)  Exis=ng  MPI  implementa=on   2)  User-­‐Level  Fault  Mi=ga=on   3)  Reinit  Proposal   Which  proposal  is  necessary  and  sufficient?   …" //"Original"on_gradients_ready" On_gradients_ready(float"*buf)"{" " //"conduct"in;place"allreduce"of"gradients" "rc"="MPI_Allreduce"(…",""…);" " //"average"the"gradients"by"communicator"size" " …" " …" //"Fault"tolerant"on_gradients_ready" On_gradients_ready(float"*buf)"{" " //"conduct"in;place"allreduce"of"gradients" "rc"="MPI_Allreduce"(…,""…);" " While&(rc&!=&MPI_SUCCESS)&{& //&shrink&the&communicator&to&a&new&comm.& MPIX_Comm_shrink(origcomm,&&newcomm);& rc&=&MPI_Allreduce(…,&…);& }& //"average"the"gradients"by"communicator"size" …" " Code Snippet of Original Callback Code Snippet for Fault tolerant DL
  • 11. Impact of DL on Other Application Domains 11 Computa=onal     Chemistry   Buildings,  Power  Grid   t   When  mul,-­‐bit  faults  result  in  applica,on   error?   HPC   What  DL  techniques  are  useful  for   Energy  Modeling  of  Buildings?   Can  molecular  structure   predict  the  molecular   proper,es?  
  • 12. MaTEx: Machine Learning Toolkit for Extreme Scale MaTEx August 14, 2017 12 1)  Open  source  soNware  with  users  in  academia,  laboratories  and  industry   2)  Supports  graphics  processing  unit  (GPU),  central  processing  unit  (CPU)  clusters/ LCFs  with  high-­‐end  systems/interconnects   3)  Machine  Learning  Toolkit  for  Extreme  Scale  -­‐MaTEx:  github.com/matex-­‐org/ matex  
  • 13. Architectures Supported by MaTEx August 14, 2017 13 K20     (Gemini)   GPU   Arch.   K40   K80   P100   Interconnect   InfiniBand   Ethernet   Omni-­‐Path   CPU   Arch.   Xeon  (SB,   Haswell)   Intel  Knights   Landing   Power  8   Comparing  the  Performance  of  NVIDIA  DGX-­‐1  and     Intel  KNL  on  Deep  Learning  Workloads,   ParLearning’17,  IPDPS’17  
  • 14. Demystifying Extreme Scale DL August 14, 2017 14   TF  Run=me     TF  Scripts   (gRPC)   Data  Readers   Architectures   Google-­‐TensorFlow     TF  Run=me   (MPI  Changes)     Data  Readers   Architectures   MaTEx-­‐TensorFlow   TF  Scripts   Requires  no   TF  specific   changes  for   users   Not  aerac/ve   for  scien/sts!   Supports  automa=c  distribu=on  of  HDF5,  CSV,  PNetCDF  formats  
  • 15. Example Code Changes August 14, 2017 15 6 1 import tensorflow as tf 1 import tensorflow as tf 2 import numpy as np 2 import numpy as np 3 ... 3 ... 4 from datasets import DataSet 4 5 ... 5 ... 6 image_net = DataSet(...) 6 7 data = image_net.training_data 7 data = ... # Load training data 8 labels = image_net.training_labels 8 labels = ... # Load Labels 9 ... 9 ... 10 # Setting up the network 10 # Setting up the network 11 ... 11 ... 12 # Setting up optimizer 12 # Setting up optimizer 13 ... 13 ... 14 init = tf.global_variables_initializer() 14 init = tf.global_variables_initializer() 15 sess = tf.Session() 15 sess = tf.Session() 16 sess.run(init) 16 sess.run(init) 17 ... 17 ... 18 # Run training regime 18 # Run training regime Fig. 3: (Left) A sample MaTEx-TensorFlow script, (Right) Original TensorFlow script. Notice that MaTEx-TensorFlow requires no TensorFlow specific changes. Name CPU (#cores) GPU Network MPI cuDNN CUDA Nodes #cores K40 Haswell (20) K40 IB OpenMPI 1.8.3 4 7.5 8 160 SP Ivybridge (20) N/A IB OpenMPI 1.8.4 N/A N/A 20 400 TABLE I: Hardware and Software Description. IB (InfiniBand). The proposed research extends Baseline-Caffe incorporating MaTEx-­‐TensorFlow  Code   Original  TF  Code   User-­‐transparent  Distributed  TensorFlow,  A.  Vishnu  et  al.,    Arxiv’17   Supports  automa=c  distribu=on  of  HDF5,  CSV,  PNetCDF  formats  
  • 16. User-Transparent Distributed Keras August 14, 2017 16 1)  Distributed  Keras  with  MPI  available    on  github.com/matex-­‐org/matex   2)  Currently  the  only  Keras  implementa/on  that  does  not  require  any  MPI  specific   changes  to  code   3)  Tested  on  NERSC  architectures   1 1 import tensorflow as tf 1 import tensorflow as tf 2 import numpy as np 2 import numpy as np 3 # Keras Imports 3 # Keras Imports 4 ... 4 ... 5 dataset = tf.DataSet(...) 5 6 data = dataset.training_data 6 data = ... # Load training data 7 labels = dataset.training_labels 7 labels = ... # Load Labels 8 ... 8 ... 9 # Defining Keras Model 9 # Defining Keras Model 10 ... 10 ... 11 # Call to Keras training method 11 # Call to Keras training method 12 ... 12 ...
  • 17. August 14, 2017 17 Use-Case: SLAC Water/Ice Classification Reducing  the  ,me  to  new  science  -­‐  From  Experiment  to  Publica=on   Typical  Experiment:   1)  ~100  images/sec   2)   ~100  TB  of  data   3)  Problem  further  exacerbated  for  upcoming  LCLS-­‐2  (up  to  1M  images/sec)   4)    Several  domains  exhibit  these  characteris=cs     Typical  Problems:   1)  Too  many  images  –  can  we  find  the  important  ones?   2)  Unknown  whether  the  experiment  is  on  the  “right  track”:   1)  Results  not  known  =ll  post-­‐hoc  data  analysis   3)  If  the  experiment  succeeds:   1)  Exorbitant  =me  spent  (several  man  days)  in  data  cleaning/labeling   2)  Several  man  days  spent  in  manual  data  analysis  (such  as  genera=ng  probability  distribu=on   func=ons)    Can  we  do  beJer?  
  • 18. August 14, 2017 18 Sample Proof: Distinguishing Water from Ice     Dataset  Specifica/on:   1)  ~68GB  of  data  consis=ng  of  images  with  Water   and  Ice  crystals   2)  Scien=sts  spent  17  man  days  labeling  each   image  as  represen=ng  Water  or  Ice   3)  Objec=ve  –  can  we  reduce  the  labeling  =me,   while  achieving  very  high  accuracy?   1)  We  take  4000  samples  and  consider   following  data  splits:   1)  Label  1200  to  2800  samples  using   Deep  Learning  (Convolu=onal  +   Deep  Neural  architectures)  and  see   the  accuracy  on  remaining  samples   (2800  –  1200)   2)  Observa/on:  With  2800  samples,  we  can   accurately  classify  ~97%  of  remaining   samples  correctly     4)  Conclusion:  major  reduc=on  in  labeling  =me   with  results  matching  human  labeling   1)  Poten=al  for  significant  reduc=on  in  =me   for  scien=fic  discovery     2)  Labeling  only  “boundary”  samples  would   further  reduce  the  human  effort   0.45   0.55   0.65   0.75   0.85   0.95   0   20   40   60   80   100   120   140   Tes=ng  Accuracy  vs.  Time  (in  minutes)   Water/Ice  dataset accuracy  1203   accuracy  2005   accuracy  2807   accuracy  3609  
  • 19. Model  re-­‐trained   and   recommenda/ons   changed   Prototype for Semi-Supervised Learning
  • 20. Collaborators 20 Jeff  Daily   Charles  Siegel   Vinay  Amatya   Leon  Song   Ang  Li   Garrep     Goh   Malachi   Schram   Joseph   Manzano   Vikas   Chandan   Thomas  J  Lane@SLAC  
  • 21. Thanks! MaTEx August 14, 2017 21 Contact:  [email protected]   MaTEx  webpage:  hpps://github.com/matex-­‐org/matex/   Publica=ons:  hpps://github.com/matex-­‐org/matex/wiki/publica=ons