Scaling Deep Learning Algorithms on Extreme Scale Architectures

Scaling Deep Learning
Algorithms on Extreme Scale
Architectures
ABHINAV VISHNU
1
Team Lead, Scalable Machine Learning, Pacific Northwest National Laboratory
MVAPICH User Group (MUG) 2017

The rise of Deep Learning!
2
FeedForward

Back-‐propaga/on

Several
scien,ﬁc
applica,ons
have
shown
remarkable
improvements
in
modeling/classiﬁca,on
tasks
!!

Human
accuracy!

Challenges in Using Deep Learning
3
•  How
to
design
DNN
topology?

•  Which
samples
are
important?

•  How
to
handle
unlabeled
data?

•  Supercomputers
are
typically

used
for
simula=on
–
eﬀec=ve
for

DL
implementa=ons?

•  How
much
eﬀort
required
for

using
DL
algorithms?

•  Will
it
only
reduce
=me-‐to-‐
solu=on
or
improve
baseline

performance
of
the
model?

Vision for Machine/Deep Learning R&D
4
Novel
Machine

Learning/Deep

Learning
Algorithms

Extreme
Scale

ML/DL
Algorithms

MaTEx:
Machine

Learning
Toolkit
for

Extreme
Scale

DL
Applica=ons:

HEP,
SLAC,
Power

Grid,
HPC,
Chemistry

Novel ML/DL Algorithms: Pruning Neurons
Training Phase Re-training Phase
Proposed Adaptive
Pruning During the
Training Phase
State of the art Pruning
After training, requiring
Re-training
(a) (b)
Error decay
Error fixed
Which
neurons
are
important?

Adap=ve
Neuron
Apoptosis
for
Accelera=ng

DL

Algorithms

Area
Under
Curve
-‐
ROC:

1)  Improved
from
0.88
to
0.94

2)  2.5x
speedup
in
learning

/me

3)  3x
simpler
model

1 1.5
3
5
1.7 2.3
5
8
2.8
4.1
9
15
1
4
11
21
0
5
10
15
20
25
Default Conser. Normal Aggressive
Improvement
Speedup and Parameter Reduction vs 20 cores
without Apoptosis
20 Cores 40 Cores 80 Cores Parameter Reduction

Novel ML/DL Algorithms: Neuro-genesis
6
Training
Can
you
create
neural
network
topologies
semi-‐automa,cally?

Genera=ng
Neural
Networks
from
BluePrints

10
20 30 40 50
2000 1500
42
16
10
84
32
20
125
58
30
167
64
40
208
80
50
2880 2160
1600 1200
2000 1500
16
10
32
20
58
30
64
40
80
50
1600 1200
2000 1500

Novel ML/DL Algorithms: Sample Pruning
7
Epoch
Epoch #
Batch0
Batch1
Batchn
Batch0
Batch1
Batchn
Eon
Epoch
Batch0
Batch1
Batchp
Batch0
Batch1
Batchn
Eon
Which
Samples
are
Important?

YinYang
Deep
Learning
for
Large
Scale
Systems

Scaling DL Algorithms Using Asynchronous
Primitives
August 15, 2017 8
Interconnect(
(NVLINK,(PCI1Ex,(InﬁniBand)(
All-to-All reduction (MPI_Allreduce, NCCL_allreduce)
Interconnect(
(NVLINK,(PCI1Ex,(InﬁniBand)(
Not started CompletedIn Progress Asynchronous thread
Master thread
Enqueues
Async thread
Dequeues
MPI_Allreduce

Sample Results
August 15, 2017 9
0
1
2
3
4
5
6
7
8
9
4 8 16 32 64 128
Batches per Second
Number of GPUs
AGD BaG
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
8 16 32 64
Batches per Second
Number of Compute Nodes
SGD AGD
PIC

MVAPICH

Strong
Scaling

SummitDev

IBM
Spectrum
MPI

Weak
Scaling

What does Fault Tolerant Deep Learning
Need from MPI?
August 15, 2017 10
MPI
has
been
cri=cized
heavily

for
lack
of
fault
tolerance
support

1)  Exis=ng
MPI
implementa=on

2)  User-‐Level
Fault
Mi=ga=on

3)  Reinit
Proposal

Which
proposal
is
necessary
and
sufficient?

…"
//"Original"on_gradients_ready"
On_gradients_ready(float"*buf)"{"
"
//"conduct"in;place"allreduce"of"gradients"
"rc"="MPI_Allreduce"(…",""…);"
"
//"average"the"gradients"by"communicator"size"
"
…"
"
…"
//"Fault"tolerant"on_gradients_ready"
On_gradients_ready(float"*buf)"{"
"
//"conduct"in;place"allreduce"of"gradients"
"rc"="MPI_Allreduce"(…,""…);"
"
While&(rc&!=&MPI_SUCCESS)&{&
//&shrink&the&communicator&to&a&new&comm.&
MPIX_Comm_shrink(origcomm,&&newcomm);&
rc&=&MPI_Allreduce(…,&…);&
}&
//"average"the"gradients"by"communicator"size"
…"
"
Code Snippet of Original Callback Code Snippet for Fault tolerant DL

Impact of DL on Other Application Domains
11
Computa=onal

Chemistry

Buildings,
Power
Grid

t

When
mul,-‐bit
faults
result
in
applica,on

error?

HPC

What
DL
techniques
are
useful
for

Energy
Modeling
of
Buildings?

Can
molecular
structure

predict
the
molecular

proper,es?

MaTEx: Machine Learning Toolkit for Extreme
Scale
MaTEx
August 14, 2017 12
1)  Open
source
soNware
with
users
in
academia,
laboratories
and
industry

2)  Supports
graphics
processing
unit
(GPU),
central
processing
unit
(CPU)
clusters/
LCFs
with
high-‐end
systems/interconnects

3)  Machine
Learning
Toolkit
for
Extreme
Scale
-‐MaTEx:
github.com/matex-‐org/
matex

Architectures Supported by MaTEx
August 14, 2017 13
K20

(Gemini)

GPU

Arch.

K40
K80
P100

Interconnect
InﬁniBand
Ethernet
Omni-‐Path

CPU

Arch.

Xeon
(SB,

Haswell)

Intel
Knights

Landing

Power
8

Comparing
the
Performance
of
NVIDIA
DGX-‐1
and

Intel
KNL
on
Deep
Learning
Workloads,

ParLearning’17,
IPDPS’17

Demystifying Extreme Scale DL
August 14, 2017 14

TF
Run=me

TF
Scripts

(gRPC)

Data
Readers

Architectures

Google-‐TensorFlow

TF
Run=me

(MPI
Changes)

Data
Readers

Architectures

MaTEx-‐TensorFlow

TF
Scripts

Requires
no

TF
speciﬁc

changes
for

users

Not
aerac/ve

for
scien/sts!

Supports
automa=c
distribu=on
of
HDF5,
CSV,
PNetCDF
formats

Example Code Changes
August 14, 2017 15
6
1 import tensorflow as tf 1 import tensorflow as tf
2 import numpy as np 2 import numpy as np
3 ... 3 ...
4 from datasets import DataSet 4
5 ... 5 ...
6 image_net = DataSet(...) 6
7 data = image_net.training_data 7 data = ... # Load training data
8 labels = image_net.training_labels 8 labels = ... # Load Labels
9 ... 9 ...
10 # Setting up the network 10 # Setting up the network
11 ... 11 ...
12 # Setting up optimizer 12 # Setting up optimizer
13 ... 13 ...
14 init = tf.global_variables_initializer() 14 init = tf.global_variables_initializer()
15 sess = tf.Session() 15 sess = tf.Session()
16 sess.run(init) 16 sess.run(init)
17 ... 17 ...
18 # Run training regime 18 # Run training regime
Fig. 3: (Left) A sample MaTEx-TensorFlow script, (Right) Original TensorFlow script. Notice that MaTEx-TensorFlow requires
no TensorFlow speciﬁc changes.
Name CPU (#cores) GPU Network MPI cuDNN CUDA Nodes #cores
K40 Haswell (20) K40 IB OpenMPI 1.8.3 4 7.5 8 160
SP Ivybridge (20) N/A IB OpenMPI 1.8.4 N/A N/A 20 400
TABLE I: Hardware and Software Description. IB (InﬁniBand). The proposed research extends Baseline-Caffe incorporating
MaTEx-‐TensorFlow
Code
Original
TF
Code

User-‐transparent
Distributed
TensorFlow,
A.
Vishnu
et
al.,

Arxiv’17

Supports
automa=c
distribu=on
of
HDF5,
CSV,
PNetCDF
formats

User-Transparent Distributed Keras
August 14, 2017 16
1)  Distributed
Keras
with
MPI
available

on
github.com/matex-‐org/matex

2)  Currently
the
only
Keras
implementa/on
that
does
not
require
any
MPI
speciﬁc

changes
to
code

3)  Tested
on
NERSC
architectures

1
1 import tensorflow as tf 1 import tensorflow as tf
2 import numpy as np 2 import numpy as np
3 # Keras Imports 3 # Keras Imports
4 ... 4 ...
5 dataset = tf.DataSet(...) 5
6 data = dataset.training_data 6 data = ... # Load training data
7 labels = dataset.training_labels 7 labels = ... # Load Labels
8 ... 8 ...
9 # Defining Keras Model 9 # Defining Keras Model
10 ... 10 ...
11 # Call to Keras training method 11 # Call to Keras training method
12 ... 12 ...

August 14, 2017 17
Use-Case: SLAC Water/Ice Classification
Reducing
the
,me
to
new
science
-‐
From
Experiment
to
Publica=on

Typical
Experiment:

1)  ~100
images/sec

2) 
~100
TB
of
data

3)  Problem
further
exacerbated
for
upcoming
LCLS-‐2
(up
to
1M
images/sec)

4)

Several
domains
exhibit
these
characteris=cs

Typical
Problems:

1)  Too
many
images
–
can
we
ﬁnd
the
important
ones?

2)  Unknown
whether
the
experiment
is
on
the
“right
track”:

1)  Results
not
known
=ll
post-‐hoc
data
analysis

3)  If
the
experiment
succeeds:

1)  Exorbitant
=me
spent
(several
man
days)
in
data
cleaning/labeling

2)  Several
man
days
spent
in
manual
data
analysis
(such
as
genera=ng
probability
distribu=on

func=ons)

Can
we
do
beJer?

August 14, 2017 18
Sample Proof:
Distinguishing Water from Ice

Dataset
Specifica/on:

1)  ~68GB
of
data
consis=ng
of
images
with
Water

and
Ice
crystals

2)  Scien=sts
spent
17
man
days
labeling
each

image
as
represen=ng
Water
or
Ice

3)  Objec=ve
–
can
we
reduce
the
labeling
=me,

while
achieving
very
high
accuracy?

1)  We
take
4000
samples
and
consider

following
data
splits:

1)  Label
1200
to
2800
samples
using

Deep
Learning
(Convolu=onal
+

Deep
Neural
architectures)
and
see

the
accuracy
on
remaining
samples

(2800
–
1200)

2)  Observa/on:
With
2800
samples,
we
can

accurately
classify
~97%
of
remaining

samples
correctly

4)  Conclusion:
major
reduc=on
in
labeling
=me

with
results
matching
human
labeling

1)  Poten=al
for
significant
reduc=on
in
=me

for
scien=fic
discovery

2)  Labeling
only
“boundary”
samples
would

further
reduce
the
human
effort

0.45

0.55

0.65

0.75

0.85

0.95

0
20
40
60
80
100
120
140

Tes=ng
Accuracy
vs.
Time
(in
minutes)

Water/Ice
dataset
accuracy
1203
accuracy
2005
accuracy
2807
accuracy
3609

Model
re-‐trained

and

recommenda/ons

changed

Prototype for Semi-Supervised Learning

Collaborators
20
Jeﬀ
Daily
Charles
Siegel
Vinay
Amatya
Leon
Song
Ang
Li

Garrep

Goh

Malachi

Schram

Joseph

Manzano

Vikas

Chandan

Thomas
J
Lane@SLAC

Thanks!
MaTEx
August 14, 2017 21
Contact:
abhinav.vishnu@pnnl.gov

MaTEx
webpage:
hpps://github.com/matex-‐org/matex/

Publica=ons:
hpps://github.com/matex-‐org/matex/wiki/publica=ons

Scaling Deep Learning Algorithms on Extreme Scale Architectures

More Related Content

What's hot (20)

Similar to Scaling Deep Learning Algorithms on Extreme Scale Architectures (20)

More from inside-BigData.com (20)

Recently uploaded (20)

Scaling Deep Learning Algorithms on Extreme Scale Architectures