Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster

Performance Optimization of Deep
Learning Frameworks on Modern Intel
Architectures
ElMoustapha Ould-Ahmed-Vall, AG Ramesh,
Vamsi Sripathi and Karthik Raman
Representing the work of many at Intel

Agenda

•  Op#miza#on
ma+ers
on
modern
architectures

•  Intel’s
recent
Xeon
and
Xeon
Phi
products

•  Introduc#on
to
Deep
Learning

•  Op#mizing
DL
frameworks
on
IA

•  Key
challenges

•  Op#miza#on
techniques

•  Performance
data

•  DL
scaling

Moore’s
Law
Goes
on!

Increasing
clock
speeds
-‐>

more
cores
+
wider
SIMD
(Hierarchical
parallelism)

Combined
Amdahl’s
Law
for
Vector
Mul<cores*

𝑺𝒑𝒆𝒆𝒅𝒖𝒑=(1/ 𝑆 𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 +1− 𝑆 𝑒𝑟𝑖𝑎𝑙↓𝑓𝑟𝑎𝑐 /𝑵𝒖𝒎𝑪𝒐𝒓𝒆𝒔  )∗(1/ 𝑆 𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 
+1− 𝑆 𝑐𝑎𝑙𝑎𝑟↓𝑓𝑟𝑎𝑐 /𝑽𝒆𝒄𝒕𝒐𝒓𝑳𝒆𝒏𝒈𝒕𝒉  )
Goal: Reduce Serial Fraction and Reduce Scalar Fraction of Code
Ideal Speedup: NumCores*VectorLength (requires zero scalar, zero serial work)
Compute Bound Performance
Most kernels of ML codes are compute bound
i.e. raw FLOPS matter
Roofline Model
Gflops/s = min (Peak Gflops/s, Stream BW * flops/byte)
Peak
“Compute”
Gflops/s

Peak
“Compute”
Gflops/s

without
SIMD

Compute
intensity
(flops/byte)

A+ainable
Gflops/s

Overview of Current Generation of
Intel Xeon and Xeon Phi Products

Current
Intel®
Xeon
PlaBorms

Westmere
Sandy
Bridge

Intel

Microarchitecture

(Nehalem)

Intel

Microarchitecture

(Sandy
Bridge)

NEW

Intel

Microarchitecture

(Sandy
Bridge)

Nehalem
Ivy
Bridge

45nm
Process

Technology
32nm
Process
Technology
22nm
Process
Technology

TOCK
TICK
TOCK
TICK
TOCK

NEW
Intel®

Microarchitecture

(Nehalem)

Haswell

NEW

Intel

Microarchitecture

(Haswell)

TICK

14nm
Process

Technology

Latest released – Broadwell (14nm process)
•  Intel’s foundation of HPC and ML performance
•  Suited for full scope of workloads
•  Industry leading performance/watt for serial & highly parallel workloads.
•  Upto 22 cores / socket (Broadwell-EP) (w/ Hyper-Threading technology)
Software optimization helps maximize benefit and
adoption of new features

Broadwell

Intel

Microarchitecture

(Haswell)

2nd
Genera<on
Intel®
Xeon
Phi™
PlaBorm

Intel®
AVX
Technology

HSW/BDW

512b AVX512
Flops/Cycle: 64SP / 32
DP (FMA)
SKX
&
KNL
SNB/IVB

AVX512
512-bit FP/Integer
32 registers
8 mask registers
Embedded rounding
Embedded broadcast
Scalar/SSE/AVX “promotions”
Native media additions
HPC additions
Transcendental support
Gather/Scatter
AVX AVX2
256-bit basic FP
16 registers
NDS (and AVX128)
Improved blend
MASKMOV
Implicit unaligned
Float16 (IVB 2012)
256-bit FP FMA
256-bit integer
PERMD
Gather
256b AVX2
Flops/Cycle: 32SP / 16
DP (FMA)
256b AVX1
Flops/Cycle: 16 SP / 8
DP

Overview of Deep Learning and
DL Frameworks

Deep
Learning
–
Convolu<onal
Neural
Network

Filter = 3 x
3
Stride =
2
Pad_size =
1
Convolution Parameters:
Number of outputs/feature-maps: < 4 >
Filter size: < 3 x 3 >
Stride: < 2 >
Pad_size (for corner case): <1>
Feature
maps

•  Step
1:
Training

(Over
Hours/Days/Weeks)

Deep
Learning:
Train
Once
Use
Many
Times

Person

90%
person

8%
traffic
light

Input
data

Output

Classifica#on

Create
Deep

network

•  Step
2:
Inference

(Real
Time)

New
input
from

camera
and

sensors

Output

Classifica#on

Trained
neural

network
model

97%

person

Trained

Model

Bigger
Data
Be[er
Hardware
Smarter
Algorithms

Deep
Learning:
Why
Now?

Image:
1000
KB
/
picture

Audio:
5000
KB
/
song

Video:
5,000,000
KB
/
movie

Transistor
density
doubles
every

18
months

Cost
/
GB
in
1995:
$1000.00

Cost
/
GB
in
2015:
$0.03

Advances
in
algorithm

innova#on,
including
neural

networks,
leading
to
be+er

accuracy
in
training
models

Intel
Caﬀe
–
ML
Framework

Op<mized
for
Xeon
and
Xeon
Phi
Products

q  Fork of BVLC Caffe by Intel to
optimize for IA
q  Leverages Intel MKL Deep Neural
Network (DNN) API’s
q  Optimized for BDW (AVX2) and KNL
(MIC_AVX512)
q  https://github.com/intel/caffe
Intel
Caﬀe

Compute

Layer

Convolu#on
ReLU
Pooling

Data
Layer

LMDB,

LevelDB,

HDF5

Intel
MKL
DNN

BDW
KNL

Tensorflow
™
:
Open
Source
ML
Framework
(Google)

•  Computa<on
is
a
Dataflow
Graph
with
Tensors

•  General
compu#ng
mathema#cal
framework
–
widely
used
for

•  Deep
Neural
Networks

•  Other
machine
learning
algorithms

•  HPC
applica#ons

•  Key
computa#onal
kernels,
extendable
user
opera#ons

•  Core
in
C++,
front
end
wrapper
in
python

•  Mul#
node
support
using
GRPC

•  Google
Remote
Procedural
Calls

Example

from
Jeff
Dean’s
presenta#on

Optimizing Deep Learning Frameworks

Performance
Op<miza<on
on
Modern
PlaBorms

U#lize
all
the

cores

OpenMP,
MPI,
TBB…

Reduce

synchroniza#on

events,
serial
code

Improve
load

balancing

Vectorize/SIMD

Unit
strided
access

per
SIMD
lane

High
vector

eﬃciency

Data
alignment

Eﬃcient

memory/cache

use

Blocking

Data
reuse

Prefetching

Memory
alloca#on

Hierarchical
Parallelism

Fine-‐Grained
Parallelism
/
within
node

Sub-‐domain:
1)
Mul<-‐level
domain
decomposi<on
(ex.
across
layers)

2)
Data
decomposi<on
(layer
parallelism)

Coarse-‐Grained
/

mul#-‐node

Domain
decomposi<on

Scaling

Improve
load

balancing

Reduce

synchroniza#on

events,
all-‐to-‐all

comms

Intel
Strategy:
Op<mized
Deep
Learning
Environment

Fuel
the
development
of
ver#cal
solu#ons

Deliver
best
single
node
and
mul#-‐node

performance

Accelerate
design,
training,
and
deployment

Drive
op#miza#ons
across
open
source
deep

learning
frameworks

Intel®
Deep
Learning
SDK

Intel®
Omni-‐Path

Architecture

(Intel®
OPA)

Maximum
performance
on
Intel
architecture

Intel®
Math
Kernel

Library
(Intel®
MKL)

+
Training
Inference

Intel®
MKL-‐DNN

+

Example
Challenge
1:
Data
Layout
Has
Big
Impact
on
Performance

•  Data
Layouts
impacts
performance

•  Sequen#al
access
to
avoid
gather/sca+er

•  Have
itera#ons
in
inner
most
loop
to
ensure
high
vector
u#liza#on

•  Maximize
data
reuse;
e.g.
weights
in
a
convolu#on
layer

•  Conver#ng
to/from
op#mized
Layout
is
some
#mes
less
expensive
than

opera#ng
on
unop#mized
Layout

21
18
32
6
3

1
8
0
3
26

40
9
22
76
81

23
44
81
32
11

5
38
10
11
1

8
92
37
29
44

11
9
22
3
26

3
47
29
88
1

15
16
22
46
12

29
9
13
11
1

21
8
18
92
..
1
11
..

21
18
…
1
..
8
92
..

Be+er
op#mized
for

some
opera#ons

vs

Example
Challenge
2:
Minimize
Conversions
Overhead

•  End
to
end
op#miza#on
can
reduce
conversions

•  Staying
in
op#mized
layout
as
long
as
possible
becomes
one
of
the

tuning
goals

•  Minimize
the
number
of
back
and
forth
conversions

•  Use
of
graph
op#miza#on
techniques

Convolu#on
Convolu#on
Max
Pool

Na#ve
to

MKL
layout

MKL
layout

to

Na#ve

MKL
layout

to

Na#ve

Na#ve
to

MKL
layout

•  Maximize
parallelism
to
use
all
cores
eﬃciently

•  Intra
opera#on/layer
parallelism

within
operators
(OpenMP)

Inter
opera#on
parallelism
across
operators

8
92
37
29

11
9
22
3

3
47
29
88

15
16
22
46

concat

3x3

Conv

5x5

Conv

1x1

Conv

10
20

15
18
Convolu#on
of
#les
in
parallel

Parallel
execu#on

Example
Challenge
3:
Ensuring
Enough
Parallelism
to
Leverage
all
Cores

Example
Challenge
4:
Op<mizing
the
Data
Layer

•  Data
Layer
comprises
3
major
ops

o  Read
data

o  Decode
data:
e.g.
JPEG
decode,
decompression

o  Transform
data

•  Result
of
read,
decode
&
transform
is
input
to
DNN
layers

•  Reduce
number
of
cores
dedicated
to
feed
DNN

o  IO
op#miza#on:
consider
compression

o  Decode:
consider
LMDB
instead
of
JPEG

o  Resizing/data
processing:
consider
pre-‐processing

o  Then
vectorize,
parallelize

C0
Boost

thread

C1
Boost

thread

C2

OpenMP

C3

OpenMP

….

….

Cn-‐1

OpenMP

Op<mizing
Deep
Learning
Frameworks
for
Intel®
Architecture

•  Leverage
high
performant
compute
libraries
and
tools

•  e.g.
Intel®
Math
Kernel
Library,
Intel®
Python,
Intel®
Compiler
etc.

•  Data
Format/Shape:

•  Right
format/shape
for
max
performance:
blocking,
gather/sca+er

•  Data
Layout:

•  Minimize
cost
of
data
layout
conversions

•  Parallelism:

•  Use
all
cores,
eliminate
serial
sec#ons,
load
imbalance

•  Other
Func#ons/Primi#ves
(un-‐op#mized
in
libraries):

•  Op#mize
via
compiler
knobs,
improve
exis#ng
implementa#ons

•  Memory
alloca#on

•  unique
characteris#cs
and
ability
to
reuse
buﬀers

•  Data
layer
op#miza#ons:

•  paralleliza#on,
vectoriza#on,
IO

•  Op#mize
hyper
parameters:

•  e.g.
batch
size
for
more
parallelism

•  learning
rate
and
op#mizer
to
ensure
accuracy/convergence

AlexNet
Op<miza<on
Progression

1.00x

2.20x
4.18x

6.96x
7.72x
9.27x

13.36x
13.72x

2.16x

12.48x

18.28x

25.49x
27.17x

40.71x

49.07x

0

10

20

30

40

50

60

Cumula#ve
speedup

Broadwell
Knights
Landing

VGG
Op<miza<on
Progression

1.00x
3.15x
5.40x
10.18x
13.29x
14.65x
19.27x

1.00x

15.80x

23.60x

122.50x

164.95x
171.20x

273.50x

0

50

100

150

200

250

300

Baseline
MKL
Integra#on
Thread

Op#miza#on

Compiler
Knobs

Tuning

Matrix

Transpose/Data

Transforma#ons

Memory

Alloca#ons

Conversions

Op#miza#on

Broadwell
Knights
Landing

Cumula#ve
Speedup

25

Conﬁgura<on
details

Intel®
Xeon™
processor
E5-‐2699v4
(22
Cores,
2.2
GHz),
128GB
DDR
memory,

Centos

7.2
based
on
Red
Hat*
Enterprise
Linux
7.2

Intel®
Xeon
Phi™
processor
7250
(68
Cores,
1.4
GHz,
16GB
MCDRAM:
Flat
mode),

96GB
DDR
memory,

Centos
7.2
based
on
Red
Hat*
Enterprise
Linux
7.2

AlexNet
and
VGG
benchmarks:

h+ps://github.com/soumith/convnet-‐benchmarks

Mul<-‐Node
Distributed
Training

•  Model
Parallelism

•  Break
the
model
into
N
nodes

•  The
same
data
is
in
all
the
nodes

•  Data
Parallelism

•  Break
the
dataset
into
N

nodes

•  The
same
model
is
in
all
the
nodes

•  Good
for
networks
with
few
weights,
e.g.
GoogLeNet

•  You
can
use
either
model
or
data
parallelism
or
a
hybrid
of
both

Scaling
Efficiency:
Intel®
Xeon
Phi™
Processor

Deep
Learning
Image
Classifica#on
Training
Performance
:

MULTI-‐NODE
Scaling

Soxware
and
workloads
used
in
performance
tests
may
have
been
op#mized
for
performance
only
on
Intel
microprocessors.

Performance
tests,
such
as
SYSmark
and
MobileMark,
are
measured
using
specific
computer
systems,
components,

soxware,
opera#ons
and
func#ons.

Any
change
to
any
of
those
factors
may
cause
the
results
to
vary.

You
should
consult
other
informa#on
and
performance
tests
to
assist
you
in
fully
evalua#ng
your
contemplated
purchases,
including
the

performance
of
that
product
when
combined
with
other
products.

For
more
informa#on
go
to
h+p://www.intel.com/performance
.
*Other
names
and
brands
may
be
property
of
others

Configura#ons:

•  Intel®
Xeon
Phi™

Processor
7250
(68
Cores,
1.4
GHz,
16GB
MCDRAM),
128
GB
memory,
Red
Hat*
Enterprise
Linux
6.7,
Intel®
Op#mized
Framework

0

10

20

30

40

50

60

70

80

90

100

1
2
4
8
16
32
64
128

SCALING
EFFICIENCY
%

#
OF
INTEL®
XEON
PHI™
PROCESSOR
7250
(68-‐CORES,
1.4
GHZ,
16
GB)
NODES

OverFeat

AlexNet
VGG-‐A
GoogLeNet

62

87

𝑇𝑖𝑚𝑒 𝑇𝑜 𝑇𝑟𝑎𝑖𝑛 ( 𝑇𝑇𝑇)

𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒

sweet
spot

Mul<-‐node
Challenges

•  Need
to
op#mize
both
compute
(itera#on)
and

communica#on
(weight
updates)

•  More
nodes
mean
higher
batch
per
itera#on

•  Enough
work
for
each
node

•  Op#mized
hyper
parameters
(e.g.
Batch
Size)

•  Time
to
Train:
increases
with
batch
size

•  Accuracy:
batch
size
impacts
convergence
and
accuracy

•  Communication overheads if small per node batch
•  e.g.
Total
batch
size
=
1024

•  1024
nodes
:
Batch
size
=
1
per
node
–
communica<on

dominates

•  64
nodes
each
:
Batch
size
=
16
per
node
–
computa<on

dominates

Summary

•  Don’t
be
fooled
by
performance
of
DL
workloads
when
using
unop#mized

frameworks

•  Signiﬁcant
performance
headroom
from
op#miza#on
on
Xeon
and
Xeon
Phi

•  Close
to
300x
speedup
in
certain
topologies

•  Tradi#onal
vectoriza#on
and
paralleliza#on
strategies
apply

•  Other
unique
performance
challenges:
hyper
parameters,
data
layer,
inter/
intra
layer
paralleliza#on,
etc.

•  Call
to
ac#on:

•  Try
Intel
op#mized
frameworks
available
today,
more
to
come
soon

Legal Disclaimers
•  Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across
different processor families: Go to: Learn About Intel® Processor Numbers http://www.intel.com/products/processor_number
•  Some results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system
hardware or software design or configuration may affect actual performance.
•  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such
as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.
•  Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all
of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced
benchmarks are accurate and reflect performance of systems available for purchase.
•  Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the
baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that
correlates with the performance improvements reported.
•  SPEC, SPECint, SPECfp, SPECrate, SPECpower, SPECjbb, SPECompG, SPEC MPI, and SPECjEnterprise* are trademarks of the Standard Performance
Evaluation Corporation. See http://www.spec.org for more information.
•  TPC Benchmark, TPC-C, TPC-H, and TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.
•  No computer system can provide absolute reliability, availability or serviceability. Requires an Intel® Xeon® processor E7-8800/4800/2800 v2 product
families or Intel® Itanium® 9500 series-based system (or follow-on generations of either.) Built-in reliability features available on select Intel®
processors may require additional software, hardware, services and/or an internet connection. Results may vary depending upon configuration.
Consult your system manufacturer for more details.
For systems also featuring Resilient System Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires
an Intel® Run Sure Technology-enabled system, including an enabled Intel processor and enabled technology(ies). Built-in reliability features available
on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending upon
configuration. Consult your system manufacturer for more details.
For systems also featuring Resilient Memory Technologies: No computer system can provide absolute reliability, availability or serviceability. Requires
an Intel® Run Sure Technology-enabled system, including an enabled Intel® processor and enabled technology(ies). built-in reliability features
available on select Intel® processors may require additional software, hardware, services and/or an Internet connection. Results may vary depending
upon configuration. Consult your system manufacturer for more details.

Op/miza/on No/ce
Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not
manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with Intel
microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for
more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster

More Related Content

What's hot (20)

Similar to Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster (20)

More from Intel® Software (20)

Recently uploaded (20)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster