"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook

Quantizing Deep Networks for
Efficient Inference at the edge
Raghu Krishnamoorthi, Facebook
Questions/Feedback: raghuraman@fb.com

Acknowledgements
• Results presented here are from work done at Google as part of the
Tensorflow lite team and work at facebook as part of the pytorch
team.
• Acknowledge contributions from several colleagues at Google
including: Benoit Jacob, Skiramantas Kligys, Dmitry Kalachenko,
Suharsh Sivakumar and Pete Warden.
• Also acknowledge work from colleagues at facebook: Jongsoo Park,
Maxim Naumov, Summer Deng, Marat Dukhan, Bichen Wu, Peizhao
Zhang, Jerry Zhang, Dmytro Dzhulgakov, Daya Khudia, Jianyu Huang,
James Reed, Mikhail Z, Haixin Liu and Peter Vajda.

Outline
• Motivation
• Quantization: Overview
• Quantizing deep networks
• Post Training quantization
• Quantization aware training
• Lower precision inference
• Hardware accelerator recommendations
• Model system co-design
• Looking ahead

Motivation(1)
• Data-center power consumption is doubling every year
Source: Deep Learning Inference in Facebook Data-Centers [1]

Motivation(2)
• Number of edge devices is
growing rapidly, lots of these
devices are resource
constrained.
Source: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/

Motivation(3)
• While models are
becoming more efficient,
high accuracy still implies
high complexity
From: Benchmark Analysis of
Representative Deep Neural Network
Architectures, Simone Bianco et al,

Quantization
• Many approaches to solve the problems outlined here:
• Better hardware accelerators: TPUs => Requires new custom hardware
• Optimized kernels: Cudnn, Intel MKL-DNN
• Efficient deep network architectures: Nasnet, Mobilenet, FBNet => Requires
new architectures
• A simpler approach that does not require re-design of models/new
hardware is quantization.
• Quantization refers to techniques to perform computation and storage at
reduced precision
• Works in combination with above approaches
• Requires optimized kernels to efficiently use existing hardware.

Background: Quantization(1)
• Quantization refers to mapping values from fp32 to a lower precision
format.
• Specified by
• Format
• Mapping type
• Granularity
From:https://commons.wikimedia.org/w/index.
php?curid=69415943
fp32
fp16
bfloat16
int8
int4
binary

Background:Quantization(2)
• We also consider different granularities of quantization:
• Per layer quantization
• Same mapping for all elements in a layer.
• Per row/ Per-channel quantization:
• Choose quantizer parameters independently for each row (fc layers) or for
each conv kernel (conv layers)
• Outlier aware quantization:
• Separate outliers to use lower precision arithmetic for bulk of weights.
• Dense computations for inliers with sparse computation for outliers that have
a large magnitude

Modeling quantization during
training
• Emulate quantization by quantizing and
de-quantizing in succession
• Values are still in floating point, but with
reduced precision
• 𝑥 𝑜𝑢𝑡 = 𝐹𝑎𝑘𝑒𝑄𝑢𝑎𝑛𝑡 𝑥
= 𝑠. 𝐶𝑙𝑎𝑚𝑝 𝑟𝑜𝑢𝑛𝑑
𝑥
𝑠
− 𝑧 + 𝑧
= 𝐷𝑒𝑄𝑢𝑎𝑛𝑡(𝑄𝑢𝑎𝑛𝑡 𝑥 )
• Can also model quantization as a
stochastic rounding operation
Fake Quantizer (top), showing the
quantization of output values.
Approximation for purposes of
derivative calculation (bottom).

Quantization:Benefits
Benefits Quantization
Applicability Broad applicability across models and use cases
Support Supported by x86, Nvidia Volta, ARM, Mali, QDSP
Software Support Kernel libraries widely available
Memory Size 4x reduction
Memory Bandwidth/Cache 4x reduction
Compute 2x to 4x speedup, depending on ISA
Power Typically 4x (dominated by memory access)
o Comparing float32 implementations with 8 bit inference

Quantization: Challenges
Challenges Notes Mitigation
Accuracy drop Loss in accuracy can be too high for certain
applications
Quantization aware training
Kernel Support Wide variety of operators+multiple
hardware platforms
Improving software tool-chain
(TVM) to handle varied
backends.
“Complexity” Non-trivial: Requires calibration/training in
some cases
Support in software packages:
TensorRT, Tensorflow and
Pytorch

Model Quantization: Overview
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Fake Quantization
Quantization

What to quantize?
• Only quantize parts of network that contribute significantly to performance
• Roofline analysis to identify compute vs memory bandwidth bound operations
• May need to further reduce based on accuracy impact.
• Multiple ways to quantize a network with different impact:
Quantization scheme Memory bandwidth
reduction (Weights)
Memory bandwidth
reduction (Activations)
Compute
Speedup
Notes
Weight only
quantization to int8
4x 1x 1x Suitable for
embedding lookups
Dynamic
quantization
4x 1x 2x Suitable for fc layers
with small batches
Static quantization
(int32 accumulators)
4x 4x 2x Suited for all layers,
important for
convolutions
Static quantization
(int16 accumulators)
4x 4x 4x Requires lower
precision
weights/activations

Post training quantization:
Weight compression
• Simplest quantization scheme is to compress the
weights to lower precision
• Requires no input data and can be done statically as part of
preparing a model for inference
• Hardware accelerators can benefit if de-compression is
done after memory access
• Trivial for case of fp16/int8 quantization of weights.
• K-means compression is also supported in select platforms
and is amenable to simple de-compression
• Scatter-Gather operation in select processors
• Supported in CoreML
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization

Dynamic quantization
• Dynamic quantization refers to schemes where the activations are
read/written in fp32 and are dynamically quantized to lower precisions for
compute.
• Requires no calibration data
• Data exchanged between operations is in floating point, so no need to
worry about format conversion.
• Provides performance improvements close to static quantization when
memory access is dominated by weights
• Suitable for inference in RNNs
• Lesser gains for conv layers
• Supported by:
• Pytorch
• Tensorflow Lite

Quantizing weights and activations
• Post training quantization refers to quantizing both weights and
activations to reduced precision, typically int8.
• Requires estimation of statistics of activations for determining
quantizer parameters.
• Quantizer parameters are determined by minimizing an error metric:
• KL Divergence: TensorRT
• Saturation error: Tensorflow Lite

Setup
• Standard classification model architectures
• Evaluate classification on imagenet validation dataset.
• Results obtained using Tensorflow, more details are at:
https://arxiv.org/abs/1806.08342
• More results and support from pytorch for quantizations to be
announced at Pytorch Devcon on October 10th.

Post training quantization: Results
Network Asymmetric,
Per Layer
Symmetric,
Per Channel
Asymmetric,
Per Channel
Activation
only
quantized
Weight only,
Symmetric,
Per Channel
Floating point
Mobilenet-
v2-1-224
0.001 0.698 0.697 0.7 0.698 0.719
Mobilenet-
v1-1-224
0.001 0.591 0.703 0.708 0.591 0.709
Nasnet-
Mobile
0.72 0.72 0.74 0.74 0.72 0.74
Inception-v3 0.78 0.78 0.78 0.78 0.78 0.78
Resnet-v1-50 0.75 0.75 0.75 0.751 0.75 0.752
• 8 bits for weights and activations is sufficient for common CV classification tasks
• Smaller networks are “harder” to quantize
• At 8 bits, accuracy drop is dominated by weight quantization

Results
Network Asymmetric,
Per Layer, post
training quant
Symmetric, Per
Channel, post
training quant
Asymmetric,
Per Layer,
QAT
Symmetric, Per
channel QAT
Floating point
Mobilenet-v2-1-
224
0.001 0.698 0.709 0.711 0.719
Mobilenet-v1-1-
224
0.001 0.591 0.70 0.707 0.709
Nasnet-Mobile 0.72 0.72 0.73 0.73 0.74
Inception-v3 0.78 0.78 0.78 0.78 0.78
Resnet-v1-50 0.75 0.75 0.75 0.751 0.752
• Quantization aware training provides the best accuracy and allows for simpler quantization
schemes.

Performance: Operator level benchmarks
Server: FBGEMM (quantized) vs MKL-DNN (fp32)

Performance: Model level benchmarks
(Mobile: Tensorflow Lite)
Mobile: Inference time: float vs quantized, TFLite, Pixel2
QNNPACK kernels provides an additional 2x speedup

Lower precision inference(1)
• Four bit precisions for weights
provides good accuracy, but
needs to be done selectively.
• Larger networks are more robust
to lower precision
• Quantization aware training is
critical
• Selectively quantizing layers of a
network to different precisions can
reduce the accuracy drop
4-bit weights, 8 bit activations: Top-1 accuracy results

Lower precision inference(2)
• Different layers of a
neural network
have different
sensitivity to
quantization errors
• Exciting work on
Differentiable
architecture search
[7] for determining
precision
allocations across
layers, showing
excellent
performance:

Architecture trade-offs(1)
• Clear tradeoff between number of parameters and robustness to
quantization
• One can also tradeoff number of feature maps vs precision
• Having 2x number of feature maps at 4-bits is better than 8 bit quantization of the
base network.

Architecture tradeoffs (2)
• Restricting ranges of activations apriori can hurt accuracy
• Preferable to learn activation ranges instead of fixing them beforehand.

Co-design: Training quantized models
• Designing models that provide good quantization performance requires co-
design of model architecture, training algorithms and hardware.
• Specific training enhancements include:
• Fine tune from floating point models for building quantized models.
• Freeze batch normalization statistics update to exactly model inference for further
benefits.
• Model exact rounding arithmetic done in hardware during training
• Stochastic quantization provides models robust to random perturbations of weights, but
underperforms techniques that model quantization as done at inference.
• Other enhancements to improve accuracy:
• Use distillation to train quantized student from floating point teacher network [3]

Hardware accelerator recommendations:
Basics
• Optimize memory bandwidth
• First order predictor of power consumption
• Don’t ignore activations: Most literature focusses on weights, activations can be very
significant for large resolution inputs.
• Fuse multiple operations
• Have floating point support as a backup
• Avoid switching compute to different hardware
• Optimize for GEMM
• Still the workhorse for most DNN applications
• Support low precision inference
• 8 is required, but supporting lower precision can provide really high throughput.

Software
• Don’t forget the software toolchain!
• Need to make it easy for customers to use hardware
• Integration with Tensorflow/Pytorch is important
• Writing optimized kernels for new hardware is hard
• Most implementations optimize for a specific set of models, with poor
performance for kernels needed for other models.

Differentiation
• Build a strategy for operator support
• Take a close look at TVM/MLIR efforts.
• Code generation along with hand written kernels
• To get the best out of hardware w.quantization
• Provide exact models of HW kernels for integration with training frameworks
• Consider other techniques beyond quantization
• Sparsity
• K-means compression
• Dynamic/Adaptive execution
• Don’t forget privacy
• Secure aggregation/Homo-morphic encryption becoming increasingly important
• Training at the edge:
• Depending on applications, this can be very important for privacy/personalization

References
1. J Park, M. Naumov et al, “Deep Learning Inference in Facebook Data Centers:
Characterization, Performance Optimizations and Hardware Implications”
2. Simone Bianco et al Benchmark Analysis of Representative Deep Neural Network
Architectures
3. A. Polino et al, “Model compression via distillation and quantization”
4. B.Jacob, S.Kligys et al, “Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference”
5. M.Courbariaux, Y.Bengio et al, “Binaryconnect: Training deep neural networks with
binary weights during propagations”
6. R.Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A
whitepaper”
7. B.Wu et al, “Mixed precision quantization of convnets via differentiable neural
architecture search”

"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook

More Related Content

What's hot (20)

Similar to "Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook (20)

More from Edge AI and Vision Alliance (20)

Recently uploaded (20)

"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook