SlideShare a Scribd company logo
Quantizing Deep Networks for
Efficient Inference at the edge
Raghu Krishnamoorthi, Facebook
Questions/Feedback: raghuraman@fb.com
Acknowledgements
• Results presented here are from work done at Google as part of the
Tensorflow lite team and work at facebook as part of the pytorch
team.
• Acknowledge contributions from several colleagues at Google
including: Benoit Jacob, Skiramantas Kligys, Dmitry Kalachenko,
Suharsh Sivakumar and Pete Warden.
• Also acknowledge work from colleagues at facebook: Jongsoo Park,
Maxim Naumov, Summer Deng, Marat Dukhan, Bichen Wu, Peizhao
Zhang, Jerry Zhang, Dmytro Dzhulgakov, Daya Khudia, Jianyu Huang,
James Reed, Mikhail Z, Haixin Liu and Peter Vajda.
Outline
• Motivation
• Quantization: Overview
• Quantizing deep networks
• Post Training quantization
• Quantization aware training
• Lower precision inference
• Hardware accelerator recommendations
• Model system co-design
• Looking ahead
Motivation(1)
• Data-center power consumption is doubling every year
Source: Deep Learning Inference in Facebook Data-Centers [1]
Motivation(2)
• Number of edge devices is
growing rapidly, lots of these
devices are resource
constrained.
Source: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
Motivation(3)
• While models are
becoming more efficient,
high accuracy still implies
high complexity
From: Benchmark Analysis of
Representative Deep Neural Network
Architectures, Simone Bianco et al,
Quantization
• Many approaches to solve the problems outlined here:
• Better hardware accelerators: TPUs => Requires new custom hardware
• Optimized kernels: Cudnn, Intel MKL-DNN
• Efficient deep network architectures: Nasnet, Mobilenet, FBNet => Requires
new architectures
• A simpler approach that does not require re-design of models/new
hardware is quantization.
• Quantization refers to techniques to perform computation and storage at
reduced precision
• Works in combination with above approaches
• Requires optimized kernels to efficiently use existing hardware.
Background: Quantization(1)
• Quantization refers to mapping values from fp32 to a lower precision
format.
• Specified by
• Format
• Mapping type
• Granularity
From:https://commons.wikimedia.org/w/index.
php?curid=69415943
fp32
fp16
bfloat16
int8
int4
binary
Background:Quantization(2)
• We also consider different granularities of quantization:
• Per layer quantization
• Same mapping for all elements in a layer.
• Per row/ Per-channel quantization:
• Choose quantizer parameters independently for each row (fc layers) or for
each conv kernel (conv layers)
• Outlier aware quantization:
• Separate outliers to use lower precision arithmetic for bulk of weights.
• Dense computations for inliers with sparse computation for outliers that have
a large magnitude
Modeling quantization during
training
• Emulate quantization by quantizing and
de-quantizing in succession
• Values are still in floating point, but with
reduced precision
• 𝑥 𝑜𝑢𝑡 = 𝐹𝑎𝑘𝑒𝑄𝑢𝑎𝑛𝑡 𝑥
= 𝑠. 𝐶𝑙𝑎𝑚𝑝 𝑟𝑜𝑢𝑛𝑑
𝑥
𝑠
− 𝑧 + 𝑧
= 𝐷𝑒𝑄𝑢𝑎𝑛𝑡(𝑄𝑢𝑎𝑛𝑡 𝑥 )
• Can also model quantization as a
stochastic rounding operation
Fake Quantizer (top), showing the
quantization of output values.
Approximation for purposes of
derivative calculation (bottom).
Quantization:Benefits
Benefits Quantization
Applicability Broad applicability across models and use cases
Support Supported by x86, Nvidia Volta, ARM, Mali, QDSP
Software Support Kernel libraries widely available
Memory Size 4x reduction
Memory Bandwidth/Cache 4x reduction
Compute 2x to 4x speedup, depending on ISA
Power Typically 4x (dominated by memory access)
o Comparing float32 implementations with 8 bit inference
Quantization: Challenges
Challenges Notes Mitigation
Accuracy drop Loss in accuracy can be too high for certain
applications
Quantization aware training
Kernel Support Wide variety of operators+multiple
hardware platforms
Improving software tool-chain
(TVM) to handle varied
backends.
“Complexity” Non-trivial: Requires calibration/training in
some cases
Support in software packages:
TensorRT, Tensorflow and
Pytorch
Quantizing deep networks
Model Quantization: Overview
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Fake Quantization
Quantization
What to quantize?
• Only quantize parts of network that contribute significantly to performance
• Roofline analysis to identify compute vs memory bandwidth bound operations
• May need to further reduce based on accuracy impact.
• Multiple ways to quantize a network with different impact:
Quantization scheme Memory bandwidth
reduction (Weights)
Memory bandwidth
reduction (Activations)
Compute
Speedup
Notes
Weight only
quantization to int8
4x 1x 1x Suitable for
embedding lookups
Dynamic
quantization
4x 1x 2x Suitable for fc layers
with small batches
Static quantization
(int32 accumulators)
4x 4x 2x Suited for all layers,
important for
convolutions
Static quantization
(int16 accumulators)
4x 4x 4x Requires lower
precision
weights/activations
Post training quantization:
Weight compression
• Simplest quantization scheme is to compress the
weights to lower precision
• Requires no input data and can be done statically as part of
preparing a model for inference
• Hardware accelerators can benefit if de-compression is
done after memory access
• Trivial for case of fp16/int8 quantization of weights.
• K-means compression is also supported in select platforms
and is amenable to simple de-compression
• Scatter-Gather operation in select processors
• Supported in CoreML
Train
Convert for
inference
Graph
Optimization
Kernel
Implementation
Quantization
Dynamic quantization
• Dynamic quantization refers to schemes where the activations are
read/written in fp32 and are dynamically quantized to lower precisions for
compute.
• Requires no calibration data
• Data exchanged between operations is in floating point, so no need to
worry about format conversion.
• Provides performance improvements close to static quantization when
memory access is dominated by weights
• Suitable for inference in RNNs
• Lesser gains for conv layers
• Supported by:
• Pytorch
• Tensorflow Lite
Quantizing weights and activations
• Post training quantization refers to quantizing both weights and
activations to reduced precision, typically int8.
• Requires estimation of statistics of activations for determining
quantizer parameters.
• Quantizer parameters are determined by minimizing an error metric:
• KL Divergence: TensorRT
• Saturation error: Tensorflow Lite
Results
Setup
• Standard classification model architectures
• Evaluate classification on imagenet validation dataset.
• Results obtained using Tensorflow, more details are at:
https://arxiv.org/abs/1806.08342
• More results and support from pytorch for quantizations to be
announced at Pytorch Devcon on October 10th.
Post training quantization: Results
Network Asymmetric,
Per Layer
Symmetric,
Per Channel
Asymmetric,
Per Channel
Activation
only
quantized
Weight only,
Symmetric,
Per Channel
Floating point
Mobilenet-
v2-1-224
0.001 0.698 0.697 0.7 0.698 0.719
Mobilenet-
v1-1-224
0.001 0.591 0.703 0.708 0.591 0.709
Nasnet-
Mobile
0.72 0.72 0.74 0.74 0.72 0.74
Inception-v3 0.78 0.78 0.78 0.78 0.78 0.78
Resnet-v1-50 0.75 0.75 0.75 0.751 0.75 0.752
• 8 bits for weights and activations is sufficient for common CV classification tasks
• Smaller networks are “harder” to quantize
• At 8 bits, accuracy drop is dominated by weight quantization
Quantization aware training
Results
Network Asymmetric,
Per Layer, post
training quant
Symmetric, Per
Channel, post
training quant
Asymmetric,
Per Layer,
QAT
Symmetric, Per
channel QAT
Floating point
Mobilenet-v2-1-
224
0.001 0.698 0.709 0.711 0.719
Mobilenet-v1-1-
224
0.001 0.591 0.70 0.707 0.709
Nasnet-Mobile 0.72 0.72 0.73 0.73 0.74
Inception-v3 0.78 0.78 0.78 0.78 0.78
Resnet-v1-50 0.75 0.75 0.75 0.751 0.752
• Quantization aware training provides the best accuracy and allows for simpler quantization
schemes.
Performance: Operator level benchmarks
Server: FBGEMM (quantized) vs MKL-DNN (fp32)
Performance: Model level benchmarks
(Mobile: Tensorflow Lite)
Mobile: Inference time: float vs quantized, TFLite, Pixel2
QNNPACK kernels provides an additional 2x speedup
Lower precision inference(1)
• Four bit precisions for weights
provides good accuracy, but
needs to be done selectively.
• Larger networks are more robust
to lower precision
• Quantization aware training is
critical
• Selectively quantizing layers of a
network to different precisions can
reduce the accuracy drop
4-bit weights, 8 bit activations: Top-1 accuracy results
Lower precision inference(2)
• Different layers of a
neural network
have different
sensitivity to
quantization errors
• Exciting work on
Differentiable
architecture search
[7] for determining
precision
allocations across
layers, showing
excellent
performance:
Architecture trade-offs(1)
• Clear tradeoff between number of parameters and robustness to
quantization
• One can also tradeoff number of feature maps vs precision
• Having 2x number of feature maps at 4-bits is better than 8 bit quantization of the
base network.
Architecture tradeoffs (2)
• Restricting ranges of activations apriori can hurt accuracy
• Preferable to learn activation ranges instead of fixing them beforehand.
Co-design: Training quantized models
• Designing models that provide good quantization performance requires co-
design of model architecture, training algorithms and hardware.
• Specific training enhancements include:
• Fine tune from floating point models for building quantized models.
• Freeze batch normalization statistics update to exactly model inference for further
benefits.
• Model exact rounding arithmetic done in hardware during training
• Stochastic quantization provides models robust to random perturbations of weights, but
underperforms techniques that model quantization as done at inference.
• Other enhancements to improve accuracy:
• Use distillation to train quantized student from floating point teacher network [3]
Conclusions
Hardware accelerator recommendations:
Basics
• Optimize memory bandwidth
• First order predictor of power consumption
• Don’t ignore activations: Most literature focusses on weights, activations can be very
significant for large resolution inputs.
• Fuse multiple operations
• Have floating point support as a backup
• Avoid switching compute to different hardware
• Optimize for GEMM
• Still the workhorse for most DNN applications
• Support low precision inference
• 8 is required, but supporting lower precision can provide really high throughput.
Hardware accelerator recommendations:
Software
• Don’t forget the software toolchain!
• Need to make it easy for customers to use hardware
• Integration with Tensorflow/Pytorch is important
• Writing optimized kernels for new hardware is hard
• Most implementations optimize for a specific set of models, with poor
performance for kernels needed for other models.
Hardware accelerator recommendations:
Differentiation
• Build a strategy for operator support
• Take a close look at TVM/MLIR efforts.
• Code generation along with hand written kernels
• To get the best out of hardware w.quantization
• Provide exact models of HW kernels for integration with training frameworks
• Consider other techniques beyond quantization
• Sparsity
• K-means compression
• Dynamic/Adaptive execution
• Don’t forget privacy
• Secure aggregation/Homo-morphic encryption becoming increasingly important
• Training at the edge:
• Depending on applications, this can be very important for privacy/personalization
References
1. J Park, M. Naumov et al, “Deep Learning Inference in Facebook Data Centers:
Characterization, Performance Optimizations and Hardware Implications”
2. Simone Bianco et al Benchmark Analysis of Representative Deep Neural Network
Architectures
3. A. Polino et al, “Model compression via distillation and quantization”
4. B.Jacob, S.Kligys et al, “Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference”
5. M.Courbariaux, Y.Bengio et al, “Binaryconnect: Training deep neural networks with
binary weights during propagations”
6. R.Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A
whitepaper”
7. B.Wu et al, “Mixed precision quantization of convnets via differentiable neural
architecture search”

More Related Content

PDF
Integer quantization for deep learning inference: principles and empirical ev...
PPTX
XgBoost.pptx
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
PDF
An Introduction to Neural Architecture Search
PDF
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
PPTX
Tips and tricks to win kaggle data science competitions
PDF
Scalability for All: Unreal Engine* 4 with Intel
PPTX
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)
Integer quantization for deep learning inference: principles and empirical ev...
XgBoost.pptx
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
An Introduction to Neural Architecture Search
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Tips and tricks to win kaggle data science competitions
Scalability for All: Unreal Engine* 4 with Intel
딥 러닝 자연어 처리를 학습을 위한 파워포인트. (Deep Learning for Natural Language Processing)

What's hot (20)

PPTX
DBSCAN (1) (4).pptx
PDF
boosting 기법 이해 (bagging vs boosting)
PDF
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
PDF
Best Practices for Hyperparameter Tuning with MLflow
PPTX
Introduction to Machine Learning
PDF
Common Problems in Hyperparameter Optimization
PPTX
Graph Representation Learning
PDF
Introduction to XGBoost
PDF
NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출
PDF
Deep Learning Cases: Text and Image Processing
PDF
Representation learning on graphs
PPTX
[0903 구경원] recast 네비메쉬
PDF
Building Non-Linear Narratives in Horizon Zero Dawn
PDF
Introduction to batch normalization
PDF
Autoencoder
PDF
Presentation - webinar embedded machine learning
PDF
Autoencoders
PDF
CNN Quantization
PDF
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
PDF
Shared Memory Centric Computing with CXL & OMI
DBSCAN (1) (4).pptx
boosting 기법 이해 (bagging vs boosting)
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
Best Practices for Hyperparameter Tuning with MLflow
Introduction to Machine Learning
Common Problems in Hyperparameter Optimization
Graph Representation Learning
Introduction to XGBoost
NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출
Deep Learning Cases: Text and Image Processing
Representation learning on graphs
[0903 구경원] recast 네비메쉬
Building Non-Linear Narratives in Horizon Zero Dawn
Introduction to batch normalization
Autoencoder
Presentation - webinar embedded machine learning
Autoencoders
CNN Quantization
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
Shared Memory Centric Computing with CXL & OMI
Ad

Similar to "Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook (20)

PPTX
Cvpr 2018 papers review (efficient computing)
PPTX
Deep_Learning_Frameworks_CNTK_PyTorch
PDF
Deep Learning Initiative @ NECSTLab
PPTX
Morph : a novel accelerator
PDF
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
PDF
Approximation techniques used for general purpose algorithms
PDF
Efficient execution of quantized deep learning models a compiler approach
PDF
In datacenter performance analysis of a tensor processing unit
PPTX
Introduction to computer vision with Convoluted Neural Networks
PPTX
Introduction to computer vision
PPTX
Project Slides for Website 2020-22.pptx
PDF
Once-for-All: Train One Network and Specialize it for Efficient Deployment
PDF
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
PDF
Convolutional Neural Networks : Popular Architectures
PDF
[2020 CVPR Efficient DET paper review]
PDF
Vertex Perspectives | AI Optimized Chipsets | Part II
PDF
Neural Networks from Scratch - TensorFlow 101
PPTX
DigitRecognition.pptx
PDF
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
PPTX
High Efficiency Video Codec
Cvpr 2018 papers review (efficient computing)
Deep_Learning_Frameworks_CNTK_PyTorch
Deep Learning Initiative @ NECSTLab
Morph : a novel accelerator
“Practical Approaches to DNN Quantization,” a Presentation from Magic Leap
Approximation techniques used for general purpose algorithms
Efficient execution of quantized deep learning models a compiler approach
In datacenter performance analysis of a tensor processing unit
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision
Project Slides for Website 2020-22.pptx
Once-for-All: Train One Network and Specialize it for Efficient Deployment
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Convolutional Neural Networks : Popular Architectures
[2020 CVPR Efficient DET paper review]
Vertex Perspectives | AI Optimized Chipsets | Part II
Neural Networks from Scratch - TensorFlow 101
DigitRecognition.pptx
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
High Efficiency Video Codec
Ad

More from Edge AI and Vision Alliance (20)

PDF
“An Introduction to the MIPI CSI-2 Image Sensor Standard and Its Latest Advan...
PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“An Introduction to the MIPI CSI-2 Image Sensor Standard and Its Latest Advan...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Machine Learning_overview_presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Dropbox Q2 2025 Financial Results & Investor Presentation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
NewMind AI Weekly Chronicles - August'25-Week II
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
A comparative analysis of optical character recognition models for extracting...
Machine Learning_overview_presentation.pptx
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A Presentation on Artificial Intelligence
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.
Assigned Numbers - 2025 - Bluetooth® Document
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...

"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentation from Facebook

  • 1. Quantizing Deep Networks for Efficient Inference at the edge Raghu Krishnamoorthi, Facebook Questions/Feedback: [email protected]
  • 2. Acknowledgements • Results presented here are from work done at Google as part of the Tensorflow lite team and work at facebook as part of the pytorch team. • Acknowledge contributions from several colleagues at Google including: Benoit Jacob, Skiramantas Kligys, Dmitry Kalachenko, Suharsh Sivakumar and Pete Warden. • Also acknowledge work from colleagues at facebook: Jongsoo Park, Maxim Naumov, Summer Deng, Marat Dukhan, Bichen Wu, Peizhao Zhang, Jerry Zhang, Dmytro Dzhulgakov, Daya Khudia, Jianyu Huang, James Reed, Mikhail Z, Haixin Liu and Peter Vajda.
  • 3. Outline • Motivation • Quantization: Overview • Quantizing deep networks • Post Training quantization • Quantization aware training • Lower precision inference • Hardware accelerator recommendations • Model system co-design • Looking ahead
  • 4. Motivation(1) • Data-center power consumption is doubling every year Source: Deep Learning Inference in Facebook Data-Centers [1]
  • 5. Motivation(2) • Number of edge devices is growing rapidly, lots of these devices are resource constrained. Source: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
  • 6. Motivation(3) • While models are becoming more efficient, high accuracy still implies high complexity From: Benchmark Analysis of Representative Deep Neural Network Architectures, Simone Bianco et al,
  • 7. Quantization • Many approaches to solve the problems outlined here: • Better hardware accelerators: TPUs => Requires new custom hardware • Optimized kernels: Cudnn, Intel MKL-DNN • Efficient deep network architectures: Nasnet, Mobilenet, FBNet => Requires new architectures • A simpler approach that does not require re-design of models/new hardware is quantization. • Quantization refers to techniques to perform computation and storage at reduced precision • Works in combination with above approaches • Requires optimized kernels to efficiently use existing hardware.
  • 8. Background: Quantization(1) • Quantization refers to mapping values from fp32 to a lower precision format. • Specified by • Format • Mapping type • Granularity From:https://commons.wikimedia.org/w/index. php?curid=69415943 fp32 fp16 bfloat16 int8 int4 binary
  • 9. Background:Quantization(2) • We also consider different granularities of quantization: • Per layer quantization • Same mapping for all elements in a layer. • Per row/ Per-channel quantization: • Choose quantizer parameters independently for each row (fc layers) or for each conv kernel (conv layers) • Outlier aware quantization: • Separate outliers to use lower precision arithmetic for bulk of weights. • Dense computations for inliers with sparse computation for outliers that have a large magnitude
  • 10. Modeling quantization during training • Emulate quantization by quantizing and de-quantizing in succession • Values are still in floating point, but with reduced precision • 𝑥 𝑜𝑢𝑡 = 𝐹𝑎𝑘𝑒𝑄𝑢𝑎𝑛𝑡 𝑥 = 𝑠. 𝐶𝑙𝑎𝑚𝑝 𝑟𝑜𝑢𝑛𝑑 𝑥 𝑠 − 𝑧 + 𝑧 = 𝐷𝑒𝑄𝑢𝑎𝑛𝑡(𝑄𝑢𝑎𝑛𝑡 𝑥 ) • Can also model quantization as a stochastic rounding operation Fake Quantizer (top), showing the quantization of output values. Approximation for purposes of derivative calculation (bottom).
  • 11. Quantization:Benefits Benefits Quantization Applicability Broad applicability across models and use cases Support Supported by x86, Nvidia Volta, ARM, Mali, QDSP Software Support Kernel libraries widely available Memory Size 4x reduction Memory Bandwidth/Cache 4x reduction Compute 2x to 4x speedup, depending on ISA Power Typically 4x (dominated by memory access) o Comparing float32 implementations with 8 bit inference
  • 12. Quantization: Challenges Challenges Notes Mitigation Accuracy drop Loss in accuracy can be too high for certain applications Quantization aware training Kernel Support Wide variety of operators+multiple hardware platforms Improving software tool-chain (TVM) to handle varied backends. “Complexity” Non-trivial: Requires calibration/training in some cases Support in software packages: TensorRT, Tensorflow and Pytorch
  • 14. Model Quantization: Overview Train Convert for inference Graph Optimization Kernel Implementation Train Convert for inference Graph Optimization Kernel Implementation Quantization Train Convert for inference Graph Optimization Kernel Implementation Quantization Fake Quantization Quantization
  • 15. What to quantize? • Only quantize parts of network that contribute significantly to performance • Roofline analysis to identify compute vs memory bandwidth bound operations • May need to further reduce based on accuracy impact. • Multiple ways to quantize a network with different impact: Quantization scheme Memory bandwidth reduction (Weights) Memory bandwidth reduction (Activations) Compute Speedup Notes Weight only quantization to int8 4x 1x 1x Suitable for embedding lookups Dynamic quantization 4x 1x 2x Suitable for fc layers with small batches Static quantization (int32 accumulators) 4x 4x 2x Suited for all layers, important for convolutions Static quantization (int16 accumulators) 4x 4x 4x Requires lower precision weights/activations
  • 16. Post training quantization: Weight compression • Simplest quantization scheme is to compress the weights to lower precision • Requires no input data and can be done statically as part of preparing a model for inference • Hardware accelerators can benefit if de-compression is done after memory access • Trivial for case of fp16/int8 quantization of weights. • K-means compression is also supported in select platforms and is amenable to simple de-compression • Scatter-Gather operation in select processors • Supported in CoreML Train Convert for inference Graph Optimization Kernel Implementation Quantization
  • 17. Dynamic quantization • Dynamic quantization refers to schemes where the activations are read/written in fp32 and are dynamically quantized to lower precisions for compute. • Requires no calibration data • Data exchanged between operations is in floating point, so no need to worry about format conversion. • Provides performance improvements close to static quantization when memory access is dominated by weights • Suitable for inference in RNNs • Lesser gains for conv layers • Supported by: • Pytorch • Tensorflow Lite
  • 18. Quantizing weights and activations • Post training quantization refers to quantizing both weights and activations to reduced precision, typically int8. • Requires estimation of statistics of activations for determining quantizer parameters. • Quantizer parameters are determined by minimizing an error metric: • KL Divergence: TensorRT • Saturation error: Tensorflow Lite
  • 20. Setup • Standard classification model architectures • Evaluate classification on imagenet validation dataset. • Results obtained using Tensorflow, more details are at: https://arxiv.org/abs/1806.08342 • More results and support from pytorch for quantizations to be announced at Pytorch Devcon on October 10th.
  • 21. Post training quantization: Results Network Asymmetric, Per Layer Symmetric, Per Channel Asymmetric, Per Channel Activation only quantized Weight only, Symmetric, Per Channel Floating point Mobilenet- v2-1-224 0.001 0.698 0.697 0.7 0.698 0.719 Mobilenet- v1-1-224 0.001 0.591 0.703 0.708 0.591 0.709 Nasnet- Mobile 0.72 0.72 0.74 0.74 0.72 0.74 Inception-v3 0.78 0.78 0.78 0.78 0.78 0.78 Resnet-v1-50 0.75 0.75 0.75 0.751 0.75 0.752 • 8 bits for weights and activations is sufficient for common CV classification tasks • Smaller networks are “harder” to quantize • At 8 bits, accuracy drop is dominated by weight quantization
  • 23. Results Network Asymmetric, Per Layer, post training quant Symmetric, Per Channel, post training quant Asymmetric, Per Layer, QAT Symmetric, Per channel QAT Floating point Mobilenet-v2-1- 224 0.001 0.698 0.709 0.711 0.719 Mobilenet-v1-1- 224 0.001 0.591 0.70 0.707 0.709 Nasnet-Mobile 0.72 0.72 0.73 0.73 0.74 Inception-v3 0.78 0.78 0.78 0.78 0.78 Resnet-v1-50 0.75 0.75 0.75 0.751 0.752 • Quantization aware training provides the best accuracy and allows for simpler quantization schemes.
  • 24. Performance: Operator level benchmarks Server: FBGEMM (quantized) vs MKL-DNN (fp32)
  • 25. Performance: Model level benchmarks (Mobile: Tensorflow Lite) Mobile: Inference time: float vs quantized, TFLite, Pixel2 QNNPACK kernels provides an additional 2x speedup
  • 26. Lower precision inference(1) • Four bit precisions for weights provides good accuracy, but needs to be done selectively. • Larger networks are more robust to lower precision • Quantization aware training is critical • Selectively quantizing layers of a network to different precisions can reduce the accuracy drop 4-bit weights, 8 bit activations: Top-1 accuracy results
  • 27. Lower precision inference(2) • Different layers of a neural network have different sensitivity to quantization errors • Exciting work on Differentiable architecture search [7] for determining precision allocations across layers, showing excellent performance:
  • 28. Architecture trade-offs(1) • Clear tradeoff between number of parameters and robustness to quantization • One can also tradeoff number of feature maps vs precision • Having 2x number of feature maps at 4-bits is better than 8 bit quantization of the base network.
  • 29. Architecture tradeoffs (2) • Restricting ranges of activations apriori can hurt accuracy • Preferable to learn activation ranges instead of fixing them beforehand.
  • 30. Co-design: Training quantized models • Designing models that provide good quantization performance requires co- design of model architecture, training algorithms and hardware. • Specific training enhancements include: • Fine tune from floating point models for building quantized models. • Freeze batch normalization statistics update to exactly model inference for further benefits. • Model exact rounding arithmetic done in hardware during training • Stochastic quantization provides models robust to random perturbations of weights, but underperforms techniques that model quantization as done at inference. • Other enhancements to improve accuracy: • Use distillation to train quantized student from floating point teacher network [3]
  • 32. Hardware accelerator recommendations: Basics • Optimize memory bandwidth • First order predictor of power consumption • Don’t ignore activations: Most literature focusses on weights, activations can be very significant for large resolution inputs. • Fuse multiple operations • Have floating point support as a backup • Avoid switching compute to different hardware • Optimize for GEMM • Still the workhorse for most DNN applications • Support low precision inference • 8 is required, but supporting lower precision can provide really high throughput.
  • 33. Hardware accelerator recommendations: Software • Don’t forget the software toolchain! • Need to make it easy for customers to use hardware • Integration with Tensorflow/Pytorch is important • Writing optimized kernels for new hardware is hard • Most implementations optimize for a specific set of models, with poor performance for kernels needed for other models.
  • 34. Hardware accelerator recommendations: Differentiation • Build a strategy for operator support • Take a close look at TVM/MLIR efforts. • Code generation along with hand written kernels • To get the best out of hardware w.quantization • Provide exact models of HW kernels for integration with training frameworks • Consider other techniques beyond quantization • Sparsity • K-means compression • Dynamic/Adaptive execution • Don’t forget privacy • Secure aggregation/Homo-morphic encryption becoming increasingly important • Training at the edge: • Depending on applications, this can be very important for privacy/personalization
  • 35. References 1. J Park, M. Naumov et al, “Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications” 2. Simone Bianco et al Benchmark Analysis of Representative Deep Neural Network Architectures 3. A. Polino et al, “Model compression via distillation and quantization” 4. B.Jacob, S.Kligys et al, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” 5. M.Courbariaux, Y.Bengio et al, “Binaryconnect: Training deep neural networks with binary weights during propagations” 6. R.Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper” 7. B.Wu et al, “Mixed precision quantization of convnets via differentiable neural architecture search”