SlideShare a Scribd company logo
[course site]
Memory usage and
computational
considerations
Day 2 Lecture 1
Kevin McGuinness
kevin.mcguinness@dcu.ie
Research Fellow
Insight Centre for Data Analytics
Dublin City University
Introduction
Useful when designing deep neural network architectures to be able to estimate
memory and computational requirements on the “back of an envelope”
This lecture will cover:
● Estimating neural network memory consumption
● Mini-batch sizes and gradient splitting trick
● Estimating neural network computation (FLOP/s)
● Calculating effective aperture sizes
2
Improving convnet accuracy
A common strategy for improving convnet accuracy is
to make it bigger
● Add more layers
● Made layers wider, increase depth
● Increase kernel sizes*
Works if you have sufficient data and strong
regularization (dropout, maxout, etc.)
Especially true in light of recent advances:
● ResNets: 50-1000 layers
● Batch normalization: reduce covariate shift
network year layers top-5
Alexnet 2012 7 17.0
VGG-19 2014 19 9.35
GoogleNet 2014 22 9.15
Resnet-50 2015 50 6.71
Resnet-152 2015 152 5.71
Without ensembles
3
Increasing network size
Increasing network size means using more
memory
Train time:
● Memory to store outputs of intermediate
layers (forward pass)
● Memory to store parameters
● Memory to store error signal at each
neuron
● Memory to store gradient of parameters
● Any extra memory needed by optimizer
(e.g. for momentum)
Test time:
● Memory to store outputs of intermediate
layers (forward pass)
● Memory to store parameters
Modern GPUs are still relatively memory
constrained:
● GTX Titan X: 12GB
● GTX 980: 4GB
● Tesla K40: 12GB
● Tesla K20: 5GB
4
Calculating memory requirements
Often the size of the network will be practically bound by available memory
Useful to be able to estimate memory requirements of network
True memory usage depends on the implementation
5
Calculating the model size
Conv layers:
Num weights on conv layers does not depend on input size
(weight sharing)
Depends only on depth, kernel size, and depth of previous layer
6
Calculating the model size
parameters
weights: depthn
x (kernelw
x kernelh
) x depth(n-1)
biases: depthn
7
Calculating the model size
parameters
weights: 32 x (3 x 3) x 1 = 288
biases: 32
8
Calculating the model size
parameters
weights: 32 x (3 x 3) x 32 = 9216
biases: 32
Pooling layers are parameter-free
9
Calculating the model size
Fully connected layers
● #weights = #outputs x #inputs
● #biases = #outputs
If previous layer has spatial extent (e.g. pooling
or convolutional), then #inputs is size of
flattened layer.
10
Calculating the model size
parameters
weights: #outputs x #inputs
biases: #inputs
11
Calculating the model size
parameters
weights: 128 x (14 x 14 x 32) = 802816
biases: 128
12
Calculating the model size
parameters
weights: 10 x 128 = 1280
biases: 10
13
Total model size
parameters
weights: 10 x 128 = 1280
biases: 10
parameters
weights: 128 x (14 x 14 x 32) = 802816
biases: 128
parameters
weights: 32 x (3 x 3) x 32 = 9216
biases: 32
parameters
weights: 32 x (3 x 3) x 1 = 288
biases: 32
Total: 813,802
~ 3.1 MB (32-bit floats) 14
Layer blob sizes
Easy…
Conv layers: width x height x depth
FC layers: #outputs
32 x (14 x 14) = 6,272
32 x (28 x 28) = 25,088
15
Total memory requirements (train time)
Memory for layer error
Memory for parameters
Memory for param gradients
Depends on implementation and optimizer
Memory for momentum
Memory for layer outputs
Implementation overhead (memory for convolutions, etc.)
16
Total memory requirements (test time)
Memory for layer error
Memory for parameters
Memory for param gradients
Depends on implementation and optimizer
Memory for momentum
Memory for layer outputs
Implementation overhead (memory for convolutions, etc.)
17
Memory for convolutions
Several libraries implement convolutions as matrix multiplications (e.g. caffe). Approach known as
convolution lowering
Fast (use optimized BLAS implementations) but can use a lot of memory, esp. for larger kernel sizes
and deep conv layers
5
5
…
25
224
224
224x224
[50716 x 25] [25 x 1]
Kernel
cuDNN uses a more
memory efficient
method!
https://arxiv.org/pdf/1
410.0759.pdf
18
Mini-batch sizes
Total memory in previous slides is for a single example.
In practice, we want to do mini-batch SGD:
● More stable gradient estimates
● Faster training on modern hardware
Size of batch is limited by model architecture, model size, and hardware memory.
May need to reduce batch size for training larger models.
This may affect convergence if gradients are too noisy.
19
Gradient splitting trick
Mini-batch 1
Network
ΔW
Loss 1
Loss 1
Mini-batch 2 Loss 2
ΔW
Loss 2
Mini-batch 3 Loss 3
ΔW
Loss 3
Loss on batch n
20
Estimating computational complexity
Useful to be able to estimate computational
complexity of an architecture when designing it
Computation in deep NN is dominated by
multiply-adds in FC and conv layers.
Typically we estimate the number of FLOPs
(multiply-adds) in the forward pass
Ignore non-linearities, dropout, and normalization
layers (negligible cost).
21
Estimating computational complexity
Fully connected layer FLOPs
Easy: equal to the number of weights (ignoring
biases)
= #num_inputs x #num_outputs
Convolution layer FLOPs
Product of:
● Spatial width of the map
● Spatial height of the map
● Previous layer depth
● Current layer deptjh
● Kernel width
● Kernel height
22
Example: VGG-16
Layer H W kernel H kernel W depth repeats FLOP/s
input 224 224 1 1 3 1 0.00E+00
conv1 224 224 3 3 64 2 1.94E+09
conv2 112 112 3 3 128 2 2.77E+09
conv3 56 56 3 3 256 3 4.62E+09
conv4 28 28 3 3 512 3 4.62E+09
conv5 14 14 3 3 512 3 1.39E+09
flatten 1 1 0 0 100352 1 0.00E+00
fc6 1 1 1 1 4096 1 4.11E+08
fc7 1 1 1 1 4096 1 1.68E+07
fc8 1 1 1 1 100 1 4.10E+05
1.58E+10
Bulk of
computation is
here
23
Effective aperture size
Useful to be able to compute how far a
convolutional node in a convnet sees:
● Size of the input pixel patch that affects a
node’s output
● Known as the effective aperture size,
coverage, or receptive field size
Depends on kernel size and strides from
previous layers
● 7x7 kernel can see a 7x7 patch of the
layer below
● Stride of 2 doubles what all layers after
can see
Calculate recursively
24
Summary
Shown how to estimate memory and computational requirements of a deep neural
network model
Very useful to be able to quickly estimate these when designing a deep NN
Effective aperture size tells us how much a conv node can see. Easy to calculate
recursively
25

More Related Content

What's hot (20)

PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Universitat Politècnica de Catalunya
 
PPTX
Kubernetes
Lhouceine OUHAMZA
 
PPTX
Convolution 종류 설명
홍배 김
 
PDF
Introduction to kubernetes
Gabriel Carro
 
PPTX
Detailed Description on Cross Entropy Loss Function
범준 김
 
PPTX
Denclue Algorithm - Cluster, Pe
Tauhidul Khandaker
 
PPTX
Recurrent neural network
Syed Annus Ali SHah
 
PPTX
GoogLenet
KyeongUkJang
 
PPTX
Containerized Applications Overview
Apoorv Anand
 
PDF
Distilling the knowledge in a neural network
KyeongUkJang
 
PDF
Image segmentation with deep learning
Antonio Rueda-Toicen
 
PPTX
Capsule networks
Jaehyeon Park
 
PDF
deep learning
Aravindharamanan S
 
PPTX
Face Recognition - Deep Learning
Aashish Chaubey
 
PPTX
Image classification using cnn
SumeraHangi
 
PDF
Kubernetes intro public - kubernetes meetup 4-21-2015
Rohit Jnagal
 
PDF
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
PDF
Gradle
Han Yin
 
PPTX
Deep neural networks
Si Haem
 
PPTX
Faster R-CNN
rlawjdgns
 
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Universitat Politècnica de Catalunya
 
Kubernetes
Lhouceine OUHAMZA
 
Convolution 종류 설명
홍배 김
 
Introduction to kubernetes
Gabriel Carro
 
Detailed Description on Cross Entropy Loss Function
범준 김
 
Denclue Algorithm - Cluster, Pe
Tauhidul Khandaker
 
Recurrent neural network
Syed Annus Ali SHah
 
GoogLenet
KyeongUkJang
 
Containerized Applications Overview
Apoorv Anand
 
Distilling the knowledge in a neural network
KyeongUkJang
 
Image segmentation with deep learning
Antonio Rueda-Toicen
 
Capsule networks
Jaehyeon Park
 
deep learning
Aravindharamanan S
 
Face Recognition - Deep Learning
Aashish Chaubey
 
Image classification using cnn
SumeraHangi
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Rohit Jnagal
 
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
Gradle
Han Yin
 
Deep neural networks
Si Haem
 
Faster R-CNN
rlawjdgns
 

Similar to Deep Learning for Computer Vision: Memory usage and computational considerations (UPC 2016) (20)

PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Once-for-All: Train One Network and Specialize it for Efficient Deployment
taeseon ryu
 
PDF
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Bomm Kim
 
PDF
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
PDF
Lecture 5: Convolutional Neural Network Models
Mohamed Loey
 
PPTX
OpenEBS hangout #4
OpenEBS
 
PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
PDF
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
PPTX
B.tech_project_ppt.pptx
supratikmondal6
 
PDF
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
 
PDF
3_Transfer_Learning.pdf
FEG
 
PDF
Deep Learning Initiative @ NECSTLab
NECST Lab @ Politecnico di Milano
 
PDF
Convolutional Neural Networks : Popular Architectures
ananth
 
PPTX
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
taeseon ryu
 
PPTX
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
NECST Lab @ Politecnico di Milano
 
PDF
Netflix machine learning
Amer Ather
 
PDF
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
 
PPTX
DL-CO2-Session6-VGGNet_GoogLeNet_ResNet_DenseNet_RCNN.pptx
Kv Sagar
 
PDF
AI On the Edge: Model Compression
Apache MXNet
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
taeseon ryu
 
Lightweight DNN Processor Design (based on NVDLA)
Shien-Chun Luo
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Bomm Kim
 
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Lecture 5: Convolutional Neural Network Models
Mohamed Loey
 
OpenEBS hangout #4
OpenEBS
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
B.tech_project_ppt.pptx
supratikmondal6
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
 
3_Transfer_Learning.pdf
FEG
 
Deep Learning Initiative @ NECSTLab
NECST Lab @ Politecnico di Milano
 
Convolutional Neural Networks : Popular Architectures
ananth
 
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
taeseon ryu
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
NECST Lab @ Politecnico di Milano
 
Netflix machine learning
Amer Ather
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
 
DL-CO2-Session6-VGGNet_GoogLeNet_ResNet_DenseNet_RCNN.pptx
Kv Sagar
 
AI On the Edge: Model Compression
Apache MXNet
 
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
PDF
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
PDF
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
PDF
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
PDF
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
PDF
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
 
Deep Generative Learning for All
Universitat Politècnica de Catalunya
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Universitat Politècnica de Catalunya
 
The Transformer - Xavier Giró - UPC Barcelona 2021
Universitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Universitat Politècnica de Catalunya
 
Open challenges in sign language translation and production
Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Universitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Universitat Politècnica de Catalunya
 
Intepretability / Explainable AI for Deep Neural Networks
Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Universitat Politècnica de Catalunya
 
Curriculum Learning for Recurrent Video Object Segmentation
Universitat Politècnica de Catalunya
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Universitat Politècnica de Catalunya
 
Ad

Recently uploaded (20)

PPTX
Krezentios memories in college data.pptx
notknown9
 
PPTX
microservices-with-container-apps-dapr.pptx
vjay22
 
PPTX
Discrete Logarithm Problem in Cryptography (1).pptx
meshablinx38
 
PPTX
Presentation abdominal distension (1).pptx
ChZiaullah
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
DOCX
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
PDF
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PPTX
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
PDF
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
Krezentios memories in college data.pptx
notknown9
 
microservices-with-container-apps-dapr.pptx
vjay22
 
Discrete Logarithm Problem in Cryptography (1).pptx
meshablinx38
 
Presentation abdominal distension (1).pptx
ChZiaullah
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
COT Feb 19, 2025 DLLgvbbnnjjjjjj_Digestive System and its Functions_PISA_CBA....
kayemorales1105
 
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 

Deep Learning for Computer Vision: Memory usage and computational considerations (UPC 2016)

  • 1. [course site] Memory usage and computational considerations Day 2 Lecture 1 Kevin McGuinness [email protected] Research Fellow Insight Centre for Data Analytics Dublin City University
  • 2. Introduction Useful when designing deep neural network architectures to be able to estimate memory and computational requirements on the “back of an envelope” This lecture will cover: ● Estimating neural network memory consumption ● Mini-batch sizes and gradient splitting trick ● Estimating neural network computation (FLOP/s) ● Calculating effective aperture sizes 2
  • 3. Improving convnet accuracy A common strategy for improving convnet accuracy is to make it bigger ● Add more layers ● Made layers wider, increase depth ● Increase kernel sizes* Works if you have sufficient data and strong regularization (dropout, maxout, etc.) Especially true in light of recent advances: ● ResNets: 50-1000 layers ● Batch normalization: reduce covariate shift network year layers top-5 Alexnet 2012 7 17.0 VGG-19 2014 19 9.35 GoogleNet 2014 22 9.15 Resnet-50 2015 50 6.71 Resnet-152 2015 152 5.71 Without ensembles 3
  • 4. Increasing network size Increasing network size means using more memory Train time: ● Memory to store outputs of intermediate layers (forward pass) ● Memory to store parameters ● Memory to store error signal at each neuron ● Memory to store gradient of parameters ● Any extra memory needed by optimizer (e.g. for momentum) Test time: ● Memory to store outputs of intermediate layers (forward pass) ● Memory to store parameters Modern GPUs are still relatively memory constrained: ● GTX Titan X: 12GB ● GTX 980: 4GB ● Tesla K40: 12GB ● Tesla K20: 5GB 4
  • 5. Calculating memory requirements Often the size of the network will be practically bound by available memory Useful to be able to estimate memory requirements of network True memory usage depends on the implementation 5
  • 6. Calculating the model size Conv layers: Num weights on conv layers does not depend on input size (weight sharing) Depends only on depth, kernel size, and depth of previous layer 6
  • 7. Calculating the model size parameters weights: depthn x (kernelw x kernelh ) x depth(n-1) biases: depthn 7
  • 8. Calculating the model size parameters weights: 32 x (3 x 3) x 1 = 288 biases: 32 8
  • 9. Calculating the model size parameters weights: 32 x (3 x 3) x 32 = 9216 biases: 32 Pooling layers are parameter-free 9
  • 10. Calculating the model size Fully connected layers ● #weights = #outputs x #inputs ● #biases = #outputs If previous layer has spatial extent (e.g. pooling or convolutional), then #inputs is size of flattened layer. 10
  • 11. Calculating the model size parameters weights: #outputs x #inputs biases: #inputs 11
  • 12. Calculating the model size parameters weights: 128 x (14 x 14 x 32) = 802816 biases: 128 12
  • 13. Calculating the model size parameters weights: 10 x 128 = 1280 biases: 10 13
  • 14. Total model size parameters weights: 10 x 128 = 1280 biases: 10 parameters weights: 128 x (14 x 14 x 32) = 802816 biases: 128 parameters weights: 32 x (3 x 3) x 32 = 9216 biases: 32 parameters weights: 32 x (3 x 3) x 1 = 288 biases: 32 Total: 813,802 ~ 3.1 MB (32-bit floats) 14
  • 15. Layer blob sizes Easy… Conv layers: width x height x depth FC layers: #outputs 32 x (14 x 14) = 6,272 32 x (28 x 28) = 25,088 15
  • 16. Total memory requirements (train time) Memory for layer error Memory for parameters Memory for param gradients Depends on implementation and optimizer Memory for momentum Memory for layer outputs Implementation overhead (memory for convolutions, etc.) 16
  • 17. Total memory requirements (test time) Memory for layer error Memory for parameters Memory for param gradients Depends on implementation and optimizer Memory for momentum Memory for layer outputs Implementation overhead (memory for convolutions, etc.) 17
  • 18. Memory for convolutions Several libraries implement convolutions as matrix multiplications (e.g. caffe). Approach known as convolution lowering Fast (use optimized BLAS implementations) but can use a lot of memory, esp. for larger kernel sizes and deep conv layers 5 5 … 25 224 224 224x224 [50716 x 25] [25 x 1] Kernel cuDNN uses a more memory efficient method! https://arxiv.org/pdf/1 410.0759.pdf 18
  • 19. Mini-batch sizes Total memory in previous slides is for a single example. In practice, we want to do mini-batch SGD: ● More stable gradient estimates ● Faster training on modern hardware Size of batch is limited by model architecture, model size, and hardware memory. May need to reduce batch size for training larger models. This may affect convergence if gradients are too noisy. 19
  • 20. Gradient splitting trick Mini-batch 1 Network ΔW Loss 1 Loss 1 Mini-batch 2 Loss 2 ΔW Loss 2 Mini-batch 3 Loss 3 ΔW Loss 3 Loss on batch n 20
  • 21. Estimating computational complexity Useful to be able to estimate computational complexity of an architecture when designing it Computation in deep NN is dominated by multiply-adds in FC and conv layers. Typically we estimate the number of FLOPs (multiply-adds) in the forward pass Ignore non-linearities, dropout, and normalization layers (negligible cost). 21
  • 22. Estimating computational complexity Fully connected layer FLOPs Easy: equal to the number of weights (ignoring biases) = #num_inputs x #num_outputs Convolution layer FLOPs Product of: ● Spatial width of the map ● Spatial height of the map ● Previous layer depth ● Current layer deptjh ● Kernel width ● Kernel height 22
  • 23. Example: VGG-16 Layer H W kernel H kernel W depth repeats FLOP/s input 224 224 1 1 3 1 0.00E+00 conv1 224 224 3 3 64 2 1.94E+09 conv2 112 112 3 3 128 2 2.77E+09 conv3 56 56 3 3 256 3 4.62E+09 conv4 28 28 3 3 512 3 4.62E+09 conv5 14 14 3 3 512 3 1.39E+09 flatten 1 1 0 0 100352 1 0.00E+00 fc6 1 1 1 1 4096 1 4.11E+08 fc7 1 1 1 1 4096 1 1.68E+07 fc8 1 1 1 1 100 1 4.10E+05 1.58E+10 Bulk of computation is here 23
  • 24. Effective aperture size Useful to be able to compute how far a convolutional node in a convnet sees: ● Size of the input pixel patch that affects a node’s output ● Known as the effective aperture size, coverage, or receptive field size Depends on kernel size and strides from previous layers ● 7x7 kernel can see a 7x7 patch of the layer below ● Stride of 2 doubles what all layers after can see Calculate recursively 24
  • 25. Summary Shown how to estimate memory and computational requirements of a deep neural network model Very useful to be able to quickly estimate these when designing a deep NN Effective aperture size tells us how much a conv node can see. Easy to calculate recursively 25