SlideShare a Scribd company logo
7
Most read
13
Most read
18
Most read
Alex Krizhevsky
University of Toronto
kriz@cs.utoronto.ca
ImageNet Classification with Deep Convolutional
Neural Networks
Ilya Sutskever
University of Toronto
ilya@cs.utoronto.ca
Geoffrey E. Hinton
University of Toronto
hinton@cs.utoronto.ca
Presenter : Aydin Ayanzadeh
Email: Ayanzadeh17@itu.edu.tr
Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
Outline
● Introduction
● Dataset
● Architecture of the Network
● Reducing over-fitting
● Result
2
ImageNet
● About 15M Labeled High resolution Images
● Roughly 22K Categories
● Collected from the web and labeled by Amazon Mechanical Turk
3
Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
ILSVRC
ImageNet Large Scale Visual Recognition
Challenge
Task: 1.2M, 50K Validation, 150K
testset,1k categories
Goal: Top-5 error
NEC-UIUC,Lin
Top 5 error= 28%
2010
XRCE-Perronnin
Top 5 error= 28%
2011
Supervision-Krizhevsky: Top 5-error: 16%2012
ZF-net Top5 error: 12%
L
2013
GoogLeNet-Szegedy Top 5= 7%2014
Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
5
Task in ImageNet
Rectified Linear Units (ReLUs)
● Very faster than rather than the classical activation
functions such as Tanh.
● Very computationally efficient
● Converges fast(it converges six time faster than
tanh)
6
Fig2.A four-layer convolutional neural network with ReLUs (solid line) reaches a
25% training error rate on CIFAR-10 six times faster than an equivalent network
with tanh neurons (dashed line). The learning rates for each net- work were chosen
independently to make train- ing as fast as possible. No regularization of any kind
was employed. The magnitude of the effect demonstrated here varies with network
architecture, but networks with ReLUs consistently learn several times faster than
equivalents with saturating neurons.
AlexNet General Feature
● 650K neuron
● 60M Parameters
● 630M connections
● 7 hidden weight layers
● Rectified Linear Units(Relu)
● Dropout trick,
● Randomly extracted patches with the size of (224*224)
7Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
Architecture
8Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
Architecture
9Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
Input image size can not be 224*224
((224−11+2(0))/4)+1=54.25 !!!
((227−11+2(0))/4)+1=55
10
Architecture
Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96]CONV1 : 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3 : 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4 : 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5 : 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3 : 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7:4096 neurons with F=1
[1000] FC8:1000 neurons (class scores)
Local Response Normalization
● reduces top-1 and top-5 error rates
by 1.4% and 1.2%
● k = 2, n = 5, α = 10e-4, and β = 0.75.
● It applies before ReLU nonlinearity in
certain layers
11
Data Augmentation
Data Augmentation
● Reduce Over-fitting
○ Artificially enlarge dataset
● Type of Data augmentation
○ Extract 5 patches with the size of
224*224 (four corner patch and
center patch) and horizontal reflection
○ Altering the intensity of RGB channel
in training image(perform PCA on rgb
pixels)
○ This approach reduce top-1 error by
1%
12
Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
=
Dropout
131-Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958.
● Reducing over-fitting
● zero the output of each hidden neuron
with specific probability.
● Double the number of iteration to
converge
● Learning more robust features
● Applied in the first two fully connected
layers
Stochastic Gradient Descent
● SGD with a batch size of 128
● Learning rate is setted 0.01 (equal for all layers
but, it divided based on validation error),
● Neuron biases in 2,4,5 layers and Fc layers
● NVIDIA GTX 580 (3GB GPUs)
● Weight initialization based on N(0,0.1)
14
Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
Results
15
Model Top-1(Val) Top-5(Val) Top-5(test)
SIFT+FVs 18.2% 26.2%
1 CNN 40.7% 18.2%
5 CNN 38.1% 16.4% 16.4%
1 CNN* 39.0% 16.6%
7 CNNs* 36.7% 15.4% 15.3%
Table 2:
Comparison of error rates on ILSVRC-2012 validation and test sets. In italics are best
results achieved by others. Models with an asterisk were “pre-trained” to classify the
entire ImageNet 2011 Fall release. See Section 6 for details.
Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
● Averaging the predictions of two CNNs that
were pre-trained on the entire release with
5CNNs has 15.3%.
Conclusion
AlexNet
● Rectified Linear Units(Relu)
● Dropout trick
● Data augmentation
● Trained the model using batch stochastic gradient descent
● Top5-error rate=15.4%
16
Qualitative Evaluations
17
Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
Visualizing First Layer
18
Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
Fig5. 96 convolutional kernels of size 11×11×3 learned by the first convolutional layer on the 224×224×3
input images. The top 48 kernels were learned on GPU 1 while the bottom 48 kernels were learned on GPU
2. See Section 6.1 for details.
● Top 48 kernels on GPU 1 : color-agnostic
● bottom 48 kernels on GPU 2: color-specific.
References
[1] R.M. Bell and Y. Koren. Lessons from the netflix prize challenge. ACM SIGKDD Explorations Newsletter,9(2):75–79, 2007.
[2] A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge 2010. www.image net.org/challenges. 2010.
[3] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[4] D. Cire ̧san, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. Arxiv preprint arXiv:1202.2745, 2012.
[5] D.C. Cire ̧san, U. Meier, J. Masci, L.M. Gambardella, and J. Schmidhuber. High-performance neural networks for visual object classification. Arxiv preprint arXiv:1102.0183, 2011.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
[7] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei. ILSVRC-2012, 2012. URL http://www.image-net.org/challenges/LSVRC/2012/.
[8] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories.Computer
Vision and Image Understand-ing, 106(1):59–70, 2007.
[9] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007.
URLhttp://authors.library.caltech.edu/7694
[10] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint
arXiv:1207.0580, 2012. 19
[11] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition In International Conference on Computer Vision, pages 2146–
2153. IEEE, 2009.
[12] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
[13] A. Krizhevsky. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 2010.
[14] A. Krizhevsky and G.E. Hinton. Using very deep autoencoders for content-based image retrieval. In ESANN , 2011.
[15] Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, et al. Hand-written digit recognition with a back-propagation network. In Advances in neural
information processing systems, 1990.
[16] Y. LeCun, F.J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR
2004. Proceedings of the 2004 IEEE Computer Society Conference on volume 2, pages II–97. IEEE, 2004.
[17] Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on,
pages 253–256.IEEE, 2010.
[18] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng. Convolutional deep belief networks for scalable unsuper-
vised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009.
[19] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost. In ECCV - European
Conference on Computer Vision, Florence, Italy, October 2012.
20
References
21
22

More Related Content

PPTX
AlexNet
PPTX
AlexNet.pptx
PPTX
Batch normalization presentation
PPT
Cnn method
PPT
Intro to Deep learning - Autoencoders
PDF
Densenet CNN
PDF
GoogLeNet Insights
PDF
LeNet-5
AlexNet
AlexNet.pptx
Batch normalization presentation
Cnn method
Intro to Deep learning - Autoencoders
Densenet CNN
GoogLeNet Insights
LeNet-5

What's hot (20)

PDF
Convolutional Neural Network Models - Deep Learning
PPTX
Regularization in deep learning
PDF
Mobilenetv1 v2 slide
PPTX
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
PPTX
Convolutional Neural Network and Its Applications
PDF
Deep Learning - Convolutional Neural Networks
PPTX
CNN Tutorial
PPTX
Histogram Processing
PPTX
Introduction to Deep Learning
PDF
Convolutional neural network
PPT
Artificial Neural Networks - ANN
PPTX
Machine Learning - Convolutional Neural Network
PDF
Generative adversarial networks
PDF
Introduction to Deep Learning
PDF
Deep learning for medical imaging
PPTX
Deep neural networks
PDF
Artificial Neural Networks Lect3: Neural Network Learning rules
PPTX
Introduction to CNN
PPTX
Deep Learning With Neural Networks
PDF
Convolutional Neural Network Models - Deep Learning
Regularization in deep learning
Mobilenetv1 v2 slide
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network and Its Applications
Deep Learning - Convolutional Neural Networks
CNN Tutorial
Histogram Processing
Introduction to Deep Learning
Convolutional neural network
Artificial Neural Networks - ANN
Machine Learning - Convolutional Neural Network
Generative adversarial networks
Introduction to Deep Learning
Deep learning for medical imaging
Deep neural networks
Artificial Neural Networks Lect3: Neural Network Learning rules
Introduction to CNN
Deep Learning With Neural Networks
Ad

Similar to AlexNet(ImageNet Classification with Deep Convolutional Neural Networks) (20)

PDF
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
PDF
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
PPTX
Computer Vision for Beginners
PPTX
Introduction to computer vision with Convoluted Neural Networks
PPTX
Introduction to computer vision
PDF
Computer vision for transportation
PDF
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
PDF
Finding the best solution for Image Processing
PDF
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
PPTX
Image classification with Deep Neural Networks
PPTX
PyConZA'17 Deep Learning for Computer Vision
PDF
Deep Learning for Computer Vision - ExecutiveML
PDF
imageclassification-160206090009.pdf
PPTX
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PDF
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
PDF
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
PPTX
Obscenity Detection in Images
PPTX
Convolutional Neural Networks for Computer vision Applications
PDF
Convolutional neural networks for image classification — evidence from Kaggle...
PDF
State-of-the-art Image Processing across all domains
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Computer Vision for Beginners
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision
Computer vision for transportation
Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...
Finding the best solution for Image Processing
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Image classification with Deep Neural Networks
PyConZA'17 Deep Learning for Computer Vision
Deep Learning for Computer Vision - ExecutiveML
imageclassification-160206090009.pdf
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
Obscenity Detection in Images
Convolutional Neural Networks for Computer vision Applications
Convolutional neural networks for image classification — evidence from Kaggle...
State-of-the-art Image Processing across all domains
Ad

More from UMBC (20)

PDF
LinkedGuard: SafeGuarding LinkedIn Privacy by Identifying Authentic Companies...
PDF
Cell Segmentation of 2D Phase-Contrast Microscopy Images with Deep Learning M...
PPTX
Mreps efficient and flexible detection of tandem repeats in dna
PDF
Deep Learning based Segmentation Pipeline for Label-Free Phase-Contrast Micro...
PDF
Protein family specific models using deep neural networks and transfer learni...
PDF
Spatial information Fuzzy C-mean(SFCM)
PDF
CENTRALITY OF GRAPH ON DIFFERENT NETWORK TOPOLOGIES
PPTX
Fuzzy Clustering(C-means, K-means)
PPTX
Semantic segmentation with Convolutional Neural Network Approaches
PPTX
A machine learning based protocol for efficient routing in opportunistic netw...
PPTX
Estimating Number of People in ITU-EEB as an Application of People Counting T...
PDF
Smart city take home question answers
PDF
Possible Application for smart Airports
PDF
udacity Advance Lane identification
PPTX
Kaggle Dog breed Identification
PPTX
udacity Advance Lane identification (progress presentation)
PPTX
Term project proposal image processing project
PPTX
presntation about smart charging for the vehicles
PDF
Report for Smart aiport application
PPTX
Gaussian Three-Dimensional SVM for Edge Detection Applications
LinkedGuard: SafeGuarding LinkedIn Privacy by Identifying Authentic Companies...
Cell Segmentation of 2D Phase-Contrast Microscopy Images with Deep Learning M...
Mreps efficient and flexible detection of tandem repeats in dna
Deep Learning based Segmentation Pipeline for Label-Free Phase-Contrast Micro...
Protein family specific models using deep neural networks and transfer learni...
Spatial information Fuzzy C-mean(SFCM)
CENTRALITY OF GRAPH ON DIFFERENT NETWORK TOPOLOGIES
Fuzzy Clustering(C-means, K-means)
Semantic segmentation with Convolutional Neural Network Approaches
A machine learning based protocol for efficient routing in opportunistic netw...
Estimating Number of People in ITU-EEB as an Application of People Counting T...
Smart city take home question answers
Possible Application for smart Airports
udacity Advance Lane identification
Kaggle Dog breed Identification
udacity Advance Lane identification (progress presentation)
Term project proposal image processing project
presntation about smart charging for the vehicles
Report for Smart aiport application
Gaussian Three-Dimensional SVM for Edge Detection Applications

Recently uploaded (20)

PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
737-MAX_SRG.pdf student reference guides
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
PPT on Performance Review to get promotions
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Current and future trends in Computer Vision.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Embodied AI: Ushering in the Next Era of Intelligent Systems
Internet of Things (IOT) - A guide to understanding
additive manufacturing of ss316l using mig welding
Safety Seminar civil to be ensured for safe working.
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Fundamentals of Mechanical Engineering.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Fundamentals of safety and accident prevention -final (1).pptx
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
737-MAX_SRG.pdf student reference guides
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Foundation to blockchain - A guide to Blockchain Tech
PPT on Performance Review to get promotions
III.4.1.2_The_Space_Environment.p pdffdf
Current and future trends in Computer Vision.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx

AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)

  • 1. Alex Krizhevsky University of Toronto [email protected] ImageNet Classification with Deep Convolutional Neural Networks Ilya Sutskever University of Toronto [email protected] Geoffrey E. Hinton University of Toronto [email protected] Presenter : Aydin Ayanzadeh Email: [email protected] Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
  • 2. Outline ● Introduction ● Dataset ● Architecture of the Network ● Reducing over-fitting ● Result 2
  • 3. ImageNet ● About 15M Labeled High resolution Images ● Roughly 22K Categories ● Collected from the web and labeled by Amazon Mechanical Turk 3 Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
  • 4. ILSVRC ImageNet Large Scale Visual Recognition Challenge Task: 1.2M, 50K Validation, 150K testset,1k categories Goal: Top-5 error NEC-UIUC,Lin Top 5 error= 28% 2010 XRCE-Perronnin Top 5 error= 28% 2011 Supervision-Krizhevsky: Top 5-error: 16%2012 ZF-net Top5 error: 12% L 2013 GoogLeNet-Szegedy Top 5= 7%2014 Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
  • 6. Rectified Linear Units (ReLUs) ● Very faster than rather than the classical activation functions such as Tanh. ● Very computationally efficient ● Converges fast(it converges six time faster than tanh) 6 Fig2.A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). The learning rates for each net- work were chosen independently to make train- ing as fast as possible. No regularization of any kind was employed. The magnitude of the effect demonstrated here varies with network architecture, but networks with ReLUs consistently learn several times faster than equivalents with saturating neurons.
  • 7. AlexNet General Feature ● 650K neuron ● 60M Parameters ● 630M connections ● 7 hidden weight layers ● Rectified Linear Units(Relu) ● Dropout trick, ● Randomly extracted patches with the size of (224*224) 7Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
  • 9. Architecture 9Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018 Input image size can not be 224*224 ((224−11+2(0))/4)+1=54.25 !!! ((227−11+2(0))/4)+1=55
  • 10. 10 Architecture Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018 Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96]CONV1 : 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3 : 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4 : 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5 : 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3 : 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7:4096 neurons with F=1 [1000] FC8:1000 neurons (class scores)
  • 11. Local Response Normalization ● reduces top-1 and top-5 error rates by 1.4% and 1.2% ● k = 2, n = 5, α = 10e-4, and β = 0.75. ● It applies before ReLU nonlinearity in certain layers 11
  • 12. Data Augmentation Data Augmentation ● Reduce Over-fitting ○ Artificially enlarge dataset ● Type of Data augmentation ○ Extract 5 patches with the size of 224*224 (four corner patch and center patch) and horizontal reflection ○ Altering the intensity of RGB channel in training image(perform PCA on rgb pixels) ○ This approach reduce top-1 error by 1% 12 Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018 =
  • 13. Dropout 131-Srivastava, Nitish, et al. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15.1 (2014): 1929-1958. ● Reducing over-fitting ● zero the output of each hidden neuron with specific probability. ● Double the number of iteration to converge ● Learning more robust features ● Applied in the first two fully connected layers
  • 14. Stochastic Gradient Descent ● SGD with a batch size of 128 ● Learning rate is setted 0.01 (equal for all layers but, it divided based on validation error), ● Neuron biases in 2,4,5 layers and Fc layers ● NVIDIA GTX 580 (3GB GPUs) ● Weight initialization based on N(0,0.1) 14 Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
  • 15. Results 15 Model Top-1(Val) Top-5(Val) Top-5(test) SIFT+FVs 18.2% 26.2% 1 CNN 40.7% 18.2% 5 CNN 38.1% 16.4% 16.4% 1 CNN* 39.0% 16.6% 7 CNNs* 36.7% 15.4% 15.3% Table 2: Comparison of error rates on ILSVRC-2012 validation and test sets. In italics are best results achieved by others. Models with an asterisk were “pre-trained” to classify the entire ImageNet 2011 Fall release. See Section 6 for details. Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018 ● Averaging the predictions of two CNNs that were pre-trained on the entire release with 5CNNs has 15.3%.
  • 16. Conclusion AlexNet ● Rectified Linear Units(Relu) ● Dropout trick ● Data augmentation ● Trained the model using batch stochastic gradient descent ● Top5-error rate=15.4% 16
  • 17. Qualitative Evaluations 17 Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018
  • 18. Visualizing First Layer 18 Computer vision-Dr.-Ing. Hazım Kemal EKENEL, Spring 2018 Fig5. 96 convolutional kernels of size 11×11×3 learned by the first convolutional layer on the 224×224×3 input images. The top 48 kernels were learned on GPU 1 while the bottom 48 kernels were learned on GPU 2. See Section 6.1 for details. ● Top 48 kernels on GPU 1 : color-agnostic ● bottom 48 kernels on GPU 2: color-specific.
  • 19. References [1] R.M. Bell and Y. Koren. Lessons from the netflix prize challenge. ACM SIGKDD Explorations Newsletter,9(2):75–79, 2007. [2] A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge 2010. www.image net.org/challenges. 2010. [3] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [4] D. Cire ̧san, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. Arxiv preprint arXiv:1202.2745, 2012. [5] D.C. Cire ̧san, U. Meier, J. Masci, L.M. Gambardella, and J. Schmidhuber. High-performance neural networks for visual object classification. Arxiv preprint arXiv:1102.0183, 2011. [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009. [7] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei. ILSVRC-2012, 2012. URL http://www.image-net.org/challenges/LSVRC/2012/. [8] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories.Computer Vision and Image Understand-ing, 106(1):59–70, 2007. [9] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007. URLhttp://authors.library.caltech.edu/7694 [10] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. 19
  • 20. [11] K. Jarrett, K. Kavukcuoglu, M. A. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition In International Conference on Computer Vision, pages 2146– 2153. IEEE, 2009. [12] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009. [13] A. Krizhevsky. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 2010. [14] A. Krizhevsky and G.E. Hinton. Using very deep autoencoders for content-based image retrieval. In ESANN , 2011. [15] Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, et al. Hand-written digit recognition with a back-propagation network. In Advances in neural information processing systems, 1990. [16] Y. LeCun, F.J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on volume 2, pages II–97. IEEE, 2004. [17] Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, pages 253–256.IEEE, 2010. [18] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng. Convolutional deep belief networks for scalable unsuper- vised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009. [19] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost. In ECCV - European Conference on Computer Vision, Florence, Italy, October 2012. 20 References
  • 21. 21
  • 22. 22

Editor's Notes

  • #5: make 5 guesses about the image label
  • #8: The advantage of the ReLu over sigmoid is that it trains much faster than the latter because the derivative of sigmoid becomes very small in the saturating region and therefore the updates to the weights almost vanish(Figure 4). This is called vanishing gradient problem. In the network, ReLu layer is put after each and every convolutional and fully-connected layers(FC).
  • #9: نکته دیگه اینکه امروزه دیگه مثل قبل از لایه نرمالیزاسیون! استفاده نمیشه بجاش از لایه بچ نورمالیزیشن استفاده میشه علاوه بر اون تعریف یک معماری الان بر اساس قانون خاصی نیست . بیشتر یک هنره . و تست و تجربه . نکته ای که طی این سالها بهش رسیدن اینه که هرچقدر شبکه شما عمیق تر باشه موفقیت شبکه شما هم بیشتره . اینم باید بدونید که هرچقدر عمق شبکه بیشتر باشه مشکلات اموزش اون خیلی خیلی بیشتر میشه . حالا از طرف دیگه اینکه چرا همه اش از لایه های تماما متصل استفاده نمیشه و بخاطرش از لایه کانولوشن استفاده میشه بخاطر این هست که به غیر از بحثهای پردازشی و سربار زیاد این لایه ها اورفیتینگ زیادی رو هم بخاطر تعداد زیاد پارامترها باعث میشن و از طرف دیگه ماهیت 2 بعدی تصاویر در لایه کانولوشن بخوبی قابل استفاده اس. برای همین برای اینکه بشه خصائیص غیر خطی بیشتری رو بدست اورد میان تعداد مختلفی از لایه های کانولوشن رو پشت سر هم قرار میدن و بعدش در لایه اخر از حداقل یه لایه تماما متصل برای دسته بندی استفاده میکنن . این وسط برای ایجاد یک translation invariance و همینطور کاهش ابعاد در توده های موجود در شبکه میان از pooling استفاده میکنن چون پولینگ هر بار کاهش ابعاد رو باعث میشه به همین خاطر اگر اندازه تصویر شما نسبتا کوچیک باشه زیاد و به کرار ازش استفاده نمیکنن بجاش سعی میکنن از طریق خود لایه کانولوشن با اندازه فیلتر بزرگتر و یا گام متفاوت تر کاهش ابعاد رو داشته باشن. پس خط مشی کلی امروزه اینه سعی کنید تعدادی لایه کانولوشن پشت سر هم داشته باشید بعد بسته به اندازه تصویر سعی کنید بین اینها از pooling استفاده کنید (Max نشون داده که بعضا بهتر جواب میده اما شما بسته به کارتون ممکنه از توابع دیگه نتایج بهتری بگیرید) . بعد در لایه های اخر هم از لایه تماما متصل استفاده کنید . البته تکنیک های جدیدی هم اومدن که با استفاده از اونها میتونید باعث بهتر شدن نتیجه بیشید (مثل استفاده از ELU وPrelu بجای Relu یا استفاده از dropout و dropconnect برای مقابله با اورفیتینگ و یا spatial pyramid pooling یا stochastic pooling (که این نمونه اخری رو من تست کردم نتیجه ام خیلی بد شد!) و... در آخر هم باید بگم هرکدوم از این روشها و تکنیک هایی که معرفی میشن باید تست بشن و ممکنه شما یکی از این روشها رو تست کنید و نتیجه اتون بدتر بشه وقتی این اتفاق افتاد یا نیاز به تست بیشتر دارید یا در حالت شما اصلا نیازی به استفاده از اون ویژگی خاص نیست . مثلا من بدون stochastic pooling به دقت 99درصد در ام نیست رسیده بودم که وقتی فعالش کردم دقتم شد 43 درصد! اینکه اندازه تعداد فیچرمپهای شبکه کانولوشن رو هم چقدر بگیریم این هم یه پارامتر هست و قائده خاصی نداره . چیزی که من میتونم بگم اینه که از کم شروع کنید بعد کم کم برید بالا. یه نکته خیلی مهم دیگه هم بخش optimiziation و مسائلی مثل انتخاب نرخ یادگیری و مومنتوم و... هست . ممکنه شما یک مدل /معماری خیلی خوب از لایه ها رو تعریف کنید و بعد فقط بخاطر درست انتخاب نکردن پارامترهای optimization نتایج بدی بگیرید . دوباره بعنوان نمونه من عینا همون معماری که با اون به دقت 99درصد رسیده بودم با تغییر پارامترهای بهینه سازی نتونستم بیشتر از 86درصد برم . کاری که من شخصا میکنم اینه اول سعی میکنم با یه پیکربندی اولیه شروع کنم و انقدر با پارامترهای solver کار کنم تا مطمئن بشم بهترین نتیجه رو گرفتم بعد شروع میکنم تعداد لایه ها و یا خروجی اونها رو تغییر دادن و دوباره این مسیر رو ادامه میدم تا به نتیجه برسم . البته این مربوط به زمانی هست که من بخوام بنا به دلایلی خاص از هیچ مدلی دیگه ای استفاده نکنم . ولی معمولا در 90 درصد اوقات اکثر محققا یک مدل مثل الکس نت یا گوگل نت یا Vggnet رو انتخاب میکنن و بعد شروع میکنن با تغییر دادن تو اوون از اون در کار مورد نظر خودشون استفاده کردن . نکته دیگه ای هم در مورد انتخاب مدل حائز اهمیت هست مشخصات سخت افزاری مورد نیاز اونها هست که باید بهش دقت کنید. در پناه حق موفق و سربلند باشید
  • #10: INPUT => [CONV => RELU => POOL] * 2 => [CONV => RELU] * 3 => POOL => [FC => RELU => DO] * 2 => SOFTMAX There are two methods to reduce the size of an input volume – CONV layers with a stride >1 (which we’ve already seen) and POOL layers. It is common to insert POOL layers in-between consecutive
  • #11: It contains 5 convolutional layers and 3 fully connected layers. Relu is applied after very convolutional and fully connected layer. Dropout is applied before the first and the second fully connected year. The image size in the following architecutre chart should be 227 * 227 instead of 224 * 224, as it is pointed out by Andrei Karpathy in his famous CS231n Course. More insterestingly, the input size is 224 * 224 with 2 padding in the pytorch torch vision. The output width and height should be (224–11+4)/4 + 1=55.25! The explanation here is pytorch Conv2d apply floor operator to the above result, and therefore the last one padding is ignored. It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters.
  • #12: Actvity of a neuron computed by applying kernel i at positon (x,y) and then applying the ReLU nonlinearity after each nomramalition What exactly is Local Response Normalization? Local Response Normalization (LRN) layer implements the lateral inhibition we were talking about in the previous section. This layer is useful when we are dealing with ReLU neurons. Why is that? Because ReLU neurons have unbounded activations and we need LRN to normalize that. We want to detect high frequency features with a large response. If we normalize around the local neighborhood of the excited neuron, it becomes even more sensitive as compared to its neighbors. At the same time, it will dampen the responses that are uniformly large in any given local neighborhood. If all the values are large, then normalizing those values will diminish all of them. So basically we want to encourage some kind of inhibition and boost the neurons with relatively larger activations. This has been discussed nicely in Section 3.3 of the original paper by Krizhevsky et al. How is it done in practice? There are two types of normalizations available in Caffe. You can either normalize within the same channel or you can normalize across channels. Both these methods tend to amplify the excited neuron while dampening the surrounding neurons. When you are normalizing within the same channel, it’s just like considering a 2D neighborhood of dimension N x N, where N is the size of the normalization window. You normalize this window using the values in this neighborhood. If you are normalizing across channels, you will consider a neighborhood along the third dimension but at a single location. You need to consider an area of shape N x 1 x 1. Here 1 x 1 refers to a single value in a 2D matrix and N refers to the normalization size. normalization layer carries out channel-wise normalization. =================================== ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still find that the following local normalization scheme aids generalization. Denoting by ai x,y the activity of a neuron computed by applying kernel i at position (x, y) and then applying the ReLU nonlinearity, the response-normalized activity bi x,y is given by the expression where the sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels. The constants k, n,α, and β are hyper-parameters whose values are determined using a validation set; we used k = 2, n = 5, α = 10−4, and β = 0.75. We applied this normalization after applying the ReLU nonlinearity in certain layers (see Section 3.5). This scheme bears some resemblance to the local contrast normalization scheme of Jarrett et al. [11], but ours would be more correctly termed “brightness normalization”, since we do not subtract the mean activity. Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively. We also verified the effectiveness of this scheme on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test error rate without normalization and 11% with normalization3. 3.4
  • #13: d averaging the predictions made by the network’s softmax layer on the ten patches. The with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1 where pi and λi are ith eigenvector and eigenvalue of the 3 × 3 covariance matrix of RGB pixel values,
  • #14: Why does DropOut work? The idea behind the dropout is similar to the model ensembles. Due to the dropout layer, different sets of neurons which are switched off, represent a different architecture and all these different architectures are trained in parallel with weight given to each subset and the summation of weights being one. For n neurons attached to DropOut, the number of subset architectures formed is 2^n. So it amounts to prediction being averaged over these ensembles of models. This provides a structured model regularization which helps in avoiding the over-fitting. Another view of DropOut being helpful is that since neurons are randomly chosen, they tend to avoid developing co-adaptations among themselves thereby enabling them to develop meaningful features, independent of others. . Dropout is applied before the first and the second fully connected year.
  • #15: We trained our models using stochastic gradient descent Gradient of Loss Accelearteing the early stage of neuron
  • #16: were “pre-trained” to classify the entire ImageNet 2011 Fall Our top-1 and top-5 error rates release. See
  • #18: In the left panel of Figure 4 we qualitatively assess what the network has learned by computing its top-5 predictions on eight test images. Notice that even off-center objects, such as the mite in the top-left, can be recognized by the net. Most of the top-5 labels appear reasonable. For example, only other types of cat are considered plausible labels for the leopard. In some cases (grille, cherry) there is genuine ambiguity about the intended focus of the photograph. Another way to probe the network’s visual knowledge is to consider the feature activations induced by an image at the last, 4096-dimensional hidden layer. If two images produce feature activation vectors with a small Euclidean separation, we can say that the higher levels of the neural network consider them to be similar. Figure 4 shows five images from the test set and the six images from the training set that are most similar to each of them according to this measure. Notice that at the pixel level, the retrieved training images are generally not close in L2 to the query images in the first column. For example, the retrieved dogs and elephants appear in a variety of poses. We present the results for many more test images in the supplementary material. Computing similarity by using Euclidean distance between two 4096-dimensional, real-valued vec- tors is inefficient, but it could be made efficient by training an auto-encoder to compress these vectors to short binary codes. This should produce a much better image retrieval method than applying auto- encoders to the raw pixels [14], which does not make use of image labels and hence has a tendency to retrieve images with similar patterns of edges, whether or not they are semantically similar. 7
  • #19: Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55*55 neurons in one depth slice. Notice that the parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should intuitively be useful at some other location as well due to the translationally-invariant structure of images. There is therefore no need to relearn to detect a horizontal edge at every one of the 55*55 distinct locations in the Conv layer output volume.