SlideShare a Scribd company logo
Object detection with
Deep Learning
Matthew Opala
AGENDA
Region proposals based models
Regression models
Localization & detection
Localization & detection
Computer Vision tasks
Classification Classification &
Localization
Object
detection
Instance
segmentation
Single Object Multiple Objects
Credit: http://vision.stanford.edu/teaching/cs231n/slides/2016/winter1516_lecture8.pdf
Computer Vision tasks
Classification Classification &
Localization
Object
detection
Instance
segmentation
Single Object Multiple Objects
Classification & Localization
Classification:
◦ Input: image
◦ Output: class label
◦ Evaluation: accuracy
Localization:
◦ Input: image
◦ Output: Box(x, y, w, h)
◦ Evaluation: IoU
CAT (x, y, w, h)
Object detection
◦ Many objects of
different classes
on an image
◦ Needs variable
size output
ConvNet
Final conv
feature maps
Classification
head
Regression
head
Region
proposals
Crop & warp
ConvNet
Final conv
feature maps
Classifier
Detection as regression vs. Detection as classification
Object detection models
R-CNN
Region proposals - selective search
Credit: Uijlings et al, “Selective search for Object Recognition”, IJCV 2013
RCNN - model
ConvNet
Bbox
regressors
SVM
Input Image
Regions of
Interest (RoI)
Warped image regions
Selective
search
RCNN - training
◦ Train a classification model
◦ Fine-tune it for detection
◦ Extract features
◦ Train a binary SVM for each class
◦ Train a linear regression model for
each class
RCNN - disadvantages
◦ Complex training pipeline
◦ Slow at test time - 50s per image
Fast R-CNN
Input Image
ConvNet
Bbox
regressors
Softmax
RoI projection onto
the feature map
RoI pooling
FC layers
Selective
search
RoI Pooling
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,8 0,95
0,9 0,74
Fast R-CNN advantages
◦ Much simpler training
◦ Faster - 2s per image
Faster R-CNN
Input Image
ConvNet
Bbox
regressors
Softmax
RoI pooling
FC layers
Region
proposal
network
Feature
Map
Regions
propositions
Faster R-CNN
◦ Fast enough for many applications:
140 ms per image
YOLO
Even Faster-RCNN is too slow for real-time
Model Time/img FPS Pascal 2007 mAP
RCNN 20 s/img 0.05 0.66
Fast-RCNN 2 s/img 0.5 0.7
Faster-RCNN 140 ms/img 7 0.732
YOLO v1. 22 ms/img 45 0.63
Fast YOLO v1. 6,45 ms/img 155 0.53
Credit:
https://pjreddie.com/darknet/yolo
50 km/h
278 m
RCNN
278 m
RCNN
28 m
Fast-RCNN
278 m
RCNN
28 m
Fast-RCNN
1,95 m
Faster-RCNN
278 m
RCNN
28 m
Fast-RCNN
1,95 m
Faster-RCNN
0.3 m
YOLO
Codetecon #KRK 3 - Object detection with Deep Learning
Split image into S x S grid
Each cell predicts boxes (x, y, w, h) and confidences P(object)
Each cell predicts boxes (x, y, w, h) and confidences P(object)
Each cell predicts boxes (x, y, w, h) and confidences P(object)
Each cell predicts class probability conditioned on object e.g.
P(Car | object)
CarBicycle
Dog
Dining
table
At test time we combine the box and class predictions
After NMS and thresholding
Model
◦ Image divided into S x S grid
◦ Within each grid cell predict:
▫ B boxes (4 coordinates + confidence)
▫ C class scores
◦ Regression from image to S x S x (5 * B + C) tensor
Credit: Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015
During training we match examples to correct cell
Codetecon #KRK 3 - Object detection with Deep Learning
Dog = 1
Cat = 0
Bicycle = 0
...
Adjust cell’s class probabilities
Find predicted bounding box with highest IoU
Codetecon #KRK 3 - Object detection with Deep Learning
Increase its confidence
Decrease confidence of other boxes
Codetecon #KRK 3 - Object detection with Deep Learning
Decrease confidence of boxes in cells without ground truth
detection
Codetecon #KRK 3 - Object detection with Deep Learning
Training details
● Pretrain Extraction Net on Imagenet (24 conv laters)
● SGD with decreasing learning rate
● Extensive data augmentation
● Leaky ReLUs
● Increase loss from bounding boxes coordinate predictions and
decrease for boxes that don’t contain objects
● Predicts square root of width and height instead of direct
prediction
YOLO v2
YOLO drawbacks
◦ YOLO makes a significant number
of localization errors in comparison
to Faster-RCNN
◦ Low recall in comparison to region
proposal based methods
YOLO v2
◦ Batch normalization
◦ High resolution classifier
◦ Convolutional anchor boxes
◦ K-Means for choosing boxes’ priors
◦ Fine-grained features
◦ Multi-scale training
mAP and speed on VOC 2007
Credit: Redmon, Farhadi: “YOLO9000, Better, Faster, Stronger”, arXiv 2017
YOLO 9000 - WordTree
YOLO 9000: Hierarchical Classification
◦ Train Darknet-19 on WordTree
◦ Propagate ground truth labels up
the tree
◦ Perform multiple softmax over
co-hyponyms
YOLO 9000 - Joint Classification and Detection training
◦ COCO detection + top 9000 classes from
ImageNet
◦ On detection image, backpropagate loss as
normal
◦ On classification image, only
backpropagate loss at or above the
corresponding level of label
◦ ImageNet shares 44 categories with COCO
◦ Generalizes quite good to new animals
(tiger 0.61 AP, fox, 0.52)
◦ Fails on clothing e.g. “sunglasses”
YOLO 9000: Visualizations
Single Shot Multibox Detector
SSD - YOLO architecture comparison
Credit: Liu, et al: “SSD: Single Shot Multibox Detector””,, arxiv 2016.
SSD detection
◦ Described by four parameters (cx,
cy, w, h) and class category
◦ Detector outputs single value, we
need #classes + 4 detectors for a
single detection
Different “classes” of detection
Aspect ratio: 2:1 Aspect ratio 1:2 Aspect ratio 1:1
Default boxes and aspect ratios
For each conv layer that is input to detection
there are:
(classes + 4) x #default boxes x m x n outputs
SSD Training
◦ Ground truth data needs to be assigned to
specific outputs in the fixed set of detector
outputs
◦ For each GT box we choose the default one
with best jaccard overlap
◦ Hard negative mining
◦ Data augmentation
◦ Loss: cross-entropy + Smooth L1
Deconvolutional Single Shot Detector
Credit: Liu, et al: “DSSD: Deconvolutional Single Shot Detectorr””,, arxiv 2017.
Models comparison - according to DSSD paper
Model Network Pascal 2007 mAP
Faster-RCNN ResNet-101 0.764
R-FCN ResNet-101 0.805
SSD-300 VGG-16 0.77.5
SSD-513 ResNet-101 0.806
YOLO v2 - 544 Darknet-19 0.786
DSSD-513 ResNet-101 0.815
Recap
◦ Detection as regression or
detection as classification
◦ Static images detectors are
already fast enough to work even
on video
◦ Fast YOLO is the fastest detector
◦ State-of-the-art:
▫ Resnet-101 + SSD + deconvolutions
Thanks!
Q&A
You can contact us at:
matthew.opala@craftinity.com

More Related Content

PDF
SSD: Single Shot MultiBox Detector (UPC Reading Group)
PPTX
Deep learning for object detection
PPTX
PPTX
Recent Progress on Object Detection_20170331
PDF
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
PPTX
You Only Look Once: Unified, Real-Time Object Detection
SSD: Single Shot MultiBox Detector (UPC Reading Group)
Deep learning for object detection
Recent Progress on Object Detection_20170331
画像キャプションと動作認識の最前線 〜データセットに注目して〜(第17回ステアラボ人工知能セミナー)
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Yurii Pashchenko: Unlocking the potential of Segment Anything Model (UA)
You Only Look Once: Unified, Real-Time Object Detection

What's hot (20)

PPTX
Segment Anything
PDF
Deep learning based object detection basics
PPTX
PDF
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
PDF
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
PPTX
You only look once
PPTX
You only look once (YOLO) : unified real time object detection
PDF
Deep Learning for Computer Vision: Object Detection (UPC 2016)
PPTX
You only look once: Unified, real-time object detection (UPC Reading Group)
PDF
Activity-Net Challenge 2021の紹介
PPTX
Computer Vision Introduction
PPTX
Yolo releases gianmaria
PPTX
Multi Object Tracking | Presentation 1 | ID 103001
PDF
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
PDF
【チュートリアル】コンピュータビジョンによる動画認識
PPTX
Deep Learning in Computer Vision
PPTX
Real Time Object Dectection using machine learning
PDF
Modern Convolutional Neural Network techniques for image segmentation
PPTX
Intro to Object Detection with SSD
Segment Anything
Deep learning based object detection basics
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
You only look once
You only look once (YOLO) : unified real time object detection
Deep Learning for Computer Vision: Object Detection (UPC 2016)
You only look once: Unified, real-time object detection (UPC Reading Group)
Activity-Net Challenge 2021の紹介
Computer Vision Introduction
Yolo releases gianmaria
Multi Object Tracking | Presentation 1 | ID 103001
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
【チュートリアル】コンピュータビジョンによる動画認識
Deep Learning in Computer Vision
Real Time Object Dectection using machine learning
Modern Convolutional Neural Network techniques for image segmentation
Intro to Object Detection with SSD
Ad

Similar to Codetecon #KRK 3 - Object detection with Deep Learning (20)

PDF
#10 pydata warsaw object detection with dn ns
PDF
MLIP - Chapter 5 - Detection, Segmentation, Captioning
PDF
Andrii Belas "Overview of object detection approaches: cases, algorithms and...
PPTX
Improving Object Detection on Low Quality Images
PDF
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
PDF
20220811 - computer vision
PDF
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
PDF
02 - Data validation and validity deze keer
PDF
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
PPTX
PDF
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
PDF
PR-132: SSD: Single Shot MultiBox Detector
PDF
Spatio-temporal reasoning for traffic scene understanding
PPTX
Introduction to object detection
PDF
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
PDF
Introduction to object detection
PDF
Object Detection - Míriam Bellver - UPC Barcelona 2018
PDF
#6 PyData Warsaw: Deep learning for image segmentation
PPTX
object-detection.pptx
PDF
Yulia Honcharenko "Application of metric learning for logo recognition"
#10 pydata warsaw object detection with dn ns
MLIP - Chapter 5 - Detection, Segmentation, Captioning
Andrii Belas "Overview of object detection approaches: cases, algorithms and...
Improving Object Detection on Low Quality Images
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
20220811 - computer vision
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
02 - Data validation and validity deze keer
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
PR-132: SSD: Single Shot MultiBox Detector
Spatio-temporal reasoning for traffic scene understanding
Introduction to object detection
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Introduction to object detection
Object Detection - Míriam Bellver - UPC Barcelona 2018
#6 PyData Warsaw: Deep learning for image segmentation
object-detection.pptx
Yulia Honcharenko "Application of metric learning for logo recognition"
Ad

Recently uploaded (20)

PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
BIOMOLECULES PPT........................
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
Sciences of Europe No 170 (2025)
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
2Systematics of Living Organisms t-.pptx
BIOMOLECULES PPT........................
INTRODUCTION TO EVS | Concept of sustainability
AlphaEarth Foundations and the Satellite Embedding dataset
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Derivatives of integument scales, beaks, horns,.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Phytochemical Investigation of Miliusa longipes.pdf
Cell Membrane: Structure, Composition & Functions
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
lecture 2026 of Sjogren's syndrome l .pdf
2. Earth - The Living Planet Module 2ELS
Sciences of Europe No 170 (2025)
POSITIONING IN OPERATION THEATRE ROOM.ppt
TOTAL hIP ARTHROPLASTY Presentation.pptx

Codetecon #KRK 3 - Object detection with Deep Learning

  • 1. Object detection with Deep Learning Matthew Opala
  • 2. AGENDA Region proposals based models Regression models Localization & detection
  • 4. Computer Vision tasks Classification Classification & Localization Object detection Instance segmentation Single Object Multiple Objects Credit: http://vision.stanford.edu/teaching/cs231n/slides/2016/winter1516_lecture8.pdf
  • 5. Computer Vision tasks Classification Classification & Localization Object detection Instance segmentation Single Object Multiple Objects
  • 6. Classification & Localization Classification: ◦ Input: image ◦ Output: class label ◦ Evaluation: accuracy Localization: ◦ Input: image ◦ Output: Box(x, y, w, h) ◦ Evaluation: IoU CAT (x, y, w, h)
  • 7. Object detection ◦ Many objects of different classes on an image ◦ Needs variable size output
  • 8. ConvNet Final conv feature maps Classification head Regression head Region proposals Crop & warp ConvNet Final conv feature maps Classifier Detection as regression vs. Detection as classification
  • 10. R-CNN
  • 11. Region proposals - selective search Credit: Uijlings et al, “Selective search for Object Recognition”, IJCV 2013
  • 12. RCNN - model ConvNet Bbox regressors SVM Input Image Regions of Interest (RoI) Warped image regions Selective search
  • 13. RCNN - training ◦ Train a classification model ◦ Fine-tune it for detection ◦ Extract features ◦ Train a binary SVM for each class ◦ Train a linear regression model for each class
  • 14. RCNN - disadvantages ◦ Complex training pipeline ◦ Slow at test time - 50s per image
  • 16. Input Image ConvNet Bbox regressors Softmax RoI projection onto the feature map RoI pooling FC layers Selective search
  • 17. RoI Pooling 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  • 18. RoI Pooling, output size 2 x 2, region of interest 7 x 5 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  • 19. RoI Pooling, output size 2 x 2, region of interest 7 x 5 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  • 20. RoI Pooling, output size 2 x 2, region of interest 7 x 5 0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5 0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32 0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24 0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88 0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5 0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32 0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19 0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89
  • 21. RoI Pooling, output size 2 x 2, region of interest 7 x 5 0,8 0,95 0,9 0,74
  • 22. Fast R-CNN advantages ◦ Much simpler training ◦ Faster - 2s per image
  • 24. Input Image ConvNet Bbox regressors Softmax RoI pooling FC layers Region proposal network Feature Map Regions propositions
  • 25. Faster R-CNN ◦ Fast enough for many applications: 140 ms per image
  • 26. YOLO
  • 27. Even Faster-RCNN is too slow for real-time Model Time/img FPS Pascal 2007 mAP RCNN 20 s/img 0.05 0.66 Fast-RCNN 2 s/img 0.5 0.7 Faster-RCNN 140 ms/img 7 0.732 YOLO v1. 22 ms/img 45 0.63 Fast YOLO v1. 6,45 ms/img 155 0.53 Credit: https://pjreddie.com/darknet/yolo
  • 32. 278 m RCNN 28 m Fast-RCNN 1,95 m Faster-RCNN 0.3 m YOLO
  • 34. Split image into S x S grid
  • 35. Each cell predicts boxes (x, y, w, h) and confidences P(object)
  • 36. Each cell predicts boxes (x, y, w, h) and confidences P(object)
  • 37. Each cell predicts boxes (x, y, w, h) and confidences P(object)
  • 38. Each cell predicts class probability conditioned on object e.g. P(Car | object) CarBicycle Dog Dining table
  • 39. At test time we combine the box and class predictions
  • 40. After NMS and thresholding
  • 41. Model ◦ Image divided into S x S grid ◦ Within each grid cell predict: ▫ B boxes (4 coordinates + confidence) ▫ C class scores ◦ Regression from image to S x S x (5 * B + C) tensor Credit: Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015
  • 42. During training we match examples to correct cell
  • 44. Dog = 1 Cat = 0 Bicycle = 0 ... Adjust cell’s class probabilities
  • 45. Find predicted bounding box with highest IoU
  • 48. Decrease confidence of other boxes
  • 50. Decrease confidence of boxes in cells without ground truth detection
  • 52. Training details ● Pretrain Extraction Net on Imagenet (24 conv laters) ● SGD with decreasing learning rate ● Extensive data augmentation ● Leaky ReLUs ● Increase loss from bounding boxes coordinate predictions and decrease for boxes that don’t contain objects ● Predicts square root of width and height instead of direct prediction
  • 54. YOLO drawbacks ◦ YOLO makes a significant number of localization errors in comparison to Faster-RCNN ◦ Low recall in comparison to region proposal based methods
  • 55. YOLO v2 ◦ Batch normalization ◦ High resolution classifier ◦ Convolutional anchor boxes ◦ K-Means for choosing boxes’ priors ◦ Fine-grained features ◦ Multi-scale training
  • 56. mAP and speed on VOC 2007 Credit: Redmon, Farhadi: “YOLO9000, Better, Faster, Stronger”, arXiv 2017
  • 57. YOLO 9000 - WordTree
  • 58. YOLO 9000: Hierarchical Classification ◦ Train Darknet-19 on WordTree ◦ Propagate ground truth labels up the tree ◦ Perform multiple softmax over co-hyponyms
  • 59. YOLO 9000 - Joint Classification and Detection training ◦ COCO detection + top 9000 classes from ImageNet ◦ On detection image, backpropagate loss as normal ◦ On classification image, only backpropagate loss at or above the corresponding level of label ◦ ImageNet shares 44 categories with COCO ◦ Generalizes quite good to new animals (tiger 0.61 AP, fox, 0.52) ◦ Fails on clothing e.g. “sunglasses”
  • 62. SSD - YOLO architecture comparison Credit: Liu, et al: “SSD: Single Shot Multibox Detector””,, arxiv 2016.
  • 63. SSD detection ◦ Described by four parameters (cx, cy, w, h) and class category ◦ Detector outputs single value, we need #classes + 4 detectors for a single detection
  • 64. Different “classes” of detection Aspect ratio: 2:1 Aspect ratio 1:2 Aspect ratio 1:1
  • 65. Default boxes and aspect ratios
  • 66. For each conv layer that is input to detection there are: (classes + 4) x #default boxes x m x n outputs
  • 67. SSD Training ◦ Ground truth data needs to be assigned to specific outputs in the fixed set of detector outputs ◦ For each GT box we choose the default one with best jaccard overlap ◦ Hard negative mining ◦ Data augmentation ◦ Loss: cross-entropy + Smooth L1
  • 68. Deconvolutional Single Shot Detector Credit: Liu, et al: “DSSD: Deconvolutional Single Shot Detectorr””,, arxiv 2017.
  • 69. Models comparison - according to DSSD paper Model Network Pascal 2007 mAP Faster-RCNN ResNet-101 0.764 R-FCN ResNet-101 0.805 SSD-300 VGG-16 0.77.5 SSD-513 ResNet-101 0.806 YOLO v2 - 544 Darknet-19 0.786 DSSD-513 ResNet-101 0.815
  • 70. Recap ◦ Detection as regression or detection as classification ◦ Static images detectors are already fast enough to work even on video ◦ Fast YOLO is the fastest detector ◦ State-of-the-art: ▫ Resnet-101 + SSD + deconvolutions