Codetecon #KRK 3 - Object detection with Deep Learning

Object detection with
Deep Learning
Matthew Opala

AGENDA
Region proposals based models
Regression models
Localization & detection

Computer Vision tasks
Classification Classification &
Localization
Object
detection
Instance
segmentation
Single Object Multiple Objects
Credit: http://vision.stanford.edu/teaching/cs231n/slides/2016/winter1516_lecture8.pdf

Computer Vision tasks
Classification Classification &
Localization
Object
detection
Instance
segmentation
Single Object Multiple Objects

Classification & Localization
Classification:
◦ Input: image
◦ Output: class label
◦ Evaluation: accuracy
Localization:
◦ Input: image
◦ Output: Box(x, y, w, h)
◦ Evaluation: IoU
CAT (x, y, w, h)

Object detection
◦ Many objects of
different classes
on an image
◦ Needs variable
size output

ConvNet
Final conv
feature maps
Classification
head
Regression
head
Region
proposals
Crop & warp
ConvNet
Final conv
feature maps
Classifier
Detection as regression vs. Detection as classification

Region proposals - selective search
Credit: Uijlings et al, “Selective search for Object Recognition”, IJCV 2013

RCNN - model
ConvNet
Bbox
regressors
SVM
Input Image
Regions of
Interest (RoI)
Warped image regions
Selective
search

RCNN - training
◦ Train a classification model
◦ Fine-tune it for detection
◦ Extract features
◦ Train a binary SVM for each class
◦ Train a linear regression model for
each class

RCNN - disadvantages
◦ Complex training pipeline
◦ Slow at test time - 50s per image

Input Image
ConvNet
Bbox
regressors
Softmax
RoI projection onto
the feature map
RoI pooling
FC layers
Selective
search

RoI Pooling
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89

RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,81 0,4 0,7 0,32 0,5 0,2 0,4 0,5
0,19 0,2 0,16 0,31 0,3 0,9 0,1 0,32
0,37 0,31 0,29 0,53 0,78 0,81 0,22 0,24
0,8 0,38 0,36 0,89 0,24 0,23 0,95 0,88
0,66 0,71 0,42 0,11 0,11 0,08 0,23 0,5
0,65 0,51 0,32 0,14 0,74 0,7 0,05 0,32
0,9 0,7 0,57 0,53 0,64 0,23 0,02 0,19
0,22 0,45 0,64 0,58 0,52 0,58 0,32 0,89

RoI Pooling, output size 2 x 2, region of interest 7 x 5
0,8 0,95
0,9 0,74

Fast R-CNN advantages
◦ Much simpler training
◦ Faster - 2s per image

Input Image
ConvNet
Bbox
regressors
Softmax
RoI pooling
FC layers
Region
proposal
network
Feature
Map
Regions
propositions

Faster R-CNN
◦ Fast enough for many applications:
140 ms per image

Even Faster-RCNN is too slow for real-time
Model Time/img FPS Pascal 2007 mAP
RCNN 20 s/img 0.05 0.66
Fast-RCNN 2 s/img 0.5 0.7
Faster-RCNN 140 ms/img 7 0.732
YOLO v1. 22 ms/img 45 0.63
Fast YOLO v1. 6,45 ms/img 155 0.53
Credit:
https://pjreddie.com/darknet/yolo

278 m
RCNN
28 m
Fast-RCNN
1,95 m
Faster-RCNN

278 m
RCNN
28 m
Fast-RCNN
1,95 m
Faster-RCNN
0.3 m
YOLO

Codetecon #KRK 3 - Object detection with Deep Learning

Each cell predicts boxes (x, y, w, h) and confidences P(object)

Each cell predicts class probability conditioned on object e.g.
P(Car | object)
CarBicycle
Dog
Dining
table

At test time we combine the box and class predictions

Model
◦ Image divided into S x S grid
◦ Within each grid cell predict:
▫ B boxes (4 coordinates + confidence)
▫ C class scores
◦ Regression from image to S x S x (5 * B + C) tensor
Credit: Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, arXiv 2015

During training we match examples to correct cell

Dog = 1
Cat = 0
Bicycle = 0
...
Adjust cell’s class probabilities

Find predicted bounding box with highest IoU

Decrease confidence of other boxes

Decrease confidence of boxes in cells without ground truth
detection

Training details
● Pretrain Extraction Net on Imagenet (24 conv laters)
● SGD with decreasing learning rate
● Extensive data augmentation
● Leaky ReLUs
● Increase loss from bounding boxes coordinate predictions and
decrease for boxes that don’t contain objects
● Predicts square root of width and height instead of direct
prediction

YOLO drawbacks
◦ YOLO makes a significant number
of localization errors in comparison
to Faster-RCNN
◦ Low recall in comparison to region
proposal based methods

YOLO v2
◦ Batch normalization
◦ High resolution classifier
◦ Convolutional anchor boxes
◦ K-Means for choosing boxes’ priors
◦ Fine-grained features
◦ Multi-scale training

mAP and speed on VOC 2007
Credit: Redmon, Farhadi: “YOLO9000, Better, Faster, Stronger”, arXiv 2017

YOLO 9000: Hierarchical Classification
◦ Train Darknet-19 on WordTree
◦ Propagate ground truth labels up
the tree
◦ Perform multiple softmax over
co-hyponyms

YOLO 9000 - Joint Classification and Detection training
◦ COCO detection + top 9000 classes from
ImageNet
◦ On detection image, backpropagate loss as
normal
◦ On classification image, only
backpropagate loss at or above the
corresponding level of label
◦ ImageNet shares 44 categories with COCO
◦ Generalizes quite good to new animals
(tiger 0.61 AP, fox, 0.52)
◦ Fails on clothing e.g. “sunglasses”

SSD - YOLO architecture comparison
Credit: Liu, et al: “SSD: Single Shot Multibox Detector””,, arxiv 2016.

SSD detection
◦ Described by four parameters (cx,
cy, w, h) and class category
◦ Detector outputs single value, we
need #classes + 4 detectors for a
single detection

Different “classes” of detection
Aspect ratio: 2:1 Aspect ratio 1:2 Aspect ratio 1:1

Default boxes and aspect ratios

For each conv layer that is input to detection
there are:
(classes + 4) x #default boxes x m x n outputs

SSD Training
◦ Ground truth data needs to be assigned to
specific outputs in the fixed set of detector
outputs
◦ For each GT box we choose the default one
with best jaccard overlap
◦ Hard negative mining
◦ Data augmentation
◦ Loss: cross-entropy + Smooth L1

Deconvolutional Single Shot Detector
Credit: Liu, et al: “DSSD: Deconvolutional Single Shot Detectorr””,, arxiv 2017.

Models comparison - according to DSSD paper
Model Network Pascal 2007 mAP
Faster-RCNN ResNet-101 0.764
R-FCN ResNet-101 0.805
SSD-300 VGG-16 0.77.5
SSD-513 ResNet-101 0.806
YOLO v2 - 544 Darknet-19 0.786
DSSD-513 ResNet-101 0.815

Recap
◦ Detection as regression or
detection as classification
◦ Static images detectors are
already fast enough to work even
on video
◦ Fast YOLO is the fastest detector
◦ State-of-the-art:
▫ Resnet-101 + SSD + deconvolutions

Thanks!
Q&A
You can contact us at:
matthew.opala@craftinity.com

Codetecon #KRK 3 - Object detection with Deep Learning

More Related Content

What's hot (20)

Similar to Codetecon #KRK 3 - Object detection with Deep Learning (20)

Recently uploaded (20)

Codetecon #KRK 3 - Object detection with Deep Learning