SlideShare a Scribd company logo
DenseBox: Unifying Landmark
Localization with End to End
Object Detection
Submitted on 16 Sep 2015 (v1), last revised 19 Sep 2015 (v3)
Lichao Huang, Yi Yang, Yafeng Deng, Yinan Yu
arXiv preprint arXiv:1509.04874, 2015
CHEN KUAN-YU
stu9458@gmail.com
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 1
Agenda
• Introduction
• Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
• Experiments
• Conclusion
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 2
Introduction
• How can a single fully convolutional neural network (FCN) perform on object
detection?
• In this work, we focus on one question: To what extent can an one-stage FCN
perform on object detection?
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 3
Introduction
• We introduce DenseBox a unified end-to-end FCN framework that directly
predicts bounding boxes and object class confidences through all locations and
scales of an image.
• Although similar to many existing sliding window fashion FCN detection
frameworks, DenseBox is more carefully designed to detect objects under small
scales and heavy occlusion
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 4
Introduction
• Checking for nearby cars during driving, finding a person, and localizing a
familiar face are all examples of object detection
• Indicate our DenseBox is the state-of-the-art system for detecting challenging
objects such as faces and cars
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 5
Introduction
1. First, we demonstrate that a single fully convolutional neural network, if
designed and optimized carefully, can detect objects under different scales with
heavy occlusion extremely accurately and efficiently
2. Second, we show that when incorporating with landmark localization through
multi-task learning[1]
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 6
Introduction
• The DenseBox detection Pipeline
1. Image pyramid is fed to the network
2. After several layers of convolution and pooling, upsampling feature map back and apply
convolution layers to get final output
3. Convert output feature map to bounding boxes, and apply non-maximum suppression to all
bounding boxes over the threshold
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 7
Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 8
Algorithm
• Bounding Box
• Left top 𝑝𝑡 = (𝑥𝑡, 𝑦𝑡)
• Right bottom 𝑝 𝑏 = (𝑥 𝑏, 𝑦 𝑏)
• Output feature map with 5-dimensional vector
• 𝑡𝑖 = ( 𝑠, 𝑑𝑥 𝑡 = 𝑥𝑖 − 𝑥𝑡, 𝑑𝑦 𝑡 = 𝑦𝑖 − 𝑦𝑡, 𝑑𝑥 𝑏 = 𝑥𝑖 − 𝑥 𝑏, 𝑑𝑦 𝑏 = 𝑦𝑖 − 𝑦 𝑏)
• 𝑠, is the confidence score of being an object
• 𝑑𝑥 𝑡
, 𝑑𝑦 𝑡
, 𝑑𝑥 𝑏
, 𝑑𝑦 𝑏
denote the distance between output pixel location with the
boundary of target bounding box.
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 9
Algorithm
• In this paper, we train our network on single scale, and apply it to multiple scales
for evaluation
• In training, the patches are cropped and resized to 240x240 with a face in the
center roughly has the height of 50 pixels. The output ground truth in training is a
5-channel map sized 60x60 , with the downsampling factor of four
Ground-Truth Generation
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 10
• The positive labeled region in the first channel of ground truth map is a filled
circle with radius 𝑟𝑐 (its scaling factor is set to be 0.3 to the box size)
• The remaining 4 channels are filled with the distance between the pixel location of
output map between the left top and right bottom corners of the nearest bounding
box
Algorithm Ground-Truth Generation
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 11
• Note that if multiple faces occur in one patch, we keep those faces as positive if
they fall in a scale range(e.g. 0.8 to 1.25 in our setting) relative to the face in patch
center
• The pixels of first channel, which denote the confidence score of class, in the
ground truth map are initialized with 0, and further set to 1 if within the positive
label region
• Each pixel can be treated as one sample , since every 5-channel pixel describe a
bounding box.
Algorithm Ground-Truth Generation
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 12
Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 13
Algorithm
• Network architecture of DenseBox. The rectangles with red names contain
learnable parameters
• Derived from the VGG(Visual Geometric Group) 19 model used for image
classification[35]
Model-Design
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 14
Algorithm
• Multi-Level Feature Fusion
• Recent works indicate that using features from different convolution layers can enhance
performance in task such as edge detection and segmentation
• Part-level feature focus on local details of object to find discriminative appearance parts, while
object-level or high-level feature usually has a larger receptive field in order to recognize
object
• we concatenate feature map from conv3_4 and conv4_4. The receptive field (or
sliding window size) of conv3_4 is 48x48, almost the same size of the face size in
training, and the conv4_4 have a much larger receptive field, around 118x118 in size
Model-Design
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 15
Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 16
Algorithm
• Like Fast R-CNN(R-ConvNet), our network has two sibling output branches
1. The confidence score 𝑦(per pixel in the output map) of being a target object. Given the
ground truth label 𝑦∗ ∈ (0,1) , the classification loss can be defined as follows
2. The second branch of outputs the bounding-box regression loss, denoted as 𝐿𝑙𝑜𝑐. It targets
on minimizing the L2 loss between the predicted location offsets 𝑑 = ( 𝑑 𝑡𝑥1, 𝑑 𝑡𝑦1, 𝑑 𝑡𝑥2,
𝑑 𝑡𝑦2)and the targets 𝑑∗ = (𝑑 𝑡𝑥1
∗
, 𝑑 𝑡𝑦1
∗
, 𝑑 𝑡𝑥2
∗
, 𝑑 𝑡𝑦2
∗
)
Multi-Task Training
(1) (2)
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 17
Algorithm
• The process of selecting negative samples is one of the crucial parts in learning
• In addition, the detector will degrade if we penalize loss on those samples lying in
the margin of positive and negative region.
• Here we use a binary mask for each output pixel to indicate whether it is selected
in training
Multi-Task Training - Balance Sampling
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 18
Algorithm
• Ignoring Gray Zone
• Hard Negative Mining
• Loss with Mask
Multi-Task Training - Balance Sampling
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 19
Algorithm
• Ignoring Gray Zone
• The gray zone is defined on the margin of positive and negative region. It should not be
considered to be positive or negative, and its loss weight should be set to 0
• 𝐷𝑖𝑠 𝑝𝑖𝑥𝑒𝑙 < 𝑟𝑛𝑒𝑎𝑟 = 2 𝑝𝑖𝑥𝑒𝑙
• 𝑓𝑖𝑔𝑛 decided to select whether or not
Multi-Task Training - Balance Sampling
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 20
Algorithm
• Hard Negative Mining
• We make learning more efficient by searching the badly predicted samples rather than random
samples. After negative mining, the badly predicted samples are very likely to be selected, so
that gradient descent learning on those samples leads more robust prediction with less noise
• Sort the loss of output pixels in descending order, and assign the top 1% to be hard-negative,
in all experiments, we keep all positive labeled pixels(samples) and the ratio of positive and
negative to be 1:1
• 𝑓𝑠𝑒𝑙 to those pixels (samples) selected in a mini-batch.
Multi-Task Training - Balance Sampling
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 21
Algorithm
• Loss with Mask
• Now we can define the mask 𝑀( 𝑡𝑖) for each sample 𝑡𝑖 = ( 𝑦𝑖 𝑑𝑖) as a function of flags
mentioned above
• Then if we combine the classification (1) and bounding box regression (2) loss with masks,
our full multi-task loss can be represented as
Multi-Task Training - Balance Sampling
(3)
(4)
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 22
Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 23
Algorithm
• Landmark localization can be achieved in DenseBox just by stacking a few layers
owe to the fully convolution architecture.
Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 24
Algorithm
Yann LeCun, “Learning Hierarchies of Invariant
feature” , Center for data science & courant institute
NYU, http://www.slideshare.net/yandex/yann-le-cun
Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 25
Algorithm
• Location Loss value 𝐿𝑙𝑚
• 𝜆 𝑑𝑒𝑡 , 𝜆𝑙𝑚 is controll the balance of the three tasks
• Refine detection loss as 𝐿 𝑟𝑓
Refine with Landmark Localization
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 26
Experiments
1. Landmarks for face(MALF Face Detection Task)
2. 8 landmarks for car(KITTI Car Detection Task)
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 27
Experiments
• MALF Face Detection Task
• Each image the longest image side does not exceed 800 pixels.
• Test our model on each image at several scales. The test scale starts from 2−3 to 21.2 with the
step of 20.3. This setting ena2ble our models to detect faces from 20 pixels to 400 pixels in
height.
• Results of three versions of DenseBox on MALF dataset:
• DenseBoxNoLandmark denotes DenseBox without landmark in training.
• DenseBoxLandmark is the model incorporating landmark localization,
• DenseBoxEnsemble is the result of ensembling 10 DenseBox with landmarks from different batch iterations
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 28
Experiments
• KITTI Car Detection Task
• The key difficulty of KITTI car detection task is that a great amount of cars are in small size
and occluded. We selectively annotate 8 landmarks for large cars
• The evaluation metric of KITTI car detection task is different from general object detection.
KITTI requires an overlap of 70% for true positive bounding box, while other tasks such as
face detection only requires 50% overlap
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 29
Experiments
Different versions of DenseBox and Recall-Curve2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 30
Experiments
Method Moderate (%) Easy(%) Hard(%)
Regionlets[23] 76.45 84.75 59.70
AOG[19] 74.26 84.24 60.51
3DVP[40] 75.77 87.46 65.38
spCov_LBP 77.40 87.19 60.60
DeepInsight 84.40 84.59 76.09
NIPS ID 331 87.14 88.33 76.11
DJML 88.79 91.31 77.73
DenseBox(Without landmark) 85.07 82.33 76.27
DenseBox(with landmark) 85.74 83.63 76.71
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 31
Experiments
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 32
Experiments
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 33
Conclusion
• The performance can be boosted easily by incorporating landmark information
• The DenseBox achieves impressive performance on both face detection and car
detection task, demonstrating its high suitable for situation
• The original DenseBox presented in this paper needs several seconds to process
one image. But this has been addressed in our latter version
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 34

More Related Content

PPTX
Tutorial on Object Detection (Faster R-CNN)
PDF
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PDF
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
PDF
PR-305: Exploring Simple Siamese Representation Learning
PDF
Mask-RCNN for Instance Segmentation
PDF
ShuffleNet - PR054
PDF
Object Detection Beyond Mask R-CNN and RetinaNet III
PDF
Attentive semantic alignment with offset aware correlation kernels
Tutorial on Object Detection (Faster R-CNN)
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Object Segmentation (D2L7 Insight@DCU Machine Learning Workshop 2017)
PR-305: Exploring Simple Siamese Representation Learning
Mask-RCNN for Instance Segmentation
ShuffleNet - PR054
Object Detection Beyond Mask R-CNN and RetinaNet III
Attentive semantic alignment with offset aware correlation kernels

What's hot (20)

PDF
PR-284: End-to-End Object Detection with Transformers(DETR)
PDF
Deep Learning for Computer Vision: Segmentation (UPC 2016)
PDF
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PDF
Emerging Properties in Self-Supervised Vision Transformers
PPTX
Exploring Simple Siamese Representation Learning
PDF
Exploring Simple Siamese Representation Learning
PDF
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PDF
Focal loss for dense object detection
PDF
Faster R-CNN - PR012
PDF
PR-366: A ConvNet for 2020s
PDF
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PPT
PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
PDF
Perceptrons (D1L2 2017 UPC Deep Learning for Computer Vision)
PPTX
Structure from motion
PDF
Performance evaluation of ds cdma
PDF
PR-297: Training data-efficient image transformers & distillation through att...
PPT
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
PDF
Pca ankita dubey
PPTX
Visualization using tSNE
PPTX
Graph R-CNN for Scene Graph Generation
PR-284: End-to-End Object Detection with Transformers(DETR)
Deep Learning for Computer Vision: Segmentation (UPC 2016)
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Emerging Properties in Self-Supervised Vision Transformers
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Focal loss for dense object detection
Faster R-CNN - PR012
PR-366: A ConvNet for 2020s
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PCA-SIFT: A More Distinctive Representation for Local Image Descriptors
Perceptrons (D1L2 2017 UPC Deep Learning for Computer Vision)
Structure from motion
Performance evaluation of ds cdma
PR-297: Training data-efficient image transformers & distillation through att...
Build Your Own 3D Scanner: 3D Scanning with Swept-Planes
Pca ankita dubey
Visualization using tSNE
Graph R-CNN for Scene Graph Generation
Ad

Viewers also liked (14)

PPT
第三章Ti msp430平台介紹 v3
PPTX
2014 1029 adaptive dissmination of safety data among vehicles
PPTX
車用通信報告
PPTX
Stm32 develop tool introduction
PPTX
The design of electronic license plate recognition terminal system based on n...
PDF
Based on raspberry pi with the application of Stepper
PPT
第二周課程 Arduino介紹
PPTX
2014暑期訓練之Linux kernel power
PDF
Object detection technique using bounding box algorithm for
PDF
Stm32f4硬體週邊介紹
PPT
Tumour detection
PPTX
艾鍗學院-單晶片韌體-CC2500通訊實驗
PPT
Verilog 語法教學
PPTX
簡介 GitHub 平台
第三章Ti msp430平台介紹 v3
2014 1029 adaptive dissmination of safety data among vehicles
車用通信報告
Stm32 develop tool introduction
The design of electronic license plate recognition terminal system based on n...
Based on raspberry pi with the application of Stepper
第二周課程 Arduino介紹
2014暑期訓練之Linux kernel power
Object detection technique using bounding box algorithm for
Stm32f4硬體週邊介紹
Tumour detection
艾鍗學院-單晶片韌體-CC2500通訊實驗
Verilog 語法教學
簡介 GitHub 平台
Ad

Similar to Densebox (20)

PDF
Anchor free object detection by deep learning
PDF
Detection of Dense, Overlapping, Geometric Objects
PDF
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
PDF
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
PPTX
150424 Scalable Object Detection using Deep Neural Networks
PDF
Unsupervised Object Detection
PPTX
Computer Vision Gans
PDF
MLIP - Chapter 5 - Detection, Segmentation, Captioning
PDF
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
PDF
Detection focal loss 딥러닝 논문읽기 모임 발표자료
PPTX
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
PDF
Object Detection (D2L5 Insight@DCU Machine Learning Workshop 2017)
PDF
ICRA Nathan Piasco
PPTX
understanding the planet using satellites and deep learning
PPTX
TAME: Trainable Attention Mechanism for Explanations
PDF
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
PDF
Unsupervised Computer Vision: The Current State of the Art
PPTX
[NS][Lab_Seminar_241125]Affinity Attention Graph Neural Network for Weakly Su...
PPTX
Object detection - RCNNs vs Retinanet
PDF
IRJET- Real-Time Object Detection using Deep Learning: A Survey
Anchor free object detection by deep learning
Detection of Dense, Overlapping, Geometric Objects
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
DETECTION OF DENSE, OVERLAPPING, GEOMETRIC OBJECTS
150424 Scalable Object Detection using Deep Neural Networks
Unsupervised Object Detection
Computer Vision Gans
MLIP - Chapter 5 - Detection, Segmentation, Captioning
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
Detection focal loss 딥러닝 논문읽기 모임 발표자료
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
Object Detection (D2L5 Insight@DCU Machine Learning Workshop 2017)
ICRA Nathan Piasco
understanding the planet using satellites and deep learning
TAME: Trainable Attention Mechanism for Explanations
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Unsupervised Computer Vision: The Current State of the Art
[NS][Lab_Seminar_241125]Affinity Attention Graph Neural Network for Weakly Su...
Object detection - RCNNs vs Retinanet
IRJET- Real-Time Object Detection using Deep Learning: A Survey

Recently uploaded (20)

PPTX
2. Earth - The Living Planet earth and life
PDF
Sciences of Europe No 170 (2025)
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
The scientific heritage No 166 (166) (2025)
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
BIOMOLECULES PPT........................
PPTX
Microbiology with diagram medical studies .pptx
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
2. Earth - The Living Planet earth and life
Sciences of Europe No 170 (2025)
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
The scientific heritage No 166 (166) (2025)
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
TOTAL hIP ARTHROPLASTY Presentation.pptx
7. General Toxicologyfor clinical phrmacy.pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
2Systematics of Living Organisms t-.pptx
BIOMOLECULES PPT........................
Microbiology with diagram medical studies .pptx
6.1 High Risk New Born. Padetric health ppt
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Classification Systems_TAXONOMY_SCIENCE8.pptx
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Biophysics 2.pdffffffffffffffffffffffffff
Introduction to Fisheries Biotechnology_Lesson 1.pptx

Densebox

  • 1. DenseBox: Unifying Landmark Localization with End to End Object Detection Submitted on 16 Sep 2015 (v1), last revised 19 Sep 2015 (v3) Lichao Huang, Yi Yang, Yafeng Deng, Yinan Yu arXiv preprint arXiv:1509.04874, 2015 CHEN KUAN-YU [email protected] 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 1
  • 2. Agenda • Introduction • Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization • Experiments • Conclusion 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 2
  • 3. Introduction • How can a single fully convolutional neural network (FCN) perform on object detection? • In this work, we focus on one question: To what extent can an one-stage FCN perform on object detection? 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 3
  • 4. Introduction • We introduce DenseBox a unified end-to-end FCN framework that directly predicts bounding boxes and object class confidences through all locations and scales of an image. • Although similar to many existing sliding window fashion FCN detection frameworks, DenseBox is more carefully designed to detect objects under small scales and heavy occlusion 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 4
  • 5. Introduction • Checking for nearby cars during driving, finding a person, and localizing a familiar face are all examples of object detection • Indicate our DenseBox is the state-of-the-art system for detecting challenging objects such as faces and cars 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 5
  • 6. Introduction 1. First, we demonstrate that a single fully convolutional neural network, if designed and optimized carefully, can detect objects under different scales with heavy occlusion extremely accurately and efficiently 2. Second, we show that when incorporating with landmark localization through multi-task learning[1] 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 6
  • 7. Introduction • The DenseBox detection Pipeline 1. Image pyramid is fed to the network 2. After several layers of convolution and pooling, upsampling feature map back and apply convolution layers to get final output 3. Convert output feature map to bounding boxes, and apply non-maximum suppression to all bounding boxes over the threshold 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 7
  • 8. Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 8
  • 9. Algorithm • Bounding Box • Left top 𝑝𝑡 = (𝑥𝑡, 𝑦𝑡) • Right bottom 𝑝 𝑏 = (𝑥 𝑏, 𝑦 𝑏) • Output feature map with 5-dimensional vector • 𝑡𝑖 = ( 𝑠, 𝑑𝑥 𝑡 = 𝑥𝑖 − 𝑥𝑡, 𝑑𝑦 𝑡 = 𝑦𝑖 − 𝑦𝑡, 𝑑𝑥 𝑏 = 𝑥𝑖 − 𝑥 𝑏, 𝑑𝑦 𝑏 = 𝑦𝑖 − 𝑦 𝑏) • 𝑠, is the confidence score of being an object • 𝑑𝑥 𝑡 , 𝑑𝑦 𝑡 , 𝑑𝑥 𝑏 , 𝑑𝑦 𝑏 denote the distance between output pixel location with the boundary of target bounding box. 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 9
  • 10. Algorithm • In this paper, we train our network on single scale, and apply it to multiple scales for evaluation • In training, the patches are cropped and resized to 240x240 with a face in the center roughly has the height of 50 pixels. The output ground truth in training is a 5-channel map sized 60x60 , with the downsampling factor of four Ground-Truth Generation 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 10
  • 11. • The positive labeled region in the first channel of ground truth map is a filled circle with radius 𝑟𝑐 (its scaling factor is set to be 0.3 to the box size) • The remaining 4 channels are filled with the distance between the pixel location of output map between the left top and right bottom corners of the nearest bounding box Algorithm Ground-Truth Generation 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 11
  • 12. • Note that if multiple faces occur in one patch, we keep those faces as positive if they fall in a scale range(e.g. 0.8 to 1.25 in our setting) relative to the face in patch center • The pixels of first channel, which denote the confidence score of class, in the ground truth map are initialized with 0, and further set to 1 if within the positive label region • Each pixel can be treated as one sample , since every 5-channel pixel describe a bounding box. Algorithm Ground-Truth Generation 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 12
  • 13. Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 13
  • 14. Algorithm • Network architecture of DenseBox. The rectangles with red names contain learnable parameters • Derived from the VGG(Visual Geometric Group) 19 model used for image classification[35] Model-Design 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 14
  • 15. Algorithm • Multi-Level Feature Fusion • Recent works indicate that using features from different convolution layers can enhance performance in task such as edge detection and segmentation • Part-level feature focus on local details of object to find discriminative appearance parts, while object-level or high-level feature usually has a larger receptive field in order to recognize object • we concatenate feature map from conv3_4 and conv4_4. The receptive field (or sliding window size) of conv3_4 is 48x48, almost the same size of the face size in training, and the conv4_4 have a much larger receptive field, around 118x118 in size Model-Design 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 15
  • 16. Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 16
  • 17. Algorithm • Like Fast R-CNN(R-ConvNet), our network has two sibling output branches 1. The confidence score 𝑦(per pixel in the output map) of being a target object. Given the ground truth label 𝑦∗ ∈ (0,1) , the classification loss can be defined as follows 2. The second branch of outputs the bounding-box regression loss, denoted as 𝐿𝑙𝑜𝑐. It targets on minimizing the L2 loss between the predicted location offsets 𝑑 = ( 𝑑 𝑡𝑥1, 𝑑 𝑡𝑦1, 𝑑 𝑡𝑥2, 𝑑 𝑡𝑦2)and the targets 𝑑∗ = (𝑑 𝑡𝑥1 ∗ , 𝑑 𝑡𝑦1 ∗ , 𝑑 𝑡𝑥2 ∗ , 𝑑 𝑡𝑦2 ∗ ) Multi-Task Training (1) (2) 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 17
  • 18. Algorithm • The process of selecting negative samples is one of the crucial parts in learning • In addition, the detector will degrade if we penalize loss on those samples lying in the margin of positive and negative region. • Here we use a binary mask for each output pixel to indicate whether it is selected in training Multi-Task Training - Balance Sampling 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 18
  • 19. Algorithm • Ignoring Gray Zone • Hard Negative Mining • Loss with Mask Multi-Task Training - Balance Sampling 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 19
  • 20. Algorithm • Ignoring Gray Zone • The gray zone is defined on the margin of positive and negative region. It should not be considered to be positive or negative, and its loss weight should be set to 0 • 𝐷𝑖𝑠 𝑝𝑖𝑥𝑒𝑙 < 𝑟𝑛𝑒𝑎𝑟 = 2 𝑝𝑖𝑥𝑒𝑙 • 𝑓𝑖𝑔𝑛 decided to select whether or not Multi-Task Training - Balance Sampling 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 20
  • 21. Algorithm • Hard Negative Mining • We make learning more efficient by searching the badly predicted samples rather than random samples. After negative mining, the badly predicted samples are very likely to be selected, so that gradient descent learning on those samples leads more robust prediction with less noise • Sort the loss of output pixels in descending order, and assign the top 1% to be hard-negative, in all experiments, we keep all positive labeled pixels(samples) and the ratio of positive and negative to be 1:1 • 𝑓𝑠𝑒𝑙 to those pixels (samples) selected in a mini-batch. Multi-Task Training - Balance Sampling 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 21
  • 22. Algorithm • Loss with Mask • Now we can define the mask 𝑀( 𝑡𝑖) for each sample 𝑡𝑖 = ( 𝑦𝑖 𝑑𝑖) as a function of flags mentioned above • Then if we combine the classification (1) and bounding box regression (2) loss with masks, our full multi-task loss can be represented as Multi-Task Training - Balance Sampling (3) (4) 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 22
  • 23. Algorithm • Ground-Truth Generation • Model Design • Multi-Task Training • Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 23
  • 24. Algorithm • Landmark localization can be achieved in DenseBox just by stacking a few layers owe to the fully convolution architecture. Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 24
  • 25. Algorithm Yann LeCun, “Learning Hierarchies of Invariant feature” , Center for data science & courant institute NYU, http://www.slideshare.net/yandex/yann-le-cun Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 25
  • 26. Algorithm • Location Loss value 𝐿𝑙𝑚 • 𝜆 𝑑𝑒𝑡 , 𝜆𝑙𝑚 is controll the balance of the three tasks • Refine detection loss as 𝐿 𝑟𝑓 Refine with Landmark Localization 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 26
  • 27. Experiments 1. Landmarks for face(MALF Face Detection Task) 2. 8 landmarks for car(KITTI Car Detection Task) 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 27
  • 28. Experiments • MALF Face Detection Task • Each image the longest image side does not exceed 800 pixels. • Test our model on each image at several scales. The test scale starts from 2−3 to 21.2 with the step of 20.3. This setting ena2ble our models to detect faces from 20 pixels to 400 pixels in height. • Results of three versions of DenseBox on MALF dataset: • DenseBoxNoLandmark denotes DenseBox without landmark in training. • DenseBoxLandmark is the model incorporating landmark localization, • DenseBoxEnsemble is the result of ensembling 10 DenseBox with landmarks from different batch iterations 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 28
  • 29. Experiments • KITTI Car Detection Task • The key difficulty of KITTI car detection task is that a great amount of cars are in small size and occluded. We selectively annotate 8 landmarks for large cars • The evaluation metric of KITTI car detection task is different from general object detection. KITTI requires an overlap of 70% for true positive bounding box, while other tasks such as face detection only requires 50% overlap 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 29
  • 30. Experiments Different versions of DenseBox and Recall-Curve2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 30
  • 31. Experiments Method Moderate (%) Easy(%) Hard(%) Regionlets[23] 76.45 84.75 59.70 AOG[19] 74.26 84.24 60.51 3DVP[40] 75.77 87.46 65.38 spCov_LBP 77.40 87.19 60.60 DeepInsight 84.40 84.59 76.09 NIPS ID 331 87.14 88.33 76.11 DJML 88.79 91.31 77.73 DenseBox(Without landmark) 85.07 82.33 76.27 DenseBox(with landmark) 85.74 83.63 76.71 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 31
  • 32. Experiments 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 32
  • 33. Experiments 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 33
  • 34. Conclusion • The performance can be boosted easily by incorporating landmark information • The DenseBox achieves impressive performance on both face detection and car detection task, demonstrating its high suitable for situation • The original DenseBox presented in this paper needs several seconds to process one image. But this has been addressed in our latter version 2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 34

Editor's Notes

  • #2: 統一標記且點到點對應的定位方式,他們稱之為DenseBox. arXiv: arXiv(X依希臘文的χ發音,讀音如英語的archive)是一個收集物理學、數學、計算機科學與生物學論文預印本的網站。至2008年10月為止,arXiv.org已收集了超過50萬篇預印本; 2014年底, 達到1百萬篇的藏量 。2014年時, 約略以每月 8000 篇的速率增加。
  • #3: 在特徵辨識的領域上閱讀約三篇左右有代表性的論文皆只提出訓練的方法,並且藉由這些方法來減少訓練的資源。 首先要如何生成Dataset所應對的GT方便後續的使用、設定Model的擷取條件、多重任務下的訓練(不同種類)並在最後提到如何標註有興趣的部分
  • #4: 破題提要到如何能以Single-FCN表現在物件偵測上?並且著重問題在one-stage FCN在物件偵測上能夠表現到什麼程度
  • #5: DenseBox是一個統一的點到點的全神經網路架構,可直接預測數塊區域和物體種類的潛力。這方法經由圖片的全域位置和尺度。雖然相似於已經存在的Sliding Window盛行的FCN確認框架,但DenseBox依舊是更細微的設計去確認更小尺度和更多障礙物的演算法
  • #6: 最後,像在鄰近的車輛、找到人臉等都是重點,且在兩個Challenge上都是相當先進的做法,主要針對臉、車。
  • #7: 包含兩項特點。偵測障礙物和小尺寸的是一特點、第二個部分則是藉由Multi-task learning以結合標誌性的定位
  • #8: 1. 餵入圖片金字塔(不同解析度大小)進網路中 2. 利用數個卷積層來採樣特徵,並且把採樣的特徵取回來以後應用卷積層去得到最後結果 3. 利用非最大抑制(將那些特徵轉換成四種方向並且確認相鄰的兩個像素)來使得輸出的bouniding-boxes超過門檻值
  • #10: 每一個特徵輸出圖上面包含了分數以及該Pixel和bounding box的端點距離。而最後輸出將轉換成Bounding box,而非最大抑制將應用於這些boxes而它們的分數必須要超過Threshold.
  • #11: 第一步是拿GT來訓練,本篇論文當中以單一尺寸的圖片來訓練Network,而在訓練當中,擷取的圖片會重新縮小成240x240,而大部分GT所指向的臉部大約高50pixel,所以5-channel的size才是坐落在60x60,這縮小率大約是4倍。但後續的實驗過程有提到為了更大的擷取,並不只有50pixel來代表高
  • #12: 將一張圖片分為五個通道,第一個Channel放置的是GT的位置,剩下四個則是放置最靠近BB的四個座標點的距離。而該GT的位置也是分數的意思,在範圍內則是判定有一定的成績。這是製作GT的條件
  • #13: 即使有多張臉出現,只要都是在一定的範圍內就算是檢測得到。每一個Pixel就有5-channel,一個是所占的分數,另外四個則是在BB的距離,方便後續計算結果。不在範圍內是0、在範圍內是1,藉此計算overlap.
  • #14: 在[35]當中有提到多層的使用,卷積與工程輸出有關,例如輸出=input*轉移函數等(f和g生成的第三個函數,表徵f經過翻轉和平移的g的重疊部分面積,又稱兩種特徵的重疊積)。在此利用深度卷積網路的方式做大尺寸範圍的搜尋,但是因為做得比較深,
  • #15: 上述卷積可以得到不少有機會的特徵。而在Pooling layer當中則是得到維度很大的特徵,將特徵切成幾個區域,取其最大值或平均值,得到新的、維度較小的特徵。 之所以在此說可以當成學習的一環,而是因為藉由多次的特徵分析和卷積協助可以拿到更深層的特徵,並且把這些特徵再次採樣、訓練並且確認損失。 VGG這團隊提出的Model作為步驟中的表示方法,包含了許多運算的道理在內,利用它的模型來建造Network architecture.
  • #16: 而針對在多層次的特徵組合當中,近年來利用卷積層在比較小的、比較不同的地方,例如Edge-detection之類。而部分特徵可以找得更加的細微且明顯的外觀特徵,而本篇論文也將卷積可能會出現的Size大小提出來,藉由這提出的大小來告知不同大小針對特徵的定義也都不同,conv4_4的size為conv3_4的一半,這是因為使用雙線性的向上採集的方式,由大入小、再回到相同的解析度大小(入、出一致)
  • #17: 在[35]當中有提到多層的使用,卷積與工程輸出有關,例如輸出=input*轉移函數等(f和g生成的第三個函數,表徵f經過翻轉和平移的g的重疊部分面積,又稱兩種特徵的重疊積)。在此利用深度卷積網路的方式做大尺寸範圍的搜尋.
  • #18: 近似R-CNN(Region-based的神經網路數學法,用以協助分類)的方式協助分類, 第一個支線是代表每一個pixel上的output都應該要和GT比較得出差值, 如(1). 而和BB的座標差距則是落在(2)上, 計算出每一個點上的可能結果. 而這個L代表的就是Loss, 錯誤可能. L2 Loss的意思是取取平方差的意思,把錯誤放大。
  • #19: 在多重任務間訓練需要平均採樣, 需要正、負的採樣結果。而採集負結果其實是一個相當關鍵的部分。在採樣在正反中間的邊界時,偵測器將被迫降階。而最後使用mask去顯示出哪一個pixel該被選擇、哪一個不該被選擇。 每個mini-batch中只有一個訓練樣本。用在Training上面,利用每一次針對集合的迭代來找到整個訓練集合上的梯度方向,可以沿著該方向找到下一個迭代點。而mini-batch GD則是被提出來只針對某一個patch,而不是全部的集合樣本。
  • #20: 在多重任務間訓練需要平均採樣, 需要正、負的採樣結果
  • #21: 如果是在灰色地帶的話則捨棄,該Threshold設為2-pixel(邊線上的意思),因為negative狀況可以視為背景,所以等於是在外與內間的邊線取灰階地帶
  • #22: 選擇錯誤的資料是較為簡單且明顯地,這有助於幫助加強預測的準確程度,猶如刪除錯誤的資料一般。在Label的比例上保持正負皆1:1的情狀,而用fsel來決定是否選擇(錯誤的資料才選擇)
  • #23: 在Mask的選擇上,如果忽略或是沒有被選擇到,那麼該Mask就代表沒有效,對(4)有影響。在(4)當中,紅框處要先確認GT是否有效(因為他要比較分數),如果GT本身是無效的那則代表根本不需要訓練。Landaloc是代表平均函數,可以平均分配的分數和位置的差異結果,並在最後Normalize d*,把這些框架分成50/4(一個偵測到的object的大小是50pixel)
  • #24: 在[35]當中有提到多層的使用,卷積與工程輸出有關,例如輸出=input*轉移函數等(f和g生成的第三個函數,表徵f經過翻轉和平移的g的重疊部分面積,又稱兩種特徵的重疊積)。在此利用深度卷積網路的方式做大尺寸範圍的搜尋,但是因為做得比較深,
  • #25: 這步驟融合前面的區域性標註,不斷的細化後得到探勘錯誤並回饋錯誤(但這些錯誤後續不知道有沒有使用)。Det、landmark結合成成上述的圖案,並再擴張成檢視的size為搜尋到的結果圓圈。在本步驟利用Convolution一次又一次的簡化可能的結果並且總和起來、製作出可能的部分。紅字提到的1x1x512就代表著Size或者又稱之為Channel且分為512張,可能的特徵圖案
  • #26: 左上角是檢測的方式,而在中間的階段則像前面的紅框處,把分別的結果串列進來。而前面提到傳了4個過程則是代表分數、Bounding-Box座標。右下角的圖片是範例圖,大致說明著NxNxN的意思。在此的NxN代表size,但有時候代表的可能是圖片灰階或是RGB.
  • #27: L(lm)是標記的分數、rf是重新檢視縮小後的Loss檢測分數、det是區域檢測分數,這些搭配權重之後才是確切的總和成績。
  • #28: 總共應用了MALF、KITTI兩個資料庫。而這兩項挑戰則是如左右圖一般。 左方是找出臉部的可能特徵,在眉毛、鼻子、嘴巴上都有特別之處,這些和車子一樣都特別標註起特徵並且藉由CNN學習。
  • #29: MALF包含了五千多張圖片,但這些圖片的姿態都趨向於正面。大部分的從小尺寸到大尺寸都有,而三種不同版本的DenseBox可以用於測試階段使用。大致上的臉部是 50 pixels高,並在[0.8 , 1.25]的倍數上確認
  • #30: 在KITTI的Car-Detection資料庫當中,overlap必須要大於70%、face detection需要50%。在“The PASCAL Visual Object Classes (VOC) Challenge“當中,則是以50%作為Threshold,並且將各種不同的overlap-ratio作為標準來判斷AP的數值高低。
  • #31: 左邊的圖片顯示出不同的Dense-box版本的比較。最高的Ensemble代表從不同的batch合成了10個Landmark的DenseBox。而右圖代表著Recall-rate,DenseBox在搜索上能夠盡全力找到全部的資料內容(一張圖)
  • #32: 在當時, DenseBox蟬聯四個月的第一到第四名, 而如今是由DJML擊敗其他人(在KITTI的車輛的辨識上)。 [19]整合前後狀況與確認被掩蓋的遮蔽物 [23]則是藉由區域性重新定位的方式 [40]三維定位的空間架構
  • #33: 大部分特徵擷取都能夠成功,像右下角圖片中顯示抓錯的問題則是因為那些「特徵」太近似一個臉
  • #35: 該論文應用到Car和Face上,但是初始的運算速度不夠快速。後續版本有些許改變