Densebox

DenseBox: Unifying Landmark
Localization with End to End
Object Detection
Submitted on 16 Sep 2015 (v1), last revised 19 Sep 2015 (v3)
Lichao Huang, Yi Yang, Yafeng Deng, Yinan Yu
arXiv preprint arXiv:1509.04874, 2015
CHEN KUAN-YU
stu9458@gmail.com
2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 1

Agenda
• Introduction
• Algorithm
• Ground-Truth Generation
• Model Design
• Multi-Task Training
• Refine with Landmark Localization
• Experiments
• Conclusion

Introduction
• How can a single fully convolutional neural network (FCN) perform on object
detection?
• In this work, we focus on one question: To what extent can an one-stage FCN
perform on object detection?

Introduction
• We introduce DenseBox a unified end-to-end FCN framework that directly
predicts bounding boxes and object class confidences through all locations and
scales of an image.
• Although similar to many existing sliding window fashion FCN detection
frameworks, DenseBox is more carefully designed to detect objects under small
scales and heavy occlusion

Introduction
• Checking for nearby cars during driving, finding a person, and localizing a
familiar face are all examples of object detection
• Indicate our DenseBox is the state-of-the-art system for detecting challenging
objects such as faces and cars

Introduction
1. First, we demonstrate that a single fully convolutional neural network, if
designed and optimized carefully, can detect objects under different scales with
heavy occlusion extremely accurately and efficiently
2. Second, we show that when incorporating with landmark localization through
multi-task learning[1]

Introduction
• The DenseBox detection Pipeline
1. Image pyramid is fed to the network
2. After several layers of convolution and pooling, upsampling feature map back and apply
convolution layers to get final output
3. Convert output feature map to bounding boxes, and apply non-maximum suppression to all
bounding boxes over the threshold

Algorithm
• Model Design

Algorithm
• Bounding Box
• Left top 𝑝𝑡 = (𝑥𝑡, 𝑦𝑡)
• Right bottom 𝑝 𝑏 = (𝑥 𝑏, 𝑦 𝑏)
• Output feature map with 5-dimensional vector
• 𝑡𝑖 = ( 𝑠, 𝑑𝑥 𝑡 = 𝑥𝑖 − 𝑥𝑡, 𝑑𝑦 𝑡 = 𝑦𝑖 − 𝑦𝑡, 𝑑𝑥 𝑏 = 𝑥𝑖 − 𝑥 𝑏, 𝑑𝑦 𝑏 = 𝑦𝑖 − 𝑦 𝑏)
• 𝑠, is the confidence score of being an object
• 𝑑𝑥 𝑡
, 𝑑𝑦 𝑡
, 𝑑𝑥 𝑏
, 𝑑𝑦 𝑏
denote the distance between output pixel location with the
boundary of target bounding box.

Algorithm
• In this paper, we train our network on single scale, and apply it to multiple scales
for evaluation
• In training, the patches are cropped and resized to 240x240 with a face in the
center roughly has the height of 50 pixels. The output ground truth in training is a
5-channel map sized 60x60 , with the downsampling factor of four
Ground-Truth Generation

• The positive labeled region in the first channel of ground truth map is a filled
circle with radius 𝑟𝑐 (its scaling factor is set to be 0.3 to the box size)
• The remaining 4 channels are filled with the distance between the pixel location of
output map between the left top and right bottom corners of the nearest bounding
box
Algorithm Ground-Truth Generation

• Note that if multiple faces occur in one patch, we keep those faces as positive if
they fall in a scale range(e.g. 0.8 to 1.25 in our setting) relative to the face in patch
center
• The pixels of first channel, which denote the confidence score of class, in the
ground truth map are initialized with 0, and further set to 1 if within the positive
label region
• Each pixel can be treated as one sample , since every 5-channel pixel describe a
bounding box.
Algorithm Ground-Truth Generation

Algorithm
• Model Design

Algorithm
• Network architecture of DenseBox. The rectangles with red names contain
learnable parameters
• Derived from the VGG(Visual Geometric Group) 19 model used for image
classification[35]
Model-Design

Algorithm
• Multi-Level Feature Fusion
• Recent works indicate that using features from different convolution layers can enhance
performance in task such as edge detection and segmentation
• Part-level feature focus on local details of object to find discriminative appearance parts, while
object-level or high-level feature usually has a larger receptive field in order to recognize
object
• we concatenate feature map from conv3_4 and conv4_4. The receptive field (or
sliding window size) of conv3_4 is 48x48, almost the same size of the face size in
training, and the conv4_4 have a much larger receptive field, around 118x118 in size
Model-Design

Algorithm
• Model Design

Algorithm
• Like Fast R-CNN(R-ConvNet), our network has two sibling output branches
1. The confidence score 𝑦(per pixel in the output map) of being a target object. Given the
ground truth label 𝑦∗ ∈ (0,1) , the classification loss can be defined as follows
2. The second branch of outputs the bounding-box regression loss, denoted as 𝐿𝑙𝑜𝑐. It targets
on minimizing the L2 loss between the predicted location offsets 𝑑 = ( 𝑑 𝑡𝑥1, 𝑑 𝑡𝑦1, 𝑑 𝑡𝑥2,
𝑑 𝑡𝑦2)and the targets 𝑑∗ = (𝑑 𝑡𝑥1
∗
, 𝑑 𝑡𝑦1
∗
, 𝑑 𝑡𝑥2
∗
, 𝑑 𝑡𝑦2
∗
)
Multi-Task Training
(1) (2)

Algorithm
• The process of selecting negative samples is one of the crucial parts in learning
• In addition, the detector will degrade if we penalize loss on those samples lying in
the margin of positive and negative region.
• Here we use a binary mask for each output pixel to indicate whether it is selected
in training
Multi-Task Training - Balance Sampling

Algorithm
• Ignoring Gray Zone
• Hard Negative Mining
• Loss with Mask

Algorithm
• Ignoring Gray Zone
• The gray zone is defined on the margin of positive and negative region. It should not be
considered to be positive or negative, and its loss weight should be set to 0
• 𝐷𝑖𝑠 𝑝𝑖𝑥𝑒𝑙 < 𝑟𝑛𝑒𝑎𝑟 = 2 𝑝𝑖𝑥𝑒𝑙
• 𝑓𝑖𝑔𝑛 decided to select whether or not

Algorithm
• Hard Negative Mining
• We make learning more efficient by searching the badly predicted samples rather than random
samples. After negative mining, the badly predicted samples are very likely to be selected, so
that gradient descent learning on those samples leads more robust prediction with less noise
• Sort the loss of output pixels in descending order, and assign the top 1% to be hard-negative,
in all experiments, we keep all positive labeled pixels(samples) and the ratio of positive and
negative to be 1:1
• 𝑓𝑠𝑒𝑙 to those pixels (samples) selected in a mini-batch.

Algorithm
• Loss with Mask
• Now we can define the mask 𝑀( 𝑡𝑖) for each sample 𝑡𝑖 = ( 𝑦𝑖 𝑑𝑖) as a function of flags
mentioned above
• Then if we combine the classification (1) and bounding box regression (2) loss with masks,
our full multi-task loss can be represented as
(3)
(4)

Algorithm
• Model Design

Algorithm
• Landmark localization can be achieved in DenseBox just by stacking a few layers
owe to the fully convolution architecture.
Refine with Landmark Localization

Algorithm
Yann LeCun, “Learning Hierarchies of Invariant
feature” , Center for data science & courant institute
NYU, http://www.slideshare.net/yandex/yann-le-cun

Algorithm
• Location Loss value 𝐿𝑙𝑚
• 𝜆 𝑑𝑒𝑡 , 𝜆𝑙𝑚 is controll the balance of the three tasks
• Refine detection loss as 𝐿 𝑟𝑓

Experiments
1. Landmarks for face(MALF Face Detection Task)
2. 8 landmarks for car(KITTI Car Detection Task)

Experiments
• MALF Face Detection Task
• Each image the longest image side does not exceed 800 pixels.
• Test our model on each image at several scales. The test scale starts from 2−3 to 21.2 with the
step of 20.3. This setting ena2ble our models to detect faces from 20 pixels to 400 pixels in
height.
• Results of three versions of DenseBox on MALF dataset:
• DenseBoxNoLandmark denotes DenseBox without landmark in training.
• DenseBoxLandmark is the model incorporating landmark localization,
• DenseBoxEnsemble is the result of ensembling 10 DenseBox with landmarks from different batch iterations

Experiments
• KITTI Car Detection Task
• The key difficulty of KITTI car detection task is that a great amount of cars are in small size
and occluded. We selectively annotate 8 landmarks for large cars
• The evaluation metric of KITTI car detection task is different from general object detection.
KITTI requires an overlap of 70% for true positive bounding box, while other tasks such as
face detection only requires 50% overlap

Experiments
Different versions of DenseBox and Recall-Curve2016/3/21 NCKU CSIE NEAT CHEN KUAN-YU 30

Experiments
Method Moderate (%) Easy(%) Hard(%)
Regionlets[23] 76.45 84.75 59.70
AOG[19] 74.26 84.24 60.51
3DVP[40] 75.77 87.46 65.38
spCov_LBP 77.40 87.19 60.60
DeepInsight 84.40 84.59 76.09
NIPS ID 331 87.14 88.33 76.11
DJML 88.79 91.31 77.73
DenseBox(Without landmark) 85.07 82.33 76.27
DenseBox(with landmark) 85.74 83.63 76.71

Experiments

Conclusion
• The performance can be boosted easily by incorporating landmark information
• The DenseBox achieves impressive performance on both face detection and car
detection task, demonstrating its high suitable for situation
• The original DenseBox presented in this paper needs several seconds to process
one image. But this has been addressed in our latter version

Densebox

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Densebox (20)

Recently uploaded (20)

Densebox

Editor's Notes