fully convolutional networks for semantic segmentation

UC Berkeley
Fully Convolutional Networks
for Semantic Segmentation
Jonathan Long* Evan Shelhamer* Trevor Darrell
1
Presented by: Gordon Christie
Slide credit: Jonathan Long

Overview
• Reinterpret standard classification convnets as “Fully
convolutional” networks (FCN) for semantic segmentation
• Use AlexNet, VGG, and GoogleNet in experiments
• Novel architecture: combine information from different
layers for segmentation
• State-of-the-art segmentation for PASCAL VOC
2011/2012, NYUDv2, and SIFT Flow at the time
• Inference less than one fifth of a second for a typical
image
2

pixels in, pixels out
monocular depth estimation (Liu et al. 2015)
boundary prediction (Xie & Tu 2015)
semantic
segmentation
3

4
“tabby cat”
1000-dim vector
< 1 millisecond
convnets perform classification
end-to-end learning

R-CNN
5
many seconds
“cat”
“dog”
R-CNN does detection

6
R-CNN
figure: Girshick et al.

7
< 1/5 second
end-to-end learning
???

“tabby cat”
8
a classification network

9
becoming fully convolutional

10
becoming fully convolutional

11
upsampling output

conv, pool,
nonlinearity
upsampling
pixelwise
output + loss
end-to-end, pixels-to-pixels
network
12

Dense Predictions
• Shift-and-stitch: trick that yields dense predictions
without interpolation
• Upsampling via deconvolution
• Shift-and-stitch used in preliminary experiments, but not
included in final model
• Upsampling found to be more effective and efficient
13

Classifier to Dense FCN
• Convolutionalize proven classification architectures:
AlexNet, VGG, and GoogLeNet (reimplementation)
• Remove classification layer and convert all fully
connected layers to convolutions
• Append 1x1 convolution with channel dimensions and
predict scores at each of the coarse output locations (21
categories + background for PASCAL)
14

Classifier to Dense FCN
Cast ILSVRC classifiers into FCNs and compare
performance on validation set of PASCAL 2011
15

spectrum of deep features
combine where (local, shallow) with what (global, deep)
fuse features into deep jet
(cf. Hariharan et al. CVPR15 “hypercolumn”) 16

skip layers
interp + sum
interp + sum
dense output 17
end-to-end, joint learning
of semantics and location

Comparison of skip FCNs
19
Results on subset of validation set of PASCAL VOC 2011

stride 32
no skips
stride 16
1 skip
stride 8
2 skips
ground truth
input image
skip layer refinement
20

training + testing
- train full image at a time without patch sampling
- reshape network to take input of any size
- forward time is ~150ms for 500 x 500 x 21 output
21

Results – PASCAL VOC 2011/12
VOC 2011: 8498 training images (from additional labeled data
22

Results – NYUDv2
23
Table 4. Results on NYUDv2. RGBD is early-fusion of the
RGB anddepth channelsat theinput. HHAisthedepthembed-
ding of [14] as horizontal disparity, height above ground, and
the angle of the local surface normal with the inferred gravity
direction. RGB-HHA is the jointly trained late fusion model
that sums RGB and HHA predictions.
pixel
acc.
mean
acc.
mean
IU
f.w.
IU
Gupta et al. [14] 60.3 - 28.6 47.0
FCN-32s RGB 60.0 42.2 29.2 43.9
FCN-32s RGBD 61.5 42.4 30.5 45.5
FCN-32s HHA 57.1 35.2 24.2 40.4
FCN-32s RGB-HHA 64.3 44.9 32.8 48.0
FCN-16s RGB-HHA 65.4 46.1 34.0 49.5
Table 5.
(center) a
a non-pa
SVM wh
vnet train
samples (
noted RC
L
Tigh
Tighe
Tighe
Farabe
Farabe
1449 RGB-D images with pixelwise labels  40 categories

Results – SIFT Flow
2688 images with pixel labels
33 semantic categories, 3 geometric categories
Learn both label spaces jointly
 learning and inference have similar performance and
computation as independent models
24
is early-fusion of the
HAisthedepth embed-
ght above ground, and
th the inferred gravity
ned late fusion model
ean
c.
mean
IU
f.w.
IU
28.6 47.0
2 29.2 43.9
4 30.5 45.5
2 24.2 40.4
9 32.8 48.0
1 34.0 49.5
D images, with pixel-
Table 5. Results on SIFT Flow10
with class segmentation
(center) and geometric segmentation (right). Tighe [33] is
a non-parametric transfer method. Tighe 1 is an exemplar
SVM while 2 is SVM + MRF. Farabet is a multi-scale con-
vnet trained onclass-balanced samples(1) or natural frequency
samples (2). Pinheiro is a multi-scale, recurrent convnet, de-
noted RCNN3 (◦ 3
). Themetric for geometry ispixel accuracy.
pixel
acc.
mean
acc.
mean
IU
f.w.
IU
geom.
acc.
Liu et al. [23] 76.7 - - - -
Tigheet al. [33] - - - - 90.8
Tigheet al. [34] 1 75.6 41.1 - - -
Tigheet al. [34] 2 78.6 39.2 - - -
Farabet et al. [8] 1 72.3 50.8 - - -
Farabet et al. [8] 2 78.5 29.6 - - -
Pinheiro et al. [28] 77.7 29.8 - - -
FCN-16s 85.2 51.7 39.5 76.1 94.3

FCN SDS* Truth Input
25
Relative to prior state-of-the-
art SDS:
- 20% relative
improvement
for mean IoU
- 286× faster
*Simultaneous Detection and Segmentation
Hariharan et al. ECCV14

leaderboard
== segmentation with Caffe
26
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN

conclusion
fully convolutional networks are fast, end-
to-end models for pixelwise problems
- code in Caffe branch (merged soon)
- models for PASCAL VOC, NYUDv2,
SIFT Flow, PASCAL-Context
27
caffe.berkeleyvision.org
github.com/BVLC/caffe
fcn.berkeleyvision.org

fully convolutional networks for semantic segmentation

More Related Content

Similar to fully convolutional networks for semantic segmentation (20)

Recently uploaded (20)

fully convolutional networks for semantic segmentation

Editor's Notes