SlideShare a Scribd company logo
UC Berkeley
Fully Convolutional Networks
for Semantic Segmentation
Jonathan Long* Evan Shelhamer* Trevor Darrell
1
Presented by: Gordon Christie
Slide credit: Jonathan Long
Overview
• Reinterpret standard classification convnets as “Fully
convolutional” networks (FCN) for semantic segmentation
• Use AlexNet, VGG, and GoogleNet in experiments
• Novel architecture: combine information from different
layers for segmentation
• State-of-the-art segmentation for PASCAL VOC
2011/2012, NYUDv2, and SIFT Flow at the time
• Inference less than one fifth of a second for a typical
image
2
Slide credit: Jonathan Long
pixels in, pixels out
monocular depth estimation (Liu et al. 2015)
boundary prediction (Xie & Tu 2015)
semantic
segmentation
3
Slide credit: Jonathan Long
4
“tabby cat”
1000-dim vector
< 1 millisecond
convnets perform classification
end-to-end learning
Slide credit: Jonathan Long
R-CNN
5
many seconds
“cat”
“dog”
R-CNN does detection
Slide credit: Jonathan Long
6
R-CNN
figure: Girshick et al.
Slide credit: Jonathan Long
7
< 1/5 second
end-to-end learning
???
Slide credit: Jonathan Long
“tabby cat”
8
a classification network
Slide credit: Jonathan Long
9
becoming fully convolutional
Slide credit: Jonathan Long
10
becoming fully convolutional
Slide credit: Jonathan Long
11
upsampling output
Slide credit: Jonathan Long
conv, pool,
nonlinearity
upsampling
pixelwise
output + loss
end-to-end, pixels-to-pixels
network
12
Slide credit: Jonathan Long
Dense Predictions
• Shift-and-stitch: trick that yields dense predictions
without interpolation
• Upsampling via deconvolution
• Shift-and-stitch used in preliminary experiments, but not
included in final model
• Upsampling found to be more effective and efficient
13
Classifier to Dense FCN
• Convolutionalize proven classification architectures:
AlexNet, VGG, and GoogLeNet (reimplementation)
• Remove classification layer and convert all fully
connected layers to convolutions
• Append 1x1 convolution with channel dimensions and
predict scores at each of the coarse output locations (21
categories + background for PASCAL)
14
Classifier to Dense FCN
Cast ILSVRC classifiers into FCNs and compare
performance on validation set of PASCAL 2011
15
spectrum of deep features
combine where (local, shallow) with what (global, deep)
fuse features into deep jet
(cf. Hariharan et al. CVPR15 “hypercolumn”) 16
Slide credit: Jonathan Long
skip layers
interp + sum
interp + sum
dense output 17
end-to-end, joint learning
of semantics and location
Slide credit: Jonathan Long
skip layers
18
Comparison of skip FCNs
19
Results on subset of validation set of PASCAL VOC 2011
stride 32
no skips
stride 16
1 skip
stride 8
2 skips
ground truth
input image
skip layer refinement
20
Slide credit: Jonathan Long
training + testing
- train full image at a time without patch sampling
- reshape network to take input of any size
- forward time is ~150ms for 500 x 500 x 21 output
21
Slide credit: Jonathan Long
Results – PASCAL VOC 2011/12
VOC 2011: 8498 training images (from additional labeled data
22
Results – NYUDv2
23
Table 4. Results on NYUDv2. RGBD is early-fusion of the
RGB anddepth channelsat theinput. HHAisthedepthembed-
ding of [14] as horizontal disparity, height above ground, and
the angle of the local surface normal with the inferred gravity
direction. RGB-HHA is the jointly trained late fusion model
that sums RGB and HHA predictions.
pixel
acc.
mean
acc.
mean
IU
f.w.
IU
Gupta et al. [14] 60.3 - 28.6 47.0
FCN-32s RGB 60.0 42.2 29.2 43.9
FCN-32s RGBD 61.5 42.4 30.5 45.5
FCN-32s HHA 57.1 35.2 24.2 40.4
FCN-32s RGB-HHA 64.3 44.9 32.8 48.0
FCN-16s RGB-HHA 65.4 46.1 34.0 49.5
Table 5.
(center) a
a non-pa
SVM wh
vnet train
samples (
noted RC
L
Tigh
Tighe
Tighe
Farabe
Farabe
1449 RGB-D images with pixelwise labels  40 categories
Results – SIFT Flow
2688 images with pixel labels
33 semantic categories, 3 geometric categories
Learn both label spaces jointly
 learning and inference have similar performance and
computation as independent models
24
is early-fusion of the
HAisthedepth embed-
ght above ground, and
th the inferred gravity
ned late fusion model
ean
c.
mean
IU
f.w.
IU
28.6 47.0
2 29.2 43.9
4 30.5 45.5
2 24.2 40.4
9 32.8 48.0
1 34.0 49.5
D images, with pixel-
Table 5. Results on SIFT Flow10
with class segmentation
(center) and geometric segmentation (right). Tighe [33] is
a non-parametric transfer method. Tighe 1 is an exemplar
SVM while 2 is SVM + MRF. Farabet is a multi-scale con-
vnet trained onclass-balanced samples(1) or natural frequency
samples (2). Pinheiro is a multi-scale, recurrent convnet, de-
noted RCNN3 (◦ 3
). Themetric for geometry ispixel accuracy.
pixel
acc.
mean
acc.
mean
IU
f.w.
IU
geom.
acc.
Liu et al. [23] 76.7 - - - -
Tigheet al. [33] - - - - 90.8
Tigheet al. [34] 1 75.6 41.1 - - -
Tigheet al. [34] 2 78.6 39.2 - - -
Farabet et al. [8] 1 72.3 50.8 - - -
Farabet et al. [8] 2 78.5 29.6 - - -
Pinheiro et al. [28] 77.7 29.8 - - -
FCN-16s 85.2 51.7 39.5 76.1 94.3
FCN SDS* Truth Input
25
Relative to prior state-of-the-
art SDS:
- 20% relative
improvement
for mean IoU
- 286× faster
*Simultaneous Detection and Segmentation
Hariharan et al. ECCV14
Slide credit: Jonathan Long
leaderboard
== segmentation with Caffe
26
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
FCN
Slide credit: Jonathan Long
conclusion
fully convolutional networks are fast, end-
to-end models for pixelwise problems
- code in Caffe branch (merged soon)
- models for PASCAL VOC, NYUDv2,
SIFT Flow, PASCAL-Context
27
caffe.berkeleyvision.org
github.com/BVLC/caffe
fcn.berkeleyvision.org
Slide credit: Jonathan Long

More Related Content

PDF
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
PPTX
EIS_REVIEW_1.pptx
PPTX
Master Thesis Defense
PDF
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
PPTX
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
PDF
Deep learning for molecules, introduction to chainer chemistry
PPTX
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
PDF
ct_meeting_final_jcy (1).pdf
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
EIS_REVIEW_1.pptx
Master Thesis Defense
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)
Deep learning for molecules, introduction to chainer chemistry
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
ct_meeting_final_jcy (1).pdf

Similar to fully convolutional networks for semantic segmentation (20)

PDF
Skip Connections, Residual networks and challanges
PDF
A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...
PDF
rcnn.pdfmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
PDF
Detection focal loss 딥러닝 논문읽기 모임 발표자료
PDF
Convolutional Neural Networks (CNN)
PDF
Colored inversion
PDF
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
PPTX
A Diffusion Wavelet Approach For 3 D Model Matching
PPTX
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
PDF
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PDF
Single image super resolution with improved wavelet interpolation and iterati...
PDF
Deep Local Parametric Filters for Image Enhancement
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
D3L4-objects.pdf
PDF
Lecture17 xing fei-fei
PDF
Introduction to Chainer Chemistry
PDF
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
PDF
Intepretability / Explainable AI for Deep Neural Networks
Skip Connections, Residual networks and challanges
A Novel Blind SR Method to Improve the Spatial Resolution of Real Life Video ...
rcnn.pdfmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Convolutional Neural Networks (CNN)
Colored inversion
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
A Diffusion Wavelet Approach For 3 D Model Matching
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Single image super resolution with improved wavelet interpolation and iterati...
Deep Local Parametric Filters for Image Enhancement
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
D3L4-objects.pdf
Lecture17 xing fei-fei
Introduction to Chainer Chemistry
An Improved Adaptive Multi-Objective Particle Swarm Optimization for Disassem...
Intepretability / Explainable AI for Deep Neural Networks
Ad

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Spectroscopy.pptx food analysis technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
Programs and apps: productivity, graphics, security and other tools
Spectroscopy.pptx food analysis technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Ad

fully convolutional networks for semantic segmentation

  • 1. UC Berkeley Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell 1 Presented by: Gordon Christie Slide credit: Jonathan Long
  • 2. Overview • Reinterpret standard classification convnets as “Fully convolutional” networks (FCN) for semantic segmentation • Use AlexNet, VGG, and GoogleNet in experiments • Novel architecture: combine information from different layers for segmentation • State-of-the-art segmentation for PASCAL VOC 2011/2012, NYUDv2, and SIFT Flow at the time • Inference less than one fifth of a second for a typical image 2 Slide credit: Jonathan Long
  • 3. pixels in, pixels out monocular depth estimation (Liu et al. 2015) boundary prediction (Xie & Tu 2015) semantic segmentation 3 Slide credit: Jonathan Long
  • 4. 4 “tabby cat” 1000-dim vector < 1 millisecond convnets perform classification end-to-end learning Slide credit: Jonathan Long
  • 5. R-CNN 5 many seconds “cat” “dog” R-CNN does detection Slide credit: Jonathan Long
  • 6. 6 R-CNN figure: Girshick et al. Slide credit: Jonathan Long
  • 7. 7 < 1/5 second end-to-end learning ??? Slide credit: Jonathan Long
  • 8. “tabby cat” 8 a classification network Slide credit: Jonathan Long
  • 9. 9 becoming fully convolutional Slide credit: Jonathan Long
  • 10. 10 becoming fully convolutional Slide credit: Jonathan Long
  • 12. conv, pool, nonlinearity upsampling pixelwise output + loss end-to-end, pixels-to-pixels network 12 Slide credit: Jonathan Long
  • 13. Dense Predictions • Shift-and-stitch: trick that yields dense predictions without interpolation • Upsampling via deconvolution • Shift-and-stitch used in preliminary experiments, but not included in final model • Upsampling found to be more effective and efficient 13
  • 14. Classifier to Dense FCN • Convolutionalize proven classification architectures: AlexNet, VGG, and GoogLeNet (reimplementation) • Remove classification layer and convert all fully connected layers to convolutions • Append 1x1 convolution with channel dimensions and predict scores at each of the coarse output locations (21 categories + background for PASCAL) 14
  • 15. Classifier to Dense FCN Cast ILSVRC classifiers into FCNs and compare performance on validation set of PASCAL 2011 15
  • 16. spectrum of deep features combine where (local, shallow) with what (global, deep) fuse features into deep jet (cf. Hariharan et al. CVPR15 “hypercolumn”) 16 Slide credit: Jonathan Long
  • 17. skip layers interp + sum interp + sum dense output 17 end-to-end, joint learning of semantics and location Slide credit: Jonathan Long
  • 19. Comparison of skip FCNs 19 Results on subset of validation set of PASCAL VOC 2011
  • 20. stride 32 no skips stride 16 1 skip stride 8 2 skips ground truth input image skip layer refinement 20 Slide credit: Jonathan Long
  • 21. training + testing - train full image at a time without patch sampling - reshape network to take input of any size - forward time is ~150ms for 500 x 500 x 21 output 21 Slide credit: Jonathan Long
  • 22. Results – PASCAL VOC 2011/12 VOC 2011: 8498 training images (from additional labeled data 22
  • 23. Results – NYUDv2 23 Table 4. Results on NYUDv2. RGBD is early-fusion of the RGB anddepth channelsat theinput. HHAisthedepthembed- ding of [14] as horizontal disparity, height above ground, and the angle of the local surface normal with the inferred gravity direction. RGB-HHA is the jointly trained late fusion model that sums RGB and HHA predictions. pixel acc. mean acc. mean IU f.w. IU Gupta et al. [14] 60.3 - 28.6 47.0 FCN-32s RGB 60.0 42.2 29.2 43.9 FCN-32s RGBD 61.5 42.4 30.5 45.5 FCN-32s HHA 57.1 35.2 24.2 40.4 FCN-32s RGB-HHA 64.3 44.9 32.8 48.0 FCN-16s RGB-HHA 65.4 46.1 34.0 49.5 Table 5. (center) a a non-pa SVM wh vnet train samples ( noted RC L Tigh Tighe Tighe Farabe Farabe 1449 RGB-D images with pixelwise labels  40 categories
  • 24. Results – SIFT Flow 2688 images with pixel labels 33 semantic categories, 3 geometric categories Learn both label spaces jointly  learning and inference have similar performance and computation as independent models 24 is early-fusion of the HAisthedepth embed- ght above ground, and th the inferred gravity ned late fusion model ean c. mean IU f.w. IU 28.6 47.0 2 29.2 43.9 4 30.5 45.5 2 24.2 40.4 9 32.8 48.0 1 34.0 49.5 D images, with pixel- Table 5. Results on SIFT Flow10 with class segmentation (center) and geometric segmentation (right). Tighe [33] is a non-parametric transfer method. Tighe 1 is an exemplar SVM while 2 is SVM + MRF. Farabet is a multi-scale con- vnet trained onclass-balanced samples(1) or natural frequency samples (2). Pinheiro is a multi-scale, recurrent convnet, de- noted RCNN3 (◦ 3 ). Themetric for geometry ispixel accuracy. pixel acc. mean acc. mean IU f.w. IU geom. acc. Liu et al. [23] 76.7 - - - - Tigheet al. [33] - - - - 90.8 Tigheet al. [34] 1 75.6 41.1 - - - Tigheet al. [34] 2 78.6 39.2 - - - Farabet et al. [8] 1 72.3 50.8 - - - Farabet et al. [8] 2 78.5 29.6 - - - Pinheiro et al. [28] 77.7 29.8 - - - FCN-16s 85.2 51.7 39.5 76.1 94.3
  • 25. FCN SDS* Truth Input 25 Relative to prior state-of-the- art SDS: - 20% relative improvement for mean IoU - 286× faster *Simultaneous Detection and Segmentation Hariharan et al. ECCV14 Slide credit: Jonathan Long
  • 26. leaderboard == segmentation with Caffe 26 FCN FCN FCN FCN FCN FCN FCN FCN FCN FCN FCN FCN FCN FCN FCN Slide credit: Jonathan Long
  • 27. conclusion fully convolutional networks are fast, end- to-end models for pixelwise problems - code in Caffe branch (merged soon) - models for PASCAL VOC, NYUDv2, SIFT Flow, PASCAL-Context 27 caffe.berkeleyvision.org github.com/BVLC/caffe fcn.berkeleyvision.org Slide credit: Jonathan Long

Editor's Notes

  • #2: Goal of work is to use FCn to predict class at every pixel Transfer existing classification models to dense prediction tasks
  • #3: Note that using existing networks is transfer learning
  • #9: note omissions “activations” fixed size input, single label output desire: efficient per-pixel output
  • #14: “Final layer deconvolutional filters are fixed to bilinear inter- polation, while intermediate upsampling layers are initial- ized to bilinear upsampling” Changing only the filters and layer strides of a convnet can produce the same output as this shift-and-stitch trick.
  • #15: “Despite similar classification accuracy, our implementation of GoogLeNet did not match this segmentation result.”
  • #16: THESE ARE VAL NUMBERS. Just begun and they are already state of the art They initialize using the classification models trained on imagenet Train with per-pixel multinomial loss and validate with mean intersection over union
  • #19: “Max fusion made learning difficult due to gradient switching.” Decreasing the stride of pooling layers is the most straightforward way to obtain finer predictions. However, doing so is problematic for our VGG16-based net. Setting the pool5 layer to have stride 1 requires our convolutionalized fc6 to have a kernel size of 14 × 14 in order to maintain its receptive field size. In addi- tion to their computational cost, we had difficulty learning such large filters. We made an attempt to re-architect the layers above pool5 with smaller filters, but were not suc- cessful in achieving comparable performance; one possible explanation is that the initialization from ImageNet-trained weights in the upper layers is important.
  • #20: Fixed = only fine tuning in final layer
  • #23: For following 3 results, dropout was used when used in original network SDS: MCG proposals, feature extraction, SVM to classify, region refinement
  • #24: Gupta: region proposals (using depth and rgb), deep features for depth and rgb, svm classifier, segmentation Gupta et all encode depth differently (surface normals and height from ground included) RGBD (early fusion) little improvement, perhaps difficult to propogate meaningful gradients through model To add depth information, we train on a model upgraded to take four-channel RGB-D input (early fusion)
  • #25: Semantic: bridge, mountain, sun, etc Geometric: horizontal, vertical, sky Farabet: multi-scale convnet, averaging class predictions across superpixels Pinheiro: patch based learning using multiple scales with rcnns
  • #26: + NYUD net for multi-modal input and SIFT Flow net for multi-task output
  • #27: Many segmentation methods powered by Caffe, most FCNs