SlideShare a Scribd company logo
Generation of Synthetic Referring Expressions
for Object Segmentation in Videos
Author: Ioannis Kazakos
Table of
Contents
1. Topic & Background
2. Relevant Literature
3. Motivation
4. Method
5. Experiments & Results
6. Conclusions
1.Topic & Background
Vision & Language
● Recently raised research area
● Owing to deep learning revolution and independent success in CV and NLP
○ CNNs, Object detection/segmentation models
○ LSTMs, Word embeddings
● Many applications
○ Autonomous driving
○ Assistance of visually impaired individuals
○ Interactive video editing
○ Navigation from vision and language etc.
Vision & Language Tasks
● Visual Question Answering, Agrawal et al. 2015
● Caption Generation, Vinyals et al. 2015
● Text to Images, Zhang et al. 2016
● And many more!
● Referring Expression
○ An accurate description of a specific object, but not of any other object in the current scene
○ Example:
■ “a woman” ❌
■ “a woman in red” ❌
■ “a woman in red on the right” ✅
■ “a woman in red top and blue shorts” ✅
● Object Segmentation
○ Assign a label to every pixel corresponding to the target object
Object Segmentation with Referring Expressions
Referring Expression Video Object Segmentation
2. Relevant Literature
Many works on images
● First work: “Segmentation from Natural Language Expressions”, Hu et al. 2016
● Subsequent works tried to jointly model vision and language features and leverage
attention to better capture dependencies between visual and linguistic features
● Most of these works use the Refer-It collection of datasets for training and evaluation
○ Three large-scale image datasets with referring expressions and segmentation masks
○ Collected on top of Microsoft COCO (Common objects in Context)
○ RefCOCO, RefCOCO+ and RefCOCOg
● 142,209 referring expressions
● 50,000 objects
● 19,994 images
RefCOCO dataset
Expression = “right kid”
Expression = “left elephant”
Few works on videos
● “Video Object Segmentation with
Language Referring Expressions”,
Khoreva et al. 2018
○ DAVIS-2017: Big set of 78 object classes
○ Too few videos (150 in total)
○ They use a frame-based model
○ Pre-training on RefCOCO is used
● “Actor and action segmentation from a
sentence”, Gavrilyuk et al. 2018
○ A2D: Small set of object classes (only 8 actors)
○ J-HDMB: Single object in each video
DAVIS-2017
3. Motivation
Main Challenges
● Models
○ Temporal consistency across frames
○ Models’ size and complexity
● Data
○ No large-scale datasets for videos
○ Poor quality of crowdsourced referring expressions
■ ~10% fail to correctly describe the target object (no RE)
Analysis from Bellver et al. 2020
A2D
DAVIS-2017
Method Inspiration
A2D
DAVIS-2017
● Existing datasets include trivial cases where a single object from each class
appears
● In such cases an object can be identified using only its class e.g. saying “a
person” or “a horse”
● Existing large datasets for video object segmentation are labeled in terms of
object classes
● Annotating a large dataset with referring expressions requires tremendous
human effort
Basic Idea
Generate (automatically) synthetic referring expressions starting from an object’s
class and enhancing them with other cues without any human annotation cost
Thesis Purpose
1. Propose a method for generating synthetic referring expressions for a large-scale
video object segmentation dataset
1. Evaluate the effectiveness of the generated synthetic referring expressions for the
task of video object segmentation with referring expressions
4. Method
YouTube-VIS Dataset
YouTube-VOS
→ Large-scale dataset for video object segmentation
→ Short YouTube videos of 3-6 seconds
→ 4,453 videos in total
→ 94 object categories
YouTube-VIS
→ Created on top of YouTube-VOS
→ 2,883 videos
→ 40 object classes
→ Exhaustively annotated = All objects belonging to
the 40 classes are labeled with pixel-wise masks.
● The formulation of our method allows its application to any other object
detection/segmentation dataset
● We apply our proposed method on the YouTube-VIS dataset
Overview
1. Ground-truth annotations
● Object class
● Bounding boxes
○ Relative size
○ Relative location
2. Faster R-CNN, Ren et al. 2015
● Enhanced with attribute head by Tang et al. 2020
● Pre-trained on Visual Genome dataset for attribute detection
○ Able to detect a predefined set of 201 attributes
○ Includes color and non-color attributes
○ Non-color attributes can be adjectives (“large”, “spotted”) or verbs (“surfing”)
Cues
1. Object Class (e.g “a person”)
○ It can be enough only if one object of this class is present in the video frame
○ However, in most cases more cues are necessary
Cues
2. Relative Size
○ The areas At and Ao of the target and other object
bounding boxes are computed:
■ At >= 2Ao : “bigger” is added to the ref. expression
■ At <= 0.5Ao : “smaller” respectively
■ 0.5Ao < At < 2Ao : relative location not applicable
○ Similarly for more objects, “biggest”/“smallest” if
target is “bigger”/ “smaller” than all other objects
“a bigger dog”
Cues
3. Relative Location (1 or 2 other objects of the same class)
○ The most discriminative axis (X or Y) is determined using the bounding boxes boundaries
○ The maximum non-overlapping distance between bounding boxes is calculated
○ If distance above a certain threshold, relative location is computed, according to the axis found:
■ If X-axis: “on the left” / “on the right”
■ If Y-axis: “in the front” / “in the back”
○ For 3 objects, combinations of relative locations of each pair of objects are combined (e.g “in
the middle”, “in the front left” etc.)
“rabbit on the left”
rabbit rabbit
rabbit
Cues
4. Attributes
○ Faster R-CNN detection is matched to the target object using Intersection-over-Union
○ An attribute is added to the referring expression only if it is unique for the target object
○ Attributes can be colors, other adjectives (“spotted”, “large”) and verbs (“walking”, “surfing”)
○ We select up to 2 color attributes (e.g. “brown and black dog”) and 1 non-color (e.g. “walking”)
Detected Attributes:
'white' : 0.9250
'black' : 0.8844
'brown' : 0.8062
“a white rabbit”
SynthRef-YouTube-VIS
Example of referring expressions
generated with the proposed method
5. Experiments & Results
We use RefVOS model (Bellver et al. 2020) for the experiments
● Frame-based model
● DeepLabv3 visual encoder
● BERT language encoder
● Multi-modal embedding obtained via multiplication
DeepLabv3
Model
Training Details
● Batch size of 8 video frames (2 GPUs)
● Frames are cropped/padded to 480x480
● SGD optimizer
● Learning rate policy depends on the target dataset
Evaluation Metrics
1. Region Similarity (J)
Jaccard Index (Intersection-over-Union) between predicted and ground-truth mask
1. Contour Accuracy (F)
F1-score of the contour-based precision Pc and recall Rc between the contour points of the
predicted mask c(M) and the ground-truth c(G), computed via a bipartite graph matching.
1. Precision@X
Given a threshold X in the range [0.5,0.9], a predicted mask for an object is counted as true positive
if its J is larger than X, and as false positive otherwise. Then, Precision is computed
as the ratio between the number of true positives and the total number of instances
Experiments
1. Extra pre-training of the model using the generated synthetic data and
evaluating on DAVIS-2017 and A2D Sentences datasets
Results on DAVIS-2017
DAVIS-2017
Validation
DAVIS-2017
Train & Validation
No fine-tuning
Fine-tuning
Qualitative Results on DAVIS-2017
Pre-trained only on RefCOCO Pre-trained on RefCOCO + SynthRef-YouTube-VIS
Results on A2D Sentences
Referring expressions of A2D Sentences are focused on actions,
including mostly verbs and less attributes
Experiments
1. Pre-training the model using the generated synthetic data and evaluating on
DAVIS-2017 and A2D Sentences datasets
1. Training on human vs synthetic referring expressions on the same videos
Refer-YouTube-VOS
● Seo et al. 2020 annotated YouTube-VOS dataset with referring expressions
● This allowed a direct comparison of our synthetic referring expressions with human-produced
ones
Human vs Synthetic
Training:
1. Synthetic referring expressions from SynthRef-YouTube-VIS (our synthetic dataset)
2. Human-produced referring expressions from Refer-YouTube-VOS
Evaluation: On the test split of SynthRef-YouTube-VIS using human-produced referring expressions
from Refer-YouTube-VOS
Experiments
1. Pre-training the model using the generated synthetic data and evaluating on
DAVIS-2017 and A2D Sentences datasets
1. Training on human vs synthetic referring expressions on the same videos
1. Ablation study
Ablation Study
● Impact of Synthetic Referring Expression Information (DAVIS-2017)
● Freezing the language branch for synthetic pre-training
6. Conclusions
1. Pre-training a model using the synthetic referring expressions, when it is additionally
trained on real ones, increases its ability to generalize across different datasets.
1. Gains are higher when no fine-tuning is performed on the target dataset
1. Synthetic referring expressions do not achieve better results than human-produced ones
but can be used complementary without any additional annotation cost
1. More information in the referring expressions yields better segmentation accuracy
Conclusions
● Extend the proposed method by adding more cues
○ Use scene-graph generation models to add relationships between objects
Image from Xu et al. 2017
● Apply the proposed method to other existing object detection/segmentation datasets
○ Create synthetic expressions for Microsoft COCO images to be used interchangeably with RefCOCO
Future work
Thank you!
Questions?

More Related Content

PPTX
A Comparative Analysis of Deep Learning Modelsfor Flower Recognition and Heal...
PDF
Using synthetic data for computer vision model training
PPT
Ml ppt
PPTX
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
PDF
MobileNet - PR044
PPTX
Object Detection using Deep Neural Networks
PDF
PR-132: SSD: Single Shot MultiBox Detector
PDF
End-to-End Object Detection with Transformers
A Comparative Analysis of Deep Learning Modelsfor Flower Recognition and Heal...
Using synthetic data for computer vision model training
Ml ppt
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
MobileNet - PR044
Object Detection using Deep Neural Networks
PR-132: SSD: Single Shot MultiBox Detector
End-to-End Object Detection with Transformers

What's hot (20)

PPTX
PPTX
Image Segmentation: Approaches and Challenges
PDF
Deep learning based object detection basics
PPTX
Deep learning tutorial 9/2019
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
PPT
Image inpainting
PPTX
Convolutional neural networks deepa
PPTX
Emotion recognition using image processing in deep learning
PDF
Overview of Convolutional Neural Networks
PPTX
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
PPTX
Deep Learning A-Z™: Self Organizing Maps (SOM) - How do SOMs learn (part 2)
PDF
Facial Emotion Recognition using Convolution Neural Network
PDF
Machine Learning Summer School 2016
PPTX
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 3: Flattening
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
PPTX
Machine Learning - Convolutional Neural Network
PDF
Word Embeddings, why the hype ?
PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
PDF
MobileNet V3
Image Segmentation: Approaches and Challenges
Deep learning based object detection basics
Deep learning tutorial 9/2019
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Image inpainting
Convolutional neural networks deepa
Emotion recognition using image processing in deep learning
Overview of Convolutional Neural Networks
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
Deep Learning A-Z™: Self Organizing Maps (SOM) - How do SOMs learn (part 2)
Facial Emotion Recognition using Convolution Neural Network
Machine Learning Summer School 2016
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Step 3: Flattening
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Machine Learning - Convolutional Neural Network
Word Embeddings, why the hype ?
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
MobileNet V3
Ad

Similar to Generation of Synthetic Referring Expressions for Object Segmentation in Videos (20)

PPTX
Semantic Summarization of videos, Semantic Summarization of videos
PDF
Recurrent Instance Segmentation with Linguistic Referring Expressions
PDF
Video + Language: Where Does Domain Knowledge Fit in?
PDF
Video + Language: Where Does Domain Knowledge Fit in?
PPTX
[NS][Lab_Seminar_240607]Unbiased Scene Graph Generation in Videos.pptx
PPTX
Video Description using Deep Learning
PDF
Renaissance@SNU 발표자료.pdf
PDF
Video search by deep-learning
PDF
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
PPTX
Video Annotation for Visual Tracking via Selection and Refinement_tran.pptx
PDF
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
PDF
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
PDF
Video + Language 2019
PDF
Video captioning in Vietnamese using deep learning
PDF
Leveraging Computer Vision and Natural Language Processing for Object Detecti...
PPTX
Explaining video summarization based on the focus of attention
PDF
Video+Language: From Classification to Description
PDF
Vision and Language: Past, Present and Future
PDF
IMAGE CONTENT DESCRIPTION USING LSTM APPROACH
Semantic Summarization of videos, Semantic Summarization of videos
Recurrent Instance Segmentation with Linguistic Referring Expressions
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
[NS][Lab_Seminar_240607]Unbiased Scene Graph Generation in Videos.pptx
Video Description using Deep Learning
Renaissance@SNU 발표자료.pdf
Video search by deep-learning
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Video Annotation for Visual Tracking via Selection and Refinement_tran.pptx
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Video + Language 2019
Video captioning in Vietnamese using deep learning
Leveraging Computer Vision and Natural Language Processing for Object Detecti...
Explaining video summarization based on the focus of attention
Video+Language: From Classification to Description
Vision and Language: Past, Present and Future
IMAGE CONTENT DESCRIPTION USING LSTM APPROACH
Ad

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
PDF
Deep Generative Learning for All
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
PDF
Open challenges in sign language translation and production
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Curriculum Learning for Recurrent Video Object Segmentation
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
PDF
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
The Transformer - Xavier Giró - UPC Barcelona 2021
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Open challenges in sign language translation and production
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Learn2Sign : Sign language recognition and translation using human keypoint e...
Intepretability / Explainable AI for Deep Neural Networks
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Curriculum Learning for Recurrent Video Object Segmentation
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020

Recently uploaded (20)

PPTX
Managing Community Partner Relationships
PPTX
modul_python (1).pptx for professional and student
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Microsoft Core Cloud Services powerpoint
PPT
Predictive modeling basics in data cleaning process
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Lecture1 pattern recognition............
PPTX
Leprosy and NLEP programme community medicine
PPTX
Modelling in Business Intelligence , information system
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
How to run a consulting project- client discovery
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
Managing Community Partner Relationships
modul_python (1).pptx for professional and student
Data_Analytics_and_PowerBI_Presentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Microsoft Core Cloud Services powerpoint
Predictive modeling basics in data cleaning process
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Data Science and Data Analysis
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Lecture1 pattern recognition............
Leprosy and NLEP programme community medicine
Modelling in Business Intelligence , information system
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
How to run a consulting project- client discovery
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
importance of Data-Visualization-in-Data-Science. for mba studnts
DATA COLLECTION METHODS-ppt for nursing research
Optimise Shopper Experiences with a Strong Data Estate.pdf
ISS -ESG Data flows What is ESG and HowHow

Generation of Synthetic Referring Expressions for Object Segmentation in Videos

  • 1. Generation of Synthetic Referring Expressions for Object Segmentation in Videos Author: Ioannis Kazakos
  • 2. Table of Contents 1. Topic & Background 2. Relevant Literature 3. Motivation 4. Method 5. Experiments & Results 6. Conclusions
  • 4. Vision & Language ● Recently raised research area ● Owing to deep learning revolution and independent success in CV and NLP ○ CNNs, Object detection/segmentation models ○ LSTMs, Word embeddings ● Many applications ○ Autonomous driving ○ Assistance of visually impaired individuals ○ Interactive video editing ○ Navigation from vision and language etc.
  • 5. Vision & Language Tasks ● Visual Question Answering, Agrawal et al. 2015 ● Caption Generation, Vinyals et al. 2015 ● Text to Images, Zhang et al. 2016 ● And many more!
  • 6. ● Referring Expression ○ An accurate description of a specific object, but not of any other object in the current scene ○ Example: ■ “a woman” ❌ ■ “a woman in red” ❌ ■ “a woman in red on the right” ✅ ■ “a woman in red top and blue shorts” ✅ ● Object Segmentation ○ Assign a label to every pixel corresponding to the target object Object Segmentation with Referring Expressions
  • 7. Referring Expression Video Object Segmentation
  • 9. Many works on images ● First work: “Segmentation from Natural Language Expressions”, Hu et al. 2016 ● Subsequent works tried to jointly model vision and language features and leverage attention to better capture dependencies between visual and linguistic features ● Most of these works use the Refer-It collection of datasets for training and evaluation ○ Three large-scale image datasets with referring expressions and segmentation masks ○ Collected on top of Microsoft COCO (Common objects in Context) ○ RefCOCO, RefCOCO+ and RefCOCOg
  • 10. ● 142,209 referring expressions ● 50,000 objects ● 19,994 images RefCOCO dataset Expression = “right kid” Expression = “left elephant”
  • 11. Few works on videos ● “Video Object Segmentation with Language Referring Expressions”, Khoreva et al. 2018 ○ DAVIS-2017: Big set of 78 object classes ○ Too few videos (150 in total) ○ They use a frame-based model ○ Pre-training on RefCOCO is used ● “Actor and action segmentation from a sentence”, Gavrilyuk et al. 2018 ○ A2D: Small set of object classes (only 8 actors) ○ J-HDMB: Single object in each video DAVIS-2017
  • 13. Main Challenges ● Models ○ Temporal consistency across frames ○ Models’ size and complexity ● Data ○ No large-scale datasets for videos ○ Poor quality of crowdsourced referring expressions ■ ~10% fail to correctly describe the target object (no RE) Analysis from Bellver et al. 2020 A2D DAVIS-2017
  • 14. Method Inspiration A2D DAVIS-2017 ● Existing datasets include trivial cases where a single object from each class appears ● In such cases an object can be identified using only its class e.g. saying “a person” or “a horse” ● Existing large datasets for video object segmentation are labeled in terms of object classes ● Annotating a large dataset with referring expressions requires tremendous human effort
  • 15. Basic Idea Generate (automatically) synthetic referring expressions starting from an object’s class and enhancing them with other cues without any human annotation cost
  • 16. Thesis Purpose 1. Propose a method for generating synthetic referring expressions for a large-scale video object segmentation dataset 1. Evaluate the effectiveness of the generated synthetic referring expressions for the task of video object segmentation with referring expressions
  • 18. YouTube-VIS Dataset YouTube-VOS → Large-scale dataset for video object segmentation → Short YouTube videos of 3-6 seconds → 4,453 videos in total → 94 object categories YouTube-VIS → Created on top of YouTube-VOS → 2,883 videos → 40 object classes → Exhaustively annotated = All objects belonging to the 40 classes are labeled with pixel-wise masks. ● The formulation of our method allows its application to any other object detection/segmentation dataset ● We apply our proposed method on the YouTube-VIS dataset
  • 19. Overview 1. Ground-truth annotations ● Object class ● Bounding boxes ○ Relative size ○ Relative location 2. Faster R-CNN, Ren et al. 2015 ● Enhanced with attribute head by Tang et al. 2020 ● Pre-trained on Visual Genome dataset for attribute detection ○ Able to detect a predefined set of 201 attributes ○ Includes color and non-color attributes ○ Non-color attributes can be adjectives (“large”, “spotted”) or verbs (“surfing”)
  • 20. Cues 1. Object Class (e.g “a person”) ○ It can be enough only if one object of this class is present in the video frame ○ However, in most cases more cues are necessary
  • 21. Cues 2. Relative Size ○ The areas At and Ao of the target and other object bounding boxes are computed: ■ At >= 2Ao : “bigger” is added to the ref. expression ■ At <= 0.5Ao : “smaller” respectively ■ 0.5Ao < At < 2Ao : relative location not applicable ○ Similarly for more objects, “biggest”/“smallest” if target is “bigger”/ “smaller” than all other objects “a bigger dog”
  • 22. Cues 3. Relative Location (1 or 2 other objects of the same class) ○ The most discriminative axis (X or Y) is determined using the bounding boxes boundaries ○ The maximum non-overlapping distance between bounding boxes is calculated ○ If distance above a certain threshold, relative location is computed, according to the axis found: ■ If X-axis: “on the left” / “on the right” ■ If Y-axis: “in the front” / “in the back” ○ For 3 objects, combinations of relative locations of each pair of objects are combined (e.g “in the middle”, “in the front left” etc.) “rabbit on the left” rabbit rabbit rabbit
  • 23. Cues 4. Attributes ○ Faster R-CNN detection is matched to the target object using Intersection-over-Union ○ An attribute is added to the referring expression only if it is unique for the target object ○ Attributes can be colors, other adjectives (“spotted”, “large”) and verbs (“walking”, “surfing”) ○ We select up to 2 color attributes (e.g. “brown and black dog”) and 1 non-color (e.g. “walking”) Detected Attributes: 'white' : 0.9250 'black' : 0.8844 'brown' : 0.8062 “a white rabbit”
  • 24. SynthRef-YouTube-VIS Example of referring expressions generated with the proposed method
  • 25. 5. Experiments & Results
  • 26. We use RefVOS model (Bellver et al. 2020) for the experiments ● Frame-based model ● DeepLabv3 visual encoder ● BERT language encoder ● Multi-modal embedding obtained via multiplication DeepLabv3 Model
  • 27. Training Details ● Batch size of 8 video frames (2 GPUs) ● Frames are cropped/padded to 480x480 ● SGD optimizer ● Learning rate policy depends on the target dataset
  • 28. Evaluation Metrics 1. Region Similarity (J) Jaccard Index (Intersection-over-Union) between predicted and ground-truth mask 1. Contour Accuracy (F) F1-score of the contour-based precision Pc and recall Rc between the contour points of the predicted mask c(M) and the ground-truth c(G), computed via a bipartite graph matching. 1. Precision@X Given a threshold X in the range [0.5,0.9], a predicted mask for an object is counted as true positive if its J is larger than X, and as false positive otherwise. Then, Precision is computed as the ratio between the number of true positives and the total number of instances
  • 29. Experiments 1. Extra pre-training of the model using the generated synthetic data and evaluating on DAVIS-2017 and A2D Sentences datasets
  • 30. Results on DAVIS-2017 DAVIS-2017 Validation DAVIS-2017 Train & Validation No fine-tuning Fine-tuning
  • 31. Qualitative Results on DAVIS-2017 Pre-trained only on RefCOCO Pre-trained on RefCOCO + SynthRef-YouTube-VIS
  • 32. Results on A2D Sentences Referring expressions of A2D Sentences are focused on actions, including mostly verbs and less attributes
  • 33. Experiments 1. Pre-training the model using the generated synthetic data and evaluating on DAVIS-2017 and A2D Sentences datasets 1. Training on human vs synthetic referring expressions on the same videos
  • 34. Refer-YouTube-VOS ● Seo et al. 2020 annotated YouTube-VOS dataset with referring expressions ● This allowed a direct comparison of our synthetic referring expressions with human-produced ones
  • 35. Human vs Synthetic Training: 1. Synthetic referring expressions from SynthRef-YouTube-VIS (our synthetic dataset) 2. Human-produced referring expressions from Refer-YouTube-VOS Evaluation: On the test split of SynthRef-YouTube-VIS using human-produced referring expressions from Refer-YouTube-VOS
  • 36. Experiments 1. Pre-training the model using the generated synthetic data and evaluating on DAVIS-2017 and A2D Sentences datasets 1. Training on human vs synthetic referring expressions on the same videos 1. Ablation study
  • 37. Ablation Study ● Impact of Synthetic Referring Expression Information (DAVIS-2017) ● Freezing the language branch for synthetic pre-training
  • 39. 1. Pre-training a model using the synthetic referring expressions, when it is additionally trained on real ones, increases its ability to generalize across different datasets. 1. Gains are higher when no fine-tuning is performed on the target dataset 1. Synthetic referring expressions do not achieve better results than human-produced ones but can be used complementary without any additional annotation cost 1. More information in the referring expressions yields better segmentation accuracy Conclusions
  • 40. ● Extend the proposed method by adding more cues ○ Use scene-graph generation models to add relationships between objects Image from Xu et al. 2017 ● Apply the proposed method to other existing object detection/segmentation datasets ○ Create synthetic expressions for Microsoft COCO images to be used interchangeably with RefCOCO Future work