Generation of Synthetic Referring Expressions for Object Segmentation in Videos

Generation of Synthetic Referring Expressions
for Object Segmentation in Videos
Author: Ioannis Kazakos

Table of
Contents
1. Topic & Background
2. Relevant Literature
3. Motivation
4. Method
5. Experiments & Results
6. Conclusions

Vision & Language
● Recently raised research area
● Owing to deep learning revolution and independent success in CV and NLP
○ CNNs, Object detection/segmentation models
○ LSTMs, Word embeddings
● Many applications
○ Autonomous driving
○ Assistance of visually impaired individuals
○ Interactive video editing
○ Navigation from vision and language etc.

Vision & Language Tasks
● Visual Question Answering, Agrawal et al. 2015
● Caption Generation, Vinyals et al. 2015
● Text to Images, Zhang et al. 2016
● And many more!

● Referring Expression
○ An accurate description of a specific object, but not of any other object in the current scene
○ Example:
■ “a woman” ❌
■ “a woman in red” ❌
■ “a woman in red on the right” ✅
■ “a woman in red top and blue shorts” ✅
● Object Segmentation
○ Assign a label to every pixel corresponding to the target object
Object Segmentation with Referring Expressions

Referring Expression Video Object Segmentation

Many works on images
● First work: “Segmentation from Natural Language Expressions”, Hu et al. 2016
● Subsequent works tried to jointly model vision and language features and leverage
attention to better capture dependencies between visual and linguistic features
● Most of these works use the Refer-It collection of datasets for training and evaluation
○ Three large-scale image datasets with referring expressions and segmentation masks
○ Collected on top of Microsoft COCO (Common objects in Context)
○ RefCOCO, RefCOCO+ and RefCOCOg

● 142,209 referring expressions
● 50,000 objects
● 19,994 images
RefCOCO dataset
Expression = “right kid”
Expression = “left elephant”

Few works on videos
● “Video Object Segmentation with
Language Referring Expressions”,
Khoreva et al. 2018
○ DAVIS-2017: Big set of 78 object classes
○ Too few videos (150 in total)
○ They use a frame-based model
○ Pre-training on RefCOCO is used
● “Actor and action segmentation from a
sentence”, Gavrilyuk et al. 2018
○ A2D: Small set of object classes (only 8 actors)
○ J-HDMB: Single object in each video
DAVIS-2017

Main Challenges
● Models
○ Temporal consistency across frames
○ Models’ size and complexity
● Data
○ No large-scale datasets for videos
○ Poor quality of crowdsourced referring expressions
■ ~10% fail to correctly describe the target object (no RE)
Analysis from Bellver et al. 2020
A2D
DAVIS-2017

Method Inspiration
A2D
DAVIS-2017
● Existing datasets include trivial cases where a single object from each class
appears
● In such cases an object can be identified using only its class e.g. saying “a
person” or “a horse”
● Existing large datasets for video object segmentation are labeled in terms of
object classes
● Annotating a large dataset with referring expressions requires tremendous
human effort

Basic Idea
Generate (automatically) synthetic referring expressions starting from an object’s
class and enhancing them with other cues without any human annotation cost

Thesis Purpose
1. Propose a method for generating synthetic referring expressions for a large-scale
video object segmentation dataset
1. Evaluate the effectiveness of the generated synthetic referring expressions for the
task of video object segmentation with referring expressions

YouTube-VIS Dataset
YouTube-VOS
→ Large-scale dataset for video object segmentation
→ Short YouTube videos of 3-6 seconds
→ 4,453 videos in total
→ 94 object categories
YouTube-VIS
→ Created on top of YouTube-VOS
→ 2,883 videos
→ 40 object classes
→ Exhaustively annotated = All objects belonging to
the 40 classes are labeled with pixel-wise masks.
● The formulation of our method allows its application to any other object
detection/segmentation dataset
● We apply our proposed method on the YouTube-VIS dataset

Overview
1. Ground-truth annotations
● Object class
● Bounding boxes
○ Relative size
○ Relative location
2. Faster R-CNN, Ren et al. 2015
● Enhanced with attribute head by Tang et al. 2020
● Pre-trained on Visual Genome dataset for attribute detection
○ Able to detect a predefined set of 201 attributes
○ Includes color and non-color attributes
○ Non-color attributes can be adjectives (“large”, “spotted”) or verbs (“surfing”)

Cues
1. Object Class (e.g “a person”)
○ It can be enough only if one object of this class is present in the video frame
○ However, in most cases more cues are necessary

Cues
2. Relative Size
○ The areas At and Ao of the target and other object
bounding boxes are computed:
■ At >= 2Ao : “bigger” is added to the ref. expression
■ At <= 0.5Ao : “smaller” respectively
■ 0.5Ao < At < 2Ao : relative location not applicable
○ Similarly for more objects, “biggest”/“smallest” if
target is “bigger”/ “smaller” than all other objects
“a bigger dog”

Cues
3. Relative Location (1 or 2 other objects of the same class)
○ The most discriminative axis (X or Y) is determined using the bounding boxes boundaries
○ The maximum non-overlapping distance between bounding boxes is calculated
○ If distance above a certain threshold, relative location is computed, according to the axis found:
■ If X-axis: “on the left” / “on the right”
■ If Y-axis: “in the front” / “in the back”
○ For 3 objects, combinations of relative locations of each pair of objects are combined (e.g “in
the middle”, “in the front left” etc.)
“rabbit on the left”
rabbit rabbit
rabbit

Cues
4. Attributes
○ Faster R-CNN detection is matched to the target object using Intersection-over-Union
○ An attribute is added to the referring expression only if it is unique for the target object
○ Attributes can be colors, other adjectives (“spotted”, “large”) and verbs (“walking”, “surfing”)
○ We select up to 2 color attributes (e.g. “brown and black dog”) and 1 non-color (e.g. “walking”)
Detected Attributes:
'white' : 0.9250
'black' : 0.8844
'brown' : 0.8062
“a white rabbit”

SynthRef-YouTube-VIS
Example of referring expressions
generated with the proposed method

We use RefVOS model (Bellver et al. 2020) for the experiments
● Frame-based model
● DeepLabv3 visual encoder
● BERT language encoder
● Multi-modal embedding obtained via multiplication
DeepLabv3
Model

Training Details
● Batch size of 8 video frames (2 GPUs)
● Frames are cropped/padded to 480x480
● SGD optimizer
● Learning rate policy depends on the target dataset

Evaluation Metrics
1. Region Similarity (J)
Jaccard Index (Intersection-over-Union) between predicted and ground-truth mask
1. Contour Accuracy (F)
F1-score of the contour-based precision Pc and recall Rc between the contour points of the
predicted mask c(M) and the ground-truth c(G), computed via a bipartite graph matching.
1. Precision@X
Given a threshold X in the range [0.5,0.9], a predicted mask for an object is counted as true positive
if its J is larger than X, and as false positive otherwise. Then, Precision is computed
as the ratio between the number of true positives and the total number of instances

Experiments
1. Extra pre-training of the model using the generated synthetic data and
evaluating on DAVIS-2017 and A2D Sentences datasets

Results on DAVIS-2017
DAVIS-2017
Validation
DAVIS-2017
Train & Validation
No fine-tuning
Fine-tuning

Qualitative Results on DAVIS-2017
Pre-trained only on RefCOCO Pre-trained on RefCOCO + SynthRef-YouTube-VIS

Results on A2D Sentences
Referring expressions of A2D Sentences are focused on actions,
including mostly verbs and less attributes

Experiments
1. Pre-training the model using the generated synthetic data and evaluating on
DAVIS-2017 and A2D Sentences datasets
1. Training on human vs synthetic referring expressions on the same videos

Refer-YouTube-VOS
● Seo et al. 2020 annotated YouTube-VOS dataset with referring expressions
● This allowed a direct comparison of our synthetic referring expressions with human-produced
ones

Human vs Synthetic
Training:
1. Synthetic referring expressions from SynthRef-YouTube-VIS (our synthetic dataset)
2. Human-produced referring expressions from Refer-YouTube-VOS
Evaluation: On the test split of SynthRef-YouTube-VIS using human-produced referring expressions
from Refer-YouTube-VOS

Experiments
1. Pre-training the model using the generated synthetic data and evaluating on
DAVIS-2017 and A2D Sentences datasets
1. Training on human vs synthetic referring expressions on the same videos
1. Ablation study

Ablation Study
● Impact of Synthetic Referring Expression Information (DAVIS-2017)
● Freezing the language branch for synthetic pre-training

1. Pre-training a model using the synthetic referring expressions, when it is additionally
trained on real ones, increases its ability to generalize across different datasets.
1. Gains are higher when no fine-tuning is performed on the target dataset
1. Synthetic referring expressions do not achieve better results than human-produced ones
but can be used complementary without any additional annotation cost
1. More information in the referring expressions yields better segmentation accuracy
Conclusions

● Extend the proposed method by adding more cues
○ Use scene-graph generation models to add relationships between objects
Image from Xu et al. 2017
● Apply the proposed method to other existing object detection/segmentation datasets
○ Create synthetic expressions for Microsoft COCO images to be used interchangeably with RefCOCO
Future work

Generation of Synthetic Referring Expressions for Object Segmentation in Videos

More Related Content

What's hot (20)

Similar to Generation of Synthetic Referring Expressions for Object Segmentation in Videos (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Generation of Synthetic Referring Expressions for Object Segmentation in Videos