Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)

Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
Slides @DocXavi
Tutorial:
One Perceptron to Rule Them All
Part III: Language & Vision

2
Acknowledgments
Mariona
Carós
Benet
Oriol
Amaia
Salvador
Santiago
Pascual
Marta R.
Costa-jussà
Francisco
Roldan
Issey
Masuda
Ionut
Sorodoc
Carina
Silberer
Gemma
Boleda
Carles
Ventura
Ioannis
Kazakos
Míriam
Bellver
Alba M.
Herrera
Amanda
Duarte

4
Outline
1. Generative Models
a. Text
b. Vision
2. Discriminative Models
a. Text
b. Vision
3. Representation Learning
4. Control Tasks

5
Outline
a. Text
b. Vision
a. Text
b. Vision
4. Control Tasks

6
Encoder Decoder
Representation

7
#ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image Captioning with RNN

8
#DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions."
CVPR 2015 (Slides by Marc Bolaños)
Image Captioning with RNN

9
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
Image Captioning with RNN & Attention

10
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
Image Captioning with RNN & Attention

11
Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. "Meshed-Memory Transformer for Image
Captioning." CVPR 2020. [tweet]
Image Captioning with Transformers

12
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning

13
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning

14
Recipe Generation
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.

15
Recipe Generation
Title: Edamame corn salad
Ingredients
pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil
Instructions
- In a large bowl, combine edamame, corn, red onion, cilantro,
avocado, and red bell pepper.
- In a small bowl, whisk together olive oil, vinegar, salt, and
pepper.
- Pour dressing over edamame mixture and toss to coat.
- Cover and refrigerate for at least 1 hour before serving.
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.

16
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
Fighting Data Bias in Captioning

17
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.
Fighting Data Bias in Captioning

18
Jeﬀrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Video Captioning

19
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video

20
Multimodal Machine Translation
Challenge on Multimodal Image Translation:
http://www.statmt.org/wmt17/multimodal-task.html#task1

21
Multimodal Machine Translation
Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann.
"Multimodal machine translation through visuals and speech." Machine Translation (2020): 1-51. [tweet]

22
Sign Language Translation with RNN+Att
Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.

23
Sign Language Translation with Transformers
Necati Cihan Camgoz, Oscar Koller, Simon Hadﬁeld, Richard Bowden, “Sign Language Transformers: Joint
End-to-end Sign Language Recognition and Translation” CVPR 2020.

24
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading."
(2016).

25
Lip Reading
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level
Lipreading." (2016).

26
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017

27
Lipreading: Watch, Listen, Attend & Spell
Audio
features
Image
features
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017

28
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep

29
Lipreading
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online
application." Interspeech 2018.

30
Image Captioning Grounded on Detected Objects
Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]

31Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
Image Captioning Grounded on Detected Objects

32Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Image Captioning Grounded on Heatmaps

33
Outline
a. Text
b. Vision
a. Text
b. Vision
4. Control Tasks

34
Encoder Decoder
Representation

35
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Image Generation

36
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016. [code]
Image Generation

37
Image Synthesis
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]

38
Image Synthesis with Cycle Consistency
#MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by
redescription." CVPR 2019. [code]

39
Image Synthesis with Cycle Consistency
#MirroGAN Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. "Mirrorgan: Learning text-to-image generation by
redescription." CVPR 2019. [code]

40Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018
Image Generation via Scene Graphs

41
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].

42
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018
Video Generation by Composition

43
Saunders, B., Camgoz, N. C., & Bowden, R. (2020). Progressive Transformers for End-to-End Sign Language Production.
ECCV 2020.
Sign Language Generation with Transformers

44
Lucas Ventura, Amanda Duarte, Xavier Giro-i-Nieto, “Can Everybody Sign Now ? Exploring Sign Language
Video Generation from 2D Poses”. ECCV SLRTP Workshop 2020.
Sign Language Generation (pose 2 pixels)

45
Outline
a. Text
b. Vision
a. Text
b. Vision
4. Control Tasks

46
Encoder
Decoder
Representation
Encoder
Representation

47
Visual Question Answering
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA:
Visual question answering." CVPR 2015.

48
Visual Question Answering (VQA)
Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual
Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).

49
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter
prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
Visual Question Answering (VQA)

50
VQA: Dynamic Memory Networks
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for
Visual and Textual Question Answering." ICML 2016

51
Visual Reasoning
#Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick.
"CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017

52
Visual Reasoning: Programming
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoﬀman, Fei-Fei Li, Larry
Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017
Program Generator Execution Engine

53
Visual Reasoning: Relation Networks
#RN Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy
Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017.
Relation Networks concatenate all possible pairs of objects with the an encoded question to later ﬁnd the
answer with a MLP.

54
Visual Dialog
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual
Dialog." CVPR 2017 [Project]

55
Visual Dialog
Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for
Dementia." ICMR 2020. [talk]
Demo @ ICMR 2020 (Wednesday 11:00am)

56
Visual Dialog
Caros, Mariona, Maite Garolera, Petia Radeva, and Xavier Giro-i-Nieto. "Automatic Reminiscence Therapy for
Dementia." ICMR 2020. [talk]

57
Hate Speech Detection in Memes
Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech in Pixels: Detection of Oﬀensive Memes
towards Automatic Moderation”. NeurIPS 2019 AI for Good Workshop. [code]
Hate Speech Detection

58
Outline
a. Text
b. Image
a. Text
b. Image
4. Control Tasks

59
Encoder
Decoder
Representation
Encoder
Representation

60
Niu, Yulei, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. "Variational Context: Exploiting Visual and Textual Context for
Grounding Referring Expressions." arXiv preprint arXiv:1907.03609 (2019).
Objects from Referring Expressions

61
Video Objects from Referring Expressions
Li, Zhenyang, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. "Tracking by natural language
speciﬁcation." CVPR 2017. [code]

62
Video Object Detection with Transformers
Sadhu, A., Chen, K., & Nevatia, R. (2020). Video Object Grounding using Semantic Roles in Language Description. CVPR 2020.

63
#Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular
attention network for referring expression comprehension." CVPR 2018. [code]
Segments from Referring Expressions

64
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV
2018.

65
Herrera-Palacio, Alba, Carles Ventura, and Xavier Giro-i-Nieto. "Video object linguistic grounding." ACM Multimedia
Workshops 2019.

66
#RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto.
"RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint
arXiv:2010.00263 (2020).

67
#RefVOS Bellver, Miriam, Carles Ventura, Carina Silberer, Ioannis Kazakos, Jordi Torres, and Xavier Giro-i-Nieto.
"RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation." arXiv preprint
arXiv:2010.00263 (2020).

68
#SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation
of Synthetic Referring Expressions for Object Segmentation” (submitted)
Synthetic Expressions w/ Scene Graphs

69
#SynthRef Ioannis Kazakos, Bellver, Miriam, Carles Ventura, Carina Silberer, and Xavier Giro-i-Nieto, “Generation
of Synthetic Referring Expressions for Object Segmentation” (submitted)

Segments from Questions
Gan, Chuang, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. "VQS: Linking segmentations to questions and
answers for supervised attention in vqa and question-focused semantic segmentation." ICCV 2017.

71
Outline
a. Text
b. Image
a. Text
b. Image
4. Control Tasks

72
Encoder Encoder
Representation

73
Joint Representations (Embeddings)
#Devise Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeﬀ Dean, and Tomas Mikolov. "Devise: A deep
visual-semantic embedding model." NIPS 2013

74
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code]
No images from “cat” in
the training set...
...but they can still be
recognised as “cats”
thanks to the
representations learned
from text .

75
Multimodal Retrieval
Kiros, Ryan, Ruslan Salakhutdinov, and Richard S. Zemel. "Unifying visual-semantic embeddings with multimodal neural
language models." NeurIPS 2014 Deep Learning Workshop.

76
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.

77
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.

78
Image and text retrieval with joint embeddings.
Joint Neural Embeddings
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Oﬂi, Ingmar Weber, Antonio
Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]

79
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Oﬂi, Ingmar Weber, Antonio Torralba,
“Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 [video]

80
joint
embedding
LSTM Bidirectional LSTM
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Oﬂi, Ingmar Weber, Antonio Torralba,
“Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017

81
Representations
Sariyildiz, Mert Bulent, Julien Perez, and Diane Larlus. "Learning Visual Representations with Caption Annotations." ECCV
2020. [tweet]

82
Representations
#ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo]
Visual Task:
Predict the visual categories for the
masked video frame
Language Task:
Predict the masked word (same as in
language-only BERT).

83
Representations
#ViLBERT Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." NeurIPS 2019. [MIT talk by Devih Parikh] [demo]
Multimodal Task:
Predict whether the video frames correspond to the caption.

84
Representations
#VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video
and language representation learning." ICCV 2019.

85
Representations
#VideoBERT Sun, Chen, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. "Videobert: A joint model for video
and language representation learning." ICCV 2019.
Rich representations can be used to retrieve matching video frames, which are encoded after vector
quantization.

86
Representations
#VirTEX Karan Desai, Justin Johnson, “VirTex: Learning Visual Representations from Textual Annotations” arXiv 2020
[tweet]

87
Learning Language from Video
Doughty, Hazel, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. "Action Modiﬁers: Learning from Adverbs in
Instructional Videos." CVPR 2020..

88
Learning Language from Video
Surís, Dídac, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl Vondrick. "Learning to Learn Words from Visual Scenes." ECCV
2020.

89
Outline
a. Text
b. Image
a. Text
b. Image
4. Control Tasks

90
Platforms for Embodied AI
#Habitat Savva, Manolis, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub et
al. "Habitat: A platform for embodied ai research." ICCV 2019. [site]

91
Navigation
Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor
Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language
Navigation." NeurIPS 2018.

92
Navigation
#R2R Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., ... & van den Hengel, A. Vision-and-language
navigation: Interpreting visually-grounded navigation instructions in real environments. CVPR 2018. [tweet]

93
Navigation
#RxR Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason Baldridge, “Room-Across-Room:
Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding” EMNLP 2020.

94
Navigation
Ünal, Emre, Ozan Arkan Can, and Yücel Yemez. "Visually Grounded Language Learning For Robot Navigation." ACMMM
Workshops 2019.

95
Object manipulation
Hill, F., Lampinen, A. K., Schneider, R., Clark, S., Botvinick, M., McClelland, J. L., & Santoro, A. Environmental drivers of
systematicity and generalization in a situated agent. ICLR 2020. [talk]

96
Outline
a. Text
b. Image
a. Text
b. Image
4. Control Tasks

97
My take home message
a. Text
b. Vision
a. Text
b. Vision
3. Feature Learning
4. Control Tasks

Xavier Giro-i-Nieto
@DocXavi
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
Was this tutorial helpful ? Please consider citing:
Go raibh maith agat / Thank you
Giro-i-Nieto, X. One Perceptron to Rule Them All: Language,
Vision, Audio and Speech. In Proceedings of the 2020
International Conference on Multimedia Retrieval (pp. 7-8).

Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)

More Related Content

What's hot (20)

Similar to Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial) (20)

More from Universitat Politècnica de Catalunya (20)

Recently uploaded (20)

Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)