The document outlines various models and techniques related to generative and discriminative tasks in language and vision, focusing on image and video captioning, visual question answering, and sign language translation. It references numerous studies and methods, including encoder-decoder representations, attention mechanisms, and multimodal machine translation. Key topics include combating data bias, dynamic memory networks, and advancements in visual reasoning and object grounding through language.
Related topics: