Describing Images using Inferred Visual Dependency Representations

Desmond Elliott, Arjen P. de Vries

Research output: Chapter in Book/Report/Conference proceedingConference contribution


The Visual Dependency Representation (VDR) is an explicit model of the spatial
relationships between objects in an image. In this paper we present an approach
to training a VDR Parsing Model without the extensive human supervision used in previous work. Our approach is to find the objects mentioned in a given description using a state-of-the-art object detector, and to use successful detections to produce training data. The description of an unseen image is produced by first predicting its VDR over automatically detected objects, and then generating the text with a template-based generation model using the predicted VDR. The performance of our approach is comparable to a state-of-the-art multimodal deep neural network in images depicting actions.
Original languageEnglish
Title of host publicationProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers
Number of pages11
Publication statusPublished - Jul 2015


Dive into the research topics of 'Describing Images using Inferred Visual Dependency Representations'. Together they form a unique fingerprint.

Cite this