Abstract / Description of output
Given a (static) scene, a human can effortlessly describe what is going on (who is doing what to whom, how, and why). The process requires knowledge about the world, how
it is perceived, and described. In this paper we
study the problem of interpreting and verbalizing visual information using abstract scenes
created from collections of clip art images. We
propose a model inspired by machine translation operating over a large parallel corpus
of visual relations and linguistic descriptions.
We demonstrate that this approach produces
human-like scene descriptions which are both
fluent and relevant, outperforming a number of competitive alternatives based on templates, sentence-based retrieval, and a multi-modal neural language model.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies |
Place of Publication | Denver, Colorado |
Publisher | Association for Computational Linguistics |
Pages | 1505-1515 |
Number of pages | 11 |
Publication status | Published - 1 May 2015 |