Abstract
Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them. The state-of-the-art methods in image captioning struggles to approach human level performance, especially when data is limited. In this paper, we propose to improve the performance of the state-of-the-art image captioning models by incorporating two sources of prior knowledge: (i) a conditional latent topic attention, that uses a set of latent variables (topics) as an anchor to generate highly probable words and, (ii) a regularization technique that exploits the inductive biases in syntactic and semantic structure of captions and improves the generalization of image captioning models. Our experiments validate that our method produces more human interpretable captions and also leads to significant improvements on the MSCOCO dataset in both the full and low data regimes.
Original language | English |
---|---|
Title of host publication | Computer Vision – ECCV 2020 Workshops, Proceedings |
Editors | Adrien Bartoli, Andrea Fusiello |
Publisher | Springer |
Pages | 369-385 |
Number of pages | 17 |
ISBN (Electronic) | 978-3-030-66096-3 |
ISBN (Print) | 978-3-030-66095-6 |
DOIs | |
Publication status | Published - 3 Jan 2021 |
Event | Workshops held at the 16th European Conference on Computer Vision - Glasgow, United Kingdom Duration: 23 Aug 2020 → 28 Aug 2020 https://eccv2020.eu |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Publisher | Springer |
Volume | 12536 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | Workshops held at the 16th European Conference on Computer Vision |
---|---|
Abbreviated title | ECCV 2020 |
Country/Territory | United Kingdom |
City | Glasgow |
Period | 23/08/20 → 28/08/20 |
Internet address |