Injecting Prior Knowledge into Image Caption Generation

Arushi Goel*, Basura Fernando, Thanh Son Nguyen, Hakan Bilen

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them. The state-of-the-art methods in image captioning struggles to approach human level performance, especially when data is limited. In this paper, we propose to improve the performance of the state-of-the-art image captioning models by incorporating two sources of prior knowledge: (i) a conditional latent topic attention, that uses a set of latent variables (topics) as an anchor to generate highly probable words and, (ii) a regularization technique that exploits the inductive biases in syntactic and semantic structure of captions and improves the generalization of image captioning models. Our experiments validate that our method produces more human interpretable captions and also leads to significant improvements on the MSCOCO dataset in both the full and low data regimes.
Original languageEnglish
Title of host publicationComputer Vision – ECCV 2020 Workshops, Proceedings
EditorsAdrien Bartoli, Andrea Fusiello
PublisherSpringer
Pages369-385
Number of pages17
ISBN (Electronic)978-3-030-66096-3
ISBN (Print)978-3-030-66095-6
DOIs
Publication statusPublished - 3 Jan 2021
EventWorkshops held at the 16th European Conference on Computer Vision - Glasgow, United Kingdom
Duration: 23 Aug 202028 Aug 2020
https://eccv2020.eu

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume12536
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceWorkshops held at the 16th European Conference on Computer Vision
Abbreviated titleECCV 2020
Country/TerritoryUnited Kingdom
CityGlasgow
Period23/08/2028/08/20
Internet address

Fingerprint

Dive into the research topics of 'Injecting Prior Knowledge into Image Caption Generation'. Together they form a unique fingerprint.

Cite this