Comparing Automatic Evaluation Measures for Image Description

Desmond Elliott, Frank Keller

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Image description is a new natural language generation task, where the aim is to generate a human-like description of an image. The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements. The focus of this paper is to determine the correlation of automatic measures with human judgements for this task. We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets. The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements.
Original languageEnglish
Title of host publicationProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers
Place of PublicationBaltimore, Maryland, USA
PublisherAssociation for Computational Linguistics
Pages452-457
Number of pages6
DOIs
Publication statusPublished - Jun 2014
Event52nd Annual Meeting of the Association for Computational Linguistics - Baltimore, United States
Duration: 22 Jun 201427 Jun 2014
http://acl2014.org/home.htm

Conference

Conference52nd Annual Meeting of the Association for Computational Linguistics
Abbreviated titleACL 2014
Country/TerritoryUnited States
CityBaltimore
Period22/06/1427/06/14
Internet address

Cite this