Putting Human Assessments of Machine Translation Systems in Order

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Human assessment is often considered the gold standard in evaluation of translation systems. But in order for the evaluation to be meaningful, the rankings obtained from human assessment must be consistent and repeatable. Recent analysis by Bojar et al. (2011) raised several concerns about the rankings derived from human assessments of English-Czech translation systems in the 2010 Workshop on Machine Translation. We extend their analysis to all of the ranking tasks from 2010 and 2011, and show through an extension of their reasoning that the ranking is naturally cast as an instance of finding the minimum feedback arc set in a tournament, a well- known NP-complete problem. All instances of this problem in the workshop data are efficiently solvable, but in some cases the rankings it produces are surprisingly different from the ones previously published. This leads to strong caveats and recommendations for both producers and consumers of these rankings.
Original languageEnglish
Title of host publicationProceedings of the Seventh Workshop on Statistical Machine Translation
Place of PublicationMontréal, Canada
PublisherAssociation for Computational Linguistics
Number of pages9
Publication statusPublished - 1 Jun 2012


Dive into the research topics of 'Putting Human Assessments of Machine Translation Systems in Order'. Together they form a unique fingerprint.

Cite this