Analysing the Effect of Out-of-domain Data on SMT Systems

Barry Haddow, Philipp Koehn

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

In statistical machine translation (SMT), it is known that performance declines when the training data is in a different domain from the test data. Nevertheless, it is frequently necessary to supplement scarce in-domain training data with out-of-domain data. In this paper, we first try to relate the effect of the out-of-domain data on translation performance to measures of corpus similarity, then we separately analyse the effect of adding the out-of-domain data at different parts of the training pipeline (alignment, phrase extraction, and phrase scoring). Through experiments in 2 domains and 8 language pairs it is shown that the out-of-domain data improves coverage and translation of rare words, but may degrade the translation quality for more common words.
Original languageEnglish
Title of host publicationProceedings of the Seventh Workshop on Statistical Machine Translation
Place of PublicationStroudsburg, PA, USA
PublisherAssociation for Computational Linguistics
Number of pages11
ISBN (Print)978-1-937284-20-6
Publication statusPublished - 2012


Dive into the research topics of 'Analysing the Effect of Out-of-domain Data on SMT Systems'. Together they form a unique fingerprint.

Cite this