Abstract
In statistical machine translation (SMT), it is known that performance declines when the training data is in a different domain from the test data. Nevertheless, it is frequently necessary to supplement scarce in-domain training data with out-of-domain data. In this paper, we first try to relate the effect of the out-of-domain data on translation performance to measures of corpus similarity, then we separately analyse the effect of adding the out-of-domain data at different parts of the training pipeline (alignment, phrase extraction, and phrase scoring). Through experiments in 2 domains and 8 language pairs it is shown that the out-of-domain data improves coverage and translation of rare words, but may degrade the translation quality for more common words.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the Seventh Workshop on Statistical Machine Translation |
| Place of Publication | Stroudsburg, PA, USA |
| Publisher | Association for Computational Linguistics |
| Pages | 422-432 |
| Number of pages | 11 |
| ISBN (Print) | 978-1-937284-20-6 |
| Publication status | Published - 2012 |
Fingerprint
Dive into the research topics of 'Analysing the Effect of Out-of-domain Data on SMT Systems'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver