Abstract / Description of output
We provide a few insights on data selection for machine translation. We evaluate the quality of the new CzEng 1.0, a parallel data source used in WMT12. We describe a simple technique for reducing out-of-vocabulary rate after phrase extraction. We discuss the benefits of tuning towards multiple reference translations for English-Czech language pair. We introduce a novel approach to data selection by full-text indexing and search: we select sentences similar to the test set from a large monolingual corpus and explore several options of incorporating them in a machine translation system. We show that this method can improve translation quality. Finally, we describe our submitted system CU-TAMCH-BOJ.
Original language | English |
---|---|
Title of host publication | Proceedings of the Seventh Workshop on Statistical Machine Translation |
Place of Publication | Montreal, Canada |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 374-381 |
Number of pages | 8 |
Publication status | Published - 7 Jun 2012 |
Event | Seventh Workshop on Statistical Machine Translation - Montreal, Canada Duration: 7 Jun 2012 → 8 Jun 2012 http://www.statmt.org/wmt12/ |
Conference
Conference | Seventh Workshop on Statistical Machine Translation |
---|---|
Country/Territory | Canada |
City | Montreal |
Period | 7/06/12 → 8/06/12 |
Internet address |