Selecting data for English-to-Czech machine translation

Ales Tamchyna, Petra Galuscakova, Amir Kamran, Milos Stanojevic, Ondrej Bojar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We provide a few insights on data selection for machine translation. We evaluate the quality of the new CzEng 1.0, a parallel data source used in WMT12. We describe a simple technique for reducing out-of-vocabulary rate after phrase extraction. We discuss the benefits of tuning towards multiple reference translations for English-Czech language pair. We introduce a novel approach to data selection by full-text indexing and search: we select sentences similar to the test set from a large monolingual corpus and explore several options of incorporating them in a machine translation system. We show that this method can improve translation quality. Finally, we describe our submitted system CU-TAMCH-BOJ.
Original languageEnglish
Title of host publicationProceedings of the Seventh Workshop on Statistical Machine Translation
Place of PublicationMontreal, Canada
PublisherAssociation for Computational Linguistics (ACL)
Pages374-381
Number of pages8
Publication statusPublished - 7 Jun 2012
EventSeventh Workshop on Statistical Machine Translation - Montreal, Canada
Duration: 7 Jun 20128 Jun 2012
http://www.statmt.org/wmt12/

Conference

ConferenceSeventh Workshop on Statistical Machine Translation
Country/TerritoryCanada
CityMontreal
Period7/06/128/06/12
Internet address

Fingerprint

Dive into the research topics of 'Selecting data for English-to-Czech machine translation'. Together they form a unique fingerprint.

Cite this