Copied Monolingual Data Improves Low-Resource Neural Machine Translation

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We train a neural machine translation (NMT) system to both translate sourcelanguage text and copy target-language text, thereby exploiting monolingual corpora in the target language. Specifically, we create a bitext from the monolingual text in the target language so that each source sentence is identical to the target sentence. This copied data is then mixed with the parallel corpus and the NMT system is trained like normal, with no metadata to distinguish the two input languages. Our proposed method proves to be an effective way of incorporating monolingual data into low-resource NMT. On Turkish↔English and Romanian↔English translation tasks, we see gains of up to 1.2 BLEU over a strong baseline with back-translation. Further analysis shows that the linguistic phenomena behind these gains are different from and largely orthogonal to back-translation, with our copied corpus method improving accuracy on named entities and other words that should remain identical between the source and target languages.
Original languageEnglish
Title of host publicationProceedings of the Second Conference on Machine Translation
Subtitle of host publicationPart of EMNLP 2017
PublisherAssociation for Computational Linguistics
Pages148–156
Number of pages9
Volume1: Research Papers
DOIs
Publication statusPublished - 8 Sep 2017
Event2017 Conference on Machine Translation - Copenhagen, Denmark
Duration: 7 Sep 20178 Sep 2017

Conference

Conference2017 Conference on Machine Translation
CountryDenmark
CityCopenhagen
Period7/09/178/09/17

Fingerprint

Dive into the research topics of 'Copied Monolingual Data Improves Low-Resource Neural Machine Translation'. Together they form a unique fingerprint.

Cite this