Exploring diversity in back translation for low-resource machine translation

Laurie Burchell, Alexandra Birch, Kenneth Heafield

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Back translation is one of the most widely used methods for improving the performance of neural machine translation systems. Recent research has sought to enhance the effectiveness of this method by increasing the ‘diversity’ of the generated translations. We argue that the definitions and metrics used to quantify ‘diversity’ in previous work have been insufficient. This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity. We present novel metrics for measuring these different aspects of diversity and carry out empirical analysis into the effect of these types of diversity on final neural machine translation model performance for low-resource English↔Turkish and mid-resource English↔Icelandic. Our findings show that generating back translation using nucleus sampling results in higher final model performance, and that this method of generation has high levels of both lexical and syntactic diversity. We also find evidence that lexical diversity is more important than syntactic for back translation performance.
Original languageEnglish
Title of host publicationProceedings of the 3rd Workshop on Deep Learning for Low-Resource NLP
EditorsColin Cherry, Angela Fan, George Foster, Gholamreza (Reza) Haffari, Shahram Khadivi, Nanyun (Violet) Peng, Xiang Ren, Ehsan Shareghi, Swabha Swayamdipta
Place of PublicationStroudsburg, PA, USA
PublisherAssociation for Computational Linguistics
Pages67-79
Number of pages13
ISBN (Electronic)9781955917971
DOIs
Publication statusPublished - 14 Jul 2022
EventThe 3rd Deep Learning for Low-Resource NLP Workshop - Seattle, United States
Duration: 14 Jul 202214 Jul 2022
Conference number: 3
https://sites.google.com/view/deeplo-2022/home

Workshop

WorkshopThe 3rd Deep Learning for Low-Resource NLP Workshop
Abbreviated titleDeepLo 2022
Country/TerritoryUnited States
CitySeattle
Period14/07/2214/07/22
Internet address

Fingerprint

Dive into the research topics of 'Exploring diversity in back translation for low-resource machine translation'. Together they form a unique fingerprint.

Cite this