Quality or quantity? On data scale and diversity in adapting large language models for low-resource translation

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource translation still lags significantly behind Neural Machine Translation (NMT) models. In this paper, we explore what it would take to adapt LLMs for low-resource settings. In particular, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has been shown to be less important for MT using LLMs than in previous MT research. Similarly, diversity during SFT has been shown to promote significant transfer in LLMs across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both of these considerations: a) parallel data is critical during both pretraining and SFT, and b) diversity tends to cause interference, not transfer. Our experiments, conducted with 3 LLMs across 2 low-resourced language groups - indigenous American and North-East Indian - reveal consistent patterns in both cases, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve lower-resource languages.
Original languageEnglish
Title of host publicationProceedings of the Ninth Conference on Machine Translation
EditorsBarry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
PublisherAssociation for Computational Linguistics
Pages1-17
Number of pages17
ISBN (Electronic)9798891761797
DOIs
Publication statusPublished - 16 Nov 2024
EventNinth Conference on Machine Translation - Miami, United States
Duration: 15 Nov 202416 Nov 2024

Conference

ConferenceNinth Conference on Machine Translation
Abbreviated titleWMT24
Country/TerritoryUnited States
CityMiami
Period15/11/2416/11/24

Fingerprint

Dive into the research topics of 'Quality or quantity? On data scale and diversity in adapting large language models for low-resource translation'. Together they form a unique fingerprint.

Cite this