Synthesising a corpus of Gaelic traditional narrative with cross-lingual text expansion

Will Lamb*, Dongge Han, Ondrej Klejch, Beatrice Alex, Peter Bell

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Advances in large language modelling have disproportionately benefited high-resource languages due to their vastly greater training data reserves. This paper proposes a novel cross-lingual text expansion (XLTE) technique using multilingual large language models (MLLMs) to mitigate data sparsity in low-resource languages (LRLs). We apply XLTE to the domain of traditional Scottish Gaelic storytelling, aiming to generate coherent, long-form narrative texts by leveraging MLLMs’ cross-lingual capabilities. The effectiveness of this technique is demonstrated using GPT-4o, with supervised fine-tuning (SFT) providing a $57.2\%$ reduction in perplexity and decreased neologism over baseline models. Despite these improvements, qualitative analyses reveal stylistic divergences between synthesised and genuine data. Nevertheless, XLTE offers a promising, scalable method for synthesising training sets in other languages and domains, opening avenues for further improvements in low-resource language modelling.
Original languageEnglish
Title of host publicationProceedings of the 5th Celtic Language Technology Workshop
Place of PublicationAbu Dhabi
PublisherACL Anthology
Pages12–26
Number of pages12
Publication statusPublished - 24 Jan 2025
EventCeltic Language Technology Conference 5 (CLTW5) - Abu Dhabi, United Arab Emirates
Duration: 20 Jan 2025 → …
https://cltworkshop.github.io

Conference

ConferenceCeltic Language Technology Conference 5 (CLTW5)
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period20/01/25 → …
Internet address

Fingerprint

Dive into the research topics of 'Synthesising a corpus of Gaelic traditional narrative with cross-lingual text expansion'. Together they form a unique fingerprint.

Cite this