Projects per year
Abstract
Advances in large language modelling have disproportionately benefited high-resource languages due to their vastly greater training data reserves. This paper proposes a novel cross-lingual text expansion (XLTE) technique using multilingual large language models (MLLMs) to mitigate data sparsity in low-resource languages (LRLs). We apply XLTE to the domain of traditional Scottish Gaelic storytelling, aiming to generate coherent, long-form narrative texts by leveraging MLLMs’ cross-lingual capabilities. The effectiveness of this technique is demonstrated using GPT-4o, with supervised fine-tuning (SFT) providing a $57.2\%$ reduction in perplexity and decreased neologism over baseline models. Despite these improvements, qualitative analyses reveal stylistic divergences between synthesised and genuine data. Nevertheless, XLTE offers a promising, scalable method for synthesising training sets in other languages and domains, opening avenues for further improvements in low-resource language modelling.
Original language | English |
---|---|
Title of host publication | Proceedings of the 5th Celtic Language Technology Workshop |
Place of Publication | Abu Dhabi |
Publisher | ACL Anthology |
Pages | 12–26 |
Number of pages | 12 |
Publication status | Published - 24 Jan 2025 |
Event | Celtic Language Technology Conference 5 (CLTW5) - Abu Dhabi, United Arab Emirates Duration: 20 Jan 2025 → … https://cltworkshop.github.io |
Conference
Conference | Celtic Language Technology Conference 5 (CLTW5) |
---|---|
Country/Territory | United Arab Emirates |
City | Abu Dhabi |
Period | 20/01/25 → … |
Internet address |
Fingerprint
Dive into the research topics of 'Synthesising a corpus of Gaelic traditional narrative with cross-lingual text expansion'. Together they form a unique fingerprint.-
Gaelic Speech Recognition for Media, Education and Research
Bell, P. (Principal Investigator), Alex, B. (Co-investigator) & Lamb, W. (Co-investigator)
31/03/23 → 31/07/25
Project: Research
-
Decoding Hidden Heritages in Gaelic Traditional Narrative with Text-Mining and Phylogenetics
Lamb, W. (Principal Investigator) & Alex, B. (Co-investigator)
1/08/21 → 31/07/24
Project: Research
Research output
- 1 Conference contribution
-
Developing automatic speech recognition for Scottish Gaelic
Evans, L., Lamb, W., Sinclair, M. & Alex, B., 15 Jun 2022, Proceedings of the 4th Celtic Language Technology Workshop at LREC 2022 (CLTW 4). Fransen, T., Lamb, W. & Prys, D. (eds.). European Language Resources Association (ELRA), p. 110-120 11 p.Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
Open AccessFile