A transformer-based standardisation system for Scottish Gaelic

Junfan Huang, Beatrice Alex, Michael Bauer, D.S. Jasin, Liang Yuchao, Robert Thomas, Will Lamb*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

The transition from rule-based to neural-based architectures has made it more difficult for low-resource languages like Scottish Gaelic to participate in modern language technologies. The performance of deep-learning approaches correlates with the availability of training data, and low-resource languages have limited data reserves by definition. Historical and non-standard orthographic texts could be used to supplement training data, but manual conversion of these texts is expensive and timeconsuming. This paper describes the development of a neuralbased orthographic standardisation system for Scottish Gaelic and compares it to an earlier rule-based system. The best performance yielded a precision of 93.92, a recall of 92.20 and a word error rate of 11.01. This was obtained using a transformerbased mixed teacher model which was trained with augmented data
Original languageEnglish
Title of host publicationProceedings of SIGUL 2023
Subtitle of host publication2nd Annual Meeting of the ELRA/ISCA SIG on Under-resourced Languages
Number of pages5
Publication statusPublished - 19 Aug 2023

Keywords / Materials (for Non-textual outputs)

  • Scottish Gaelic
  • text standardisation
  • text normalisation
  • transformer
  • neural network
  • Natural Language Processing (NLP)
  • machine learning


Dive into the research topics of 'A transformer-based standardisation system for Scottish Gaelic'. Together they form a unique fingerprint.

Cite this