Abstract
The transition from rule-based to neural-based architectures has made it more difficult for low-resource languages like Scottish Gaelic to participate in modern language technologies. The performance of deep-learning approaches correlates with the availability of training data, and low-resource languages have limited data reserves by definition. Historical and non-standard orthographic texts could be used to supplement training data, but manual conversion of these texts is expensive and timeconsuming. This paper describes the development of a neuralbased orthographic standardisation system for Scottish Gaelic and compares it to an earlier rule-based system. The best performance yielded a precision of 93.92, a recall of 92.20 and a word error rate of 11.01. This was obtained using a transformerbased mixed teacher model which was trained with augmented data
Original language | English |
---|---|
Title of host publication | Proceedings of SIGUL 2023 |
Subtitle of host publication | 2nd Annual Meeting of the ELRA/ISCA SIG on Under-resourced Languages |
Publisher | ISCA |
Pages | 108-112 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 19 Aug 2023 |
Keywords / Materials (for Non-textual outputs)
- Scottish Gaelic
- text standardisation
- text normalisation
- transformer
- neural network
- Natural Language Processing (NLP)
- machine learning