Annotated Reference Corpus of Scottish Gaelic (ARCOSG)



A gold-standard, grammatically tagged corpus of Scottish Gaelic


A representative, tagged corpus of Scottish Gaelic, divided into 8 registers (4 spoken, 4 written) of approximately 10k words each. The corpus is presented as individual txt files. The corpus was hand-tagged by Lamb, Arbuthnot and Naismith and separately verified by them. It uses the Brown format tag separators ('/': e.g. 'agus/Cc') and an annotation scheme derived from the Irish PAROLE tagset (see Uí Dhonnchadha, E. and van Genabith, J. 2006. A Part-of-Speech tagger for Irish using finite state morphology and constraint grammar disambiguation. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 2241-2244.).

The annotation scheme is described in a PDF included with the data: Lamb, W. and Naismith, S (2014) Scottish Gaelic Part-of-Speech Annotation Guidelines.

This work was funded by Bòrd na Gàidhlig and Carnegie Trust for the Universities of Scotland.
Date made availableMay 2016
PublisherUniversity of Edinburgh
Temporal coverage1997 - 2002
Date of data production1997 - 2002
Geographical coverageHighlands and Islands of Scotland
  • Developing an automatic part-of-speech tagger for Scottish Gaelic

    Danso, S. & Lamb, W., 23 Aug 2014, Proceedings of the Celtic Technology Workshop (CLTW 2014): A Workshop of the 25th International Conference on Computational Linguistics (COLING 2014) August 23, 2014 Dublin, Ireland. Judge, J., Lynn, T., Ward, M. & Brian, Ó. R. (eds.). Vol. 1. p. 1-5 5 p. 1

    Research output: Chapter in Book/Report/Conference proceedingChapter (peer-reviewed)peer-review

    Open Access

Cite this