Abstract
Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033% across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, which we audit manually to ensure reliability. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model's performance, both in comparison to existing open models and by language class.
Original language | English |
---|---|
Title of host publication | Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) |
Editors | Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki |
Place of Publication | Toronto, Canada |
Publisher | Association for Computational Linguistics |
Pages | 865-879 |
Number of pages | 15 |
ISBN (Electronic) | 9781959429715 |
DOIs | |
Publication status | Published - 9 Jul 2023 |
Event | The 61st Annual Meeting of the Association for Computational Linguistics - Westin Harbour Castle, Toronto, Canada Duration: 9 Jul 2023 → 14 Jul 2023 Conference number: 61 https://2023.aclweb.org/ |
Publication series
Name | Proceedings of the ACL Conference |
---|---|
Publisher | ACL |
ISSN (Print) | 0736-587X |
Conference
Conference | The 61st Annual Meeting of the Association for Computational Linguistics |
---|---|
Abbreviated title | ACL 2023 |
Country/Territory | Canada |
City | Toronto |
Period | 9/07/23 → 14/07/23 |
Internet address |