An open dataset and model for language identification

Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, Kenneth Heafield

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033% across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, which we audit manually to ensure reliability. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model's performance, both in comparison to existing open models and by language class.
Original languageEnglish
Title of host publicationProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
EditorsAnna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Place of PublicationToronto, Canada
PublisherAssociation for Computational Linguistics
Pages865-879
Number of pages15
ISBN (Electronic)9781959429715
DOIs
Publication statusPublished - 9 Jul 2023
EventThe 61st Annual Meeting of the Association for Computational Linguistics - Westin Harbour Castle, Toronto, Canada
Duration: 9 Jul 202314 Jul 2023
Conference number: 61
https://2023.aclweb.org/

Publication series

NameProceedings of the ACL Conference
PublisherACL
ISSN (Print)0736-587X

Conference

ConferenceThe 61st Annual Meeting of the Association for Computational Linguistics
Abbreviated titleACL 2023
Country/TerritoryCanada
CityToronto
Period9/07/2314/07/23
Internet address

Fingerprint

Dive into the research topics of 'An open dataset and model for language identification'. Together they form a unique fingerprint.

Cite this