Arabic dialect identification under scrutiny: Limitations of single-label classification

Amr Keleg, Walid Magdy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running since 2018. However, ADI systems are reported to fail in distinguishing between the micro-dialects of Arabic. We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that. We highlight the limitation of the incompleteness of the Dialect labels and demonstrate how it impacts the evaluation of ADI systems. A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that ≈ 67% of the validated errors are not true errors. Consequently, we propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.
Original languageEnglish
Title of host publicationProceedings of ArabicNLP 2023
EditorsHassan Sawaf, Samhaa El-Beltagy, Wajdi Zaghouani, Walid Magdy, Ahmed Abdelali, Nadi Tomeh, Ibrahim Abu Farza, Nizar Habash, Salam Khalifa, Amr Keleg, Hatem Haddad, Imed Zitouni, Khalil Mrini, Rawan Almatham
PublisherAssociation for Computational Linguistics
Pages385-398
Number of pages14
Edition1
ISBN (Electronic)9781959429272
DOIs
Publication statusPublished - 7 Dec 2023
EventThe First Arabic Natural Language Processing Conference - Sentosa, Singapore
Duration: 7 Dec 20237 Dec 2023
Conference number: 1
https://arabicnlp2023.sigarab.org/

Conference

ConferenceThe First Arabic Natural Language Processing Conference
Abbreviated titleArabicNLP 2023
Country/TerritorySingapore
CitySentosa
Period7/12/237/12/23
Internet address

Fingerprint

Dive into the research topics of 'Arabic dialect identification under scrutiny: Limitations of single-label classification'. Together they form a unique fingerprint.

Cite this