ALDi: Quantifying the Arabic level of dialectness of text

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17% from news articles and 83% from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers' stylistic choices in different situations, a useful property for sociolinguistic analyses.
Original languageEnglish
Title of host publicationProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
EditorsHouda Bouamor, Juan Pino, Kalika Bali
PublisherAssociation for Computational Linguistics
Pages10597-10611
Number of pages15
Edition28
ISBN (Electronic)9798891760608
DOIs
Publication statusPublished - 1 Dec 2023
EventThe 2023 Conference on Empirical Methods in Natural Language Processing - Resorts World Convention Centre, Sentosa, Singapore
Duration: 6 Dec 202310 Dec 2023
Conference number: 28
https://2023.emnlp.org/

Conference

ConferenceThe 2023 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2023
Country/TerritorySingapore
CitySentosa
Period6/12/2310/12/23
Internet address

Fingerprint

Dive into the research topics of 'ALDi: Quantifying the Arabic level of dialectness of text'. Together they form a unique fingerprint.

Cite this