The Effect of Arabic Dialect Familiarity on Data Annotation

Ibrahim Abu Farha, Walid Magdy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Data annotation is the foundation of most natural language processing (NLP) tasks. However, data annotation is complex and there is often no specific correct label, especially in subjective tasks. Data annotation is affected by the annotators' ability to understand the provided data. In the case of Arabic, this is important due to the large dialectal variety. In this paper, we analyse how Arabic speakers understand other dialects in written text. Also, we analyse the effect of dialect familiarity on the quality of data annotation, focusing on Arabic sarcasm detection. This is done by collecting third-party labels and comparing them to high-quality first-party labels. Our analysis shows that annotators tend to better identify their own dialect and they are prone to confuse dialects they are unfamiliar with. For task labels, annotators tend to perform better on their dialect or dialects they are familiar with. Finally, females tend to perform better than males on the sarcasm detection task. We suggest that to guarantee high-quality labels, researchers should recruit native dialect speakers for annotation.
Original languageEnglish
Title of host publicationProceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)
EditorsHouda Bouamor, Hend Al-Khalifa, Kareem Darwish, Owen Rambow, Fethi Bougares, Ahmed Abdelali, Nadi Tomeh, Salam Khalifa, Wajdi Zaghouani
Place of PublicationStroudsburg, PA, USA
PublisherAssociation for Computational Linguistics
Pages399-408
Number of pages10
ISBN (Print)978-1-959429-27-2
Publication statusPublished - 2 Feb 2023
EventThe Seventh Arabic Natural Language Processing Workshop, 2022 - Abu Dhabi, United Arab Emirates
Duration: 8 Dec 20228 Dec 2022
Conference number: 7
https://sites.google.com/view/wanlp2022/

Workshop

WorkshopThe Seventh Arabic Natural Language Processing Workshop, 2022
Abbreviated titleWANLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period8/12/228/12/22
Internet address

Fingerprint

Dive into the research topics of 'The Effect of Arabic Dialect Familiarity on Data Annotation'. Together they form a unique fingerprint.

Cite this