Multi-Dialect Arabic POS Tagging: A CRF Approach

Kareem Darwish, Hamdy Mubarak, Mohamed Eldesouki, Ahmed AbdelAli, Younes Samih, Randah Alharbi, Mohammed Attia, Walid Magdy, Laura Kallmeyer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper introduces a new dataset of POS-tagged Arabic tweets in four major dialects along with tagging guidelines. The data, which we are releasing publicly, includes tweets in Egyptian, Levantine, Gulf, and Maghrebi, with 350 tweets for each dialect with appropriate train/test/development splits for 5-fold cross validation. We use a Conditional Random Fields (CRF) sequence labeler to train POS taggers for each dialect and examine the effect of cross and joint dialect training, and give benchmark results for the datasets. Using clitic n-grams, clitic metatypes, and stem templates as features, we were able to train a joint model that can correctly tag four different dialects with an average accuracy of 89.3%.
Original languageEnglish
Title of host publication11th edition of the Language Resources and Evaluation Conference
Place of PublicationMiyazaki, Japan
PublisherEuropean Language Resources Association (ELRA)
Pages93-98
Number of pages6
ISBN (Electronic)979-10-95546-00-9
Publication statusE-pub ahead of print - 12 May 2018
Event11th Edition of the Language Resources and Evaluation Conference - Miyazaki, Japan
Duration: 7 May 201812 May 2018
http://lrec2018.lrec-conf.org/en/

Conference

Conference11th Edition of the Language Resources and Evaluation Conference
Abbreviated titleLREC 2018
Country/TerritoryJapan
CityMiyazaki
Period7/05/1812/05/18
Internet address

Cite this