SMASH at Qur’an QA 2022: Creating Better Faithful Data Splits for Low-resourced Question Answering Scenarios

Amr Keleg, Walid Magdy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

The Qur'an QA 2022 shared task aims at assessing the possibility of building systems that can extract answers to religious questions given relevant passages from the Holy Qur'an. This paper describes SMASH's system that was used to participate in this shared task. Our experiments reveal a data leakage issue among the different splits of the dataset. This leakage problem hinders the reliability of using the models' performance on the development dataset as a proxy for the ability of the models to generalize to new unseen samples. After creating better faithful splits from the original dataset, the basic strategy of fine-tuning a language model pretrained on classical Arabic text yielded the best performance on the new evaluation split. The results achieved by the model suggests that the small scale dataset is not enough to fine-tune large transformer-based language models in a way that generalizes well. Conversely, we believe that further attention could be paid to the type of questions that are being used to train the models given the sensitivity of the data.
Original languageEnglish
Title of host publicationProceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
EditorsHend Al-Khalifa, Tamer Elsayed, Hamdy Mubarak, Abdulmohsen Al-Thubaity, Waleed Magdy, Kareem Darwish
Place of PublicationParis, France
PublisherEuropean Language Resources Association (ELRA)
Pages136-145
Number of pages10
ISBN (Electronic)979-10-95546-75-7
Publication statusPublished - 20 Jun 2022
Event5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detection (OSACT 2022) @ LREC 2022 - Marseille, France
Duration: 20 Jun 202220 Jun 2022
Conference number: 5

Workshop

Workshop5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detection (OSACT 2022) @ LREC 2022
Abbreviated titleOSACT 2022
Country/TerritoryFrance
CityMarseille
Period20/06/2220/06/22

Keywords / Materials (for Non-textual outputs)

  • Question Answering
  • , Reading Comprehension Question Answering
  • Arabic NLP

Fingerprint

Dive into the research topics of 'SMASH at Qur’an QA 2022: Creating Better Faithful Data Splits for Low-resourced Question Answering Scenarios'. Together they form a unique fingerprint.

Cite this