Abstract
The Qur'an QA 2022 shared task aims at assessing the possibility of building systems that can extract answers to religious questions given relevant passages from the Holy Qur'an. This paper describes SMASH's system that was used to participate in this shared task. Our experiments reveal a data leakage issue among the different splits of the dataset. This leakage problem hinders the reliability of using the models' performance on the development dataset as a proxy for the ability of the models to generalize to new unseen samples. After creating better faithful splits from the original dataset, the basic strategy of fine-tuning a language model pretrained on classical Arabic text yielded the best performance on the new evaluation split. The results achieved by the model suggests that the small scale dataset is not enough to fine-tune large transformer-based language models in a way that generalizes well. Conversely, we believe that further attention could be paid to the type of questions that are being used to train the models given the sensitivity of the data.
Original language | English |
---|---|
Title of host publication | Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection |
Editors | Hend Al-Khalifa, Tamer Elsayed, Hamdy Mubarak, Abdulmohsen Al-Thubaity, Waleed Magdy, Kareem Darwish |
Place of Publication | Paris, France |
Publisher | European Language Resources Association (ELRA) |
Pages | 136-145 |
Number of pages | 10 |
ISBN (Electronic) | 979-10-95546-75-7 |
Publication status | Published - 20 Jun 2022 |
Event | 5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detection (OSACT 2022) @ LREC 2022 - Marseille, France Duration: 20 Jun 2022 → 20 Jun 2022 Conference number: 5 |
Workshop
Workshop | 5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detection (OSACT 2022) @ LREC 2022 |
---|---|
Abbreviated title | OSACT 2022 |
Country/Territory | France |
City | Marseille |
Period | 20/06/22 → 20/06/22 |
Keywords
- Question Answering
- , Reading Comprehension Question Answering
- Arabic NLP