Alnefaie, Sarah Saed M (2024) Artificial Intelligence for Answering Questions from the Holy Quran and Hadith. PhD thesis, University of Leeds.
Abstract
The aim of this thesis is to develop a system that uses artificial intelligence (AI) models to answer questions related to the Holy Quran and Hadith in classical Arabic (CA). The Holy Quran and Hadith are the two main sources of the Islamic religion.
To achieve this goal, two question-and-answer corpora were developed: one for the Holy Quran, named the Quran Question-Answer (QUQA), and another for Hadith , called the Hadith Question-Answer (HAQA). QUQA is an Arabic dataset focused on the Holy Quran, comprising 3,369 records and over 301,000 tokens. Since some questions may have multiple answers, there are a total of 2,189 unique questions. The verses in the answers represent nearly 47\% of the Quran. In the Arabic HAQA dataset for Hadith , there are 1,598 records, over 290,000 tokens, and 1,366 questions.
After creating the datasets, various deep learning (DL) models were explored to obtain answers to the Arabic religious questions. These deep learning models are categorized into pre-trained language models (PLMs) and large language models (LLMs). PLMs, such as BERT, are smaller LLMs that have been fine-tuned for specific downstream tasks. In contrast, LLMs, such as GPT-4, can execute tasks without requiring tailored training data.
When building a question-answering system using PLMs, the system comprises two tasks: passage retrieval (PR) and machine reading comprehension (MRC). In the PR task, the entire Quran or Hadith book is divided into paragraphs. These paragraphs, along with the question, serve as inputs to the model, which retrieves the paragraph containing the answer. There are several approaches to performing the PR task. The method that achieved the best results with the Quran is the Dense Representations Approach (DPR) using the AraBERT Base model. For the Hadith dataset, two models achieved good results: the hybrid approach using the CAMeL-BERT model and the relevance classification approach using the CAMeL-BERT model. In the MRC task, the inputs are the paragraph selected from the first task and the question. The model identifies the specific answer within the paragraph. I evaluated the performance of approximately nine different Arabic models based on BERT with the Quran and Hadith, employing various methods to improve performance and different datasets for training. Combining the AraBERT Large and AraBERT Base achieved the highest results for the Quran, while CAMeL-BERT achieved the best results for the Hadith.
I also evaluated the effectiveness of LLMs (such as GPT-4) in answering questions related to the Quran and Hadith; however, the outcomes were unsatisfactory. Consequently, I implemented the Retrieval Augmented Generation (RAG) technique, which significantly enhanced the results for Quran-related answers. Nonetheless, these models require considerable further development to achieve a level of understanding comparable to that of humans.
Metadata
Supervisors: | Atwell, Eric and Alsalka, Mohammed Ammar |
---|---|
Related URLs: | |
Awarding institution: | University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering (Leeds) |
Academic unit: | School of Computer Science |
Depositing User: | Ms Sarah Alnefaie |
Date Deposited: | 18 Mar 2025 14:22 |
Last Modified: | 18 Mar 2025 14:22 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:36372 |
Download
Final eThesis - complete (pdf)
Filename: Final_Thesis.pdf
Licence:
This work is licensed under a Creative Commons Attribution NonCommercial ShareAlike 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.