Compression versus Machine Learning for Classifying Modern Arabic Code-Switching in Social Media and Classical Arabic Hadith

Abstract

This thesis aims to enrich Arabic resources by building several Arabic corpora and making them freely available to the Arabic research community. Therefore, the Bangor Arabic–English codeswitching (BAEC) corpus, the Saudi Dialect Corpus (SDC) and the Egyptian Dialect Corpus (EDC) and the Non-Authentic Hadith (NAH) corpus were built.

This thesis carries out the detection of code-switching in Arabic varieties and dialects from social media platforms to evaluate the prediction by partial matching (PPM) compression approach, comparing it with a the support vector machine (SVM) classifier with character-based and wordbased approaches. The aim was to test the PPM compression on modern standard Arabic (MSA) and Arabic dialect before using it on Hadith.To the best of our knowledge, no previous study involving the detection of code-switching between Arabic and English using PPM compression has been published before. The experimental results show that PPM compression achieved a higher accuracy rate than the SVM classifier when the training corpus correctly represented the language or dialect being studied.

Then, classifying experiments on Arabic Hadith to evaluate the PPM compression approach and compare it against machine learning and deep learning approaches was also performed. The aim was to classify Arabic Hadith into two main classification tasks: Hadith components classification and Hadith authenticity classification. For the former, the experimental results show that deep learning classifiers can achieve a higher classification accuracy than the other classifiers under study. However, the execution time for deep learning classifiers was high. For the latter, the experimental results showed that Isnad was the part of a Hadith resulting in the most effective automatic determination of authenticity. In addition, the results proved that Matan can be used to judge Hadiths with up to 85% accuracy. These experiments were novel in their approaches to Hadith authenticity classification because they investigated the use of the ii character-based text compression scheme PPM and DL classifiers.

Finally, the current thesis also investigated the automatic segmentation of Arabic Hadith using PPM compression. The experiments showed that PPM was effective in segmenting Hadith into its two main components, having been tested on different Hadith corpora that have different structures. The main innovation in these experiments was their use of a character-based text compression method to segment the Hadiths.

Metadata

Supervisors:	Atwell, Eric and Alsalka, Mohammad
Related URLs:	Deep Learning vs Compression-Based vs Traditional Machine Learning Classifiers to Detect Hadith Authenticity (Related publication) Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study (Related publication) NAH Corpus (Research data) Automatic Hadith Segmentation using PPM Compression (Related publication) Non-authentic Hadith Corpus: Design and Methodology (Related publication)
Keywords:	PPM, Deep Learning, Machine Learning, Arabic NLP, Hadith, Code-switching, Corpus
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Identification Number/EthosID:	uk.bl.ethos.878066
Depositing User:	Mrs Taghreed Tarmom
Date Deposited:	20 Apr 2023 13:30
Last Modified:	11 May 2023 09:53
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:32522

Download

Final eThesis - complete (pdf)

Filename: tarmom22ThesisV16.pdf

Licence:
This work is licensed under a Creative Commons Attribution NonCommercial ShareAlike 4.0 International License

CLICK TO DOWNLOAD

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Compression versus Machine Learning for Classifying Modern Arabic Code-Switching in Social Media and Classical Arabic Hadith

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics

Compression versus Machine Learning for Classifying Modern Arabic Code-Switching in Social Media and Classical Arabic Hadith

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Related datasets

Export

Statistics