White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Ensemble Morphosyntactic Analyser for Classical Arabic

Alosaimy, Abdulrahman Mohammed S (2018) Ensemble Morphosyntactic Analyser for Classical Arabic. PhD thesis, University of Leeds.

alosaimy18thesisV76.pdf - Final eThesis - complete (pdf)
Available under License Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales.

Download (9Mb) | Preview


Classical Arabic (CA) is an influential language for Muslim lives around the world. It is the language of two sources of Islamic laws: the Quran and the Sunnah, the collection of traditions and sayings attributed to the prophet Mohammed. However, classical Arabic in general, and the Sunnah, in particular, is underexplored and under-resourced in the field of computational linguistics. This study examines the possible directions for adapting existing tools, specifically morphological analysers, designed for modern standard Arabic (MSA) to classical Arabic. Morphological analysers of CA are limited, as well as the data for evaluating them. In this study, we adapt existing analysers and create a validation data-set from the Sunnah books. Inspired by the advances in deep learning and the promising results of ensemble methods, we developed a systematic method for transferring morphological analysis that is capable of handling different labelling systems and various sequence lengths. In this study, we handpicked the best four open access MSA morphological analysers. Data generated from these analysers are evaluated before and after adaptation through the existing Quranic Corpus and the Sunnah Arabic Corpus. The findings are as follows: first, it is feasible to analyse under-resourced languages using existing comparable language resources given a small sufficient set of annotated text. Second, analysers typically generate different errors and this could be exploited. Third, an explicit alignment of sequences and the mapping of labels is not necessary to achieve comparable accuracies given a sufficient size of training dataset. Adapting existing tools is easier than creating tools from scratch. The resulting quality is dependent on training data size and number and quality of input taggers. Pipeline architecture performs less well than the End-to-End neural network architecture due to error propagation and limitation on the output format. A valuable tool and data for annotating classical Arabic is made freely available.

Item Type: Thesis (PhD)
Related URLs:
Keywords: Ensemble Morphological analysis Classical Arabic Sunnah Deep learning Pos tagging
Academic Units: The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Identification Number/EthosID: uk.bl.ethos.759822
Depositing User: Abdulrahman Alosaimy
Date Deposited: 03 Dec 2018 12:14
Last Modified: 18 Feb 2020 12:32
URI: http://etheses.whiterose.ac.uk/id/eprint/22359

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Actions (repository staff only: login required)