Ensemble Morphosyntactic Analyser for Classical Arabic

Abstract

Classical Arabic (CA) is an influential language for Muslim lives around the
world. It is the language of two sources of Islamic laws: the Quran and the Sunnah,
the collection of traditions and sayings attributed to the prophet Mohammed.
However, classical Arabic in general, and the Sunnah, in particular, is underexplored and under-resourced in the field of computational linguistics. This study examines the possible directions for adapting existing tools, specifically morphological analysers, designed for modern standard Arabic (MSA) to classical Arabic.
Morphological analysers of CA are limited, as well as the data for evaluating them. In this study, we adapt existing analysers and create a validation data-set from
the Sunnah books. Inspired by the advances in deep learning and the promising
results of ensemble methods, we developed a systematic method for transferring
morphological analysis that is capable of handling different labelling systems and
various sequence lengths.
In this study, we handpicked the best four open access MSA morphological analysers. Data generated from these analysers are evaluated before and after adaptation through the existing Quranic Corpus and the Sunnah Arabic Corpus. The findings are as follows: first, it is feasible to analyse under-resourced languages using existing comparable language resources given a small sufficient set of annotated text. Second, analysers typically generate different errors and this could be exploited. Third, an explicit alignment of sequences and the mapping of labels is not necessary to achieve comparable accuracies given a sufficient size of training dataset.
Adapting existing tools is easier than creating tools from scratch. The resulting quality is dependent on training data size and number and quality of input taggers. Pipeline architecture performs less well than the End-to-End neural network architecture due to error propagation and limitation on the output format. A valuable tool and data for annotating classical Arabic is made freely available.

Metadata

Supervisors:	Atwell, Eric
Related URLs:	Data (Research data) Sunnah Arabic Corpus (Research data) Sawaref Project Code (Project) A review of morphosyntactic analysers and tag-sets for Arabic corpus linguistics (Related publication) Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers (Related publication) Diacritization of a Highly Cited Text: A Classical Arabic Book as a Case (Related publication) Web-based Annotation Tool for Inflectional Language Resources (Related publication)
Keywords:	Ensemble Morphological analysis Classical Arabic Sunnah Deep learning Pos tagging
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Identification Number/EthosID:	uk.bl.ethos.759822
Depositing User:	Abdulrahman Alosaimy
Date Deposited:	03 Dec 2018 12:14
Last Modified:	18 Feb 2020 12:32
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:22359

Download

Final eThesis - complete (pdf)

Filename: alosaimy18thesisV76.pdf

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License

CLICK TO DOWNLOAD

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Ensemble Morphosyntactic Analyser for Classical Arabic

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics

Ensemble Morphosyntactic Analyser for Classical Arabic

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Related datasets

Export

Statistics