Building the Arabic Learner Corpus and a System for Arabic Error Annotation

Abstract

Recent developments in learner corpora have highlighted the growing role they play in some linguistic and computational research areas such as language teaching and natural language processing. However, there is a lack of a well-designed Arabic learner corpus that can be used for studies in the aforementioned research areas.
This thesis aims to introduce a detailed and original methodology for developing a new learner corpus. This methodology which represents the major contribution of the thesis includes a combination of resources, proposed standards and tools developed for the Arabic Learner Corpus project. The resources include the Arabic Learner Corpus, which is the largest learner corpus for Arabic based on systematic design criteria. The resources also include the Error Tagset of Arabic that was designed for annotating errors in Arabic covering 29 types of errors under five broad categories.
The Guide on Design Criteria for Learner Corpus is an example of the proposed standards which was created based on a review of previous work. It focuses on 11 aspects of corpus design criteria. The tools include the Computer-aided Error Annotation Tool for Arabic that provides some functions facilitating error annotation such as the smart-selection function and the auto-tagging function. Additionally, the tools include the ALC Search Tool that is developed to enable searching the ALC and downloading the source files based on a number of determinants.
The project was successfully able to recruit 992 people including language learners, data collectors, evaluators, annotators and collaborators from more than 30 educational institutions in Saudi Arabia and the UK. The data of the Arabic Learner Corpus was used in a number of projects for different purposes including error detection and correction, native language identification, Arabic analysers evaluation, applied linguistics studies and data-driven Arabic learning. The use of the ALC highlights the extent to which it is important to develop this project.

Metadata

Supervisors:	Atwell, Eric
Related URLs:	Author Research data Research data
Keywords:	Arabic, Learner, Corpus
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Identification Number/EthosID:	uk.bl.ethos.666598
Depositing User:	Abdullah Alfaifi
Date Deposited:	15 Sep 2015 13:18
Last Modified:	25 Nov 2015 13:49
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:9736

Download

Final eThesis - complete (pdf)

Filename: ALFAIFI_PhD_Thesis.pdf

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License

CLICK TO DOWNLOAD

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Building the Arabic Learner Corpus and a System for Arabic Error Annotation

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics

Building the Arabic Learner Corpus and a System for Arabic Error Annotation

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Related datasets

Export

Statistics