Arabic Dialect Texts Classification

Abstract

This study investigates how to classify Arabic dialects in text by extracting features which show the differences between dialects. There has been a lack of research about classification of Arabic dialect texts, in comparison to English and some other languages, due to the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and some other languages. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a medium of communication and as a source of a corpus. We collected tweets from Twitter, comments from Facebook and online newspapers from five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. The research sought to: 1) create a dataset of Arabic dialect texts to use in training and testing the system of classification, 2) find appropriate features to classify Arabic dialects: lexical (word and multi-word-unit) and grammatical variation across dialects, 3) build a more sophisticated filter to extract features from Arabic-character written dialect text files.
In this thesis, the first part describes the research motivation to show the reason for choosing the Arabic dialects as a research topic. The second part presents some background information about the Arabic language and its dialects, and the literature review shows previous research about this subject. The research methodology part shows the initial experiment to classify Arabic dialects. The results of this experiment showed the need to create an Arabic dialect text corpus, by exploring Twitter and online newspaper. The corpus used to train the ensemble classifier and to improve the accuracy of classification the corpus was extended by collecting tweets from Twitter based on the spatial coordinate points and comments from Facebook posts. The corpus was annotated with dialect labels and used in automatic dialect classification experiments. The last part of this thesis presents the results of classification, conclusions and future work.

Metadata

Supervisors:	Atwell, Eric
Keywords:	Arabic Dialect, Classification, Machine Learning, Corpora
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Identification Number/EthosID:	uk.bl.ethos.772851
Depositing User:	Mrs Areej Alshutayri
Date Deposited:	16 Apr 2019 09:59
Last Modified:	11 May 2020 09:53
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:23600

Download

Final eThesis - complete (pdf)

Filename: Areej Alshutayri Thesis.pdf

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License

CLICK TO DOWNLOAD

[thumbnail of Areej Alshutayri Thesis.pdf]

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Arabic Dialect Texts Classification

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics