White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Arabic Dialect Texts Classification

Alshutayri, Areej Odah O. (2018) Arabic Dialect Texts Classification. PhD thesis, University of Leeds.

[img] Text
Areej Alshutayri Thesis.pdf - Final eThesis - complete (pdf)
Restricted until 1 May 2020.

Request a copy


This study investigates how to classify Arabic dialects in text by extracting features which show the differences between dialects. There has been a lack of research about classification of Arabic dialect texts, in comparison to English and some other languages, due to the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and some other languages. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a medium of communication and as a source of a corpus. We collected tweets from Twitter, comments from Facebook and online newspapers from five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. The research sought to: 1) create a dataset of Arabic dialect texts to use in training and testing the system of classification, 2) find appropriate features to classify Arabic dialects: lexical (word and multi-word-unit) and grammatical variation across dialects, 3) build a more sophisticated filter to extract features from Arabic-character written dialect text files. In this thesis, the first part describes the research motivation to show the reason for choosing the Arabic dialects as a research topic. The second part presents some background information about the Arabic language and its dialects, and the literature review shows previous research about this subject. The research methodology part shows the initial experiment to classify Arabic dialects. The results of this experiment showed the need to create an Arabic dialect text corpus, by exploring Twitter and online newspaper. The corpus used to train the ensemble classifier and to improve the accuracy of classification the corpus was extended by collecting tweets from Twitter based on the spatial coordinate points and comments from Facebook posts. The corpus was annotated with dialect labels and used in automatic dialect classification experiments. The last part of this thesis presents the results of classification, conclusions and future work.

Item Type: Thesis (PhD)
Keywords: Arabic Dialect, Classification, Machine Learning, Corpora
Academic Units: The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Depositing User: Mrs Areej Alshutayri
Date Deposited: 16 Apr 2019 09:59
Last Modified: 16 Apr 2019 09:59
URI: http://etheses.whiterose.ac.uk/id/eprint/23600

Please use the 'Request a copy' link(s) above to request this thesis. This will be sent directly to someone who may authorise access.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Actions (repository staff only: login required)