Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora

Abstract

Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop
standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text.

We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is
required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information
to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part.

Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior knowledge broad-coverage lexical resources; the SALMA – ABCLexicon.

More fine-grained tag sets may be more appropriate for some tasks. The SALMA –Tag Set is a theory standard for encoding, which captures long-established traditional
fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent.

The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging.

Metadata

Supervisors:	Atwell, E.
ISBN:	978-0-85731-148-1
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Identification Number/EthosID:	uk.bl.ethos.546653
Depositing User:	Repository Administrator
Date Deposited:	27 Feb 2012 10:42
Last Modified:	07 Mar 2014 11:24
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:2165

Download

Sawalha_MSS_Computing_PhD_2011

Filename: Sawalha_MSS_Computing_PhD_2011.pdf

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License

CLICK TO DOWNLOAD

[thumbnail of Sawalha_MSS_Computing_PhD_2011.pdf]

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Altmetric

View Altmetric information about this item.

Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora

Abstract

Metadata

Download

Sawalha_MSS_Computing_PhD_2011

Export

Statistics