Onyenwe, Ikechukwu Ekene (2017) Developing Methods and Resources for Automated Processing of the African Language Igbo. PhD thesis, University of Sheffield.
Abstract
Natural Language Processing (NLP) research is still in its infancy in Africa. Most of languages in Africa have few or zero NLP resources available, of which Igbo is among those at zero state. In this study, we develop NLP resources to support NLP-based research in the Igbo language. The springboard is the development of a new part-of-speech (POS) tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result of language internal features not recognized in EAGLES. The tagset consists of three granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The medium-grained tagset is to strike a balance between the other two grains for practical purpose. Following this is the preprocessing of Igbo electronic texts through normalization and tokenization processes. The tokenizer is developed in this study using the tagset definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million tokens. This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus (IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the IgbTS where necessary. A novel automatic method was developed to bootstrap a manual annotation process through exploitation of the by-products of this IAA exercise, to improve IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach was adopted to propose erroneous instances on IgbTC for correction. A novel automatic method that uses knowledge of affixes to flag and correct all morphologically-inflected words in the IgbTC whose tags violate their status as not being morphologically-inflected was also developed and used. Experiments towards the development of an automatic POS tagging system for Igbo using IgbTC show good accuracy scores comparable to other languages that these taggers have been tested on, such as English. Accuracy on the words previously unseen during the taggers’ training (also called unknown words) is considerably low, and much lower on the unknown words that are morphologically-complex, which indicates difficulty in handling morphologically-complex words in Igbo. This was improved by adopting a morphological reconstruction method (a linguistically-informed segmentation into stems and affixes) that reformatted these morphologically-complex words into patterns learnable by machines. This enables taggers to use the knowledge of stems and associated affixes of these morphologically-complex words during the tagging process to predict their appropriate tags. Interestingly, this method outperforms other methods that existing taggers use in handling unknown words, and achieves an impressive increase for the accuracy of the morphologically-inflected unknown words and overall unknown words. These developments are the first NLP toolkit for the Igbo language and a step towards achieving the objective of Basic Language Resources Kits (BLARK) for the language. This IgboNLP toolkit will be made available for the NLP community and should encourage further research and development for the language.
Metadata
Supervisors: | Hepple, Mark R |
---|---|
Publicly visible additional information: | My work place email address ie.onyenwe@unizik.edu.ng. |
Awarding institution: | University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Engineering (Sheffield) |
Identification Number/EthosID: | uk.bl.ethos.713301 |
Depositing User: | Mr Ikechukwu Ekene Onyenwe |
Date Deposited: | 05 May 2017 13:45 |
Last Modified: | 12 Oct 2018 09:38 |
Download
Ikechukwu.E.Onyenwe-PHD-ThesisComplete_version
Filename: Ikechukwu.E.Onyenwe-PHD-ThesisComplete_version.pdf
Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.