Daraselia, Sophiko (2019) Computational Analysis of Morphosyntactic Categories in Georgian. PhD thesis, University of Leeds.
Abstract
This thesis describes the development of part-of-speech tagging resources for the Georgian language, consisting of i.) a new morphosyntactic language model for part-of-speech (POS) tagging purposes; ii.) tagging guidelines for tagging and post-editing; iii.) the KATAG tagset and iv.) the trained parameter files the probabilistic TreeTagger program needs to work on Georgian texts.
A new morphosyntactic model of Georgian for part-of-speech tagging purposes is described in the thesis. The thesis also describes a tagset (KATAG) defined in accordance with a new morphosyntactic model of the language and a set of design principles and tagging guidelines.
A stochastic methodology is used here to perform tagging in Georgian. Namely, the Treetagger - a probabilistic part-of-speech tagging program has been trained on Georgian texts. The justification for this choice is discussed. I use two tokenisation approaches in part-of-speech tagging. An accuracy of 92.41% using an enclitic tokenisation approach and accuracy of 87.13% was achieved using a non-enclitic tokenisation approach, corroborating my hypothesis that treating enclitic elements separately from the host words results in better tagging performance.
To make the tagger program easily adaptable for a range of inputs (type, variety or genre of text), the performance of the probabilistic TreeTagger program was evaluated according to the obtained test set consisting of five different genres such as academic, informal, legal, fiction and news.
Metadata
Supervisors: | Sharoff, Serge and Nelson, Diane and Hardie, Andrew |
---|---|
Keywords: | Part-of-speech tagging, morphosyntactic annotation, corpus linguistics |
Awarding institution: | University of Leeds |
Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures and Societies (Leeds) The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures and Societies (Leeds) > Linguistics & Phonetics (Leeds) |
Identification Number/EthosID: | uk.bl.ethos.789498 |
Depositing User: | Dr Sophiko Daraselia |
Date Deposited: | 15 Nov 2019 15:14 |
Last Modified: | 25 Mar 2021 16:45 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:25313 |
Download
Final eThesis - complete (pdf)
Filename: Sophiko_thesis_2019.pdf
Description: Computational Analysis of Morphosyntactic Categories in Georgian
Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.