White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Named entity recognition: challenges in document annotation, gazetteer construction and disambiguation

Zhang, Ziqi (2013) Named entity recognition: challenges in document annotation, gazetteer construction and disambiguation. PhD thesis, University of Sheffield.

This is the latest version of this item.

Available under License Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 UK: England & Wales.

Download (3225Kb)


The 'information explosion' has generated unprecedented amount of published information that is still growing at an astonishing rate. As the amount of information grows, the problem of managing the information becomes challenging. A key to this challenge rests on the technology of Information Extraction, which automatically transforms unstructured textual data into structured representation that can be interpreted and manipulated by machines. It is recognised that a fundamental task in Information Extraction is Named Entity Recognition, the goals of which are identifying references of named entities in unstructured documents, and classifying them into pre-defined semantic categories. Further, due to the polysemous nature of natural language, name references are often ambiguous. Resolving ambiguity concerns recognising the true referent entity of a name ref-erence, essentially a further named entity 'recognition' step and often a compulsory process required by tasks built on top of NER. This research presents a body of work aimed at addressing three research questions for NER. The first question concerns effective and efficient methods for training data annotation, which is the task of creating essential training examples for machine learning based NER methods. The second question studies automatically generating background knowledge for NER in the form of gazetteers, which are often critical resources to improve the performance of NER methods. The third question addresses resolving ambiguous name references, a further 'recognition' step that ensures the output of NER to be usable by many complex tasks and applications. For each research question, the related literature has been carefully studied and their limitations have been identified and discussed. New hypotheses and methods have been pro-posed, leading to a number of contributions: - an approach to training data annotation for supervised NER methods, based on the study of annotator suitability and suitability based task allocation; - a method of automatically expanding existing gazetteers of pre-defined semantic categories exploiting the structure and knowledge of Wikipedia; - a method of automatically generating untyped gazetteers for NER based on the 'topic-representativeness' of words in documents; - a method of named entity disambiguation based on maximising the semantic relatedness between candidate entities in a text discourse; - a review of lexical semantic relatedness measures; and a new lexical semantic relatedness measure that harnesses knowledge from different resources. The proposed methods have been evaluated by carefully designed experiments, following the standard practice in each related research area. The results have confirmed the validity of their corresponding hypotheses, as well as the empirical effectiveness of these methods. Overall it is believed that this research has made solid contribution to the re-search of NER and related areas.

Item Type: Thesis (PhD)
Keywords: Information Extraction, Named Entity Recognition, Named Entity Disambiguation, gazetteer construction, semantic relatedness, document annotation
Academic Units: The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield)
The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield)
Identification Number/EthosID: uk.bl.ethos.570183
Depositing User: Dr Ziqi Zhang
Date Deposited: 25 Apr 2018 12:48
Last Modified: 25 Apr 2018 12:48
URI: http://etheses.whiterose.ac.uk/id/eprint/19276

Available Versions of this Item

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Actions (repository staff only: login required)