White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Modelling entity instantiations

McKinlay, Andrew James (2013) Modelling entity instantiations. PhD thesis, University of Leeds.

[img]
Preview
Text
thesis.pdf
Available under License Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales.

Download (2580Kb) | Preview

Abstract

The problem of automatically extracting structured information from texts is an important, unsolved problem within the field of Natural Language Processing. The extraction of such information can facilitate activities such as the building of knowledge bases, automatic summarisation and sentiment analysis. A human reader can easily discern the events described in a text, along with the participants and the relationships between them, but using a computer to automatically discover the same information is much more challenging. Particular focus has been given to extracting relations between the entities in a text, such as those representing geographical locations, personal and social relationships, and employment. In this thesis, we consider two closely related entity relationships, which are interesting, frequent and have not been tackled previously, which we refer to collectively as entity instantiations. We define an entity instantiation as an entity relation in which a set of entities is introduced, and either a member or subset of this set is mentioned. In the example below, we see a set membership instantiation, between ‘several EU countries’ and ‘the UK’, along with a subset instantiation, between the same set and ‘the low countries’. Inflation has increased sharply in several EU countries. In the UK, this has accompanied a drop in interest rates, but in the low countries rates have remained steady. This thesis details the creation of the first corpus of entity instantiations. The final corpus consists of 4,521 instantiations, 2,118 of which are intersentential, and 2,403 of which are intrasentential, annotated over 75 Penn Treebank Wall Street Journal newswire texts. The subsequent annotation study shows high levels of inter-annotator agreement and our corpus study analyses the annotated entity instantiations in terms of their internal structure, the distance between arguments and their syntactic relationship, finding a particularly strong link between syntactic parent-child relationships and sentence-internal entity instantiations. To establish that the accurate automatic identification of entity instantiations is possible, we develop the first instantiation identification algorithm, which uses a supervised machine learning approach. The feature set draws on surface, syntactic, contextual, salience and knowledge features to aid classification. We separately apply our classifier to intersentential and intrasentential entity instantiations and experiment with both balanced data, with a 50/50 positive/negative split, and the original unbalanced corpus. The classifier records highly significant performance increases over both unigram-based and majority class baselines on the balanced data, and also on the original distribution of intrasentential instantiations. In order to take advantage of the aforementioned link between syntax and intrasentential entity instantiations, tree kernels were employed to learn directly from the syntactic parse trees which contain the two potential participants in an intrasentential instantiation. The tree kernel features perform similarly to the unstructured feature set, with a much shorter development time. Combining tree kernels with unstructured features gives further improvements over both the baselines, and either method in isolation. We also apply our entity instantiations to the difficult problem of implicit discourse relation classification, hypothesising that introducing features identifying the presence of an entity instantiation between the arguments of a discourse relation can improve classification performance. Our experiments show that an entity instantiation is a strong indicator of the presence of an Expansion.Instantiation discourse relation. We create a binary Expansion.Instantiation classifier, based on the feature set detailed in Sporleder and Lascarides (2008), but augment it by adding entity instantiation features based on gold standard annotations. The classifier which includes entity instantiation data performs significantly better than the same classifier without entity instantiation data. We also experiment with the incorporation of machine-identified entity instantiations. However, our entity instantiation classifier is not sufficiently accurate to impact on discourse relation classification.

Item Type: Thesis (PhD)
ISBN: 978-0-85731-482-6
Academic Units: The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Identification Number/EthosID: uk.bl.ethos.589282
Depositing User: Repository Administrator
Date Deposited: 06 Jan 2014 16:31
Last Modified: 07 Mar 2014 11:48
URI: http://etheses.whiterose.ac.uk/id/eprint/4936

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Actions (repository staff only: login required)