White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Unsupervised Learning of Multiword Expressions

Korkontzelos, Ioannis (2010) Unsupervised Learning of Multiword Expressions. PhD thesis, University of York.

Available under License Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 UK: England & Wales.

Download (10Mb)


Multiword expressions are expressions consisting of two or more words that correspond to some conventional way of saying things (Manning & Schutze 1999). Due to the idiomatic nature of many of them and their high frequency of occurence in all sorts of text, they cause problems in many Natural Language Processing (NLP) applications and are frequently responsible for their shortcomings. Efficiently recognising multiword expressions and deciding the degree of their idiomaticity would be useful to all applications that require some degree of semantic processing, such as question-answering, summarisation, parsing, language modelling and language generation. In this thesis we investigate the issues of recognising multiword expressions, domainspecific or not, and of deciding whether they are idiomatic. Moreover, we inspect the extent to which multiword expressions can contribute to a basic NLP task such as shallow parsing and ways that the basic property of multiword expressions, idiomaticity, can be employed to define a novel task for Compositional Distributional Semantics (CDS). The results show that it is possible to recognise multiword expressions and decide their compositionality in an unsupervised manner, based on cooccurrence statistics and distributional semantics. Further, multiword expressions are beneficial for other fundamental applications of Natural Language Processing either by direct integration or as an evaluation tool. In particular, termhood-based methods, which are based on nestedness information, are shown to outperform unithood-based methods, which measure the strength of association among the constituents of a multi-word candidate term. A simple heuristic was proved to perform better than more sophisticated methods. A new graph-based algorithm employing sense induction is proposed to address multiword expression compositionality and is shown to perform better than a standard vector space model. Its parameters were estimated by an unsupervised scheme based on graph connectivity. Multiword expressions are shown to contribute to shallow parsing. Moreover, they are used to define a new evaluation task for distributional semantic composition models.

Item Type: Thesis (PhD)
Academic Units: The University of York > Computer Science (York)
Identification Number/EthosID: uk.bl.ethos.535066
Depositing User: Dr. Ioannis Korkontzelos
Date Deposited: 27 Feb 2012 15:57
Last Modified: 08 Sep 2016 12:20
URI: http://etheses.whiterose.ac.uk/id/eprint/2091

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Actions (repository staff only: login required)