White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Improving multilingual sentiment analysis using linguistic knowledge

Di Bari, Marilena (2015) Improving multilingual sentiment analysis using linguistic knowledge. PhD thesis, University of Leeds.

thesis_final.pdf - Final eThesis - complete (pdf)
Available under License Creative Commons Attribution Noncommercial 2.0 UK: England & Wales.

Download (2604Kb) | Preview


The need for the automatic analysis of opinions in written texts, which has been growing in recent years in several domains, has made Sentiment Analysis a very popular field (Liu 2012). In this area, systems have been traditionally classifying sentences as positive or negative only in accordance to the sentiment that words most frequently assume (e.g. “angry” negative, “beautiful” positive). Such strategies present two main limitations: 1. Multiple opinions often appear in the same sentence, with each expressing an opposing sentiment on different subjects (e.g. a positive opinion is expressed on the plot of a film, but a negative one on the actors' performance). 2. The most frequent sentiment, collected in sentiment dictionaries, does not take into account the fact that context often alters the orientation. Sentiment dictionaries have also been demonstrated to have small coverage (Di Bari, Sharoff et al. 2013, Di Bari 2015). As a consequence, I propose an automatic system based on deep linguistic knowledge given in particular by dependency parsing relations (Nivre 2005) and by attributes taken from the Appraisal framework (Martin and White 2005), a theory concerned with the language of evaluation, attitude and emotion within Systemic Functional Linguistics (Halliday 1978). As a basis for the creation of the automatic system, I tailored an annotation scheme called SentiML inspired by previous works (Whitelaw, Garg et al. 2005, Bloom, Garg et al. 2007, Bloom and Argamon 2009) and carried out the annotation task in three languages (English, Italian and Russian) by using MAE (Stubbs 2011). The resulting corpora consist of around 500 sentences and 9000 tokens for each language. The corpora contain both original texts and translations of different types: news, political speeches and TED talks (Cettolo, Girardi et al. 2012). The foundation of SentiML lies in the fact that an opinion can be captured in a pair consisting of usually two words with different functions: a target as the expression the sentiment refers to, and a modifier as the expression conveying the sentiment. The pair consisting of the target and the modifier altogether is called appraisal group. Along with these main categories, the annotation includes their attributes, among which the most important are the appraisal type according to the Appraisal framework (‘affect’, ‘appreciation’, ‘judgement’) and the orientation (‘positive’ or ‘negative’, both out-of-context and contextual). A detailed manual analysis of the translation strategies (Baker 2002) and the appraisal types across the corpora, supported by insights from Corpus Linguistics has been carried out. The most interesting expressions found during such analysis have been automatically analysed afterwards with the aim of having a further evaluation of the system. Nonetheless, the main evaluation consists of a comparison with a rule-based system that makes use of already existing tools such as the part-of-speech (POS) tagger and the sentiment dictionary. The main objective of this work is to demonstrate that the Appraisal framework and Sentiment analysis can successfully support each other. The additional consideration that this has been done not only for English, but in parallel for Italian and Russian (and as one of the first applications of the Appraisal Framework in these languages) and for different text types, makes the research unique. Moreover, because the methodology used to compare a variety of linguistic features (morphological, grammatical, lexical, syntactical) at work in sentiment analysis has been applied to three languages belonging to different families (Germanic, Romance and Slavonic), it is expected to be generalizable to other languages. As far as the practical applications are concerned, the automatic system could be used in any field in which written opinions need to be analysed. In the meanwhile, the new individual resources such as the annotated corpora and the Maltparser models for Italian and Russian have been made publicly available.

Item Type: Thesis (PhD)
Related URLs:
Keywords: Systemic Functional Linguistics, Appraisal Framework, Sentiment Analysis, Translation Studies, Corpus Linguistics, annotation, SentiML annotation scheme, SentiML corpus, multilingual study, Russian, Italian
Academic Units: The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures and Societies (Leeds)
Identification Number/EthosID: uk.bl.ethos.680909
Depositing User: Mrs Marilena Di Bari
Date Deposited: 23 Mar 2016 13:06
Last Modified: 15 Oct 2018 13:21
URI: http://etheses.whiterose.ac.uk/id/eprint/11883

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Actions (repository staff only: login required)