White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Human Annotation and Automatic Detection of Web Genres

Rezapour Asheghi, Noushin (2015) Human Annotation and Automatic Detection of Web Genres. PhD thesis, University of Leeds.

Noushin_Rezapour_asheghi_PhD_Thesis.pdf - Final eThesis - complete (pdf)
Available under License Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales.

Download (3193Kb) | Preview


Texts differ from each other in various dimensions such as topic, sentiment, authorship and genre. In this thesis, the dimension of text variation of interest is genre. Unlike topic classification, genre classification focuses on the functional purpose of documents and classifies them into categories such as news, review, online shop, personal home page and conversational forum. In other words, genre classification allows the identification of documents that are similar in terms of purpose, even they are topically very diverse. Research on web genres has been motivated by the idea that finding information on the web can be made easier and more effective by automatic classification techniques that differentiate among web documents with respect to their genres. Following this idea, during the past two decades, researchers have investigated the performance of various genre classification algorithms in order to enhance search engines. Therefore, current web automatic genre identification research has resulted in several genre annotated web-corpora as well as a variety of supervised machine learning algorithms on these corpora. However, previous research suffers from shortcomings in corpus collection and annotation (in particular, low human reliability in genre annotation), which then makes the supervised machine learning results hard to assess and compare to each other as no reliable benchmarks exist. This thesis addresses this shortcoming. First, we built the Leeds Web Genre Corpus Balanced-design (LWGC-B) which is the first reliably annotated corpus for web genres, using crowd-sourcing for genre annotation. This corpus which was compiled by focused search method, overcomes the drawbacks of previous genre annotation efforts such as low inter-coder agreement and false correlation between genre and topic classes. Second, we use this corpus as a benchmark to determine the best features for closed-set supervised machine learning of web genres. Third, we enhance the prevailing supervised machine learning paradigm by using semi-supervised graph-based approaches that make use of the graph-structure of the web to improve classification results. Forth, we expanded our annotation method successfully to Leeds Web Genre Corpus Random (LWGC-R) where the pages to be annotated are collected randomly by querying search engines. This randomly collected corpus also allowed us to investigate coverage of the underlying genre inventory. The result shows that our 15 genre categories are sufficient to cover the majority but not the vast majority of the random web pages. The unique property of the LWGC-R corpus (i.e. having web pages that do not belong to any of the predefined genre classes which we refer to as noise) allowed us to, for the first time, evaluate the performance of an open-set genre classification algorithm on a dataset with noise. The outcome of this experiment indicates that automatic open-set genre classification is a much more challenging task compared to closed-set genre classification due to noise. The results also show that automatic detection of some genre classes is more robust to noise compared to other genre classes.

Item Type: Thesis (PhD)
Academic Units: The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Identification Number/EthosID: uk.bl.ethos.657001
Depositing User: Leeds CMS
Date Deposited: 14 Jul 2015 12:20
Last Modified: 25 Jul 2018 09:50
URI: http://etheses.whiterose.ac.uk/id/eprint/9445

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Actions (repository staff only: login required)