Iria, José (2013) Learning for text mining : tackling the cost of feature and knowledge engineering. PhD thesis, University of Sheffield.
Abstract
Over the last decade, the state-of-the-art in text mining has moved
towards the adoption of machine learning as the main paradigm at the
heart of approaches. Despite significant advances, machine learning based
text mining solutions remain costly to design, develop and maintain
for real world problems. An important component of such cost
(feature engineering) concerns the effort required to understand which
features or characteristics of the data can be successfully exploited in
inducing a predictive model of the data. Another important component
of the cost (knowledge engineering) has to do with the effort in creating
labelled data, and in eliciting knowledge about the mining systems and
the data itself.
I present a series of approaches, methods and findings aimed at reducing
the cost of creating and maintaining document classification and
information extraction systems. They address the following questions:
Which classes of features lead to an improved classification accuracy in
the document classification and entity extraction tasks? How to reduce
the amount of labelled examples needed to train machine learning based
document classification and information extraction systems, so
as to relieve domain experts from this costly task? How to effectively
represent knowledge about these systems and the data that they manipulate,
in order to make systems interoperable and results replicable?
I provide the reader with the background information necessary to
understand the above questions and the contributions to the state-of the-
art contained herein. The contributions include: the identification
of novel classes of features for the document classification task which
exploit the multimedia nature of documents and lead to improved
classification accuracy; a novel approach to domain adaptation for
text categorization which outperforms standard supervised and semi-supervised
methods while requiring considerably less supervision;
and a well-founded formalism for declaratively specifying text and
multimedia mining systems.
Metadata
Awarding institution: | University of Sheffield |
---|---|
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield) |
Identification Number/EthosID: | uk.bl.ethos.577568 |
Depositing User: | EThOS Import Sheffield |
Date Deposited: | 29 Nov 2016 09:51 |
Last Modified: | 29 Nov 2016 09:51 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:14608 |
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.