Domain and genre dependency in Statistical Machine Translation

Abstract

Statistical Machine Translation (SMT) is currently the most promising and widely studied paradigm in the broader field of Machine Translation, continuously explored in order to improve its performance and to find solutions to its current shortcomings, in particular the sparsity of big bilingual corpora in a variety of domains or genres to be used as training data. However, while one the main trends is still to rely as much as possible on already available large collections of data, even when they do not fit quite well specific translation tasks in terms of relatedness of content, the possibility of using less but appropriately selected training sets - depending on the textual variety of the documents that need to be translated case by case - has not been extensively explored as much so far.
The goal of this research is to investigate whether this latter possibility, i.e. the lack of availability of large quantities of assorted data, can have a possible solution in the application of strategies commonly used in genre and domain classification (including unsupervised topic modeling and document dissimilarity techniques), in particular performing subsampling experiments on bilingual corpora in order to obtain a good fit between training data and the texts that need to be translated with SMT.
For the purposes of this study, already existing freely available large corpora were found to be unsuitable for the selection of domain/document specifc subsamples, so two new parallel corpora - English-Italian and English-German - were compiled employing the \web as corpus" approach on websites containing translated content. Then some tests were made on documents belonging to different varieties, translated with SMT systems built using subsamples of training data selected using document dissimilarity measures in order to pick up the most suitable documents as training data.
Such method has shown how the choice of subsampling strategy heavily depends on the text variety of each considered document, but it has also proven that better translation results can be obtained from small samples of training sets rather than using all the available data, which brings benefits also in terms of quicker training times and use of fewer computational resources.

Metadata

Supervisors:	Sharoff, Serge and Babych, Bogdan and Thomas, Martin
Keywords:	statistical, machine translation, computational linguistics, genre, domain, document dissimilarity
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures and Societies (Leeds)
Identification Number/EthosID:	uk.bl.ethos.643605
Depositing User:	Mr Marco Brunello
Date Deposited:	31 Mar 2015 09:20
Last Modified:	25 Nov 2015 13:48
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:8420

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Domain and genre dependency in Statistical Machine Translation

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics