White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Designing a General Framework for Text Alignment: Case Studies with Two South Asian Languages

Aswani, Niraj (2012) Designing a General Framework for Text Alignment: Case Studies with Two South Asian Languages. PhD thesis, University of Sheffield.

[img]
Preview
Text
Aswani,_Niraj.pdf
Available under License Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 UK: England & Wales.

Download (2636Kb)

Abstract

Building machine translation systems for many South Asian languages (such as Hindi, Gujarati, etc.) using statistical methods is problematic. The primary reason is insufficient parallel data to learn accurate word alignment. Additionally, these languages are morphologically rich and have free word order. When it is difficult to rely purely on statistical methods due to insufficient data, research shows that better performance can be obtained by building hybrid systems that rely on language specific resources, such as morphological analysers or dictionaries, as well as statistical methods. However, it is difficult to find such language specific resources for many South Asian languages. Since languages such as Hindi, Gujarati, Urdu, Bengali, Punjabi and Marathi are all very similar in structure and the main differences lie in the script and vocabulary used for these languages, we hypothesise that it is possible to develop resources for one of these languages and generalize the approach to allow rapid bootstrapping of similar resources for the other closely related languages -- with minimal effort and similar accuracies. To verify this, we develop a few resources for the Hindi language, including a sentence alignment algorithm, a morphological analyser and a transliteration similarity component and generalize the approach to allow rapid bootstrapping of similar resources for the Gujarati language. We show that the approach works on both the Hindi and Gujarati languages and achieves results that are comparable to similar state-of-the-art (SOA) resources available for these languages. We also hypothesise that it is possible to develop a high performance hybrid word alignment algorithm that relies on such language specific resources. To verify this, we design, implement and evaluate a novel English-Hindi hybrid word alignment system that uses the Hindi specific resources developed by us. Not only do we show our word alignment system outperforms other SOA English-Hindi word alignment systems, but also how simple it is to adapt it to the English-Gujarati language pair.

Item Type: Thesis (PhD)
Academic Units: The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield)
The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield)
Depositing User: Mr Niraj Aswani
Date Deposited: 20 Aug 2012 15:01
Last Modified: 08 Aug 2013 08:49
URI: http://etheses.whiterose.ac.uk/id/eprint/2618

Actions (repository staff only: login required)