Iakovenko, Olga
ORCID: 0000-0002-7801-6585
(2025)
Matrix Language Identification for Code-Switching.
PhD thesis, University of Sheffield.
Abstract
Code Switching (CS) refers to the alternating use of two or more languages within a single conversation. This phenomenon poses significant challenges for Automatic Speech Recognition (ASR) systems due to abrupt language transitions, mixed linguistic structures, and variations in phonetic and syntactic patterns. A critical factor in improving CS ASR performance is the accurate identification of the Matrix Language (MLang) - the language that provides the syntactic and structural framework for CS utterances.
This thesis leverages two linguistic principles for MLang determination: the Morpheme Order Principle (MOP) and the System Morpheme Principle (SMP) from the Matrix Language Frame (MLF) theory to develop new Matrix Language Identity (MLID) methods for text and speech. Based on the two linguistic principles, three methods for textual MLID are defined: P1.1, the singleton principle, P1.2, the morpheme order principle and P2, the system morpheme principle. Additionally, this thesis introduces novel approaches for system morpheme discovery based on MLID theory. Two implementations of P2 (the system morpheme principle) - a deterministic and a predictive approach - are applied to SEAME and Miami CS datasets to identify system morphemes relevant to MLang determination. Furthermore, the applicability of the proposed MLID methods to naturalness assessment is explored. Simulated CS datasets were generated for four language pairs and evaluated by native speakers based on a naturalness score. Using MLF theory, the adherence to grammars of the CS languages of the simulated CS text was analysed through four zero-resource approaches: P2, P2 extracted, Relaxed P2, and Relaxed P2 extracted. Finally, this thesis investigates the impact of MLID on ASR performance. MLID was predicted from CS audio alongside ASR and word-level LID in a multitask learning framework. The proposed CS ASR system achieved a significantly lower MER compared to the baseline.
Metadata
| Supervisors: | Hain, Thomas |
|---|---|
| Related URLs: | |
| Awarding institution: | University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) |
| Date Deposited: | 22 Dec 2025 10:18 |
| Last Modified: | 22 Dec 2025 10:18 |
| Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:37908 |
Download
Final eThesis - complete (pdf)
Embargoed until: 22 December 2026
Please use the button below to request a copy.
Filename: THESIS__Matrix_Language_Identification_for_Code_Switching.pdf
Export
Statistics
Please use the 'Request a copy' link(s) in the 'Downloads' section above to request this thesis. This will be sent directly to someone who may authorise access.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.