Space-aware subword tokenisation and complex word processing in language models

Abstract

This work investigates the limitations of how subword tokenisers handle spaces, and how the processing of spaces impacts the performance of language models, with a particular focus on the processing of complex words. Motivated by previous work which shows tokeniser limitations, we propose a simple and effective modification to state-of-the-art subword tokenisers—treating spaces as individual tokens—that ameliorates known issues with the morphological validity of tokenisers such as BPE, Unigram, and WordPiece, especially regarding the splitting of prefixes. We extrinsically evaluate our space-aware tokeniser variants (BPE' , Unigram', WordPiece') through pretraining a number of encoder-only transformer language models, and finetuning them on a range of downstream tasks, in the general domain, and for the processing of complex words. For the latter, to extend the analysis of complex word processing beyond English for the first time, we introduce a new dataset (mCWIF) in English, German, Turkish, and Finnish. Across datasets, our space-aware tokenisers substantially improve performance on complex word classification for English, German, and Finnish, with inconsistent results in Turkish. We suggest that the Turkish results arise because topically-relevant subwords in Turkish occur either at the start of a word or not, but very rarely in both positions. In the general domain, we also find that we can remove all word boundary information from sequences and retain equivalent performance, and that such information doesn’t boost performance when included either explicitly through the input or implicitly through the pretraining task. Our work contributes to understanding of the impact of subword tokenisation, the limitations of state-of-the-art subword tokenisers, and how they can be improved for complex word processing. The main research questions are: What is the impact of poor-quality tokenisation, and what causes issues with subword tokenisers? How can subword tokenisers be improved? Can we introduce space information in an alternative way to improve the performance of language models?

Metadata

Supervisors:	Aline, Villavicencio
Awarding institution:	University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield)
Date Deposited:	09 Feb 2026 14:03
Last Modified:	09 Feb 2026 14:03
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:37615

Download

Final eThesis - complete (pdf)

Filename: Thesis_Corrected.pdf

Licence:
This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License

CLICK TO DOWNLOAD

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Space-aware subword tokenisation and complex word processing in language models

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics