Alajrami, Ahmed J S
ORCID: https://orcid.org/0000-0003-0830-559X
(2025)
Understanding how Language Models Learn Under Constrained or Noisy Settings.
PhD thesis, University of Sheffield.
Abstract
Transformer-based Language models (LMs) have become the foundation of modern Natural Language Processing (NLP), demonstrating strong performance across various tasks. Typically trained on massive datasets with curated objectives and clean supervision, these models are increasingly being deployed in real-world settings where data is noisy, incomplete, or otherwise imperfect. Understanding what drives their learning and how robust they are to such conditions is essential.
This thesis investigates how different training signals, namely pre-training objectives, input structure, and instruction quality, affect the generalization behavior of LMs. Through three empirical studies, it examines model performance when traditional assumptions are relaxed or deliberately challenged. The first study explores the role of the pre-training objective, comparing linguistically intuitive objectives with arbitrary or non-linguistic alternatives. Surprisingly, even non-intuitive objectives yield models that acquire linguistic knowledge, suggesting that architecture and data distribution may matter more than the objective’s semantic alignment. The second study focuses on the internal structure of input tokens. Motivated by psycholinguistic findings, we test whether models can learn effectively from inputs with only partial character information. Results show that models remain robust, even with severely reduced inputs, indicating a surprising resilience to information loss. The final study shifts to instruction-tuning. Contrary to the common assumption that clean, well-formed prompts are necessary for generalization, we find that training with noisy or perturbed instructions can often improve both robustness and performance. This suggests that strategically applied noise can act as a form of regularization.
Together, these findings deepen our understanding of how LMs learn under constrained or imperfect conditions. They show that models can be not only effective when trained on clean, large-scale data but also surprisingly resilient, and at times improved, when faced with noisy or incomplete signals. This thesis offers practical insights for building robust, data-efficient language models suitable for real-world deployment.
Metadata
| Supervisors: | Aletras, Nikolaos |
|---|---|
| Keywords: | NLP, LLMs, Language Modeling |
| Awarding institution: | University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) |
| Date Deposited: | 19 Jan 2026 09:57 |
| Last Modified: | 19 Jan 2026 09:57 |
| Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:38062 |
Download
Final eThesis - complete (pdf)
Filename: PhD Thesis_Ahmed Alajrami.pdf
Licence:

This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.