Xue, Huiyin
ORCID: https://orcid.org/0000-0002-8705-6431
(2025)
Exploring Efficient Methods for Transformer-based Foundation Language Models.
PhD thesis, University of Sheffield.
Abstract
Transformer-based language models have achieved huge success in various natural language processing tasks, but their efficiency remains a critical challenge. The core limitations stem from the computationally intensive dot-product attention mechanism, which scales quadratically with input length, and the increasing memory footprint associated with large model sizes and extensive vocabularies. These issues hinder their deployment on resource-constrained devices and limit their ability to process long contexts.
This thesis explores novel methods to enhance the efficiency of Transformer-based models, focusing on architectural modifications during the pre-training stage. It is presented as a collection of three peer-reviewed publications, each addressing a specific component of the Transformer architecture. The first work introduces a parameter-efficient embedding layer that uses a hashing function to support an unlimited vocabulary with a fixed-size embedding matrix. This approach effectively breaks the rigid, one-to-one mapping between tokens and embeddings, significantly reducing memory consumption. The second work tackles the parameter redundancy within the multi-head attention mechanism. It proposes a more efficient alternative that uses a single shared projection matrix and multiple head embeddings, substantially reducing the number of attention-related parameters while preserving model performance. The final work systematically analyses the key design principles of the dot-product attention mechanism. The findings provide insights into what makes attention so effective and offer a foundation for developing more streamlined and efficient attention mechanisms in the future.
This thesis demonstrates that fine-grained architectural modifications during pre-training can yield substantial improvements in model efficiency. The proposed methods are orthogonal to existing post-training compression techniques, providing a complementary approach to creating more scalable and deployable language models. The findings collectively contribute to a deeper understanding of the core components of Transformer models and pave the way for designing next-generation language models that are both powerful and efficient.
Metadata
| Supervisors: | Aletras, Nikolaos |
|---|---|
| Related URLs: |
|
| Keywords: | natural language processing, pretraining, language modeling, efficient methods, model design, micro design, embeddings, attention mechanism |
| Awarding institution: | University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) |
| Date Deposited: | 27 Jan 2026 11:48 |
| Last Modified: | 27 Jan 2026 11:48 |
| Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:38089 |
Download
Final eThesis - complete (pdf)
Filename: Huiyin_PhD_thesis__final_.pdf
Description: pdf
Licence:

This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.