Processing Semantic Outliers and Low Frequency Phenomena with Large Language Models

Abstract

Distributional semantics is fundamental to modern language models and is the basis for learning linguistic vector representations. This leads to errors when encountering phenomena that deviate from standard statistical patterns. This thesis examines two domains presenting semantic outliers: idiomatic expressions and semantic changes in Alzheimer's Disease (AD) speech.

The first part of this work tests two hypotheses around idiomatic expressions: whether high-quality embeddings can be trained using low-resource techniques, and whether modern LLMs have out-of-the-box representations that allow them to surpass smaller fine-tuned models. The latter part explores whether semantic content alone gives sufficient diagnostic signal for AD detection when isolated from surface features.

Idiomatic expressions are non-compositional multiword expressions like 'break a leg', where meaning cannot be derived from component words. We apply existing low resource techniques to idiomaticity tasks. Expression embeddings trained using 1-150 contexts and knowledge injection through Pattern Exploit Training both lead to improvements in English idiomaticity detection, though show weaker results in Portuguese and Galician. Multi-billion parameter LLMs are evaluated on multiple idiomaticity detection datasets, with the best models achieving high F1 scores (>0.86), and analyses revealing strong idiomatic understanding.

AD cognitive impairment manifests as semantic deterioration, including comprehension deficits, semantic paraphasias, and increased generic term usage. We develop an LLM-based pipeline that transforms speech across multiple languages by translating transcripts into English, generating summaries, and creating narrative storyboards. Validation using BLEU, chrF, and semantic similarity shows low surface-form overlap but high semantic preservation. Classifiers trained on transformed transcripts show minimal performance changes (+/-0.1 macro F1), confirming that semantic changes alone provide sufficient diagnostic information.

These findings demonstrate that distributional semantic outliers can both challenge LLMs with idiomaticity, whilst presenting opportunities for AD detection.

Metadata

Supervisors:	Villa-Uriol, Maria-Cruz and Villavicencio, Aline
Keywords:	LLMs, Idiomaticity, Alzheimer's, Dementia, Non-compositionality
Awarding institution:	University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield)
Date Deposited:	30 Mar 2026 08:29
Last Modified:	30 Mar 2026 08:29
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:37925

Download

Final eThesis - complete (pdf)

Filename: Thesis.pdf

Licence:
This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License

CLICK TO DOWNLOAD

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Processing Semantic Outliers and Low Frequency Phenomena with Large Language Models

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics