Towards Phonetically-Informed Automatic Speaker Recognition

Abstract

This thesis explores novel applications of phonetic theory to enhance our understanding of
Automatic Speaker Recognition (ASR). Previous studies typically only explore the
performance of one phonetic feature in isolation; instead, this thesis explores bespoke,
systematically-validated combinations of many different phonetic features. Sociophonetic-
tailoring is also uncommon in previous literature, so this thesis also explores how these
features can be fused together in optimised ways for different accents and speech styles. This
thesis finds that all of the tested phonetic features can be effective for ASR, but tailoring
approaches to different accents and speech styles is the most important consideration in terms
of overall performance. That said, higher formants were generally found to be most effective
for ASR whilst features relating to non-modal voicing were found to be least effective. As the
tested (socio)phonetic features are all explainable, a potential future application of these
findings is to improve the explainability of ASR systems. ASR systems are increasingly
present in modern society and they are undeniably powerful, but their inner workings are not
fully explainable; they are considered ‘black boxes’ by researchers like Rudin (2018) and
they are becoming increasingly distrusted by triers-of-fact (van der Veer et al., 2021). When
tested on their own, the bespoke combinations of explainable phonetic approaches performed
worse than state-of-the-art ASR systems, but this reflects the known trade-off between
explainability and performance (Moez et al., 2016). However, this thesis also finds that its
best-performing phonetic approaches to ASR do not have a detrimental impact to the
performance of off-the-shelf ASR systems when they are fused together; as a result, these
explainable, bespoke, combinatory phonetic approaches could be fused with ASR systems to
add an extra element of explainability to them without concern for performance.

Metadata

Supervisors:	Hughes, Vincent and Harrison, Philip and Watt, Dominic and Beet, Steve and Ravary, Ladan
Keywords:	Phonetics, Forensic Phonetics, Speaker Recognition, Automatic Speaker Recognition
Awarding institution:	University of York
Academic Units:	The University of York > Language and Linguistic Science (York)
Depositing User:	Mr Elliot Holmes
Date Deposited:	18 Jun 2025 11:25
Last Modified:	18 Jun 2025 11:25
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:37020

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Towards Phonetically-Informed Automatic Speaker Recognition

Abstract

Metadata

Download

Examined Thesis (PDF)

Export

Statistics