Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks

Abstract

In speaker recognition, deep neural networks deliver state-of-the-art performance due
to their large capacities and powerful feature extraction abilities. However, this performance can be highly affected by interference from background noise and other speakers.
This thesis focuses on new neural network architectures that are designed to overcome
such interference and thereby improve the robustness of the speaker recognition system.
In order to improve the noise robustness of the speaker recognition model, two
novel network architectures are proposed. The first is the hierarchical attention network, which is able to capture both local and global features in order to improve the
robustness of the network. The experimental results show it can deliver results that
are comparable to the published state-of-the-art methods, reaching 4.28% equal error
rate using the Voxceleb1 training and test sets. The second approach is the speech
enhancement and speaker recognition joint system that consists of two networks; the
first integrates speech enhancement and speaker recognition into one framework to
better filter out noise, while the other makes further use of speaker embeddings input to a speech enhancement network. This provides prior knowledge for the speech
enhancement network which improves its performance. The results show that a joint
system with a speaker dependent speech enhancement model can deliver results that
are comparable to the published state-of-the-art methods, reaching 4.15% equal error
rate using the Voxceleb1 training and test sets.
In order to overcome interfering speaker, two novel approaches are proposed. The
first is referred to as the embedding de-mixing approach that separates the speaker and content properties from a two-speaker signal in an embedding space, rather than
in a signal space. The results show that the de-mixed embeddings are close to the
clean embeddings in terms of quality, and the back-end speaker recognition model can
make use of the de-mixed embeddings to reach 96.9% speaker identification accuracy,
compared to those achieved using clean embeddings (98.5%) on TIMIT dataset. The
second approach is the first end-to-end weakly supervised speaker identification approach based on a novel hierarchical transformer network architecture. The results
show that the proposed model can capture speaker properties from two speakers in
one input utterance. The hierarchical transformer network can reach more than 3%
relative improvement compared to the baselines in all of the test conditions.

Metadata

Supervisors:	Hain, Thomas
Keywords:	Speaker Recognition, Deep Neural Networks, Robust Speaker Recognition.
Awarding institution:	University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield)
Identification Number/EthosID:	uk.bl.ethos.840404
Depositing User:	Dr. Yanpei Shi
Date Deposited:	25 Oct 2021 15:35
Last Modified:	01 Dec 2021 10:54
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:29662

Download

Final eThesis - complete (pdf)

Filename: thesis_final_version.pdf

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License

CLICK TO DOWNLOAD

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics