Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks

Abstract

In speaker recognition, deep neural networks deliver state-of-the-art performance due to their large capacities and powerful feature extraction abilities. However, this performance can be highly affected by interference from background noise and other speakers. This thesis focuses on new neural network architectures that are designed to overcome such interference and thereby improve the robustness of the speaker recognition system. In order to improve the noise robustness of the speaker recognition model, two novel network architectures are proposed. The first is the hierarchical attention network, which is able to capture both local and global features in order to improve the robustness of the network. The experimental results show it can deliver results that are comparable to the published state-of-the-art methods, reaching 4.28% equal error rate using the Voxceleb1 training and test sets. The second approach is the speech enhancement and speaker recognition joint system that consists of two networks; the first integrates speech enhancement and speaker recognition into one framework to better filter out noise, while the other makes further use of speaker embeddings input to a speech enhancement network. This provides prior knowledge for the speech enhancement network which improves its performance. The results show that a joint system with a speaker dependent speech enhancement model can deliver results that are comparable to the published state-of-the-art methods, reaching 4.15% equal error rate using the Voxceleb1 training and test sets. In order to overcome interfering speaker, two novel approaches are proposed. The first is referred to as the embedding de-mixing approach that separates the speaker and content properties from a two-speaker signal in an embedding space, rather than in a signal space. The results show that the de-mixed embeddings are close to the clean embeddings in terms of quality, and the back-end speaker recognition model can make use of the de-mixed embeddings to reach 96.9% speaker identification accuracy, compared to those achieved using clean embeddings (98.5%) on TIMIT dataset. The second approach is the first end-to-end weakly supervised speaker identification approach based on a novel hierarchical transformer network architecture. The results show that the proposed model can capture speaker properties from two speakers in one input utterance. The hierarchical transformer network can reach more than 3% relative improvement compared to the baselines in all of the test conditions.

Metadata

Supervisors:	Hain, Thomas
Keywords:	Speaker Recognition, Deep Neural Networks, Robust Speaker Recognition.
Awarding institution:	University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield)
Identification Number/EthosID:	uk.bl.ethos.840404
Depositing User:	Dr. Yanpei Shi
Date Deposited:	25 Oct 2021 15:35
Last Modified:	01 Dec 2021 10:54

Download

Final eThesis - complete (pdf)

Filename: thesis_final_version.pdf

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License

CLICK TO DOWNLOAD

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

CORE (COnnecting REpositories)

Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics