Speech Separation in Noisy Reverberant Acoustic Environments

Abstract

Speech separation remains a vital area of research for many modern technologies. The ubiquitous spread of deep neural networks (DNNs) in many areas of signal processing (SP) and machine learning (ML) research over the past decade has resulted in significant improvements in single-channel speech separation on a number of benchmark datasets, particularly for anechoic speech separation. Speech separation and enhancement in noisy and reverberant acoustic environments remains challenging, particularly in single-channel models. This is the area this thesis primarily focuses on. In recent years, the temporal convolutional network (TCN) has been a popular sequence model, particularly for speech separation. Arguably, the most popular of such models, the convolutional time-domain audio separation network (Conv-TasNet) model, is analysed in the second part of this thesis for how well it performs in dereverberation tasks with a view to improving the combined speech enhancement and separation performance. It is shown that the network’s optimal receptive field varies depending on the reverberation time of the data being dereverbed. Further to this, the weighted multi-dilation temporal convolutional network (WD-TCN) is proposed as an improvement to the TCN and is shown to give consistent improvements in dereverberation and separation tasks using various parts of a noisy reverberant speech mixtures corpus known as WHAMR. The WD-TCN allows the TCN to dynamically adjust the focus of its receptive field to focus on more or less localized temporal context. An alternative improvement which adjusts the receptive field itself at the frame level is also proposed. This model is referred to as the deformable temporal convolutional network (DTCN) as it uses deformable convolution in place of vanilla convolution which allows the convolutional kernel to have a fully adaptive receptive field by using a linear interpolation function. The TCN model is typically considered a computationally lightweight model compared to other DNN models; however, it doesn’t model global sequence information beyond the mean and variance. Transformer networks that process high-level global context have led to breakthroughs in performance across various areas of machine learning, including speech separation. The third part of this thesis starts by looking at how incorporating multihead attention (MHA), the main mechanism used in Transformers, can be leveraged to encode iii iv global context in the encodings of the Conv-TasNet model. A multihead self-attention (MHSA) encoder is proposed and shows notable improvements in performance across a number of acoustic conditions. A series of MHA decoders are also proposed with some but less notable improvements. Following on from this, a study on training signal lengths (TSLs) is performed on the dual-path (DP) Transformer-based speech separation model known as SepFormer. This study demonstrated that the choice of TSL is non-trivial and depends on the signal length distribution of the training data in particular but also the data segmentation strategy and the model architecture. Following on from this study, a DNN model referred to as the time domain Conformer (TD-Conformer) is proposed. This model is proposed as a more optimal analogue of the DP Transformer model, particularly for noisy reverberant speech separation. The Conformer is also notably more efficient to train than the DP Transformer in the context of the work presented in this thesis. It is shown that the TD-Conformer model, which uses convolution in place of Transformers for processing local context leads to improved performance for combined speech separation and enhancement with a reduction in computational complexity on shorter signal lengths. Evidence is also provided that this is likely due to a large model size resulting in improved generalization properties of the network. A further chapter is dedicated to analysing the differences between DP models and the conformer models by proposing a model which combines the two, referred to as the convolutional separation Transformer (ConSepT) model. The analysis of this model highlights the strength of the larger model size of the Conformer models in helping the model to generalize and further demonstrates some of the redundancies in the SepFormer model. In the final part of this thesis, the application of speech separation to multi-speaker automatic speech recognition (ASR) is explored. A novel method is proposed to fine-tune the TD-Conformer model for ASR in a transcription-free fashion. This is done by leveraging the embeddings of pre-trained ASR models and computing the differences between them. The proposed method is shown to result in a reduction in word error rate (WER) on a dataset of real doctor-patient conversations.

Metadata

Supervisors:	Hain, Thomas and Goetze, Stefan
Keywords:	source separation, speech recognition, multi-speaker, speech enhancement, dereverberation
Awarding institution:	University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield)
Depositing User:	Dr William Ravenscroft
Date Deposited:	22 Aug 2024 11:08
Last Modified:	22 Aug 2024 11:08
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:35362

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Speech Separation in Noisy Reverberant Acoustic Environments

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics