Zhang, Jisi (2022) Time-domain Multi-channel Speech Separation for Overlapping Speech Recognition. PhD thesis, University of Sheffield.
Abstract
Despite the recent progress of automatic speech recognition (ASR) driven by deep learning, conversational speech recognition using distant microphones is still challenging. In natural environments, an utterance recorded by distant microphones is corrupted by noise and reverberation, and overlapped by competing speakers, which degrade the speech recognition performance.
Speech separation techniques aim to recover individual sources from a noisy mixture, and have been shown beneficial to robust ASR. Deep-learning based separation approaches using a single microphone have moved towards directly processing time-domain signals and outperformed time-frequency domain approaches. When multiple microphones are available, spatial information has been demonstrated to be beneficial for separation. This thesis investigates deep-learning based approaches for time-domain separation using multiple microphones. The designed system is further applied to overlap speech recognition in noisy environments. Three major contributions are summarised as follows.
Firstly, a fully-convolutional multi-channel time-domain separation network is developed. The system uses a neural network to automatically learn spatial features from multiple recordings. Different network architectures and multi-stage separation are also considered for the system design. Experiments show that the proposed system achieves better separation and recognition performance over a conventional time-frequency domain approach.
Next, the time-domain separation system is extended to a speaker extraction system, which employs speaker identity information. A two-stage speaker conditioning mechanism is proposed to efficiently inform the speaker information to the extraction system. The proposed extraction system can simultaneously output multiple corresponding sources from a noisy mixture and further improve the recognition performance over the blind separation approach.
The third contribution studies unsupervised and semi-supervised learning approaches to establish a separation system in situations where only a limited amount of clean data is accessible. An existing unsupervised training strategy that trains a separation system to predict mixtures is improved by exploiting teacher-student learning approaches in this work.
Metadata
Supervisors: | Barker, Jon |
---|---|
Keywords: | speech separation, robust speech recognition |
Awarding institution: | University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Engineering (Sheffield) |
Identification Number/EthosID: | uk.bl.ethos.860663 |
Depositing User: | Mr Jisi Zhang |
Date Deposited: | 15 Aug 2022 08:22 |
Last Modified: | 01 Sep 2022 09:54 |
Open Archives Initiative ID (OAI ID): | oai:etheses.whiterose.ac.uk:31141 |
Download
Final eThesis - complete (pdf)
Filename: JZhang_2022_July.pdf
Licence:
This work is licensed under a Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License
Export
Statistics
You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.