Distant Speech Recognition of Natural Spontaneous Multi-party Conversations

Abstract

Distant speech recognition (DSR) has gained wide interest recently. While deep networks keep improving ASR overall, the performance gap remains between using close-talking recordings and distant recordings. Therefore the work in this thesis aims at providing some insights for further improvement of DSR performance.

The investigation starts with collecting the first multi-microphone and multi-media corpus of natural spontaneous multi-party conversations in native English with the speaker location tracked, i.e. the Sheffield Wargame Corpus (SWC). The state-of-the-art recognition systems with the acoustic models trained standalone and adapted both show word error rates (WERs) above 40% on headset recordings and above 70% on distant recordings. A comparison between SWC and AMI corpus suggests a few unique properties in the real natural spontaneous conversations, e.g. the very short utterances and the emotional speech. Further experimental analysis based on simulated data and real data quantifies the impact of such influence factors on DSR performance, and illustrates the complex interaction among multiple factors which makes the treatment of each influence factor much more difficult.

The reverberation factor is studied further. It is shown that the reverberation effect on speech features could be accurately modelled with a temporal convolution in the complex spectrogram domain. Based on that a polynomial reverberation score is proposed to measure the distortion level of short utterances. Compared to existing reverberation metrics like C50, it avoids a rigid early-late-reverberation partition without compromising the performance on ranking the reverberation level of recording environments and channels. Furthermore, the existing reverberation measurement is signal independent thus unable to accurately estimate the reverberation distortion level in short recordings. Inspired by the phonetic analysis on the reverberation distortion via self-masking and overlap-masking, a novel partition of reverberation distortion into the intra-phone smearing and the inter-phone smearing is proposed, so that the reverberation distortion level is first estimated on each part and then combined.

Metadata

Supervisors:	Hain, Thomas and Guy, Brown
Keywords:	Speech recognition; distant speech recognition; LVCSR; deep neural network; Sheffield Wargame Corpus; beamforming; dereverberation; reverberation measurement; spontaneous speech recognition
Awarding institution:	University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Computer Science (Sheffield) The University of Sheffield > Faculty of Science (Sheffield) > Computer Science (Sheffield)
Identification Number/EthosID:	uk.bl.ethos.718827
Depositing User:	Miss Yulan Liu
Date Deposited:	03 Jul 2017 10:55
Last Modified:	12 Oct 2018 09:41
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:17691

Download

Final submission of the PhD thesis

Filename: thesis_final_20June2017.pdf

Description: Final submission of the PhD thesis

Licence:
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License

CLICK TO DOWNLOAD

[thumbnail of Final submission of the PhD thesis]

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Distant Speech Recognition of Natural Spontaneous Multi-party Conversations

Abstract

Metadata

Download

Final submission of the PhD thesis

Export

Statistics