Fu, Rong (2009) Robust Speaker Diarization for Single Channel Recorded Meetings. PhD thesis, University of York.
Available under License Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 UK: England & Wales.
This thesis describes research into speaker diarization for recorded meetings. It explores the algorithms and the implementation of an off-line speaker segmentation and clustering system for meetings that have been recorded using one microphone. Speaker diarization is defined as a process of partitioning a spoken record into speaker-homogeneous regions. The meeting record contains different kinds of noise and the length of the noise varies significantly. The average speech-turn is short and the number of speakers is unknown. To reduce the influence of these aural characteristics on the performance of the speaker diarization system, this thesis proposed four new algorithms. First, a new speech activity detection method, which adjusts the non-speech model complexity according to the noise length ratio. Second, a new speaker change point detection measure was derived based on the Fisher Linear Discriminate Analysis to help detect short speaker turns. Third, the Equal Weight Penalty Criterion was formulated as a new model complexity selection criterion to train both the speakers' models and the Universal Background Model (UBM). It contains two penalty terms, one penalizes the model dimensions and removes mixtures with small mixing probability, the other penalizes the Kullback Leibler divergence between the prior and posterior distribution of the mixing parameters, removing those components that share the same location. This criterion can be adjusted by the prior distribution parameter delta, which controls how many components are used in the model. Fourth, a weight and mean adaptation method was developed to adapt potential speaker models from the UBM. In addition, a potential speaker merging termination scheme, based on the Normalized Cuts, was introduced into the system. Combining all the new techniques derived in this thesis together, the error rate of the baseline system was reduced from 18.61% to 9.24% on the development set, 18.89% to 10.50% on the evaluation set from AMI corpus, and 21.35% to 15.48% on the evaluation set from ISL corpus. When using the Normalized Cuts based potential speaker merging termination scheme, the error rate of the baseline system was reduced 18.61% to 10.33% on the development set, 18.89% to 9.99% on the evaluation set from AMI corpus, and 21.35% to 13.70% percentage points on the evaluation set from ISL corpus.
|Item Type:||Thesis (PhD)|
|Keywords:||speaker diarization, speaker recognition, speech recognition|
|Academic Units:||The University of York > Computer Science (York)|
|Depositing User:||Ms Rong Fu|
|Date Deposited:||08 Nov 2011 15:06|
|Last Modified:||08 Aug 2013 08:47|