Speaker diarization: importance of the modulation spectrum and incorporating uncertainty modelling
File(s)
Author(s)
McKnight, Simon Webster
Type
Thesis or dissertation
Abstract
This Thesis investigates speaker diarization, an important area of research that has several practical applications. It has three main areas: (a) analyses of evaluation methodology based on diarization error rates (DER) and use of forgiveness collars; (b) use of modulation spectrum features to distinguish speakers, including overlapping, and investigating salient parts; and (c) uncertainty quantification of machine learning models and using them to improve performance.
The analyses of evaluation methodology highlight shortcomings in the use of DERs and sensitivity to the ground truth used. Initial research shows that using simple DERs without forgiveness collars can unfairly penalise diarization systems. However, human subject-based experiments are conducted and compared to state-of-the-art systems, showing that uniform forgiveness collars are not a satisfactory way of dealing with insignificant errors.
Modulation spectrum features are thought to be a promising way to generate features that distinguish speakers well, particularly the joint acoustic and modulation spectrum form. Unlike most research, this Thesis covers both temporal envelope and temporal fine structure. It highlights that modulation frequencies in the 0-0.5~Hz range and around the fundamental frequencies of the speakers are most useful, contradicting earlier research preferring the 1-16~Hz range. Training machine learning models on both modulation spectrum features and mel-frequency cepstral coefficients are shown to give better results than either alone. However, they do not distinguish overlapping speakers as well as anticipated.
Machine learning models that also indicate confidence in their predictions are clearly more helpful than those that simply predict. This Thesis investigates models quantifying aleatoric and epistemic uncertainties, using output probability distributions and Monte Carlo dropout respectively, and their use in Kalman filters to improve performance. Results show performance improves for certain hyperparameters, both for single models and model ensembles.
The analyses of evaluation methodology highlight shortcomings in the use of DERs and sensitivity to the ground truth used. Initial research shows that using simple DERs without forgiveness collars can unfairly penalise diarization systems. However, human subject-based experiments are conducted and compared to state-of-the-art systems, showing that uniform forgiveness collars are not a satisfactory way of dealing with insignificant errors.
Modulation spectrum features are thought to be a promising way to generate features that distinguish speakers well, particularly the joint acoustic and modulation spectrum form. Unlike most research, this Thesis covers both temporal envelope and temporal fine structure. It highlights that modulation frequencies in the 0-0.5~Hz range and around the fundamental frequencies of the speakers are most useful, contradicting earlier research preferring the 1-16~Hz range. Training machine learning models on both modulation spectrum features and mel-frequency cepstral coefficients are shown to give better results than either alone. However, they do not distinguish overlapping speakers as well as anticipated.
Machine learning models that also indicate confidence in their predictions are clearly more helpful than those that simply predict. This Thesis investigates models quantifying aleatoric and epistemic uncertainties, using output probability distributions and Monte Carlo dropout respectively, and their use in Kalman filters to improve performance. Results show performance improves for certain hyperparameters, both for single models and model ensembles.
Version
Open Access
Date Issued
2022-06
Date Awarded
2022-11
Copyright Statement
Creative Commons Attribution NonCommercial Licence
Advisor
Naylor, Patrick
Sponsor
Engineering and Physical Sciences Research Council
Grant Number
EP/N509486/1
EP/N509486/1
Publisher Department
Electrical and Electronic Engineering
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)