Repository logo
  • Log In
    Log in via Symplectic to deposit your publication(s).
Repository logo
  • Communities & Collections
  • Research Outputs
  • Statistics
  • Log In
    Log in via Symplectic to deposit your publication(s).
  1. Home
  2. Faculty of Engineering
  3. Electrical and Electronic Engineering
  4. Electrical and Electronic Engineering PhD theses
  5. Speaker diarization: importance of the modulation spectrum and incorporating uncertainty modelling
 
  • Details
Speaker diarization: importance of the modulation spectrum and incorporating uncertainty modelling
File(s)
McKnight-S-2022-PhD-Thesis.pdf (10.17 MB)
Thesis
Author(s)
McKnight, Simon Webster
Type
Thesis or dissertation
Abstract
This Thesis investigates speaker diarization, an important area of research that has several practical applications. It has three main areas: (a) analyses of evaluation methodology based on diarization error rates (DER) and use of forgiveness collars; (b) use of modulation spectrum features to distinguish speakers, including overlapping, and investigating salient parts; and (c) uncertainty quantification of machine learning models and using them to improve performance.

The analyses of evaluation methodology highlight shortcomings in the use of DERs and sensitivity to the ground truth used. Initial research shows that using simple DERs without forgiveness collars can unfairly penalise diarization systems. However, human subject-based experiments are conducted and compared to state-of-the-art systems, showing that uniform forgiveness collars are not a satisfactory way of dealing with insignificant errors.

Modulation spectrum features are thought to be a promising way to generate features that distinguish speakers well, particularly the joint acoustic and modulation spectrum form. Unlike most research, this Thesis covers both temporal envelope and temporal fine structure. It highlights that modulation frequencies in the 0-0.5~Hz range and around the fundamental frequencies of the speakers are most useful, contradicting earlier research preferring the 1-16~Hz range. Training machine learning models on both modulation spectrum features and mel-frequency cepstral coefficients are shown to give better results than either alone. However, they do not distinguish overlapping speakers as well as anticipated.

Machine learning models that also indicate confidence in their predictions are clearly more helpful than those that simply predict. This Thesis investigates models quantifying aleatoric and epistemic uncertainties, using output probability distributions and Monte Carlo dropout respectively, and their use in Kalman filters to improve performance. Results show performance improves for certain hyperparameters, both for single models and model ensembles.
Version
Open Access
Date Issued
2022-06
Date Awarded
2022-11
URI
http://hdl.handle.net/10044/1/112778
DOI
https://doi.org/10.25560/112778
Copyright Statement
Creative Commons Attribution NonCommercial Licence
License URL
https://creativecommons.org/licenses/by-nc/4.0/
Advisor
Naylor, Patrick
Sponsor
Engineering and Physical Sciences Research Council
Grant Number
EP/N509486/1
EP/N509486/1
Publisher Department
Electrical and Electronic Engineering
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)
About
Spiral Depositing with Spiral Publishing with Spiral Symplectic
Contact us
Open access team Report an issue
Other Services
Scholarly Communications Library Services
logo

Imperial College London

South Kensington Campus

London SW7 2AZ, UK

tel: +44 (0)20 7589 5111

Accessibility Modern slavery statement Cookie Policy

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback