2
IRUS Total
Downloads
  Altmetric

‘Did the speaker change?’: Temporal tracking for overlapping speaker segmentation in multi-speaker scenarios

File Description SizeFormat 
Hogg-AOT-2022-PhD-Thesis.pdfThesis7.8 MBAdobe PDFView/Open
Title: ‘Did the speaker change?’: Temporal tracking for overlapping speaker segmentation in multi-speaker scenarios
Authors: Hogg, Aidan
Item Type: Thesis or dissertation
Abstract: Diarization systems are an essential part of many speech processing applications, such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker domains. This thesis will focus on the first task of the diarization process, that being the task of speaker segmentation which can be thought of as trying to answer the question ‘Did the speaker change?’ in an audio recording. This thesis starts by showing that time-varying pitch properties can be used advantageously within the segmentation step of a multi-talker diarization system. It is then highlighted that an individual’s pitch is smoothly varying and, therefore, can be predicted by means of a Kalman filter. Subsequently, it is shown that if the pitch is not predictable, then this is most likely due to a change in the speaker. Finally, a novel system is proposed that uses this approach of pitch prediction for speaker change detection. This thesis then goes on to demonstrate how voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker’s utterance in the presence of an additional active speaker. This thesis then extends this work to explore the use of a new multimodal approach for overlapping speaker segmentation that tracks both the fundamental frequency (F0) and direction of arrival (DoA) of each speaker simultaneously. The proposed multiple hypothesis tracking system, which simultaneously tracks both features, shows an improvement in segmentation performance when compared to tracking these features separately. Lastly, this thesis focuses on the DoA estimation part of the newly proposed multimodal approach. It does this by exploring a polynomial extension to the multiple signal classification (MUSIC) algorithm, spatio-spectral polynomial (SSP)-MUSIC, and evaluating its performance when using speech sound sources.
Content Version: Open Access
Issue Date: Apr-2022
Date Awarded: Dec-2022
URI: http://hdl.handle.net/10044/1/108180
DOI: https://doi.org/10.25560/108180
Copyright Statement: Creative Commons Attribution NonCommercial Licence
Supervisor: Naylor, Patrick
Evers, Christine
Sponsor/Funder: Engineering and Physical Sciences Research Council
Funder's Grant Number: EP/L016796/1
Department: Electrical and Electronic Engineering
Publisher: Imperial College London
Qualification Level: Doctoral
Qualification Name: Doctor of Philosophy (PhD)
Appears in Collections:Electrical and Electronic Engineering PhD theses



This item is licensed under a Creative Commons License Creative Commons