IRUS Total

Generation of realistic human behaviour

File Description SizeFormat 
Vougioukas-K-2022-PhD-Thesis.pdfThesis51.63 MBAdobe PDFView/Open
Title: Generation of realistic human behaviour
Authors: Vougioukas, Konstantinos
Item Type: Thesis or dissertation
Abstract: As the use of computers and robots in our everyday lives increases so does the need for better interaction with these devices. Human-computer interaction relies on the ability to understand and generate human behavioural signals such as speech, facial expressions and motion. This thesis deals with the synthesis and evaluation of such signals, focusing not only on their intelligibility but also on their realism. Since these signals are often correlated, it is common for methods to drive the generation of one signal using another. The thesis begins by tackling the problem of speech-driven facial animation and proposing models capable of producing realistic animations from a single image and an audio clip. The goal of these models is to produce a video of a target person, whose lips move in accordance with the driving audio. Particular focus is also placed on a) generating spontaneous expression such as blinks, b) achieving audio-visual synchrony and c) transferring or producing natural head motion. The second problem addressed in this thesis is that of video-driven speech reconstruction, which aims at converting a silent video into waveforms containing speech. The method proposed for solving this problem is capable of generating intelligible and accurate speech for both seen and unseen speakers. The spoken content is correctly captured thanks to a perceptual loss, which uses features from pre-trained speech-driven animation models. The ability of the video-to-speech model to run in real-time allows its use in hearing assistive devices and telecommunications. The final work proposed in this thesis is a generic domain translation system, that can be used for any translation problem including those mapping across different modalities. The framework is made up of two networks performing translations in opposite directions and can be successfully applied to solve diverse sets of translation problems, including speech-driven animation and video-driven speech reconstruction.
Content Version: Open Access
Issue Date: Mar-2022
Date Awarded: Aug-2022
URI: http://hdl.handle.net/10044/1/99407
DOI: https://doi.org/10.25560/99407
Copyright Statement: Creative Commons Attribution NonCommercial Licence
Supervisor: Pantic, Maja
Sponsor/Funder: Engineering and Physical Sciences Research Council (EPSRC)
Funder's Grant Number: 2130174
Department: Computing
Publisher: Imperial College London
Qualification Level: Doctoral
Qualification Name: Doctor of Philosophy (PhD)
Appears in Collections:Computing PhD theses

This item is licensed under a Creative Commons License Creative Commons