End-to-end multimodal continuous affect recognition
File(s)
Author(s)
Tzirakis, Panagiotis
Type
Thesis or dissertation
Abstract
Automatic continuous affect recognition is a vital component towards a complete and natural interaction between human and machine. However, the task is challenging as human emotions lack temporal boundaries and are expressed through various modalities and in various ways among different individuals and cultures. On top of that, the perception of emotion is highly subjective which makes the task even more challenging.
The recent years, several advancements have been accomplished in determining the emotional states with the use of Deep Neural Networks (DNNs).
Nevertheless, most of these methods fail to consider multiple modalities to model affect, or do not fully exploit the strong representation capabilities of DNNs as they also rely on hand-engineered features.
In this thesis, we aim to provide deep learning methodologies and models for the multimodal continuous affect recognition task, trained in an end-to-end manner, i.\,e. utilising the raw input signal.
In particular, we show that utilising multiple modalities in an end-to-end deep learning model enhances the performance of the model than using a single modality when trained under controlled or naturalistic environmental conditions. Although using multiple modalities can be beneficial to the model's performance, it is important how the information from multiple modalities is fused to the model. To this end, we investigate the integration of attention methods to fuse multiple modalities in a deep neural network architecture, and present the performance gains these methods provide compared to conventional approaches. Finally, observing the high performance of end-to-end DNN models, we provide a toolkit that provides end-to-end learning capabilities.
The recent years, several advancements have been accomplished in determining the emotional states with the use of Deep Neural Networks (DNNs).
Nevertheless, most of these methods fail to consider multiple modalities to model affect, or do not fully exploit the strong representation capabilities of DNNs as they also rely on hand-engineered features.
In this thesis, we aim to provide deep learning methodologies and models for the multimodal continuous affect recognition task, trained in an end-to-end manner, i.\,e. utilising the raw input signal.
In particular, we show that utilising multiple modalities in an end-to-end deep learning model enhances the performance of the model than using a single modality when trained under controlled or naturalistic environmental conditions. Although using multiple modalities can be beneficial to the model's performance, it is important how the information from multiple modalities is fused to the model. To this end, we investigate the integration of attention methods to fuse multiple modalities in a deep neural network architecture, and present the performance gains these methods provide compared to conventional approaches. Finally, observing the high performance of end-to-end DNN models, we provide a toolkit that provides end-to-end learning capabilities.
Version
Open Access
Date Issued
2022-02
Date Awarded
2022-07
Copyright Statement
Creative Commons Attribution NonCommercial Licence
Advisor
Schuller, Björn
Sponsor
Engineering and Physical Sciences Research Council (Great Britain)
Grant Number
EP/L016796/1
Publisher Department
Computing
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)