Visually guided self supervised learning of speech representations
File(s)2001.04316v1.pdf (475.01 KB)
Working paper
Author(s)
Shukla, Abhinav
Vougioukas, Konstantinos
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
Type
Working Paper
Abstract
Self supervised representation learning has recently attracted a lot of
research interest for both the audio and visual modalities. However, most works
typically focus on a particular modality or feature alone and there has been
very limited work that studies the interaction between the two modalities for
learning self supervised representations. We propose a framework for learning
audio representations guided by the visual modality in the context of
audiovisual speech. We employ a generative audio-to-video training scheme in
which we animate a still image corresponding to a given audio clip and optimize
the generated video to be as close as possible to the real video of the speech
segment. Through this process, the audio encoder network learns useful speech
representations that we evaluate on emotion recognition and speech recognition.
We achieve state of the art results for emotion recognition and competitive
results for speech recognition. This demonstrates the potential of visual
supervision for learning audio representations as a novel way for
self-supervised learning which has not been explored in the past. The proposed
unsupervised audio features can leverage a virtually unlimited amount of
training data of unlabelled audiovisual speech and have a large number of
potentially promising applications.
research interest for both the audio and visual modalities. However, most works
typically focus on a particular modality or feature alone and there has been
very limited work that studies the interaction between the two modalities for
learning self supervised representations. We propose a framework for learning
audio representations guided by the visual modality in the context of
audiovisual speech. We employ a generative audio-to-video training scheme in
which we animate a still image corresponding to a given audio clip and optimize
the generated video to be as close as possible to the real video of the speech
segment. Through this process, the audio encoder network learns useful speech
representations that we evaluate on emotion recognition and speech recognition.
We achieve state of the art results for emotion recognition and competitive
results for speech recognition. This demonstrates the potential of visual
supervision for learning audio representations as a novel way for
self-supervised learning which has not been explored in the past. The proposed
unsupervised audio features can leverage a virtually unlimited amount of
training data of unlabelled audiovisual speech and have a large number of
potentially promising applications.
Date Issued
2020-02-20
Citation
2020
Publisher
arXiv
Copyright Statement
© 2020 The Author(s)
Identifier
http://arxiv.org/abs/2001.04316v1
Subjects
eess.AS
eess.AS
cs.CV
cs.MM
Notes
Submitted to ICASSP 2020
Publication Status
Published