Watch and learn: mapping language and noisy real-world videos with
self-supervision
self-supervision
File(s)2011.09634v1.pdf (1.46 MB)
Working paper
Author(s)
Zhong, Yujie
Xie, Linhai
Wang, Sen
Specia, Lucia
Miao, Yishu
Type
Working Paper
Abstract
In this paper, we teach machines to understand visuals and natural language
by learning the mapping between sentences and noisy video snippets without
explicit annotations. Firstly, we define a self-supervised learning framework
that captures the cross-modal information. A novel adversarial learning module
is then introduced to explicitly handle the noises in the natural videos, where
the subtitle sentences are not guaranteed to be strongly corresponded to the
video snippets. For training and evaluation, we contribute a new dataset
`ApartmenTour' that contains a large number of online videos and subtitles. We
carry out experiments on the bidirectional retrieval tasks between sentences
and videos, and the results demonstrate that our proposed model achieves the
state-of-the-art performance on both retrieval tasks and exceeds several strong
baselines. The dataset will be released soon.
by learning the mapping between sentences and noisy video snippets without
explicit annotations. Firstly, we define a self-supervised learning framework
that captures the cross-modal information. A novel adversarial learning module
is then introduced to explicitly handle the noises in the natural videos, where
the subtitle sentences are not guaranteed to be strongly corresponded to the
video snippets. For training and evaluation, we contribute a new dataset
`ApartmenTour' that contains a large number of online videos and subtitles. We
carry out experiments on the bidirectional retrieval tasks between sentences
and videos, and the results demonstrate that our proposed model achieves the
state-of-the-art performance on both retrieval tasks and exceeds several strong
baselines. The dataset will be released soon.
Date Issued
2020-11-19
Citation
2020
Publisher
arXiv
Copyright Statement
© 2020 The Author(s)
Sponsor
Commission of the European Communities
U.S Air Force
Identifier
http://arxiv.org/abs/2011.09634v1
Grant Number
678017
FA8655-20-1-7006
Subjects
cs.CV
cs.CV
Notes
NeurIPS 2020 Self-Supervised Learning Workshop
Publication Status
Published