IRUS Total

Watch and learn: mapping language and noisy real-world videos with self-supervision

File Description SizeFormat 
2011.09634v1.pdfWorking paper1.49 MBAdobe PDFView/Open
Title: Watch and learn: mapping language and noisy real-world videos with self-supervision
Authors: Zhong, Y
Xie, L
Wang, S
Specia, L
Miao, Y
Item Type: Working Paper
Abstract: In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. Firstly, we define a self-supervised learning framework that captures the cross-modal information. A novel adversarial learning module is then introduced to explicitly handle the noises in the natural videos, where the subtitle sentences are not guaranteed to be strongly corresponded to the video snippets. For training and evaluation, we contribute a new dataset `ApartmenTour' that contains a large number of online videos and subtitles. We carry out experiments on the bidirectional retrieval tasks between sentences and videos, and the results demonstrate that our proposed model achieves the state-of-the-art performance on both retrieval tasks and exceeds several strong baselines. The dataset will be released soon.
Issue Date: 19-Nov-2020
URI: http://hdl.handle.net/10044/1/84555
Publisher: arXiv
Copyright Statement: © 2020 The Author(s)
Sponsor/Funder: Commission of the European Communities
U.S Air Force
Funder's Grant Number: 678017
Keywords: cs.CV
Notes: NeurIPS 2020 Self-Supervised Learning Workshop
Publication Status: Published
Appears in Collections:Computing