Disentangling content and motion for text-based neural video manipulation
File(s)0443.pdf (2.74 MB)
Published version
Author(s)
Type
Conference Paper
Abstract
Giving machines the ability to imagine possible new objects or scenes from linguistic descriptions and produce their realistic renderings is arguably one of the most challenging
problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating videos with natural language, aiming
to perform local and semantic edits on a video clip to alter the appearances of an object of interest. Our GAN architecture allows for better utilization of multiple observations by disentangling content and motion to enable controllable semantic edits. To this end, we introduce two tightly coupled networks: (i) a representation network for constructing a concise understanding of motion dynamics and temporally invariant content, and (ii) a translation network that exploits the extracted latent content representation to actuate the manipulation according to the target description. Our qualitative and quantitative eval-
uations demonstrate that DiCoMoGAN significantly outperforms existing frame-based methods, producing temporally coherent and semantically more meaningful results.
problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating videos with natural language, aiming
to perform local and semantic edits on a video clip to alter the appearances of an object of interest. Our GAN architecture allows for better utilization of multiple observations by disentangling content and motion to enable controllable semantic edits. To this end, we introduce two tightly coupled networks: (i) a representation network for constructing a concise understanding of motion dynamics and temporally invariant content, and (ii) a translation network that exploits the extracted latent content representation to actuate the manipulation according to the target description. Our qualitative and quantitative eval-
uations demonstrate that DiCoMoGAN significantly outperforms existing frame-based methods, producing temporally coherent and semantically more meaningful results.
Date Issued
2022-11-21
Date Acceptance
2022-11-21
Citation
The 33rd British Machine Vision Conference Proceedings, 2022
Publisher
British Machine Vision Association
Journal / Book Title
The 33rd British Machine Vision Conference Proceedings
Copyright Statement
© 2022. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.
Source
British Machine Vision Conference (BMVC 2022)
Publication Status
Published
Start Date
2022-11-21
Finish Date
2024-11-24
Coverage Spatial
London, UK