Exploiting diverse modalities for enhanced image-to-image translation and face analysis
File(s)
Author(s)
Uncu, Rumeysa
Type
Thesis or dissertation
Abstract
In recent years, deep learning, especially in generative models, has driven major advancements in computer vision, particularly in image-to-image translation and facial analysis. A key factor in these developments is data augmentation, which helps models overcome the limitations of small and imbalanced datasets. This thesis explores how various modalities—text, semantic segmentation, and depth—can enhance computer vision tasks, particularly in image manipulation. The fundamental goal is not only to address challenges in image manipulation but also to contribute meaningfully to computer vision advancements through the integration of informative cues.
We first introduce a method that uses word representations to augment labels for improved face attribute classification, demonstrating the effectiveness of continuous representations as a regularisation tool in deep neural networks.
Following this, with advancements in generative models such as Generative Adversarial Networks, we move into image-to-image translation. Specifically, we incorporate geometric information, such as depth and surface normals, to generate realistic facial images that preserve geometric and photometric qualities. This is followed by a hierarchical architecture guided by semantic segmentation for facial image translation. This approach processes local regions through dedicated networks, improving the quality and realism of synthesised facial images.
These advancements set the stage for the exploration of text-guided image manipulation, reflecting the evolving landscape of generative models, incorporating diffusion models and large language models. We first address challenges in unpaired datasets by generating target labels and retrieving pseudo-target images, enabling weak supervision for controllable editing. Lastly, we introduce prompt augmentation for self-supervised text-guided image manipulation. This method expands a single input prompt into a diverse set of target prompts, enhancing coherent image transformations and preserving contextual information.
The connection between these chapters highlights a comprehensive approach to augmenting computer vision tasks using multiple modalities, with each work building on the insights of the previous one.
We first introduce a method that uses word representations to augment labels for improved face attribute classification, demonstrating the effectiveness of continuous representations as a regularisation tool in deep neural networks.
Following this, with advancements in generative models such as Generative Adversarial Networks, we move into image-to-image translation. Specifically, we incorporate geometric information, such as depth and surface normals, to generate realistic facial images that preserve geometric and photometric qualities. This is followed by a hierarchical architecture guided by semantic segmentation for facial image translation. This approach processes local regions through dedicated networks, improving the quality and realism of synthesised facial images.
These advancements set the stage for the exploration of text-guided image manipulation, reflecting the evolving landscape of generative models, incorporating diffusion models and large language models. We first address challenges in unpaired datasets by generating target labels and retrieving pseudo-target images, enabling weak supervision for controllable editing. Lastly, we introduce prompt augmentation for self-supervised text-guided image manipulation. This method expands a single input prompt into a diverse set of target prompts, enhancing coherent image transformations and preserving contextual information.
The connection between these chapters highlights a comprehensive approach to augmenting computer vision tasks using multiple modalities, with each work building on the insights of the previous one.
Version
Open Access
Date Issued
2024-03
Date Awarded
2024-11
Copyright Statement
Creative Commons Attribution NonCommercial NoDerivatives Licence
Advisor
Kim, Tae-Kyun
Publisher Department
Electrical and Electronic Engineering
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)