DescriptionGenerative models, such as Auto-Encoders, Generative Adversarial Networks, Generative Flows, and Diffusion Models, are fascinating for their ability to synthesis versatile visual and audio information from merely noise. This generative process usually requires a model to perceive and compress high-dimensional data into a compact, low-dimensional latent space, where each dimension encodes some valuable semantic variations in the original data space. How much we know about the latent space is vital because it determines how we can take advantage of the corresponding generative model. Imagine a GAN trained on human faces. If we know which dimension in its latent vector controls the concept ``hair shape", we can synthesize multiple images for the same face to try different hairstyles without changing other facial attributes. Disentangling generative models makes them more fun to play with, which is the topic of this thesis. This thesis studies the unsupervised disentangling of the latent space in GANs focused on the image domain and extended to multi-modalities (image captioning and text-to-image synthesis). The proposed methods in this thesis enable the GAN model to disentangle its latent space automatically, thus sparing the expensive effort of collecting semantic labeling for the training data. Derived from disentanglement, this thesis also covers studies on model interpretability and human-controllable data synthesis. This thesis contains three main topics:
First, we work on general-purpose disentanglement. A novel GAN-based disentanglement framework with One-Hot Sampling and Orthogonal Regularization (OOGAN) is proposed. While previous works primarily attempt to tackle disentanglement learning through VAE with various approximation-based methods, we show that GANs have a natural advantage in disentangling with an alternating latent variable (noise) sampling method that is straightforward and robust.
Second, we work on a more specific task: disentangling coarse and fine level style attributes for GAN. The proposed PIVQGAN facilities independent control and manipulation of coarse-level object arrangements (posture) and fine-grained level styling (identity) for both synthesized images from noise or sampled images from real datasets. We design a Vector-Quantized module for better pose-identity disentanglement and a novel joint-training scheme merging GAN and Auto-Encoder, which facilities several self-supervision tasks for the model to better separate the attributes.
Lastly, we study two applications taking advantage of a better disentangled GAN with mutual information learning. Focusing on text-to-image generation, we propose Text-and-Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator and an image captioning discriminator under one single GAN framework. By maximizing the mutual information between the latent models of image and text. Focusing on sketch-to-image generation, we study exemplar-based sketch-to-image synthesis tasks in a self-supervised learning manner, eliminating the necessity of the paired sketch data via a better disentanglement between content information from sketch and style information from an exemplar RGB image.