DescriptionLearning reliable and interpretable representations is one of the fundamental challenges in machine learning and computer vision. Over the last decade, deep neural networks have achieved remarkable success by learning conditional distributions on the data for the purposes of solving different tasks. However, representations learned by deep models do not always manifest consistent meaning along variations: many latent factors are highly entangled. As a result, tremendous data annotations and sophisticated training skills are required, even though flawed representations with undesirable characteristics are still produced from time to time. In this work, we are interested in learning disentangled representations that encode distinct aspects of the data separately. The objective is to decouple the latent factors in a representation space, where factorizable structures are obtained and consistent semantics are associated with different variables. The disentanglement can be learned in an either supervised or self-supervised manner. Especially, we investigate three different visual analysis tasks: viewpoint estimation, landmark localization, and large-pose recognition. We show that, by learning disentangled representations, deep models are efficient to train and robust to variations, achieving state-of-the-art performance in challenging conditions.