Abstract
(type = abstract)
Understanding complex scenes is a fundamental necessity for any artificial agent system aiming to properly function under real world conditions. Those include autonomous vehicles needing to navigate in crowds or busy streets, or surveillance systems needing to track and recognize individuals in crowds. In this work, we explore the challenges affiliated with application of deep learning methodologies on complex scenes while utilizing minimal amounts of expensive, manually collected annotations. In general, the goal of this research is to maximize the applicable range of scene complexities, namely for the tasks of semantic segmentation and counting, while also minimizing annotations requirements, priori constraints, and distributional assumptions. This thesis presents the gradual removal of aforementioned supervisions, from using weak supervision, priori constraints, and data distributional assumptions, to using no supervision, no priori constraints, and no data distributional assumptions. In subsequent chapters we define the types of supervision, annotation costs, scene complexity, and selected deep learning tasks we aim to achieve. In particular, we develop the following methods: 1. Triple-S Network: This work presents a deep learning method for simultaneous segmentation and counting of cranberries to aid in yield estimation and sun exposure predictions. Notably, supervision is done using low cost center point annotations. The approach, named Triple-S Network, incorporates a three-part loss with shape priors to promote better fitting to objects of known shape typical in agricultural scenes. Our results improve overall segmentation performance by more than 6.74% and counting results by 22.91\% when compared to state-of-the-art methods. To train and evaluate the network, we collected the CRanberry Aerial Imagery Dataset (CRAID), the largest dataset of aerial drone imagery from cranberry fields. 2. Pseudo-Masks from Points: This work presents a generalizable approach to a wide range of dataset complexities, that is trainable from scratch, without any dependency on pre-trained backbones, classification, or separate refinement tasks. We utilize point annotations to generate reliable, on-the-fly pseudo-masks through refined and spatially filtered features. While our method requires point annotations that are only slightly more expensive than image-level annotations, we are to demonstrate state-of-the-art performance on benchmark datasets (PascalVOC 2012), as well as significantly outperform other state-of-the-art weakly supervised semantic segmentation methods on recent real-world datasets (CRAID, CityPersons, IAD, ADE20K, CityScapes) with up to 28.1% and 22.6% performance boosts compared to our single-stage and multi-stage baselines respectively. 3. H20-Network: This work presents a self supervised deep learning method to segment floods from satellites and aerial imagery by bridging domain gap between low and high latency satellite and coarse-to-fine label refinement. H2O-Net learns to synthesize signals highly correlative with water presence as a domain adaptation step for semantic segmentation in high resolution satellite imagery. Our work also proposes a self-supervision mechanism, which does not require any hand annotation, used during training to generate high quality ground truth data. We demonstrate that H2O-Net outperforms the state-of-the-art semantic segmentation methods on satellite imagery by 10\% and 12\% pixel accuracy and mIoU respectively for the task of flood segmentation. We emphasize the generalizability of our model by transferring model weights trained on satellite imagery to drone imagery, a highly different sensor and domain. 4. Material and Texture Representation Learning (MATTER): In this work, we present our material and texture based self-supervision method named MATTER (MATerial and TExture Representation Learning), which is inspired by classical material and texture methods. Material and texture can effectively describe any surface, including its tactile properties, color, and specularity. By extension, effective representation of material and texture can describe other semantic classes strongly associated with said material and texture. MATTER leverages multi-temporal, spatially aligned remote sensing imagery over unchanged regions to learn invariance to illumination and viewing angle as a mechanism to achieve consistency of material and texture representation. We show that our self-supervision pre-training method allows for up to 24.22% and 6.33% performance increase in unsupervised and fine-tuned setups, and up to 76% faster convergence on change detection, land cover classification, and semantic segmentation tasks. 5. Self-Supervised Object Detection from Egocentric Videos (DEVI): This work addresses the problem of self-supervised, class-agnostic object detection, which aims to locate all objects in a given view, regardless of category, without extit{extbf{any}} annotations or pre-training weights. Egocentric videos exhibit high scene complexity and irregular motion flows compared to typical video understanding tasks. Our method, self-supervised object extbf{d}etection from extbf{e}gocentric extbf{vi}deos (extbf{DEVI}), generalizes appearance-based methods to learn features that are category-specific and invariant to viewing angles and illumination conditions from highly ambiguous environments in an end-to-end manner. Our approach leverages typical human behavior and its egocentric perception to sample diverse views of the same objects for our multi-view and scale-regression loss functions. With our learned cluster residual module, we are able to effectively describe multi-category patches for better complex scene understanding. DEVI provides a boost in performance on recent egocentric datasets, with performance gains up to 4.11% AP50, 0.11% AR1, 1.32% AR10, and 5.03% AR100, while significantly reducing model complexity. We also demonstrate competitive performance on out-of-domain datasets without additional training or fine-tuning.