DescriptionHumans excel at identifying and locating multiple instances of objects or persons in a scene, despite large variations in lighting conditions, pose or scale. In order to achieve this level of robustness, humans naturally use high-level semantic information. The notion of semantics is hierarchical in nature, for example, a house is constituted of walls, floor and ceiling. When certain entities in this hierarchy can be modeled using a functional form known a priori, we refer to the data as structured, while when the functional form is unknown, the data is unstructured. In this thesis, we address the problem of discovering multiple instances of models in structured and unstructured data. The first part of this thesis deals with structured data. We identify planar regions in an indoor scene by using a single depth image from a Microsoft Kinect sensor. The clutter in the indoor scenes, the depth dependent measurement noise and the unknown number of planar regions pose serious challenges in model discovery. We propose a scalable bottom-up approach that leverages from a heteroscedastic, i.e., point dependent model of the measurement noise. The second part of the thesis addresses multiple model discovery in unstructured data in a semi-supervised setup. We develop a framework for using mean shift clustering in kernel spaces with a few user-specified pairwise constraints. A linear transformation of the initial kernel space is learned by the constrained minimization of a Bregman divergence based objective function. We automatically determine the adaptive bandwidth parameter to be used with mean shift clustering. Finally, we compare the performance with state-of-the-art semi-supervised clustering methods and show that kernel mean shift clustering performs particularly well when the number of clusters is large. We also propose a few directions for future research. Using the planar regions detected from the first frame of Kinect, a sequence of RGB-D images can be rapidly processed to dynamically generate a consistent 3D model of the scene. We also show that for the kernel learning problem, we can use ideas from group theory and semi-definite programming to devise a more efficient algorithm that only uses linearly independent constraints.