DescriptionLocal features play an important role for many computer vision problems; they are highly discriminative and possess invariant properties. However, the spatial configuration of local features plays an essential role in recognition. Spatial neighborhoods capture local geometry and collectively provide shape information about a given object. In this dissertation we studied explicit and implicit ways to exploit the joint feature-spatial arrangement in images for recognition problems. We introduce a framework to learn an embedded representation of images that captures the similarity between features and the spatial arrangement information. The framework was successfully applied in object recognition and localization context. The framework was also applied for feature matching across multiple images. We also showed the viability of the framework in regression from local features for viewpoint estimation. We also studied implicit ways to exploit the feature-spatial manifold structure in the data without explicit embedding and within a transductive learning paradigm for object localization. We learned the labels of the local features from an object class in a manner that provides spatial and feature smoothing over the labels. To achieve that we adapted the Global and Local Consistency Solution for Label Propagation to our implicit manifold model to infer the labels of local features. We showed excellent accuracy rates with very low false positive rates on the learned features labels in the test images.