Abavisani, Mahdi. Sparse and low-rank representation-based methods for multimodal clustering and recognition. Retrieved from https://doi.org/doi:10.7282/t3-ysvh-8d53
DescriptionRecent advances in technology have provided massive amounts of complex high-dimensional and multimodal data for computer vision and machine learning applications. This thesis uses sparse and low-rank representation-based techniques to introduce several approaches for leveraging the complementary information from multimodal and high-dimensional data in clustering and recognition tasks. We start with a focus on subspace clustering algorithms. We extend the popular sparse and low-rank based subspace clustering methods to multimodal subspace clustering algorithms that can integrate multiple high-dimensional modalities and represent them in low-dimensional joint subspaces. We then use convolutional neural networks (CNNs) to improve our proposed multimodal subspace clustering methods and develop deep multimodal subspace clustering networks. Furthermore, we design a framework for incorporating data augmentation techniques in subspace clustering networks. In the second part of the thesis, we focus on developing multimodal classification approaches. We start with introducing deep sparse representation-based classification (DSRC) and extending it to its multimodal version. Then, we propose novel approaches for two real-world applications with high-dimensional and multimodal data. In particular, first, we introduce a method to leverage the knowledge of multiple video streams in dynamic hand gesture recognition tasks and embed the knowledge in every single unimodal network. As a result, we improve the accuracy of unimodal networks at the test time while they remain to perform in real-time. Our second applied approach is a fusion method for combining the information in social media posts' texts and images. Both texts and images are considered high-dimensional data, and in the case of social media posts, they can sometimes be uninformative or even misleading. We presented a method that is able to filter uninformative parts of text-image pairs and leverage their complementary information to detect crisis events in social media posts. Finally, we discuss some possible future research directions.