DescriptionThe effectiveness of machine learning (ML) in today's applications largely depends on the goodness of the representation of data used within the ML algorithms. While the massiveness in dimension of modern day data often requires lower-dimensional data representations in many applications for efficient use of available computational resources, the use of uncorrelated features is also known to enhance the performance of ML algorithms. Thus, an efficient representation learning solution should focus on dimension reduction as well as uncorrelated feature extraction. Even though Principal Component Analysis (PCA) and linear autoencoders are fundamental data preprocessing tools that are largely used for dimension reduction, when engineered properly they can also be used to extract uncorrelated features. At the same time, factors like ever-increasing volume of data or inherently distributed data generation impede the use of existing centralized solutions for representation learning that require availability of data at a single location. This thesis focuses on representation learning in the case of distributed data. Specifically, it tackles the distributed feature learning problem when data samples are scattered in an arbitrarily connected network. PCA, being the most widely used representation learning tool, is the main focus of this thesis. The overall objective is solving the distributed PCA problem in 1) batch setting, and 2) streaming settings. The goals include development of algorithms that are both provably convergent at an optimal rate and also efficient in terms of communication between nodes in the network.
Two novel algorithms for distributed PCA are proposed in the batch setting that converge at a linear rate to the true eigenvectors of the global covariance matrix. The first proposed algorithm is called Distributed Sanger's Algorithm (DSA), which is a one-time scale method that converges to a neighborhood of the true eigenvectors of the covariance matrix of the data distributed in an arbitrarily connected network. Although this algorithm is fast in convergence, it reaches to only a neighborhood of the optimal solution. We propose a second algorithm called FAST-PCA (Fast and exAct diSTributed PCA) that uses a gradient-tracking based technique to converge linearly but also exactly to the true solutions. Extensive theoretical analysis is provided for both algorithms that prove their convergence and numerical results are also presented that further show the efficacy of the methods.
An important aspect of the modern world data is its ever increasing volume and thus ML algorithm in modern applications should be capable of adapting to new data. This motivates the study of representation learning problem in the streaming data setting as well. The problem of distributed PCA in the streaming data case is tackled in the second part of this thesis. To that end, an algorithm called Distributed Generalized Oja's Algorithm (DIEGO) is proposed that estimates multiple dominant eigenvectors of the population covariance matrix in the streaming settings. Theoretical analysis of the algorithm is provided that proves its convergence. In streaming data settings, the distributed PCA problem is also looked into and analysed in a special case of the distributed network, namely the federated learning setup, which has a controller-agent architecture rather than an arbitrary mesh network. Numerical results to study the effect of different parameters are presented to demonstrate the efficiency of the algorithm in both distributed and federated settings. For the purpose of a neural network based representation learning model, the algorithms are also implemented for training autoencoders with linear activation units.