Description
TitleTheory and methods for stochastic, accelerated, and distributed optimization
Date Created2022
Other Date2022-05 (degree)
Extent353 pages : illustrations
DescriptionThis thesis consists of two parts. Part I (Chapters 1-3) concerns momentum-based first-order optimization algorithms for stochastic optimization where we have only access to stochastic (noisy) estimates of the gradient of the objective. This setting would arise frequently in several key problems in supervised learning such as risk minimization for classification or regression, or saddle-point problems for distributionally robust learning. When gradients are deterministic and do not contain any noise, it is well-known that momentum-based optimization algorithms such as Nesterov’s accelerated gradient (AG) method or Polyak’s heavy ball (HB) method have improved convergence rates compared to gradient descent methods. However, in the presence of persistent stochastic gradient errors; momentum-based algorithms amplify the noise in the gradients and are less robust to gradient errors unless the stepsize and the momentum parameters are very carefully tuned to the problem at hand. This motivates the study of the distribution of the iterates of the momentum algorithms as a function of the stepsize and momentum parameters where there is a lack of principled strategies to ensure the existence of a stationary distribution or to control the probability that the suboptimality exceeds a certain threshold. Especially, existing results for momentum methods provide only limited guarantees in expected suboptimality, but do not typically characterize deviations from the expected suboptimality. In Chapter 1, we show that many momentum algorithms such as AG, HB and their variants for constrained strongly convex optimization converge to their equilibrium with the accelerated rate under some conditions on the parameters and on the noise structure. These results shed further light into the effect of parameters and how much noise momentum algorithms can tolerate before being divergent. In Chapter 2, we consider the general class of momentum methods (GMM) subject to stochastic gradient noise which include AG and HB as special cases. Under some light tail assumption on the noise structure, for strongly convex optimization, we characterize the entropic risk of the suboptimality over iterations of GMM and this provides information about the deviations from the expected suboptimality. We also obtain finite-time performance bounds on the probability that suboptimality exceeds a certain threshold as a function of the parameters. Our framework allows us to systematically trade-off the entropic risk associated to suboptimality with the convergence rate to stationarity and allows us to optimize the parameters to achieve both speed and robustness to noise at the same time where robustness is measured in terms of entropic risk; a coherent risk measure which captures information about the deviations from the suboptimality in our setting. In Chapter 3, we develop an efficient variance reduction technique based on Romberg extrapolation for the recently developed stochastic accelerated primal dual (SAPD) method which is a momentum-based stochastic optimization algorithm. We show that our proposed algorithm can improve upon SAPD both theoretically and practically and we illustrate our results on a distributionally robust learning problem. Part I of the thesis is applicable to risk minimization and empirical risk minimization problems but assumes that the associated dataset is available at a centralized location. In the era of big data, where data can be collected and stored with many devices/sensors simultaneously; there are also many settings when the data is not centralized but (decentralized) distributed over a network of computing nodes. Part II (Chapters 4-6) concerns this distributed data setting and focuses on the development of efficient distributed optimization algorithms in which predictive models are trained locally at each agent, and only local parameter vectors are shared over a communication network. Here, developing communication-efficient algorithms is the key to performance as communication between the nodes are typically much more expensive than local computations at every node. In Chapter 4, we develop a new strategy for distributed averaging over a network, an operation which is a key building block for many distributed optimization algorithms. In particular, we propose a new randomized gossiping algorithm for distributed averaging which allows the nodes to identify “bottleneck edges” for communication in a distributed manner. By passing the information through this “bottleneck edges: more frequently, we show that nodes can increase their “communication efficiency” for a variety of network structures. We theoretically and numerically show that our proposed algorithm not only improves the performance of randomized gossiping but also improves the convergence rate of distributed optimization algorithms such as EXTRA and DPGA-W on a broad family of network structures. In Chapter 5, we introduce a framework called “Inexact decentralized accelerated augmented Lagrangian method” (IDEAL) for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. When coupled with accelerated gradient descent, this framework yields a novel primal algorithm whose convergence rate is optimal and matched by recently derived lower bounds. We provide experimental results that demonstrate the effectiveness of the proposed algorithm on highly ill-conditioned problems. Lastly, Chapter 6 proposes a distributed algorithm for solving empirical risk minimization problems, called L-DQN, under the primary/secondary communication model. L-DQN is a distributed limitedmemory quasi-Newton method that supports asynchronous computations among the worker nodes. To our knowledge, this is the first distributed quasi-Newton method with provable global linear convergence guarantees in the asynchronous setting where delays between nodes are present. Numerical experiments are provided to illustrate the theory and the practical performance of our method.
NotePh.D.
NoteIncludes bibliographical references
Genretheses
LanguageEnglish
CollectionGraduate School - Newark Electronic Theses and Dissertations
Organization NameRutgers, The State University of New Jersey
RightsThe author owns the copyright to this work.