Open accessPosted Content

# ZeroSARAH: Efficient Nonconvex Finite-Sum Optimization with Zero Full Gradient Computation

02 Mar 2021-arXiv: Learning-
Abstract: We propose ZeroSARAH -- a novel variant of the variance-reduced method SARAH (Nguyen et al., 2017) -- for minimizing the average of a large number of nonconvex functions $\frac{1}{n}\sum_{i=1}^{n}f_i(x)$. To the best of our knowledge, in this nonconvex finite-sum regime, all existing variance-reduced methods, including SARAH, SVRG, SAGA and their variants, need to compute the full gradient over all $n$ data samples at the initial point $x^0$, and then periodically compute the full gradient once every few iterations (for SVRG, SARAH and their variants). Note that SVRG, SAGA and their variants typically achieve weaker convergence results than variants of SARAH: $n^{2/3}/\epsilon^2$ vs. $n^{1/2}/\epsilon^2$. Thus we focus on the variant of SARAH. The proposed ZeroSARAH and its distributed variant D-ZeroSARAH are the \emph{first} variance-reduced algorithms which \emph{do not require any full gradient computations}, not even for the initial point. Moreover, for both standard and distributed settings, we show that ZeroSARAH and D-ZeroSARAH obtain new state-of-the-art convergence results, which can improve the previous best-known result (given by e.g., SPIDER, SARAH, and PAGE) in certain regimes. Avoiding any full gradient computations (which are time-consuming steps) is important in many applications as the number of data samples $n$ usually is very large. Especially in the distributed setting, periodic computation of full gradient over all data samples needs to periodically synchronize all clients/devices/machines, which may be impossible or unaffordable. Thus, we expect that ZeroSARAH/D-ZeroSARAH will have a practical impact in distributed and federated learning where full device participation is impractical.

##### Citations
More

7 results found

Open accessPosted Content
25 Aug 2020-arXiv: Learning
Abstract: In this paper, we propose a novel stochastic gradient estimator -- ProbAbilistic Gradient Estimator (PAGE) -- for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p_t$ or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1-p_t$. We give a simple formula for the optimal choice of $p_t$. Moreover, we prove the first tight lower bound $\Omega(n+\frac{\sqrt{n}}{\epsilon^2})$ for nonconvex finite-sum problems, which also leads to a tight lower bound $\Omega(b+\frac{\sqrt{b}}{\epsilon^2})$ for nonconvex online problems, where $b:= \min\{\frac{\sigma^2}{\epsilon^2}, n\}$. Then, we show that PAGE obtains the optimal convergence results $O(n+\frac{\sqrt{n}}{\epsilon^2})$ (finite-sum) and $O(b+\frac{\sqrt{b}}{\epsilon^2})$ (online) matching our lower bounds for both nonconvex finite-sum and online problems. Besides, we also show that for nonconvex functions satisfying the Polyak-Łojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate $O(\cdot\log \frac{1}{\epsilon})$. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the optimal theoretical results and confirming the practical superiority of PAGE.

13 Citations

Open accessJournal Article
Abstract: We propose a novel accelerated variance-reduced gradient method called ANITA for finite-sum optimization. In this paper, we consider both general convex and strongly convex settings. In the general convex setting, ANITA achieves the convergence result $O\big(n\min\big\{1+\log\frac{1}{\epsilon\sqrt{n}}, \log\sqrt{n}\big\} + \sqrt{\frac{nL}{\epsilon}} \big)$, which improves the previous best result $O\big(n\min\{\log\frac{1}{\epsilon}, \log n\}+\sqrt{\frac{nL}{\epsilon}}\big)$ given by Varag (Lan et al., 2019). In particular, for a very wide range of $\epsilon$, i.e., $\epsilon \in (0,\frac{L}{n\log^2\sqrt{n}}]\cup [\frac{1}{\sqrt{n}},+\infty)$, where $\epsilon$ is the error tolerance $f(x_T)-f^*\leq \epsilon$ and $n$ is the number of data samples, ANITA can achieve the optimal convergence result $O\big(n+\sqrt{\frac{nL}{\epsilon}}\big)$ matching the lower bound $\Omega\big(n+\sqrt{\frac{nL}{\epsilon}}\big)$ provided by Woodworth and Srebro (2016). To the best of our knowledge, ANITA is the \emph{first} accelerated algorithm which can \emph{exactly} achieve this optimal result $O\big(n+\sqrt{\frac{nL}{\epsilon}}\big)$ for general convex finite-sum problems. In the strongly convex setting, we also show that ANITA can achieve the optimal convergence result $O\Big(\big(n+\sqrt{\frac{nL}{\mu}}\big)\log\frac{1}{\epsilon}\Big)$ matching the lower bound $\Omega\Big(\big(n+\sqrt{\frac{nL}{\mu}}\big)\log\frac{1}{\epsilon}\Big)$ provided by Lan and Zhou (2015). Moreover, ANITA enjoys a simpler loopless algorithmic structure unlike previous accelerated algorithms such as Katyusha (Allen-Zhu, 2017) and Varag (Lan et al., 2019) where they use an inconvenient double-loop structure. Finally, the experimental results also show that ANITA converges faster than previous state-of-the-art Varag (Lan et al., 2019), validating our theoretical results and confirming the practical superiority of ANITA.

Topics: Convex function (50%)

6 Citations

Open accessPosted Content
Zhize Li1, Peter Richtárik1Institutions (1)
20 Jul 2021-arXiv: Learning
Abstract: Due to the high communication cost in distributed and federated learning, methods relying on compressed communication are becoming increasingly popular. Besides, the best theoretically and practically performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of communications (faster convergence), e.g., Nesterov's accelerated gradient descent (Nesterov, 2004) and Adam (Kingma and Ba, 2014). In order to combine the benefits of communication compression and convergence acceleration, we propose a \emph{compressed and accelerated} gradient method for distributed optimization, which we call CANITA. Our CANITA achieves the \emph{first accelerated rate} $O\bigg(\sqrt{\Big(1+\sqrt{\frac{\omega^3}{n}}\Big)\frac{L}{\epsilon}} + \omega\big(\frac{1}{\epsilon}\big)^{\frac{1}{3}}\bigg)$, which improves upon the state-of-the-art non-accelerated rate $O\left((1+\frac{\omega}{n})\frac{L}{\epsilon} + \frac{\omega^2+n}{\omega+n}\frac{1}{\epsilon}\right)$ of DIANA (Khaled et al., 2020b) for distributed general convex problems, where $\epsilon$ is the target error, $L$ is the smooth parameter of the objective, $n$ is the number of machines/devices, and $\omega$ is the compression parameter (larger $\omega$ means more compression can be applied, and no compression implies $\omega=0$). Our results show that as long as the number of devices $n$ is large (often true in distributed/federated learning), or the compression $\omega$ is not very high, CANITA achieves the faster convergence rate $O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$, i.e., the number of communication rounds is $O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$ (vs. $O\big(\frac{L}{\epsilon}\big)$ achieved by previous works). As a result, CANITA enjoys the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).

Topics: Order (ring theory) (60%)

3 Citations

Open accessPosted Content
Haoyu Zhao1, Zhize Li2, Peter Richtárik2Institutions (2)
10 Aug 2021-arXiv: Learning
Abstract: Federated Averaging (FedAvg, also known as Local-SGD) (McMahan et al., 2017) is a classical federated learning algorithm in which clients run multiple local SGD steps before communicating their update to an orchestrating server. We propose a new federated learning algorithm, FedPAGE, able to further reduce the communication complexity by utilizing the recent optimal PAGE method (Li et al., 2021) instead of plain SGD in FedAvg. We show that FedPAGE uses much fewer communication rounds than previous local methods for both federated convex and nonconvex optimization. Concretely, 1) in the convex setting, the number of communication rounds of FedPAGE is $O(\frac{N^{3/4}}{S\epsilon})$, improving the best-known result $O(\frac{N}{S\epsilon})$ of SCAFFOLD (Karimireddy et al.,2020) by a factor of $N^{1/4}$, where $N$ is the total number of clients (usually is very large in federated learning), $S$ is the sampled subset of clients in each communication round, and $\epsilon$ is the target error; 2) in the nonconvex setting, the number of communication rounds of FedPAGE is $O(\frac{\sqrt{N}+S}{S\epsilon^2})$, improving the best-known result $O(\frac{N^{2/3}}{S^{2/3}\epsilon^2})$ of SCAFFOLD (Karimireddy et al.,2020) by a factor of $N^{1/6}S^{1/3}$, if the sampled clients $S\leq \sqrt{N}$. Note that in both settings, the communication cost for each round is the same for both FedPAGE and SCAFFOLD. As a result, FedPAGE achieves new state-of-the-art results in terms of communication complexity for both federated convex and nonconvex optimization.

2 Citations

Open accessPosted Content
07 Oct 2021-arXiv: Learning
Abstract: First proposed by Seide (2014) as a heuristic, error feedback (EF) is a very popular mechanism for enforcing convergence of distributed gradient-based optimization methods enhanced with communication compression strategies based on the application of contractive compression operators. However, existing theory of EF relies on very strong assumptions (e.g., bounded gradients), and provides pessimistic convergence rates (e.g., while the best known rate for EF in the smooth nonconvex regime, and when full gradients are compressed, is $O(1/T^{2/3})$, the rate of gradient descent in the same regime is $O(1/T)$). Recently, Richt\'{a}rik et al. (2021) proposed a new error feedback mechanism, EF21, based on the construction of a Markov compressor induced by a contractive compressor. EF21 removes the aforementioned theoretical deficiencies of EF and at the same time works better in practice. In this work we propose six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression. Several of these techniques were never analyzed in conjunction with EF before, and in cases where they were (e.g., bidirectional compression), our rates are vastly superior.

1 Citations

##### References
More

29 results found

Journal Article
Chih-Chung Chang1, Chih-Jen Lin1Institutions (1)
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

Topics: , ,  ... read more

37,868 Citations

Journal Article
Yann LeCun1, Yann LeCun2, Yoshua Bengio3, Geoffrey E. Hinton4  +1 moreInstitutions (5)
28 May 2015-Nature
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

Topics: Deep learning (59%), Object detection (53%),  ... read more

33,931 Citations

Open accessBook
14 Jan 2014-
Abstract: It was in the middle of the 1980s, when the seminal paper by Kar- markar opened a new epoch in nonlinear optimization The importance of this paper, containing a new polynomial-time algorithm for linear op- timization problems, was not only in its complexity bound At that time, the most surprising feature of this algorithm was that the theoretical pre- diction of its high efficiency was supported by excellent computational results This unusual fact dramatically changed the style and direc- tions of the research in nonlinear optimization Thereafter it became more and more common that the new methods were provided with a complexity analysis, which was considered a better justification of their efficiency than computational experiments In a new rapidly develop- ing field, which got the name "polynomial-time interior-point methods", such a justification was obligatory Afteralmost fifteen years of intensive research, the main results of this development started to appear in monographs [12, 14, 16, 17, 18, 19] Approximately at that time the author was asked to prepare a new course on nonlinear optimization for graduate students The idea was to create a course which would reflect the new developments in the field Actually, this was a major challenge At the time only the theory of interior-point methods for linear optimization was polished enough to be explained to students The general theory of self-concordant functions had appeared in print only once in the form of research monograph [12]

Topics:

3,359 Citations

Open accessBook
Shai Shalev-Shwartz1, Shai Ben-David2Institutions (2)
01 Jan 2015-
Abstract: Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics, and engineering.

Topics: , ,  ... read more

2,986 Citations

Open accessJournal Article
Abstract: In this paper, we introduce a new stochastic approximation type algorithm, namely, the randomized stochastic gradient (RSG) method, for solving an important class of nonlinear (possibly nonconvex) stochastic programming problems. We establish the complexity of this method for computing an approximate stationary point of a nonlinear programming problem. We also show that this method possesses a nearly optimal rate of convergence if the problem is convex. We discuss a variant of the algorithm which consists of applying a postoptimization phase to evaluate a short list of solutions generated by several independent runs of the RSG method, and we show that such modification allows us to improve significantly the large-deviation properties of the algorithm. These methods are then specialized for solving a class of simulation-based optimization problems in which only stochastic zeroth-order information is available.