scispace - formally typeset
Open AccessProceedings Article

Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: Joint Gradient Estimation and Tracking

Haoran Sun, +2 more
- Vol. 1, pp 9217-9228
Reads0
Chats0
TLDR
This work proposes an algorithm named D-GET (decentralized gradient estimation and tracking), which jointly performs decentralized gradient estimation (which estimates the local gradient using a subset of local samples) and gradient tracking (which tracks the global full gradient using local estimates).
Abstract
Many modern large-scale machine learning problems benefit from decentralized and stochastic optimization. Recent works have shown that utilizing both decentralized computing and local stochastic gradient estimates can outperform stateof-the-art centralized algorithms, in applications involving highly non-convex problems, such as training deep neural networks. In this work, we propose a decentralized stochastic algorithm to deal with certain smooth non-convex problems where there are m nodes in the system, and each node has a large number of samples (denoted as n). Differently from the majority of the existing decentralized learning algorithms for either stochastic or finite-sum problems, our focus is given to both reducing the total communication rounds among the nodes, while accessing the minimum number of local data samples. In particular, we propose an algorithm named D-GET (decentralized gradient estimation and tracking), which jointly performs decentralized gradient estimation (which estimates the local gradient using a subset of local samples) and gradient tracking (which tracks the global full gradient using local estimates). We show that, to achieve certain ✏ stationary solution of the deterministic finite sum problem, the proposed algorithm achieves an O(mn1/2✏ 1) sample complexity and an O(✏ 1) communication complexity. These bounds significantly improve upon the best existing bounds of O(mn✏ 1) and O(✏ 1), respectively. Similarly, for online problems, the proposed method achieves an O(m✏ 3/2) sample complexity and an O(✏ 1) communication complexity. Department of ECE, University of Minnesota Twin Cities, Minneapolis, MN USA IBM Research AI, IBM Thomas J. Watson Research Center, Yorktown Heights, NY USA. Correspondence to: Haoran Sun <sun00111@umn.edu>, Songtao Lu <songtao@ibm.com>, Mingyi Hong <mhong@umn.edu>. Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

BRIDGE: Byzantine-resilient Decentralized Gradient Descent

TL;DR: A Byzantine-resilient decentralized gradient descent (BRIDGE) method for decentralized learning that is more efficient and scalable in higher-dimensional settings and that is deployable in networks having topologies that go beyond the star topology.
Journal ArticleDOI

A General Framework for Decentralized Optimization With First-Order Methods

TL;DR: In this paper, the authors provide a general framework of decentralized first-order gradient methods that is applicable to directed and undirected communication networks and show that much of the existing work on optimization and consensus can be related explicitly to this framework.
Posted Content

A general framework for decentralized optimization with first-order methods

TL;DR: A general framework of decentralized first-order methods that is applicable to directed and undirected communication networks alike is provided and it is shown that much of the existing work on optimization and consensus can be related explicitly to this framework.
Journal ArticleDOI

Distributed Learning Systems with First-Order Methods

TL;DR: A brief introduction of some distributed learning techniques that have recently been developed, namely lossy communication compression (e.g., quantization and sparsification), asynchronous communication, and decentralized communication are provided.
Posted Content

A fast randomized incremental gradient method for decentralized non-convex optimization

TL;DR: This work shows almost sure and mean-squared convergence to a first-order stationary point and describes regimes of practical significance where GT-SAGA achieves a network-independent convergence rate and outperforms the existing approaches respectively.
References
More filters
Book

Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers

TL;DR: It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas.
Book

Parallel and Distributed Computation: Numerical Methods

TL;DR: This work discusses parallel and distributed architectures, complexity measures, and communication and synchronization issues, and it presents both Jacobi and Gauss-Seidel iterations, which serve as algorithms of reference for many of the computational approaches addressed later.
Journal ArticleDOI

Distributed Subgradient Methods for Multi-Agent Optimization

TL;DR: The authors' convergence rate results explicitly characterize the tradeoff between a desired accuracy of the generated approximate optimal solutions and the number of iterations needed to achieve the accuracy.
Posted Content

Federated Learning: Strategies for Improving Communication Efficiency

TL;DR: Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
Journal ArticleDOI

Fast linear iterations for distributed averaging

TL;DR: This work considers the problem of finding a linear iteration that yields distributed averaging consensus over a network, i.e., that asymptotically computes the average of some initial values given at the nodes, and gives several extensions and variations on the basic problem.
Related Papers (5)