Showing papers by "Dan Alistarh published in 2019"

PDF

Open Access

Proceedings Article•DOI•

SparCML: high-performance sparse communication for machine learning

[...]

Cedric Renggli¹, Saleh Ashkboos², Mehdi Aghagolzadeh³, Dan Alistarh², Torsten Hoefler¹ - Show less +1 more•Institutions (3)

ETH Zurich¹, Institute of Science and Technology Austria², Microsoft³

17 Nov 2019

TL;DR: SparCML as discussed by the authors extends MPI to support non-blocking (asynchronous) operations and low-precision data representations, in conjunction with efficient machine learning algorithms which can leverage these primitives.

...read moreread less

Abstract: Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communications. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute arbitrary sparse input data vectors. Our generic communication library, SparCML1, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations. As such, SparCML and its techniques will form the basis of future highly-scalable machine learning frameworks.

...read moreread less

76 citations

Proceedings Article•DOI•

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations.

[...]

Shigang Li¹, Tal Ben-Nun¹, Salvatore Di Girolamo¹, Dan Alistarh², Torsten Hoefler¹ - Show less +1 more•Institutions (2)

ETH Zurich¹, Institute of Science and Technology Austria²

12 Aug 2019-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This paper proposes eager-SGD, which relaxes the global synchronization for decentralized accumulation of gradients, and proposes to use two partial collectives: solo and majority, which theoretically prove the convergence of the algorithms and describe the partial collective in detail.

...read moreread less

Abstract: Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.

...read moreread less

29 citations

Posted Content•

MLSys: The New Frontier of Machine Learning Systems

[...]

Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric T. Chung, Bill Dally, Jeffrey Dean, Inderjit S. Dhillon, Alexandros G. Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood, Furong Huang, Martin Jaggi, Kevin Jamieson, Michael I. Jordan, Gauri Joshi, Rania Khalaf, Jason Knight, Jakub Konečný, Tim Kraska, Arun Kumar, Anastasios Kyrillidis, Aparna Lakshmiratan, Jing Li, Samuel Madden, H. Brendan McMahan, Erik Meijer, Ioannis Mitliagkas, Rajat Monga, Derek G. Murray, Kunle Olukotun, Dimitris S. Papailiopoulos, Gennady Pekhimenko, Theodoros Rekatsinas, Afshin Rostamizadeh, Christopher Ré, Christopher De Sa, Hanie Sedghi, Sercan Sen, Virginia Smith, Alexander J. Smola, Dawn Song, Evan R. Sparks, Ion Stoica, Vivienne Sze, Madeleine Udell, Joaquin Vanschoren, Shivaram Venkataraman, Rashmi Vinayak, Markus Weimer, Andrew Gordon Wilson, Eric P. Xing, Matei Zaharia, Ce Zhang, Ameet Talwalkar - Show less +65 more

29 Mar 2019-arXiv: Learning

TL;DR: It is proposed to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems forML, and ML optimized for metrics beyond predictive accuracy.

...read moreread less

Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.

...read moreread less

23 citations

Proceedings Article•

Distributed Learning over Unreliable Networks

[...]

Chen Yu¹, Hanlin Tang², Cedric Renggli³, Simon Kassing³, Ankit Singla³, Dan Alistarh⁴, Ce Zhang³, Ji Liu¹ - Show less +4 more•Institutions (4)

University of Rochester¹, Intel², ETH Zurich³, Institute of Science and Technology Austria⁴

24 May 2019

TL;DR: A novel theoretical analysis proving that distributedlearning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks and that the influence of the packet drop rate diminishes with the growth of the number of distributed parameter servers.

...read moreread less

Abstract: Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent work exhibits the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. In this paper, we connect these two trends, and consider the following question: {\em Can we design machine learning systems that are tolerant to network unreliability during training?} With this motivation, we focus on a theoretical problem of independent interest---given a standard distributed parameter server architecture, if every communication between the worker and the server has a non-zero probability $p$ of being dropped, does there exist an algorithm that still converges, and at what speed? The technical contribution of this paper is a novel theoretical analysis proving that distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks. Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of \textcolor{black}{parameter servers}. We map this theoretical result onto a real-world scenario, training deep neural networks over an unreliable network layer, and conduct network simulation to validate the system improvement by allowing the networks to be unreliable.

...read moreread less

21 citations

Posted Content•

Powerset Convolutional Neural Networks

[...]

Chris Wendler¹, Dan Alistarh², Markus Püschel¹•Institutions (2)

ETH Zurich¹, Institute of Science and Technology Austria²

05 Sep 2019-arXiv: Learning

TL;DR: A novel class of convolutional neural networks for set functions, i.e., data indexed with the powerset of a finite set, derived as linear, shift-equivariant functions for various notions of shifts on set functions.

...read moreread less

Abstract: We present a novel class of convolutional neural networks (CNNs) for set functions, i.e., data indexed with the powerset of a finite set. The convolutions are derived as linear, shift-equivariant functions for various notions of shifts on set functions. The framework is fundamentally different from graph convolutions based on the Laplacian, as it provides not one but several basic shifts, one for each element in the ground set. Prototypical experiments with several set function classification tasks on synthetic datasets and on datasets derived from real-world hypergraphs demonstrate the potential of our new powerset CNNs.

...read moreread less

11 citations

Proceedings Article•

Powerset Convolutional Neural Networks

[...]

Chris Wendler¹, Markus Püschel¹, Dan Alistarh²•Institutions (2)

ETH Zurich¹, Institute of Science and Technology Austria²

01 Jan 2019

TL;DR: In this article, a novel class of convolutional neural networks (CNNs) for set functions is presented, which are derived as linear, shift-equivariant functions for various notions of shifts on set functions.

...read moreread less

11 citations

Proceedings Article•DOI•

Why extension-based proofs fail

[...]

Dan Alistarh¹, James Aspnes², Faith Ellen³, Rati Gelashvili³, Leqi Zhu³ - Show less +1 more•Institutions (3)

Institute of Science and Technology Austria¹, Yale University², University of Toronto³

23 Jun 2019

TL;DR: In this article, it was shown that it is impossible to deterministically solve k-set agreement among n > k ≥ k ≥ 2 processes in a wait-free manner.

...read moreread less

Abstract: It is impossible to deterministically solve wait-free consensus in an asynchronous system. The classic proof uses a valency argument, which constructs an infinite execution by repeatedly extending a finite execution. We introduce extension-based proofs, a class of impossibility proofs that are modelled as an interaction between a prover and a protocol and that include valency arguments. Using proofs based on combinatorial topology, it has been shown that it is impossible to deterministically solve k-set agreement among n > k ≥ 2 processes in a wait-free manner. However, it was unknown whether proofs based on simpler techniques were possible. We show that this impossibility result cannot be obtained by an extension-based proof and, hence, extension-based proofs are limited in power.

...read moreread less

10 citations

Proceedings Article•DOI•

Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers

[...]

Dan Alistarh¹, Giorgi Nadiradze¹, Nikita Koval¹•Institutions (1)

Institute of Science and Technology Austria¹

17 Jun 2019

TL;DR: In this article, the authors analyze the efficiency guarantees provided by a range of incremental algorithms when parallelized via relaxed schedulers and show that the overheads of relaxation will be outweighed by the improved scalability of the relaxed scheduler.

...read moreread less

Abstract: Several classic problems in graph processing and computational geometry are solved via incremental algorithms, which split computation into a series of small tasks acting on shared state, which gets updated progressively. While the sequential variant of such algorithms usually specifies a fixed (but sometimes random) order in which the tasks should be performed, a standard approach to parallelizing such algorithms is to relax this constraint to allow for out-of-order parallel execution. This is the case for parallel implementations of Dijkstra's single-source shortest-paths (SSSP) algorithm, and for parallel Delaunay mesh triangulation. While many software frameworks parallelize incremental computation in this way, it is still not well understood whether this relaxed ordering approach can still provide any complexity guarantees. In this paper, we address this problem, and analyze the efficiency guarantees provided by a range of incremental algorithms when parallelized via relaxed schedulers. We show that, for algorithms such as Delaunay mesh triangulation and sorting by insertion, schedulers with a maximum relaxation factor of k in terms of the maximum priority inversion allowed will introduce a maximum amount of wasted work of O(log n poly(k)), where n is the number of tasks to be executed. For SSSP, we show that the additional work is O(poly(k), dmax / wmin), where dmax is the maximum distance between two nodes, and wmin is the minimum such distance. In practical settings where n >> k, this suggests that the overheads of relaxation will be outweighed by the improved scalability of the relaxed scheduler. On the negative side, we provide lower bounds showing that certain algorithms will inherently incur a non-trivial amount of wasted work due to scheduler relaxation, even for relatively benign relaxed schedulers.

...read moreread less

7 citations

Posted Content•

Asynchronous Stochastic Subgradient Methods for General Nonsmooth Nonconvex Optimization

[...]

Vyacheslav Kungurtsev, Malcolm Egan, Bapi Chatterjee, Dan Alistarh

25 Sep 2019-arXiv: Optimization and Control

TL;DR: This paper introduces the first convergence analysis covering asynchronous methods in the case of general non-smooth, non-convex objectives and shows their overall successful asymptotic convergence as well as exploring how momentum, synchronization, and partitioning all affect performance.

...read moreread less

Abstract: Asynchronous distributed methods are a popular way to reduce the communication and synchronization costs of large-scale optimization. Yet, for all their success, little is known about their convergence guarantees in the challenging case of general non-smooth, non-convex objectives, beyond cases where closed-form proximal operator solutions are available. This is all the more surprising since these objectives are the ones appearing in the training of deep neural networks. In this paper, we introduce the first convergence analysis covering asynchronous methods in the case of general non-smooth, non-convex objectives. Our analysis applies to stochastic sub-gradient descent methods both with and without block variable partitioning, and both with and without momentum. It is phrased in the context of a general probabilistic model of asynchronous scheduling accurately adapted to modern hardware properties. We validate our analysis experimentally in the context of training deep neural network architectures. We show their overall successful asymptotic convergence as well as exploring how momentum, synchronization, and partitioning all affect performance.

...read moreread less

5 citations

Posted Content•

In Search of the Fastest Concurrent Union-Find Algorithm

[...]

Dan Alistarh¹, Alexander Fedorov², Nikita Koval³•Institutions (3)

Institute of Science and Technology Austria¹, JetBrains², Yandex³

14 Nov 2019-arXiv: Data Structures and Algorithms

TL;DR: This work evaluates and analyzes the performance of several concurrent Union-Find algorithms and optimization strategies across a wide range of platforms and workloads and finds one of the fastest algorithm variants is a sequential one that uses coarse-grained locking with the lock elision optimization to reduce synchronization cost and increase scalability.

...read moreread less

Abstract: Union-Find (or Disjoint-Set Union) is one of the fundamental problems in computer science; it has been well-studied from both theoretical and practical perspectives in the sequential case. Recently, there has been mounting interest in analyzing this problem in the concurrent scenario, and several asymptotically-efficient algorithms have been proposed. Yet, to date, there is very little known about the practical performance of concurrent Union-Find. This work addresses this gap. We evaluate and analyze the performance of several concurrent Union-Find algorithms and optimization strategies across a wide range of platforms (Intel, AMD, and ARM) and workloads (social, random, and road networks, as well as integrations into more complex algorithms). We first observe that, due to the limited computational cost, the number of induced cache misses is the critical determining factor for the performance of existing algorithms. We introduce new techniques to reduce this cost by storing node priorities implicitly and by using plain reads and writes in a way that does not affect the correctness of the algorithms. Finally, we show that Union-Find implementations are an interesting application for Transactional Memory (TM): one of the fastest algorithm variants we discovered is a sequential one that uses coarse-grained locking with the lock elision optimization to reduce synchronization cost and increase scalability.

...read moreread less

4 citations

Posted Content•

PopSGD: Decentralized Stochastic Gradient Descent in the Population Model

[...]

Giorgi Nadiradze, Amirmojtaba Sabour, Aditya Sharma, Ilia Markov, Vitaly Aksenov, Dan Alistarh - Show less +2 more

25 Sep 2019

TL;DR: It is proved that, under standard assumptions, SGD can converge even in this extremely loose, decentralized setting, for both convex and non-convex objectives, and surprisingly, in the former case, the algorithm can achieve linear speedup in the number of nodes.

...read moreread less

Abstract: The population model is a standard way to represent large-scale decentralized distributed systems, in which agents with limited computational power interact in randomly chosen pairs, in order to collectively solve global computational tasks. In contrast with synchronous gossip models, nodes are anonymous, lack a common notion of time, and have no control over their scheduling. In this paper, we examine whether large-scale distributed optimization can be performed in this extremely restrictive setting. We introduce and analyze a natural decentralized variant of stochastic gradient descent (SGD), called PopSGD, in which every node maintains a local parameter, and is able to compute stochastic gradients with respect to this parameter. Every pair-wise node interaction performs a stochastic gradient step at each agent, followed by averaging of the two models. We prove that, under standard assumptions, SGD can converge even in this extremely loose, decentralized setting, for both convex and non-convex objectives. Moreover, surprisingly, in the former case, the algorithm can achieve linear speedup in the number of nodes $n$. Our analysis leverages a new technical connection between decentralized SGD and randomized load-balancing, which enables us to tightly bound the concentration of node parameters. We validate our analysis through experiments, showing that PopSGD can achieve convergence and speedup for large-scale distributed learning tasks in a supercomputing environment.

...read moreread less

Book Chapter•DOI•

Scalable FIFO Channels for Programming via Communicating Sequential Processes

[...]

Nikita Koval¹, Dan Alistarh¹, Roman Elizarov²•Institutions (2)

Institute of Science and Technology Austria¹, JetBrains²

26 Aug 2019

TL;DR: This work focuses on communicating sequential processes (CSP) and actor models, which share data via explicit communication, and the common abstraction for communication between several processes is the channel.

...read moreread less

Abstract: Traditional concurrent programming involves manipulating shared mutable state. Alternatives to this programming style are communicating sequential processes (CSP) and actor models, which share data via explicit communication. These models have been known for almost half a century, and have recently had started to gain significant traction among modern programming languages. The common abstraction for communication between several processes is the channel. Although channels are similar to producer-consumer data structures, they have different semantics and support additional operations, such as the select expression. Despite their growing popularity, most known implementations of channels use lock-based data structures and can be rather inefficient.

...read moreread less

Posted Content•

Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

[...]

Vyacheslav Kungurtsev¹, Malcolm Egan², Bapi Chatterjee³, Dan Alistarh³•Institutions (3)

Czech Technical University in Prague¹, French Institute for Research in Computer Science and Automation², Institute of Science and Technology Austria³

28 May 2019-arXiv: Optimization and Control

TL;DR: A new implementation strategy for shared-memory based training of deep neural networks, whereby concurrent parameter servers are utilized to train a partitioned but shared model in single- and multi-GPU settings without compromising accuracy is proposed.

...read moreread less

Abstract: Asynchronous distributed algorithms are a popular way to reduce synchronization costs in large-scale optimization, and in particular for neural network training. However, for nonsmooth and nonconvex objectives, few convergence guarantees exist beyond cases where closed-form proximal operator solutions are available. As most popular contemporary deep neural networks lead to nonsmooth and nonconvex objectives, there is now a pressing need for such convergence guarantees. In this paper, we analyze for the first time the convergence of stochastic asynchronous optimization for this general class of objectives. In particular, we focus on stochastic subgradient methods allowing for block variable partitioning, where the shared-memory-based model is asynchronously updated by concurrent processes. To this end, we first introduce a probabilistic model which captures key features of real asynchronous scheduling between concurrent processes; under this model, we establish convergence with probability one to an invariant set for stochastic subgradient methods with momentum. From the practical perspective, one issue with the family of methods we consider is that it is not efficiently supported by machine learning frameworks, as they mostly focus on distributed data-parallel strategies. To address this, we propose a new implementation strategy for shared-memory based training of deep neural networks, whereby concurrent parameter servers are utilized to train a partitioned but shared model in single- and multi-GPU settings. Based on this implementation, we achieve on average 1.2x speed-up in comparison to state-of-the-art training methods for popular image classification tasks without compromising accuracy.

...read moreread less

Proceedings Article•

In Search of the Fastest Concurrent Union-Find Algorithm.

[...]

Dan Alistarh¹, Alexander Fedorov², Nikita Koval¹•Institutions (2)

Institute of Science and Technology Austria¹, JetBrains²

01 Jan 2019

TL;DR: In this paper, the performance of concurrent disjoint-set union-find algorithms is evaluated and analyzed across a wide range of platforms (Intel, AMD, and ARM) and workloads (social, random, and road networks).

...read moreread less

Posted Content•

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

[...]

Ali Ramezani-Kebrya¹, Fartash Faghri², Ilya Markov³, Vitalii Aksenov³, Dan Alistarh³, Daniel M. Roy² - Show less +2 more•Institutions (3)

École Polytechnique Fédérale de Lausanne¹, University of Toronto², Institute of Science and Technology Austria³

16 Aug 2019-arXiv: Learning

TL;DR: In this article, the authors proposed a new gradient quantization scheme for data-parallel stochastic gradient descent (QSGD), which has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the qSGDinf heuristic and of other compression methods.

...read moreread less

Abstract: As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.

...read moreread less

Posted Content•

Performance Prediction for Coarse-Grained Locking.

[...]

Vitaly Aksenov, Dan Alistarh, Petr Kuznetsov

25 Apr 2019-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This work describes a simple model that can be used to predict the throughput of coarse-grained lock-based algorithms and shows that it works well for CLH lock, and is expected to work for other popular lock designs such as TTAS, MCS, etc.

...read moreread less

Abstract: A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is an alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. A lock can be viewed as a queue that arbitrates the order in which the critical sections are executed, and a natural question is whether we can use stochastic analysis to predict the resulting throughput. As a preliminary evidence to the affirmative, we describe a simple model that can be used to predict the throughput of coarse-grained lock-based algorithms. We show that our model works well for CLH lock, and we expect it to work for other popular lock designs such as TTAS, MCS, etc.

...read moreread less

Proceedings Article•DOI•

Lock-free channels for programming via communicating sequential processes: poster

[...]

Nikita Koval¹, Dan Alistarh¹, Roman Elizarov²•Institutions (2)

Institute of Science and Technology Austria¹, JetBrains²

16 Feb 2019

TL;DR: In this work, this work presents the first efficient lock-free channel algorithm, and compares it against Go and Kotlin baseline implementations.

...read moreread less

Abstract: Traditional concurrent programming involves manipulating shared mutable state. Alternatives to this programming style are communicating sequential processes (CSP) [1] and actor [2] models, which share data via explicit communication. Rendezvous channel is the common abstraction for communication between several processes, where senders and receivers perform a rendezvous handshake as a part of their protocol (senders wait for receivers and vice versa). Additionally to this, channels support the select expression. In this work, we present the first efficient lock-free channel algorithm, and compare it against Go [3] and Kotlin [4] baseline implementations.

...read moreread less

Journal Article•DOI•

Distributed Computing Column 76: Annual Review 2019

[...]

Dan Alistarh¹•Institutions (1)

Institute of Science and Technology Austria¹

04 Dec 2019-Sigact News

TL;DR: Panconesi and Srinivasan as discussed by the authors were the winners of the 2019 Edsger W. Dijkstra Prize in Distributed Computing for their paper "Randomized Distributed Edge Coloring via an Extension of the Cherno Hoe ding Bounds," which appeared in the SIAM Journal on Computing in 1997.

...read moreread less

Abstract: As this is the rst column I am editing, I would like to sincerely thank Jennifer Welch for inviting me as editor, and for being extremely helpful with the transition. I will do my best to maintain the very high standard set for the column by her, and by the previous editors. Following custom, the December issue is devoted to a review of some notable events related to distributed computing which occurred during the year. First, congratulations to Alessandro Panconesi and Aravind Srinivasan for being awarded the 2019 Edsger W. Dijkstra Prize in Distributed Computing for their paper \Randomized Distributed Edge Coloring via an Extension of the Cherno Hoe ding Bounds," which appeared in the SIAM Journal on Computing in 1997. The prize is jointly sponsored by ACM and EATCS, and is given alternately at PODC1 and DISC2; this year it was awarded at DISC.

...read moreread less