Showing papers by "Dan Alistarh published in 2018"

PDF

Open Access

Posted Content•

Model compression via distillation and quantization

[...]

Antonio Polino, Razvan Pascanu¹, Dan Alistarh²•Institutions (2)

Google¹, Institute of Science and Technology Austria²

15 Feb 2018-arXiv: Neural and Evolutionary Computing

TL;DR: This paper proposed quantized distillation and differentiable quantization to optimize the location of quantization points through stochastic gradient descent to better fit the behavior of the teacher model, and showed that quantized shallow students can reach similar accuracy levels to full-precision teacher models.

...read moreread less

Abstract: Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile or embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks. The first method we propose is called quantized distillation and leverages distillation during the training process, by incorporating distillation loss, expressed with respect to the teacher, into the training of a student network whose weights are quantized to a limited set of levels. The second method, differentiable quantization, optimizes the location of quantization points through stochastic gradient descent, to better fit the behavior of the teacher model. We validate both methods through experiments on convolutional and recurrent architectures. We show that quantized shallow students can reach similar accuracy levels to full-precision teacher models, while providing order of magnitude compression, and inference speedup that is linear in the depth reduction. In sum, our results enable DNNs for resource-constrained environments to leverage architecture and accuracy advances developed on more powerful devices.

...read moreread less

237 citations

Proceedings Article•

The Convergence of Sparsified Gradient Methods

[...]

Dan Alistarh¹, Torsten Hoefler², Mikael Johansson³, Nikola Konstantinov¹, Sarit Khirirat³, Cedric Renggli² - Show less +2 more•Institutions (3)

Institute of Science and Technology Austria¹, ETH Zurich², Royal Institute of Technology³

01 Jan 2018

TL;DR: The authors showed that sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.

...read moreread less

Abstract: Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods--where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally--are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to \emph{three orders of magnitude}, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification. This is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.

...read moreread less

202 citations

Proceedings Article•

Byzantine Stochastic Gradient Descent

[...]

Dan Alistarh¹, Zeyuan Allen-Zhu², Jerry Li³•Institutions (3)

Institute of Science and Technology Austria¹, Microsoft², Massachusetts Institute of Technology³

01 Jan 2018

TL;DR: In this article, the authors studied the problem of distributed stochastic optimization in an adversarial setting where, out of $m$ machines which compute Stochastic Gradient Descent (SGD) every iteration, an α-fraction are Byzantine, and may behave adversarially.

...read moreread less

Abstract: This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and may behave adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$-approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sample complexity and time complexity.

...read moreread less

161 citations

Posted Content•

The Convergence of Sparsified Gradient Methods

[...]

Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, Cedric Renggli - Show less +2 more

27 Sep 2018-arXiv: Learning

TL;DR: This article showed that sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.

...read moreread less

Abstract: Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods - where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally - are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to three orders of magnitude, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification. This is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.

...read moreread less

124 citations

Proceedings Article•

Model compression via distillation and quantization

[...]

Antonio Polino, Razvan Pascanu¹, Dan Alistarh²•Institutions (2)

Google¹, Institute of Science and Technology Austria²

15 Feb 2018

TL;DR: This paper proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks, and shows that quantized shallow students can reach similar accuracy levels to full-precision teacher models.

...read moreread less

112 citations

Proceedings Article•

Space-optimal majority in population protocols

[...]

Dan Alistarh¹, James Aspnes², Rati Gelashvili³•Institutions (3)

Institute of Science and Technology Austria¹, Yale University², University of Toronto³

07 Jan 2018

TL;DR: In this article, a lower bound of Ω(log n) states for any protocol which stabilizes in O(n 1−c) expected time, for any constant c > 0, was established.

...read moreread less

Abstract: Population protocols are a popular model of distributed computing, in which n agents with limited local state interact randomly, and cooperate to collectively compute global predicates. Inspired by recent developments in DNA programming, an extensive series of papers, across different communities, has examined the computability and complexity characteristics of this model. Majority, or consensus, is a central task in this model, in which agents need to collectively reach a decision as to which one of two states A or B had a higher initial count. Two metrics are important: the time that a protocol requires to stabilize to an output decision, and the state space size that each agent requires to do so. It is known that majority requires Ω(log log n) states per agent to allow for fast (poly-logarithmic time) stabilization, and that O(log2 n) states are sufficient. Thus, there is an exponential gap between the space upper and lower bounds for this problem. This paper addresses this question. On the negative side, we provide a new lower bound of Ω(log n) states for any protocol which stabilizes in O(n1−c) expected time, for any constant c > 0. This result is conditional on monotonicity and output assumptions, satisfied by all known protocols. Technically, it represents a departure from previous lower bounds, in that it does not rely on the existence of dense configurations. Instead, we introduce a new generalized surgery technique to prove the existence of incorrect executions for any algorithm which would contradict the lower bound. Subsequently, our lower bound also applies to general initial configurations, including ones with a leader. On the positive side, we give a new algorithm for majority which uses O(log n) states, and stabilizes in O(log2 n) expected time. Central to the algorithm is a new leaderless phase clock technique, which allows agents to synchronize in phases of Θ(n log n) consecutive interactions using O(log n) states per agent, exploiting a new connection between population protocols and power-of-two-choices load balancing mechanisms. We also employ our phase clock to build a leader election algorithm with a state space of size O(log n), which stabilizes in O(log2 n) expected time.

...read moreread less

87 citations

Posted Content•

Byzantine Stochastic Gradient Descent

[...]

Dan Alistarh¹, Zeyuan Allen-Zhu², Jerry Li³•Institutions (3)

Institute of Science and Technology Austria¹, Microsoft², Massachusetts Institute of Technology³

23 Mar 2018-arXiv: Learning

TL;DR: In this article, the authors studied the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which compute stochastically gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially.

...read moreread less

Abstract: This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$-approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sampling complexity and time complexity.

...read moreread less

80 citations

Journal Article•DOI•

Recent Algorithmic Advances in Population Protocols

[...]

Dan Alistarh¹, Rati Gelashvili²•Institutions (2)

Institute of Science and Technology Austria¹, University of Toronto²

24 Oct 2018-Sigact News

TL;DR: This work states that a population protocol consists of n agents with limited local state that interact randomly in pairs, according to an underlying communication graph, and cooperate to collectively compute global predicates.

...read moreread less

Abstract: Population protocols are a popular model of distributed computing, introduced by Angluin, Aspnes, Diamadi, Fischer, and Peralta [6] a little over a decade ago. In the meantime, the model has proved a useful abstraction for modeling various settings, from wireless sensor networks [35, 26], to gene regulatory networks [17], and chemical reaction networks [21]. In a nutshell, a population protocol consists of n agents with limited local state that interact randomly in pairs, according to an underlying communication graph, and cooperate to collectively compute global predicates. From a theoretical prospective, population protocols, with the restricted communication and computational power, are probably one of the simplest distributed model one can imagine. Perhaps surprisingly though, solutions to many classical distributed tasks are still possible. Moreover, these solutions often rely on interesting algorithmic ideas for design and interesting probabilistic techniques for analysis, while known lower bound results revolve around complex combinatorial arguments.

...read moreread less

55 citations

Posted Content•

SparCML: High-Performance Sparse Communication for Machine Learning

[...]

Cedric Renggli¹, Saleh Ashkboos², Mehdi Aghagolzadeh³, Dan Alistarh², Torsten Hoefler¹ - Show less +1 more•Institutions (3)

ETH Zurich¹, Institute of Science and Technology Austria², Microsoft³

22 Feb 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: The generic communication library, SparCML1, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations, and will form the basis of future highly-scalable machine learning frameworks.

...read moreread less

Abstract: Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communications. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute arbitrary sparse input data vectors. Our generic communication library, SparCML, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations. As such, SparCML and its techniques will form the basis of future highly-scalable machine learning frameworks.

...read moreread less

44 citations

Posted Content•

Distributed Learning over Unreliable Networks

[...]

Chen Yu¹, Hanlin Tang¹, Cedric Renggli², Simon Kassing², Ankit Singla², Dan Alistarh³, Ce Zhang², Ji Liu¹ - Show less +4 more•Institutions (3)

University of Rochester¹, ETH Zurich², Institute of Science and Technology Austria³

17 Oct 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this paper, the authors consider the problem of designing machine learning systems that are tolerant to network unreliability during training and show that the influence of packet drop rate diminishes with the number of black parameter servers.

...read moreread less

Abstract: Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent work exhibits the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. In this paper, we connect these two trends, and consider the following question: {\em Can we design machine learning systems that are tolerant to network unreliability during training?} With this motivation, we focus on a theoretical problem of independent interest---given a standard distributed parameter server architecture, if every communication between the worker and the server has a non-zero probability $p$ of being dropped, does there exist an algorithm that still converges, and at what speed? The technical contribution of this paper is a novel theoretical analysis proving that distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks. Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of \textcolor{black}{parameter servers}. We map this theoretical result onto a real-world scenario, training deep neural networks over an unreliable network layer, and conduct network simulation to validate the system improvement by allowing the networks to be unreliable.

...read moreread less

35 citations

Proceedings Article•DOI•

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory

[...]

Dan Alistarh¹, Christopher De Sa², Nikola Konstantinov¹•Institutions (2)

Institute of Science and Technology Austria¹, Cornell University²

23 Jul 2018

TL;DR: In this article, lock-free concurrent stochastic gradient descent (SGD) was shown to converge faster and with a wider range of parameters than previously known under asynchronous iterations, while exhibiting a fundamental trade-off between the maximum delay in the system and the rate at which SGD can converge.

...read moreread less

Abstract: Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks. Given the recent practical focus on distributed machine learning, significant work has been dedicated to the convergence properties of this algorithm under the inconsistent and noisy updates arising from execution in a distributed environment. However, surprisingly, the convergence properties of this classic algorithm in the standard shared-memory model are still not well-understood. In this work, we address this gap, and provide new convergence bounds for lock-free concurrent stochastic gradient descent, executing in the classic asynchronous shared memory model, against a strong adaptive adversary. Our results give improved upper and lower bounds on the "price of asynchrony'' when executing the fundamental SGD algorithm in a concurrent setting. They show that this classic optimization tool can converge faster and with a wider range of parameters than previously known under asynchronous iterations. At the same time, we exhibit a fundamental trade-off between the maximum delay in the system and the rate at which SGD can converge, which governs the set of parameters under which this algorithm can still work efficiently.

...read moreread less

Proceedings Article•DOI•

Distributionally Linearizable Data Structures

[...]

Dan Alistarh¹, Trevor Brown¹, Justin Kopinsky², Jerry Li², Giorgi Nadiradze³ - Show less +1 more•Institutions (3)

Institute of Science and Technology Austria¹, Massachusetts Institute of Technology², ETH Zurich³

11 Jul 2018

TL;DR: This work shows for the first time that, under a set of analytic assumptions, a family of relaxed concurrent data structures, including variants of MultiQueues, but also a new approximate counting algorithm called the MultiCounter, provides strong probabilistic guarantees on the degree of relaxation with respect to the sequential specification, in arbitrary concurrent executions.

...read moreread less

Abstract: Relaxed concurrent data structures have become increasingly popular, due to their scalability in graph processing and machine learning applications (\citeNguyen13, gonzalez2012powergraph ). Despite considerable interest, there exist families of natural, high performing randomized relaxed concurrent data structures, such as the popular MultiQueue~\citeMQ pattern for implementing relaxed priority queue data structures, for which no guarantees are known in the concurrent setting~\citeAKLN17. Our main contribution is in showing for the first time that, under a set of analytic assumptions, a family of relaxed concurrent data structures, including variants of MultiQueues, but also a new approximate counting algorithm we call the MultiCounter, provides strong probabilistic guarantees on the degree of relaxation with respect to the sequential specification, in arbitrary concurrent executions. We formalize these guarantees via a new correctness condition called distributional linearizability, tailored to concurrent implementations with randomized relaxations. Our result is based on a new analysis of an asynchronous variant of the classic power-of-two-choices load balancing algorithm, in which placement choices can be based on inconsistent, outdated information (this result may be of independent interest). We validate our results empirically, showing that the MultiCounter algorithm can implement scalable relaxed timestamps.

...read moreread less

Posted Content•

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory

[...]

Dan Alistarh¹, Christopher De Sa², Nikola Konstantinov¹•Institutions (2)

Institute of Science and Technology Austria¹, Cornell University²

23 Mar 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this article, lock-free concurrent stochastic gradient descent (SGD) was shown to converge faster and with a wider range of parameters than previously known under asynchronous iterations.

...read moreread less

Abstract: Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks. Given the recent practical focus on distributed machine learning, significant work has been dedicated to the convergence properties of this algorithm under the inconsistent and noisy updates arising from execution in a distributed environment. However, surprisingly, the convergence properties of this classic algorithm in the standard shared-memory model are still not well-understood. In this work, we address this gap, and provide new convergence bounds for lock-free concurrent stochastic gradient descent, executing in the classic asynchronous shared memory model, against a strong adaptive adversary. Our results give improved upper and lower bounds on the "price of asynchrony" when executing the fundamental SGD algorithm in a concurrent setting. They show that this classic optimization tool can converge faster and with a wider range of parameters than previously known under asynchronous iterations. At the same time, we exhibit a fundamental trade-off between the maximum delay in the system and the rate at which SGD can converge, which governs the set of parameters under which this algorithm can still work efficiently.

...read moreread less

Journal Article•DOI•

Communication-efficient randomized consensus

[...]

Dan Alistarh¹, James Aspnes², Valerie King³, Jared Saia⁴•Institutions (4)

Microsoft¹, Yale University², University of Victoria³, University of New Mexico⁴

01 Nov 2018-Distributed Computing

TL;DR: Overall, it is shown that it is possible to build an efficient message-passing implementation of a shared coin, and in the process (almost-optimally) solve the classic consensus problem in the asynchronous message-Passing model.

...read moreread less

Abstract: We consider the problem of consensus in the challenging classic model. In this model, the adversary is adaptive; it can choose which processors crash at any point during the course of the algorithm. Further, communication is via asynchronous message passing: there is no known upper bound on the time to send a message from one processor to another, and all messages and coin flips are seen by the adversary.

...read moreread less

Proceedings Article•DOI•

Synchronous Multi-GPU Deep Learning with Low-Precision Communication: An Experimental Study

[...]

Demjan Grubic, Leo Tam¹, Dan Alistarh², Ce Zhang³•Institutions (3)

Yale University¹, Institute of Science and Technology Austria², ETH Zurich³

01 Jan 2018

TL;DR: This paper conducts an empirical study to answer the question: can low-precision communication consistently improve the end-to-end performance of training modern neural networks, with no accuracy loss?

...read moreread less

Abstract: Training deep learning models has received tremendous research interest recently. In particular, there has been intensive research on reducing the communication cost of training when using multiple computational devices, through reducing the precision of the underlying data representation. Naturally, such methods induce system trade-offs—lowering communication precision could decrease communication overheads and improve scalability; but, on the other hand, it can also reduce the accuracy of training. In this paper, we study this trade-off space, and ask: Can low-precision communication consistently improve the end-to-end performance of training modern neural networks, with no accuracy loss? From the performance point of view, the answer to this question may appear deceptively easy: compressing communication through low precision should help when the ratio between communication and computation is high. However, this answer is less straightforward when we try to generalize this principle across various neural network architectures (e.g., AlexNet vs. ResNet), number of GPUs (e.g., 2 vs. 8 GPUs), machine configurations (e.g., EC2 instances vs. NVIDIA DGX-1), communication primitives (e.g., MPI vs. NCCL), and even different GPU architectures (e.g., Kepler vs. Pascal). Currently, it is not clear how a realistic realization of all these factors maps to the speed up provided by low-precision communication. In this paper, we conduct an empirical study to answer this question and report the insights.

...read moreread less

Proceedings Article•DOI•

Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms

[...]

Dan Alistarh¹, Trevor Brown¹, Justin Kopinsky², Giorgi Nadiradze³•Institutions (3)

Institute of Science and Technology Austria¹, Massachusetts Institute of Technology², ETH Zurich³

23 Jul 2018

TL;DR: This work presents an efficient method to deterministically parallelize iterative sequential algorithms, with provable runtime guarantees in terms of the number of executed tasks to completion.

...read moreread less

Abstract: There has been significant progress in understanding the parallelism inherent to iterative sequential algorithms: for many classic algorithms, the depth of the dependence structure is now well understood, and scheduling techniques have been developed to exploit this shallow dependence structure for efficient parallel implementations. A related, applied research strand has studied methods by which certain iterative task-based algorithms can be efficiently parallelized via relaxed concurrent priority schedulers. These allow for high concurrency when inserting and removing tasks, at the cost of executing superfluous work due to the relaxed semantics of the scheduler. In this work, we take a step towards unifying these two research directions, by showing that there exists a family of relaxed priority schedulers that can efficiently and deterministically execute classic iterative algorithms such as greedy maximal independent set (MIS) and matching. Our primary result shows that, given a randomized scheduler with an expected relaxation factor of k in terms of the maximum allowed priority inversions on a task, and any graph on n vertices, the scheduler is able to execute greedy MIS with only an additive factor of \poly(k) expected additional iterations compared to an exact (but not scalable) scheduler. This counter-intuitive result demonstrates that the overhead of relaxation when computing MIS is not dependent on the input size or structure of the input graph. Experimental results show that this overhead can be clearly offset by the gain in performance due to the highly scalable scheduler. In sum, we present an efficient method to deterministically parallelize iterative sequential algorithms, with provable runtime guarantees in terms of the number of executed tasks to completion.

...read moreread less

Journal Article•DOI•

ThreadScan: Automatic and Scalable Memory Reclamation

[...]

Dan Alistarh¹, William M. Leiserson², Alexander Matveev², Nir Shavit²•Institutions (2)

Institute of Science and Technology Austria¹, Massachusetts Institute of Technology²

01 May 2018

TL;DR: Instead of manually tracking access to memory locations as done in techniques like hazard pointers, or restricting shared accesses to specific epoch boundaries as in RCU, the algorithm, called ThreadScan, leverages operating system signaling to automatically detect which memory locations are being accessed by concurrent threads.

...read moreread less

Abstract: The concurrent memory reclamation problem is that of devising a way for a deallocating thread to verify that no other concurrent threads hold references to a memory block being deallocated. To date, in the absence of automatic garbage collection, there is no satisfactory solution to this problem; existing tracking methods like hazard pointers, reference counters, or epoch-based techniques like RCU are either prohibitively expensive or require significant programming expertise to the extent that implementing them efficiently can be worthy of a publication. None of the existing techniques are automatic or even semi-automated.In this article, we take a new approach to concurrent memory reclamation. Instead of manually tracking access to memory locations as done in techniques like hazard pointers, or restricting shared accesses to specific epoch boundaries as in RCU, our algorithm, called ThreadScan, leverages operating system signaling to automatically detect which memory locations are being accessed by concurrent threads.Initial empirical evidence shows that ThreadScan scales surprisingly well and requires negligible programming effort beyond the standard use of Malloc and Free.

...read moreread less

Proceedings Article•DOI•

Gradient compression for communication-limited convex optimization

[...]

Sarit Khirirat¹, Mikael Johansson¹, Dan Alistarh²•Institutions (2)

Royal Institute of Technology¹, Institute of Science and Technology Austria²

01 Dec 2018

TL;DR: This paper establishes and strengthens the convergence guarantees for gradient descent under a family of gradient compression techniques, and derives admissible step sizes and quantifies both the number of iterations and the numbers of bits that need to be exchanged to reach a target accuracy.

...read moreread less

Abstract: Data-rich applications in machine-learning and control have motivated an intense research on large-scale optimization. Novel algorithms have been proposed and shown to have optimal convergence rates in terms of iteration counts. However, their practical performance is severely degraded by the cost of exchanging high-dimensional gradient vectors between computing nodes. Several gradient compression heuristics have recently been proposed to reduce communications, but few theoretical results exist that quantify how they impact algorithm convergence. This paper establishes and strengthens the convergence guarantees for gradient descent under a family of gradient compression techniques. For convex optimization problems, we derive admissible step sizes and quantify both the number of iterations and the number of bits that need to be exchanged to reach a target accuracy. Finally, we validate the performance of different gradient compression techniques in simulations. The numerical results highlight the properties of different gradient compression algorithms and confirm that fast convergence with limited information exchange is possible.

...read moreread less

Proceedings Article•DOI•

Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms

[...]

Dan Alistarh¹, Trevor Brown¹, Justin Kopinsky², Giorgi Nadiradze³•Institutions (3)

Institute of Science and Technology Austria¹, Massachusetts Institute of Technology², ETH Zurich³

13 Aug 2018-arXiv: Data Structures and Algorithms

TL;DR: In this article, the authors show that there exists a family of relaxed priority schedulers that can efficiently and deterministically execute classic iterative algorithms such as greedy maximal independent set (MIS) and matching, with provable runtime guarantees in terms of the number of executed tasks to completion.

...read moreread less

Abstract: There has been significant progress in understanding the parallelism inherent to iterative sequential algorithms: for many classic algorithms, the depth of the dependence structure is now well understood, and scheduling techniques have been developed to exploit this shallow dependence structure for efficient parallel implementations. A related, applied research strand has studied methods by which certain iterative task-based algorithms can be efficiently parallelized via relaxed concurrent priority schedulers. These allow for high concurrency when inserting and removing tasks, at the cost of executing superfluous work due to the relaxed semantics of the scheduler. In this work, we take a step towards unifying these two research directions, by showing that there exists a family of relaxed priority schedulers that can efficiently and deterministically execute classic iterative algorithms such as greedy maximal independent set (MIS) and matching. Our primary result shows that, given a randomized scheduler with an expected relaxation factor of $k$ in terms of the maximum allowed priority inversions on a task, and any graph on $n$ vertices, the scheduler is able to execute greedy MIS with only an additive factor of poly($k$) expected additional iterations compared to an exact (but not scalable) scheduler. This counter-intuitive result demonstrates that the overhead of relaxation when computing MIS is not dependent on the input size or structure of the input graph. Experimental results show that this overhead can be clearly offset by the gain in performance due to the highly scalable scheduler. In sum, we present an efficient method to deterministically parallelize iterative sequential algorithms, with provable runtime guarantees in terms of the number of executed tasks to completion.

...read moreread less

Proceedings Article•DOI•

Fast Quantized Arithmetic on x86: Trading Compute for Data Movement

[...]

Alen Stojanov¹, Tyler M. Smith¹, Dan Alistarh², Markus Püschel¹•Institutions (2)

ETH Zurich¹, Institute of Science and Technology Austria²

01 Jan 2018

TL;DR: Clover, a new library for efficient computation using low-precision data, provides mathematical routines required by fundamental methods in optimization and sparse recovery, and supports data formats from 4-bit quantized to 32-bit IEEE-754 on current Intel processors.

...read moreread less

Abstract: We introduce Clover, a new library for efficient computation using low-precision data, providing mathematical routines required by fundamental methods in optimization and sparse recovery. Our library faithfully implements variants of stochastic quantization that guarantee convergence at low precision, and supports data formats from 4-bit quantized to 32-bit IEEE-754 on current Intel processors. In particular, we show that 4-bit can be implemented efficiently using Intel AVX despite the lack of native support for this data format. Experimental results with dot product, matrix-vector multiplication (MVM), gradient descent (GD), and iterative hard thresholding (IHT) demonstrate that the attainable speedups are in many cases close to linear with respect to the reduction of precision due to reduced data movement. Finally, for GD and IHT, we show examples of absolute speedup achieved by 4-bit versus 32-bit, by iterating until a given target error is achieved.

...read moreread less

Proceedings Article•DOI•

A Brief Tutorial on Distributed and Concurrent Machine Learning

[...]

Dan Alistarh¹•Institutions (1)

Institute of Science and Technology Austria¹

23 Jul 2018

TL;DR: This tutorial will focus on parallelization strategies for the fundamental stochastic gradient descent (SGD) algorithm, which is a key tool when training machine learning models, from classical instances such as linear regression, to state-of-the-art neural network architectures.

...read moreread less

Abstract: The area of machine learning has made considerable progress over the past decade, enabled by the widespread availability of large datasets, as well as by improved algorithms and models Given the large computational demands of machine learning workloads, parallelism, implemented either through single-node concurrency or through multi-node distribution, has been a third key ingredient to advances in machine learningThe goal of this tutorial is to provide the audience with an overview of standard distribution techniques in machine learning, with an eye towards the intriguing trade-offs between synchronization and communication costs of distributed machine learning algorithms, on the one hand, and their convergence, on the otherThe tutorial will focus on parallelization strategies for the fundamental stochastic gradient descent (SGD) algorithm, which is a key tool when training machine learning models, from classical instances such as linear regression, to state-of-the-art neural network architecturesThe tutorial will describe the guarantees provided by this algorithm in the sequential case, and then move on to cover both shared-memory and message-passing parallelization strategies, together with the guarantees they provide, and corresponding trade-offs The presentation will conclude with a broad overview of ongoing research in distributed and concurrent machine learning The tutorial will assume no prior knowledge beyond familiarity with basic concepts in algebra and analysis

...read moreread less

Proceedings Article•DOI•

Brief Announcement: Performance Prediction for Coarse-Grained Locking

[...]

Vitaly Aksenov¹, Dan Alistarh², Petr Kuznetsov³•Institutions (3)

Saint Petersburg State University of Information Technologies, Mechanics and Optics¹, Institute of Science and Technology Austria², Université Paris-Saclay³

23 Jul 2018

TL;DR: This work describes a simple model that can be used to predict the throughput of coarse-grained lock-based algorithms and shows that it works well for CLH lock, and is expected to work for other popular lock designs such as TTAS, MCS, etc.

...read moreread less

Abstract: A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is an alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. A lock can be viewed as a queue that arbitrates the order in which the critical sections are executed, and a natural question is whether we can use stochastic analysis to predict the resulting throughput. As a preliminary evidence to the affirmative, we describe a simple model that can be used to predict the throughput of coarse-grained lock-based algorithms. We show that our model works well for CLH lock, and we expect it to work for other popular lock designs such as TTAS, MCS, etc.

...read moreread less

Proceedings Article•

Synchronous Multi-GPU Training for Deep Learning with Low-Precision Communications: An Empirical Study.

[...]

Demjan Grubic, Leo Tam¹, Dan Alistarh², Ce Zhang³•Institutions (3)

Yale University¹, Institute of Science and Technology Austria², ETH Zurich³

01 Jan 2018

Posted Content•

Compressive Sensing with Low Precision Data Representation: Radio Astronomy and Beyond.

[...]

Nezihe Merve Gürel, Kaan Kara, Dan Alistarh, Ce Zhang

14 Feb 2018

TL;DR: This work focuses on a setting where this problem is especially acute compressive sensing frameworks for radio astronomy and asks: Can the precision of the data representation be lowered for all input data, with recovery guarantees and good practical performance?

...read moreread less

Abstract: Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems. Lossy compression of data is an intriguing solution, but comes with its own drawbacks, such as potential signal loss, and the need for careful optimization of the compression ratio. In this work, we focus on a setting where this problem is especially acute: compressive sensing frameworks for interferometry and medical imaging. We ask the following question: can the precision of the data representation be lowered for all inputs, with recovery guarantees and practical performance? Our first contribution is a theoretical analysis of the normalized Iterative Hard Thresholding (IHT) algorithm when all input data, meaning both the measurement matrix and the observation vector are quantized aggressively. We present a variant of low precision normalized {IHT} that, under mild conditions, can still provide recovery guarantees. The second contribution is the application of our quantization framework to radio astronomy and magnetic resonance imaging. We show that lowering the precision of the data can significantly accelerate image recovery. We evaluate our approach on telescope data and samples of brain images using CPU and FPGA implementations achieving up to a 9x speed-up with negligible loss of recovery quality.

...read moreread less

Journal Article•DOI•

Inherent limitations of hybrid transactional memory

[...]

Dan Alistarh¹, Justin Kopinsky², Petr Kuznetsov³, Srivatsan Ravi⁴, Nir Shavit² - Show less +1 more•Institutions (4)

Microsoft¹, Massachusetts Institute of Technology², Université Paris-Saclay³, University of Southern California⁴

01 Jun 2018-Distributed Computing

TL;DR: A general model for HyTM implementations, which captures the ability of hardware transactions to buffer memory accesses and captures for the first time the trade-off between the degree of hardware-software TM concurrency and the amount of instrumentation overhead.

...read moreread less

Abstract: Several hybrid transactional memory (HyTM) schemes have recently been proposed to complement the fast, but best-effort nature of hardware transactional memory with a slow, reliable software backup. However, the costs of providing concurrency between hardware and software transactions in HyTM are still not well understood. In this paper, we propose a general model for HyTM implementations, which captures the ability of hardware transactions to buffer memory accesses. The model allows us to formally quantify and analyze the amount of overhead (instrumentation) caused by the potential presence of software transactions. We prove that (1) it is impossible to build a strictly serializable HyTM implementation that has both uninstrumented reads and writes, even for very weak progress guarantees, and (2) the instrumentation cost incurred by a hardware transaction in any progressive opaque HyTM is linear in the size of the transaction’s data set. We further describe two implementations which exhibit optimal instrumentation costs for two different progress conditions. In sum, this paper proposes the first formal HyTM model and captures for the first time the trade-off between the degree of hardware-software TM concurrency and the amount of instrumentation overhead.

...read moreread less

Posted Content•

Distributionally Linearizable Data Structures

[...]

Dan Alistarh¹, Trevor Brown¹, Justin Kopinsky², Jerry Li², Giorgi Nadiradze³ - Show less +1 more•Institutions (3)

Institute of Science and Technology Austria¹, Massachusetts Institute of Technology², ETH Zurich³

03 Apr 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this paper, the authors show that a family of relaxed concurrent data structures, including variants of MultiQueue, but also a new approximate counting algorithm called the MultiCounter, provides strong probabilistic guarantees on the degree of relaxation with respect to the sequential specification, in arbitrary concurrent executions.

...read moreread less

Abstract: Relaxed concurrent data structures have become increasingly popular, due to their scalability in graph processing and machine learning applications. Despite considerable interest, there exist families of natural, high performing randomized relaxed concurrent data structures, such as the popular MultiQueue pattern for implementing relaxed priority queue data structures, for which no guarantees are known in the concurrent setting. Our main contribution is in showing for the first time that, under a set of analytic assumptions, a family of relaxed concurrent data structures, including variants of MultiQueues, but also a new approximate counting algorithm we call the MultiCounter, provides strong probabilistic guarantees on the degree of relaxation with respect to the sequential specification, in arbitrary concurrent executions. We formalize these guarantees via a new correctness condition called distributional linearizability, tailored to concurrent implementations with randomized relaxations. Our result is based on a new analysis of an asynchronous variant of the classic power-of-two-choices load balancing algorithm, in which placement choices can be based on inconsistent, outdated information (this result may be of independent interest). We validate our results empirically, showing that the MultiCounter algorithm can implement scalable relaxed timestamps, which in turn can improve the performance of the classic TL2 transactional algorithm by up to 3 times, for some settings of parameters.

...read moreread less

Posted Content•

The Transactional Conflict Problem

[...]

Dan Alistarh¹, Syed Kamran Haider², Raphael Kübler³, Giorgi Nadiradze³•Institutions (3)

Institute of Science and Technology Austria¹, University of Connecticut², ETH Zurich³

03 Apr 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this article, a family of optimal online algorithms for transactional conflict problem is proposed, where the goal is to minimize the overall running time penalty for the conflicting transactions in a transactional system.

...read moreread less

Abstract: The transactional conflict problem arises in transactional systems whenever two or more concurrent transactions clash on a data item. While the standard solution to such conflicts is to immediately abort one of the transactions, some practical systems consider the alternative of delaying conflict resolution for a short interval, which may allow one of the transactions to commit. The challenge in the transactional conflict problem is to choose the optimal length of this delay interval so as to minimize the overall running time penalty for the conflicting transactions. In this paper, we propose a family of optimal online algorithms for the transactional conflict problem. Specifically, we consider variants of this problem which arise in different implementations of transactional systems, namely "requestor wins" and "requestor aborts" implementations: in the former, the recipient of a coherence request is aborted, whereas in the latter, it is the requestor which has to abort. Both strategies are implemented by real systems. We show that the requestor aborts case can be reduced to a classic instance of the ski rental problem, while the requestor wins case leads to a new version of this classical problem, for which we derive optimal deterministic and randomized algorithms. Moreover, we prove that, under a simplified adversarial model, our algorithms are constant-competitive with the offline optimum in terms of throughput. We validate our algorithmic results empirically through a hardware simulation of hardware transactional memory (HTM), showing that our algorithms can lead to non-trivial performance improvements for classic concurrent data structures.

...read moreread less

Proceedings Article•DOI•

The Transactional Conflict Problem

[...]

Dan Alistarh¹, Syed Kamran Haider², Raphael Kübler³, Giorgi Nadiradze³•Institutions (3)

Institute of Science and Technology Austria¹, University of Connecticut², ETH Zurich³

11 Jul 2018

TL;DR: This paper considers variants of this problem which arise in different implementations of transactional systems, namely " requestor wins'' and "requestor aborts'' implementations: in the former, the recipient of a coherence request is aborted, whereas in the latter, the requestor which has to abort.

...read moreread less

Abstract: The transactional conflict problem arises in transactional systems whenever two or more concurrent transactions clash on a data item. While the standard solution to such conflicts is to immediately abort one of the transactions, some practical systems consider the alternative of delaying conflict resolution for a short interval, which may allow one of the transactions to commit. The challenge in the transactional conflict problem is to choose the optimal length of this delay interval so as to minimize the overall running time penalty for the conflicting transactions. In this paper, we propose a family of optimal online algorithms for the transactional conflict problem. Specifically, we consider variants of this problem which arise in different implementations of transactional systems, namely "requestor wins'' and "requestor aborts'' implementations: in the former, the recipient of a coherence request is aborted, whereas in the latter, it is the requestor which has to abort. Both strategies are implemented by real systems. We show that the requestor aborts case can be reduced to a classic instance of the ski rental problem, while the requestor wins case leads to a new version of this classical problem, for which we derive optimal deterministic and randomized algorithms. Moreover, we prove that, under a simplified adversarial model, our algorithms are constant-competitive with the offline optimum in terms of throughput. We validate our algorithmic results empirically through a hardware simulation of hardware transactional memory (HTM), showing that our algorithms can lead to non-trivial performance improvements for classic concurrent data structures.

...read moreread less

Posted Content•

Why Extension-Based Proofs Fail

[...]

Dan Alistarh¹, James Aspnes², Faith Ellen³, Rati Gelashvili³, Leqi Zhu³ - Show less +1 more•Institutions (3)

Institute of Science and Technology Austria¹, Yale University², University of Toronto³

04 Nov 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This work introduces extension-based proofs, a class of impossibility proofs that are modelled as an interaction between a prover and a protocol and that include valency arguments.

...read moreread less

Abstract: We introduce extension-based proofs, a class of impossibility proofs that includes valency arguments. They are modelled as an interaction between a prover and a protocol. Using proofs based on combinatorial topology, it has been shown that it is impossible to deterministically solve k-set agreement among n > k > 1 processes in a wait-free manner in certain asynchronous models. However, it was unknown whether proofs based on simpler techniques were possible. We show that this impossibility result cannot be obtained for one of these models by an extension-based proof and, hence, extension-based proofs are limited in power.

...read moreread less

Posted Content•

Compressive Sensing with Low Precision Data Representation: Theory and Applications

[...]

Nezihe Merve Gürel, Kaan Kara, Alen Stojanov, Tyler M. Smith, Dan Alistarh, Markus Püschel, Ce Zhang - Show less +3 more

14 Feb 2018

TL;DR: A theoretical analysis of the Iterative Hard Thresholding (IHT) algorithm when all input data, that is, the measurement matrix and the observation, are quantized aggressively to as little as 2 bits per value shows that there exists a variant of low precision IHT that can still provide recovery guarantees.

...read moreread less

Abstract: Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability of computer systems. Lossy compression of data is an intriguing solution, but comes with its own dangers, such as potential signal loss, and the need for careful parameter optimization. In this work, we focus on a setting where this problem is especially acute -compressive sensing frameworks for radio astronomy- and ask: Can the precision of the data representation be lowered for all inputs, with both recovery guarantees and practical performance? Our first contribution is a theoretical analysis of the Iterative Hard Thresholding (IHT) algorithm when all input data, that is, the measurement matrix and the observation, are quantized aggressively to as little as 2 bits per value. Under reasonable constraints, we show that there exists a variant of low precision IHT that can still provide recovery guarantees. The second contribution is an analysis of our general quantized framework tailored to radio astronomy, showing that its conditions are satisfied in this case. We evaluate our approach using CPU and FPGA implementations, and show that it can achieve up to 9.19x speed up with negligible loss of recovery quality, on real telescope data.

...read moreread less