Analyzing Hogwild Parallel Gaussian Gibbs Sampling

Home
/
Papers
/
Analyzing Hogwild Parallel Gaussian Gibbs Sampling

Proceedings Article•

Analyzing Hogwild Parallel Gaussian Gibbs Sampling

Matthew J. Johnson¹, James Saunderson¹, Alan S. Willsky¹•Institutions (1)

05 Dec 2013-Vol. 26, pp 2715-2723

TL;DR: It is shown that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean.

read less

Abstract: Sampling inference methods are computationally difficult to scale for many models in part because global dependencies can reduce opportunities for parallel computation. Without strict conditional independence structure among variables, standard Gibbs sampling theory requires sample updates to be performed sequentially, even if dependence between most variables is not strong. Empirical work has shown that some models can be sampled effectively by going "Hogwild" and simply running Gibbs updates in parallel with only periodic global communication, but the successes and limitations of such a strategy are not well understood. As a step towards such an understanding, we study the Hogwild Gibbs sampling strategy in the context of Gaussian distributions. We develop a framework which provides convergence conditions and error bounds along with simple proofs and connections to methods in numerical linear algebra. In particular, we show that if the Gaussian precision matrix is generalized diagonally dominant, then any Hogwild Gibbs sampler, with any update schedule or allocation of variables to processors, yields a stable sampling process with the correct sample mean.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

[...]

Christopher De Sa¹, Matthew Feldman¹, Christopher Ré¹, Kunle Olukotun¹•Institutions (1)

Stanford University¹

24 Jun 2017

TL;DR: The DMGC model is introduced, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and it is shown that it provides a way to both classify these algorithms and model their performance.

...read moreread less

Abstract: Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buck-wild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11X. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.

...read moreread less

155 citations

Cites methods from "Analyzing Hogwild Parallel Gaussian..."

...Fast asynchronous variants of other algorithms have also been proposed, such as coordinate descent [27, 28] and Gibbs sampling [11, 19]....
[...]

Journal Article•DOI•

DimmWitted: a study of main-memory statistical analytics

[...]

Ce Zhang¹, Christopher Ré²•Institutions (2)

University of Wisconsin-Madison¹, Stanford University²

01 Aug 2014

TL;DR: In this paper, the authors study the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a NUMA machine.

...read moreread less

Abstract: We perform the first study of the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a Non-Uniform Memory Access (NUMA) machine. Statistical analytics systems differ from conventional SQL-analytics in the amount and types of memory incoherence that they can tolerate. Our goal is to understand tradeoffs in accessing the data in row- or column-order and at what granularity one should share the model and data for a statistical task. We study this new tradeoff space and discover that there are tradeoffs between hardware and statistical efficiency. We argue that our tradeoff study may provide valuable information for designers of analytics engines: for each system we consider, our prototype engine can run at least one popular task at least 100× faster. We conduct our study across five architectures using popular models, including SVMs, logistic regression, Gibbs sampling, and neural networks.

...read moreread less

139 citations

Book•

Scalable Algorithms for Data and Network Analysis

[...]

Shang-Hua Teng¹•Institutions (1)

University of Southern California¹

04 May 2016

TL;DR: This tutorial surveys a family of algorithmic techniques for the design of provably-good scalable algorithms and illustrates the use of these techniques by a few basic problems that are fundamental in network analysis, particularly for the identification of significant nodes and coherent clusters/communities insocial and information networks.

...read moreread less

Abstract: In the age of Big Data, efficient algorithms are now in higher demandmore than ever before. While Big Data takes us into the asymptoticworld envisioned by our pioneers, it also challenges the classical notionof efficient algorithms: Algorithms that used to be considered efficient,according to polynomial-time characterization, may no longer be adequatefor solving today's problems. It is not just desirable, but essential,that efficient algorithms should be scalable. In other words, their complexityshould be nearly linear or sub-linear with respect to the problemsize. Thus, scalability, not just polynomial-time computability, shouldbe elevated as the central complexity notion for characterizing efficientcomputation.In this tutorial, I will survey a family of algorithmic techniques forthe design of provably-good scalable algorithms. These techniques includelocal network exploration, advanced sampling, sparsification, andgeometric partitioning. They also include spectral graph-theoreticalmethods, such as those used for computing electrical flows and samplingfrom Gaussian Markov random fields. These methods exemplifythe fusion of combinatorial, numerical, and statistical thinking in networkanalysis. I will illustrate the use of these techniques by a few basicproblems that are fundamental in network analysis, particularly for theidentification of significant nodes and coherent clusters/communities insocial and information networks. I also take this opportunity to discusssome frameworks beyond graph-theoretical models for studying conceptualquestions to understand multifaceted network data that arisein social influence, network dynamics, and Internet economics.

...read moreread less

85 citations

Cites background or methods from "Analyzing Hogwild Parallel Gaussian..."

...Both papers [263, 179] address the question of how often these machines need to exchange boundary variables in order to ensure the distributed Gibbs process to converge properly....
[...]
...H-precision matrices were first considered in [263, 179] when studying a parallel implementation of the Gibbs process, known as Hogwild Gaussian Gibbs sampling....
[...]

DeepDive: A Data Management System for Automatic Knowledge Base Construction

[...]

Ce Zhang

01 Jan 2015

79 citations

Journal Article•DOI•

Big Learning with Bayesian methods

[...]

Jun Zhu, Jianfei Chen¹, Wenbo Hu¹, Bo Zhang¹•Institutions (1)

Tsinghua University¹

04 May 2017-National Science Review

TL;DR: A survey of the recent advances in Big learning with Bayesian methods, termed Big Bayesian Learning, including nonparametric Bayesian Methods for adaptively inferring model complexity, regularized Bayesian inference for improving the flexibility via posterior regularization, and scalable algorithms and systems based on stochastic subsampling and distributed computing for dealing with large-scale applications.

...read moreread less

Abstract: The explosive growth in data volume and the availability of cheap computing resources have sparked increasing interest in Big learning, an emerging subfield that studies scalable machine learning algorithms, systems and applications with Big Data. Bayesian methods represent one important class of statistical methods for machine learning, with substantial recent developments on adaptive, flexible and scalable Bayesian learning. This article provides a survey of the recent advances in Big learning with Bayesian methods, termed Big Bayesian Learning, including non-parametric Bayesian methods for adaptively inferring model complexity, regularized Bayesian inference for improving the flexibility via posterior regularization, and scalable algorithms and systems based on stochastic subsampling and distributed computing for dealing with large-scale applications. We also provide various new perspectives on the large-scale Bayesian modeling and inference.

...read moreread less

76 citations

Cites methods from "Analyzing Hogwild Parallel Gaussian..."

...The work [158] develops various variable partitioning strategies to achieve fast parallelization while maintaining the convergence to the target posterior and the work [159] analyzes the convergence and correctness of the asynchronous Gibbs sampler (a....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13

Collapse

References

PDF

Open Access

More filters

Book•

Nonnegative Matrices in the Mathematical Sciences

[...]

Abraham Berman

01 Aug 1979

TL;DR: 1. Matrices which leave a cone invariant 2. Nonnegative matrices 3. Semigroups of non negative matrices 4. Symmetric nonnegativeMatrices 5. Generalized inverse- Positivity 6. M-matrices 7. Iterative methods for linear systems 8. Finite Markov Chains

...read moreread less

Abstract: 1. Matrices which leave a cone invariant 2. Nonnegative matrices 3. Semigroups of nonnegative matrices 4. Symmetric nonnegative matrices 5. Generalized inverse- Positivity 6. M-matrices 7. Iterative methods for linear systems 8. Finite Markov Chains 9. Input-output analysis in economics 10. The Linear complementarity problem 11. Supplement 1979-1993 References Index.

...read moreread less

6,572 citations

Book•

Graphical Models, Exponential Families, and Variational Inference

[...]

Martin J. Wainwright¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

16 Dec 2008

TL;DR: The variational approach provides a complementary alternative to Markov chain Monte Carlo as a general source of approximation methods for inference in large-scale statistical models.

...read moreread less

Abstract: The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. Graphical models have become a focus of research in many statistical, computational and mathematical fields, including bioinformatics, communication theory, statistical physics, combinatorial optimization, signal and image processing, information retrieval and statistical machine learning. Many problems that arise in specific instances — including the key problems of computing marginals and modes of probability distributions — are best studied in the general setting. Working with exponential family representations, and exploiting the conjugate duality between the cumulant function and the entropy for exponential families, we develop general variational representations of the problems of computing likelihoods, marginal probabilities and most probable configurations. We describe how a wide variety of algorithms — among them sum-product, cluster variational methods, expectation-propagation, mean field methods, max-product and linear programming relaxation, as well as conic programming relaxations — can all be understood in terms of exact or approximate forms of these variational representations. The variational approach provides a complementary alternative to Markov chain Monte Carlo as a general source of approximation methods for inference in large-scale statistical models.

...read moreread less

4,335 citations

"Analyzing Hogwild Parallel Gaussian..." refers background in this paper

...In many problems [12] one has access to the pair (J, h) and must compute or estimate the moment parameters μ and Σ (or just the diagonal) or generate samples from N (μ,Σ)....
[...]

Proceedings Article•

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

[...]

Benjamin Recht¹, Christopher Ré¹, Stephen J. Wright¹, Feng Niu¹•Institutions (1)

University of Wisconsin-Madison¹

12 Dec 2011

TL;DR: In this paper, the authors present an update scheme called HOGWILD!, which allows processors access to shared memory with the possibility of overwriting each other's work, which achieves a nearly optimal rate of convergence.

...read moreread less

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.

...read moreread less

1,939 citations

Posted Content•

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

[...]

Feng Niu, Benjamin Recht, Christopher Ré, Stephen J. Wright

28 Jun 2011-arXiv: Optimization and Control

TL;DR: This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work.

...read moreread less

1,413 citations

"Analyzing Hogwild Parallel Gaussian..." refers background or methods in this paper

...[1] provides both a motivation for Hogwild Gibbs sampling as well as the Hogwild name....
[...]
...We refer to this strategy as “Hogwild Gibbs sampling” in reference to recent work [1] in which sequential computations for computing gradient steps were applied in parallel (without global coordination) to great beneficial effect....
[...]

Journal Article•

Distributed Algorithms for Topic Models

[...]

David Newman¹, Arthur U. Asuncion¹, Padhraic Smyth¹, Max Welling¹•Institutions (1)

University of California, Irvine¹

01 Dec 2009-Journal of Machine Learning Research

TL;DR: This work describes distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model and the Hierarchical Dirichet Process (HDP) model, and proposes a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data.

...read moreread less

Abstract: We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer.

...read moreread less

438 citations

"Analyzing Hogwild Parallel Gaussian..." refers methods in this paper

...The AD-LDA sampling algorithm is an instance of the strategy we have named Hogwild Gibbs, and Bekkerman et al. [5, Chapter 11] suggests applying the strategy to other latent variable models....
[...]
...The algorithms are supported by the standard Gibbs sampling analysis, and the authors point out that while heuristic parallel samplers such as the AD-LDA sampler offer easier implementation and often greater parallelism, they are currently not supported by much theoretical analysis....
[...]
...This Hogwild Gibbs sampling strategy has long been considered a useful hack, perhaps for preparing decent initial states for a proper serial Gibbs sampler, but extensive empirical work on Approximate Distributed Latent Dirichlet Allocation (AD-LDA) [2, 3, 4, 5, 6], which applies the strategy to generate samples from a collapsed LDA model, has demonstrated its effectiveness in sampling LDA models with the same predictive performance as those generated by standard serial Gibbs [2, Figure 3]....
[...]
...There have been recent advances in understanding some of the particular structure of AD-LDA [6], but a thorough theoretical explanation for the effectiveness and limitations of Hogwild Gibbs sampling is far from complete....
[...]
...The work of Ihler et al. [6] provides some understanding of the effectiveness of a variant of AD-LDA by bounding in terms of run-time quantities the one-step error probability induced by proceeding with sampling steps in parallel, thereby allowing an AD-LDA user to inspect the computed error bound after inference [6, Section 4.2]....
[...]