Stochastic variational inference

doi:10.5555/2567709.2502622

Home
/
Papers
/
Stochastic variational inference

Journal Article•DOI•

Stochastic variational inference

Matthew D. Hoffman¹, David M. Blei², Chong Wang³, John Paisley⁴•Institutions (4)

Adobe Systems¹, Princeton University², Carnegie Mellon University³, University of California, Berkeley⁴

01 Jan 2013-Journal of Machine Learning Research (JMLR.org)-Vol. 14, Iss: 1, pp 1303-1347

TL;DR: Stochastic variational inference lets us apply complex Bayesian models to massive data sets, and it is shown that the Bayesian nonparametric topic model outperforms its parametric counterpart.

read less

Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

Auto-Encoding Variational Bayes

[...]

Diederik P. Kingma¹, Max Welling¹•Institutions (1)

University of Amsterdam¹

01 Jan 2014

TL;DR: A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.

...read moreread less

Abstract: How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.

...read moreread less

20,769 citations

Cites background from "Stochastic variational inference"

.... An advantage of wake-sleep is that it also applies to models with discrete latent variables. Wake-Sleep has the same computational complexity as AEVB per datapoint. Stochastic variational inference [HBWP13] has recently received increasing interest. Recently, [BJP12] introduced a control variate schemes to reduce the high variance of the na¨ıve gradient estimator discussed in section 2.1, and applied to...
[...]

Posted Content•

Auto-Encoding Variational Bayes

[...]

Diederik P. Kingma¹, Max Welling¹•Institutions (1)

University of Amsterdam¹

20 Dec 2013-arXiv: Machine Learning

TL;DR: In this paper, a stochastic variational inference and learning algorithm was proposed for directed probabilistic models with intractable posterior distributions and large datasets, which scales to large datasets.

...read moreread less

4,883 citations

Proceedings Article•

Dropout as a Bayesian approximation: representing model uncertainty in deep learning

[...]

Yarin Gal¹, Zoubin Ghahramani¹•Institutions (1)

University of Cambridge¹

19 Jun 2016

TL;DR: A new theoretical framework is developed casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes, which mitigates the problem of representing uncertainty in deep learning without sacrificing either computational complexity or test accuracy.

...read moreread less

Abstract: Deep learning tools have gained tremendous attention in applied machine learning. However such tools for regression and classification do not capture model uncertainty. In comparison, Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with a prohibitive computational cost. In this paper we develop a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes. A direct result of this theory gives us tools to model uncertainty with dropout NNs - extracting information from existing models that has been thrown away so far. This mitigates the problem of representing uncertainty in deep learning without sacrificing either computational complexity or test accuracy. We perform an extensive study of the properties of dropout's uncertainty. Various network architectures and nonlinearities are assessed on tasks of regression and classification, using MNIST as an example. We show a considerable improvement in predictive log-likelihood and RMSE compared to existing state-of-the-art methods, and finish by using dropout's uncertainty in deep reinforcement learning.

...read moreread less

3,472 citations

Cites methods from "Stochastic variational inference"

...Recent advances in variational inference introduced new techniques into the field such as sampling-based variational inference and stochastic variational inference (Blei et al., 2012; Kingma & Welling, 2013; Rezende et al., 2014; Titsias & LázaroGredilla, 2014; Hoffman et al., 2013)....
[...]

Journal Article•DOI•

Variational Inference: A Review for Statisticians

[...]

David M. Blei¹, Alp Kucukelbir¹, Jon McAuliffe²•Institutions (2)

Columbia University¹, University of California, Berkeley²

27 Feb 2017-Journal of the American Statistical Association

TL;DR: For instance, mean-field variational inference as discussed by the authors approximates probability densities through optimization, which is used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling.

...read moreread less

Abstract: One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this article, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization. VI has been used in many applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of densities and then to find a member of that family which is close to the target density. Closeness is measured by Kullback–Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data...

...read moreread less

3,421 citations

Posted Content•

Stochastic Backpropagation and Approximate Inference in Deep Generative Models

[...]

Danilo Jimenez Rezende¹, Shakir Mohamed¹, Daan Wierstra¹•Institutions (1)

Google¹

16 Jan 2014-arXiv: Machine Learning

TL;DR: In this article, a generative and recognition model is proposed to represent approximate posterior distributions and act as a stochastic encoder of the data, which allows for joint optimisation of the parameters of both the generative model and the recognition model.

...read moreread less

Abstract: We marry ideas from deep neural networks and approximate Bayesian inference to derive a generalised class of deep, directed generative models, endowed with a new algorithm for scalable inference and learning. Our algorithm introduces a recognition model to represent approximate posterior distributions, and that acts as a stochastic encoder of the data. We develop stochastic back-propagation -- rules for back-propagation through stochastic variables -- and use this to develop an algorithm that allows for joint optimisation of the parameters of both the generative and recognition model. We demonstrate on several real-world data sets that the model generates realistic samples, provides accurate imputations of missing data and is a useful tool for high-dimensional data visualisation.

...read moreread less

3,316 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Maximum likelihood from incomplete data via the EM algorithm

[...]

Arthur P. Dempster¹, Nan M. Laird¹, Donald B. Rubin¹•Institutions (1)

Harvard University¹

01 Sep 1977-Journal of the royal statistical society series b-methodological

49,597 citations

Journal Article•DOI•

Equation of state calculations by fast computing machines

[...]

N. Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, Edward Teller - Show less +1 more

01 Jun 1953-Journal of Chemical Physics

TL;DR: In this article, a modified Monte Carlo integration over configuration space is used to investigate the properties of a two-dimensional rigid-sphere system with a set of interacting individual molecules, and the results are compared to free volume equations of state and a four-term virial coefficient expansion.

...read moreread less

Abstract: A general method, suitable for fast computing machines, for investigating such properties as equations of state for substances consisting of interacting individual molecules is described. The method consists of a modified Monte Carlo integration over configuration space. Results for the two‐dimensional rigid‐sphere system have been obtained on the Los Alamos MANIAC and are presented here. These results are compared to the free volume equation of state and to a four‐term virial coefficient expansion.

...read moreread less

35,161 citations

"Stochastic variational inference" refers methods in this paper

...In MCMC sampling, we construct a Markov chain over the hidden variables whose stationary distribution is the posterior of interest (Metropolis et al., 1953; Hastings, 1970; Geman and Geman, 1984; Gelfand and Smith, 1990; Robert and Casella, 2004)....
[...]

Journal Article•DOI•

Latent dirichlet allocation

[...]

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

30,570 citations

"Stochastic variational inference" refers background or methods in this paper

...…how to use the general algorithm of Section 2 to derive stochastic variational inference for two probabilistic topic models: latent Dirichlet allocation (LDA) (Blei et al., 2003) and its Bayesian nonparametric counterpart, the hierarchical Dirichlet process (HDP) topic model (Teh et al., 2006a)....
[...]
...Examples include Bayesian mixture models (Ghahramani and Beal, 2000; Attias, 2000), latent Dirichlet allocation (Blei et al., 2003), hidden Markov models (and many variants) (Rabiner, 1989; Fine et al., 1998; Fox et al., 2011b; Paisley and Carin, 2009), Kalman filters (and many variants) (Kalman,…...
[...]
...Typically, researchers find the “best” number of topics with cross-validation (Blei et al.,2003)....
[...]
...70 51 v3 [...
[...]
...We follow the notation of Blei et al. (2003)....
[...]

Proceedings Article•

Latent Dirichlet Allocation

[...]

David M. Blei¹, Andrew Y. Ng¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

03 Jan 2001

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

...read moreread less

25,546 citations

Journal Article•DOI•

A New Approach to Linear Filtering and Prediction Problems

[...]

R. E. Kalman¹•Institutions (1)

Research Institute for Advanced Studies¹

01 Mar 1960-Journal of Basic Engineering

23,905 citations