scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Scalable importance tempering and Bayesian variable selection

01 Jul 2019-Journal of The Royal Statistical Society Series B-statistical Methodology (Wiley-Blackwell Publishing Ltd.)-Vol. 81, Iss: 3, pp 489-517
TL;DR: In this paper, a Monte Carlo algorithm is proposed to sample from high-dimensional probability distributions that combines Markov chain Monte Carlo (MCMC) and importance sampling, which can be applied to Bayesian Variable Selection problems.
Abstract: We propose a Monte Carlo algorithm to sample from high-dimensional probability distributions that combines Markov chain Monte Carlo (MCMC) and importance sampling. We provide a careful theoretical analysis, including guarantees on robustness to high-dimensionality, explicit comparison with standard MCMC and illustrations of the potential improvements in efficiency. Simple and concrete intuition is provided for when the novel scheme is expected to outperform standard ones. When applied to Bayesian Variable Selection problems, the novel algorithm is orders of magnitude more efficient than available alternative sampling schemes and allows to perform fast and reliable fully Bayesian inferences with tens of thousand regressors.

Summary (5 min read)

1. Introduction

  • Sampling from high-dimensional probability distributions is a common task arising in many scientific areas, such as Bayesian statistics, machine learning and statistical physics.
  • The proposed scheme, which the authors call Tempered Gibbs Sampler (TGS), involves componentwise updating rather like Gibbs Sampling (GS), with improved mixing properties and associated importance weights which remain stable as dimension increases.
  • Through an appropriately designed tempering mechanism, TGS circumvents the main limitations of standard GS, such as the slow mixing induced by strong posterior correlations.

2. The Tempered Gibbs Sampling scheme

  • The authors shall assume the following condition on Z which is stronger than necessary, but which holds naturally for their purposes later on.
  • Z(x) is bounded away from 0, and bounded above on compact sets.
  • The functions h and hw have identical support from (2).

2.1. Illustrative example.

  • Consider the following illustrative example, where the target is a bivariate Gaussian with correlation ρ = 0.999.
  • The left of Figure 1 displays the first 200 iterations of GS.
  • Now the tempered conditional distributions of TGS allow the chain to move freely around the state space despite correlation.
  • This results in high variability of the importance weights w(x(t)) , which deteriorates the efficiency of the estimators ĥTGSt defined in (3).
  • On the other hand, the TGS scheme that uses tempered conditionals as in (5), which the authors refer as TGS-mixed here, achieves both fast mixing of the Markov chain x(t) and low variance of the importance weights w(x(t)).

3. Analysis of the algorithm

  • In this section the authors provide a careful theoretical and empirical analysis of the TGS algorithm.
  • The first aim is providing theoretical guarantees on the robustness of TGS, both in terms of variance of the importance sampling weights in high dimensions and mixing of the resulting Markov chain compared to the GS one.
  • The second aim is to provide understanding about which situations will be favourable to TGS and which one will not.
  • Throughout the paper, the authors measure the efficiency of Monte Carlo algorithms through their asymptotic variances.

3.1. Robustness to high-dimensionality.

  • A major concern with classical importance tempering schemes is that they often collapse in high-dimensional scenarios (see e.g. Owen, 2013, Sec.9.1).
  • On the contrary, the importance sampling procedure associated to TGS is robust to high-dimensional scenarios.

3.2. Explicit comparison with standard Gibbs Sampling.

  • The authors now compare the efficiency of the Monte Carlo estimators produced by TGS with the ones produced by classical GS.
  • The following theorem shows that the efficiency of TGS estimators can never be worse than the one of GS estimators by a factor larger than b2.
  • In general it is possible for var(h, TGS) to be finite when Scalable Importance Tempering and Bayesian Variable Selection 9 var(h,GS) is not.
  • Choosing too small, however, will reduce the potential benefit obtained with TGS, with the latter collapsing to GS for = 0, so that optimising involves a compromise between these extremes.
  • The optimal choice involves a trade-off between small variance of the importance sampling weights and fast mixing of the resulting Markov chain.

3.3. TGS and correlation structure.

  • Theorem 1 implies that, under suitable choices of g(xi|x−i), TGS never provides significantly worse (i.e. worse by more than a controllable constant factor) efficiency than GS.
  • On the other hand, TGS performances can be much better than standard GS.
  • The underlying reason is that the tempering mechanism can dramatically speed up the convergence of the TGS Markov chain x(t) to its stationary distribution fZ by reducing correlations in the target.
  • In fact, the covariance structure of fZ is substantially different from the one of the original target f and this can avoid the sampler from getting stuck in situations where GS would.
  • Clearly, the same property does not hold for GS, whose mixing time deteriorates as ρ→ 1.

3.5. When does TGS work and when does it not?

  • The previous two sections showed that in the bivariate case TGS can induce much faster mixing compared to GS.
  • The latter depends on the correlation structure of the target with intuition being as follows.
  • Also, it is worth noting that TGS does not require prior knowledge of the global correlation structure or of which variable are strongly correlated Scalable Importance Tempering and Bayesian Variable Selection 13 to be implemented.
  • The reason for the presence or lack of improvements given by TGS lies in the different geometrical structure induced by positive and negative correlations.
  • Intuitively, the authors conjecture that if the limiting singular distribution for ρ→ 1 can be navigated with pairwise updates (i.e. moving on (xi, xj) “planes” rather than (xi) “lines” as for GS), then TGS should perform well (i.e. uniformly good mixing over ρ for good choice of β), otherwise it will not.

3.6. Controlling the frequency of coordinate updating.

  • In absence of prior information on the structure of the problem under consideration, the latter is a desirable robustness properties as it prevents the algorithm for updating some coordinates too often and ignoring others.
  • In some contexts, the authors may want to invest more computational effort in updating some coordinates rather than others (see for example the Bayesian Variable Selection problems discussed below).
  • This can be done by multiplying the selection probability pi(x) for some weight function ηi(x−i), obtaining pi(x) = ηi(x−i) g(xi|x−i) f(xi|x−i) while leaving the rest of the algorithm unchanged.
  • The authors call the resulting algorithm weighted Tempered Gibbs Sampling (wTGS).

4. Application to Bayesian Variable Selection

  • The authors shall illustrate the theoretical and methodological conclusions of Section 3 in an important class of statistical models where Bayesian computational issues are known to be particularly challenging.
  • Binary inclusion variables in Bayesian Variable Selection models typically possess the kind of pairwise and/or negative dependence structures conjectured to be conducive to successful application of TGS in Section 3.5 (see Section 4.5 for a more detailed discussion).
  • Therefore, in this section the authors provide a detailed application of TGS to sampling from the posterior distribution of Gaussian Bayesian Variable Selection models.
  • This is a widely used class of models where posterior inferences are computationally challenging due to the presence of high-dimensional discrete parameters.
  • The Gibbs Sampler is the standard choice of algorithm to draw samples from the posterior distribution (see Section B.6 in the supplement for more details).

4.1. Model specification.

  • Bayesian Variable Selection (BVS) models provide a natural and coherent framework to select a subset of explanatory variables in linear regression contexts (Chipman et al., 2001).
  • Under this model set-up, the continuous hyperparameters β and σ can be analytically integrated and one is left with an explicit expression for p(γ|Y ).

4.2. TGS for Bayesian Variable Selection.

  • For every value of i and γ−i, the authors set the tempered conditional distribution gi(γi|γ−i) to be the uniform distribution over {0, 1}.
  • Since the target state space is discrete, it is more efficient to replace the Gibbs step of updating γi conditional on i and γ−i, with its Metropolised version (see e.g. Liu, 1996).
  • The resulting specific instance of TGS is the following.

4.3. wTGS for BVS.

  • As discussed in Section 3.6, TGS updates each coordinate with the same frequency.
  • In a BVS context, however, this may be inefficient as the resulting sampler would spend most iterations updating variables that have low or negligible posterior inclusion probability, especially when p gets large.
  • Here p(γi = 1|Y ) denotes the posterior probability that γi equals 1, while p(γi = 1|γ−i, Y ) denotes the probability of the same event conditional on both the observed data Y and the current value of γ−i.
  • Note that with wTGS one can obtain a frequency of updating of the i-th component proportional to p(γi = 1|Y ) without knowing the actual value of p(γi = 1|Y ), but rather using only the conditional expressions p(γi = 1|γ−i, Y ).
  • Choosing a uniform frequency of updating favours exploration, as it forces the sampler to explore new regions of the space by flipping variables with low conditional inclusion probability.

4.4. Efficient implementation and Rao-Blackwellisation.

  • Compared to GS, TGS and wTGS provide substantially improved convergence properties at the price of an increased computational cost per iteration.
  • The additional cost is computing {p(γi|Y, γ−i)}pi=1 given γ ∈ {0, 1}p, which can be done efficiently through vectorised operations as described in Section B.1 of the supplement.
  • Such efficient implementation is crucial to the successful application of these TGS schemes.
  • The resulting cost per iteration of TGS and wTGS is of order O(np+ |γ|p) .
  • See Section B.2 of the supplement for derivations of these expressions.

4.5. Computational complexity results for simple BVS scenarios

  • TGS and wTGS in some simple BVS scenarios.the authors.
  • In the first case the posterior distribution p(γ|Y ) features independent components and thus it is the ideal case for GS, while the second case it features some maximally correlated components and thus it is a worst-case scenario for GS.
  • The authors results show that the computational complexity of TGS and wTGS is not impacted by the change in correlation structure between the two scenarios.
  • See Section B.4 of the supplement for a quantitative example.
  • Intuitively, strongly correlated regressors provide the same type of information regarding Y .

4.5.3. Fully collinear case

  • The authors now consider the other extreme case, where there are maximally correlated regressors.
  • In particular, suppose that m out of the p available regressors are perfectly collinear among themselves and with the data vector (i.e. each regressor fully explains the data), while the other p−m regressors are orthogonal to the first m ones.
  • The XTX matrix resulting from the scenario described above is not full-rank.

5.1. Illustrative example.

  • The differences between GS, TGS and wTGS can be well illustrated considering a scenario where two regressors with good explanatory power are strongly correlated.
  • As a result, the Gibbs Sampler (GS) will get stuck in one of the two local modes corresponding to one variable being active and the other inactive.
  • All chains were started from the empty model (γi = 0 for every i).
  • TGS and wTGS, which have a roughly equivalent cost per iteration, were run for 30000 iterations, after a burn in of 5000 iterations.
  • For example variable 3 in this simulated data has a PIP of roughly 0.05.

5.2. Simulated data.

  • TGS and wTGS under different simulated scenarios.the authors.
  • Scenarios analogous to the ones above have been previously considered in the literature.
  • More precisely, the authors define the relative efficiency of TGS over GS as EffTGS EffGS = essTGS/TTGS essGS/TGS = σ2GSTGS σ2TGSTTGS , (20) where σ2GS and σ 2 TGS are the variances of the Monte Carlo estimators produced by GS and TGS, respectively, while TGS and TTGS are the CPU time required to produce such estimators.
  • For each simulated dataset, the authors computed the relative efficiency defined by (20) for each PIP estimator, thus getting p values, one for each variable.
  • Table 1 reports the median of such p values for each dataset under consideration.

5.3. Real data.

  • In this section the authors consider three real datasets with increasing number of covariates.
  • The authors compare wTGS to GS and the Hamming Ball (HB) sampler, a recently proposed sampling scheme designed for posterior distributions over discrete spaces, including BVS models (Titsias and Yau, 2017).
  • These two datasets are are described in Section 5.3 of Rossell and Telesca (2017) and are freely available from the corresponding supplementary material.
  • Points close to the diagonal indicate estimates in accordance with each other across runs, while point far from the diagonal indicate otherwise.
  • The authors performed further experiments, in order to compare the wTGS performances with the ones of available R packages for BVS and some alternative methodology from the literature.

6. Discussion

  • The authors have introduced a novel Gibbs sampler variant, demonstrating its considerable potential both in toy examples as well as more realistic Bayesian Variable Selection models, and giving underpinning theory to support the use of the method and to explain its impressive convergence properties.
  • This would effectively reduce the probability of repeatedly updating the same coordinate in consecutive iterations, which, as shown in Proposition 5, can be interpreted as a rejected move.
  • TGS provides a generic way of mitigating the worst effects of dependence on Gibbs sampler convergence.
  • 2017; Papaspiliopoulos et al., 2018), the generic implementations requires the ability to perform Gibbs Sampling on generic linear transformations of the target, which is often not practical beyond the Gaussian case.
  • Given the results of Sections 4 and 5, it would be interesting to explore the use of the methodology proposed in this paper for other BVS models, such as models with more elaborate priors (e.g. Johnson and Rossell, 2012) or binary response variables.

Did you find this useful? Give us your feedback

Figures (8)

Content maybe subject to copyright    Report

warwick.ac.uk/lib-publications
Manuscript version: Author’s Accepted Manuscript
The version presented in WRAP is the author’s accepted manuscript and may differ from the
published version or Version of Record.
Persistent WRAP URL:
http://wrap.warwick.ac.uk/114513
How to cite:
Please refer to published version for the most recent bibliographic citation information.
If a published version is known of, the repository item page linked to above, will contain
details on accessing it.
Copyright and reuse:
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the
University of Warwick available open access under the following conditions.
Copyright © and all moral rights to the version of the paper presented here belong to the
individual author(s) and/or other copyright owners. To the extent reasonable and
practicable the material made available in WRAP has been checked for eligibility before
being made available.
Copies of full items can be used for personal research or study, educational, or not-for-profit
purposes without prior permission or charge. Provided that the authors, title and full
bibliographic details are credited, a hyperlink and/or URL is given for the original metadata
page and the content is not changed in any way.
Publisher’s statement:
Please refer to the repository item page, publisher’s statement section, for further
information.
For more information, please contact the WRAP Team at: wrap@warwick.ac.uk.

Scalable Importance Tempering and Bayesian Variable Se-
lection
Giacomo Zanella
Department of Decision Sciences, BIDSA and IGIER, Bocconi University, Milan, Italy.
Gareth Roberts
Department of Statistics, University of Warwick, Coventry, United Kingdom.
Summary. We propose a Monte Carlo algorithm to sample from high-dimensional prob-
ability distributions that combines Markov chain Monte Carlo (MCMC) and importance
sampling. We provide a careful theoretical analysis, including guarantees on robustness
to high-dimensionality, explicit comparison with standard MCMC and illustrations of the
potential improvements in efficiency. Simple and concrete intuition is provided for when
the novel scheme is expected to outperform standard ones. When applied to Bayesian
Variable Selection problems, the novel algorithm is orders of magnitude more efficient
than available alternative sampling schemes and allows to perform fast and reliable fully
Bayesian inferences with tens of thousand regressors.
1. Introduction
Sampling from high-dimensional probability distributions is a common task arising
in many scientific areas, such as Bayesian statistics, machine learning and statistical
physics. In this paper we propose and analyse a novel Monte Carlo scheme for generic,
high-dimensional target distributions that combines importance sampling and Markov
chain Monte Carlo (MCMC).
There have been many attempts to embed importance sampling within Monte Carlo
schemes for Bayesian analysis, see for example Smith and Gelfand (1992); Gramacy et al.
(2010) and beyond. However, except where Sequential Monte Carlo approaches can be
adopted, pure Markov chain based schemes (i.e. ones which simulate from precisely
the right target distribution with no need for subsequent importance sampling correc-
tion) have been far more successful. This is because MCMC methods are usually much
more scalable to high-dimensional situations, see for example (Frieze et al., 1994; Bel-
loni et al., 2009; Yang et al., 2016; Roberts and Rosenthal, 2016), whereas importance
sampling weight variances tend to grow (often exponentially) with dimension. In this
paper we propose a natural way to combine the best of MCMC and importance sampling
in a way that is robust in high-dimensional contexts and ameliorates the slow mixing
which plagues many Markov chain based schemes. The proposed scheme, which we call
Tempered Gibbs Sampler (TGS), involves componentwise updating rather like Gibbs
Sampling (GS), with improved mixing properties and associated importance weights
which remain stable as dimension increases. Through an appropriately designed tem-
pering mechanism, TGS circumvents the main limitations of standard GS, such as the
slow mixing induced by strong posterior correlations. It also avoids the requirement to

2 G.Zanella and G.Roberts
visit all coordinates sequentially, instead iteratively making state-informed decisions as
to which coordinate should be next updated.
Our scheme differentiates from classical simulated and parallel tempering (Marinari
and Parisi, 1992; Geyer and Thompson, 1995) in that it tempers only the coordinate
that is currently being updated, and compensates for the overdispersion induced by the
tempered update by choosing to update components which are in the tail of their con-
ditional distributions more frequently. The resulting dynamics can dramatically speed
up convergence of the standard GS, both during the transient and the stationary phase
of the algorithm. Moreover, TGS does not require multiple temperature levels (as in
simulated and parallel tempering) and thus avoids the tuning issues related to choosing
the number of levels and collection of temperatures, as well as the heavy computational
burden induced by introducing multiple copies of the original state space.
We apply the novel sampling scheme to Bayesian Variable selection problems, observ-
ing multiple orders of magnitude improvements compared to alternative Monte Carlo
schemes. For example, TGS allows to perform reliable, fully Bayesian inference for spike
and slab models with over ten thousand regressors in less than two minutes using a
simple R implementation and a single desktop computer.
The paper structure is as follows. The TGS scheme is introduced in Section 2. There
we provide basic validity results and intuition on the potential improvement given by the
the novel scheme, together with an illustrative example. In Section 3 we develop a careful
analysis of the proposed scheme. First we show that, unlike common tempering schemes,
TGS is robust to high-dimensionality of the target as the coordinate-wise tempering
mechanism employed is actually improved rather than damaged by high-dimensionality.
Secondly we show that TGS cannot perform worse than standard GS by more than a
constant factor that can be chosen by the user (in our simulations we set it to 2), while
being able to perform orders of magnitude better. Finally we provide concrete insight
regarding the type of correlation structures where TGS will perform much better than
GS and the ones where GS and TGS will perform similarly. In Section 4 we provide a
detailed application to Bayesian Variable selection problems, including computational
complexity results. Section 5 contains simulation studies. We review our findings in
Section 6. Short proofs are directly reported in the paper, while longer ones can be
found in the online supplementary material.
2. The Tempered Gibbs Sampling scheme
Let f(x) be a probability distribution with x = (x
1
, . . . , x
d
) X
1
× · · · × X
d
= X . Each
iteration of the classical random-scan Gibbs Sampler (GS) scheme proceeds by picking
i from {1, . . . , d} uniformly at random and then sampling x
i
f(x
i
|x
i
). We consider
the following tempered version of the Gibbs Sampler, which depends on a collection of
modified full conditionals denoted by {g(x
i
|x
i
)}
i,x
i
with i {1, . . . , d} and x
i
X
i
.
The only requirement on g(x
i
|x
i
) is that, for all x
i
, it is a probability density function
on X
i
absolutely continuous with respect to f (x
i
|x
i
), with no need to be the actual full
conditional of some global distribution g(x). The following functions play a crucial role

Scalable Importance Tempering and Bayesian Variable Selection 3
in the definition of the Tempered Gibbs Sampling (TGS) algorithm,
p
i
(x) =
g(x
i
|x
i
)
f(x
i
|x
i
)
for i = 1, . . . , d ; Z(x) =
1
d
d
X
i=1
p
i
(x) . (1)
Algorithm TGS. At each iteration of the Markov chain do:
(a) (Coordinate selection) Sample i from {1, . . . , d} proportionally to p
i
(x).
(b) (Tempered update) Sample x
i
g(x
i
|x
i
).
(c) (Importance weighting) Assign to the new state x a weight w(x) = Z(x)
1
.
The Markov chain x
(1)
, x
(2)
, . . . induced by steps 1 and 2 of TGS is reversible with re-
spect to fZ, which is a probability density function on X defined as (f Z)(x) = f(x)Z(x).
We shall assume the following condition on Z which is stronger than necessary, but which
holds naturally for our purposes later on.
Z(x) is bounded away from 0, and bounded above on compact sets. (2)
Throughout the paper Z and w are the inverse of each other, i.e. w(x) = Z(x)
1
for all
x X . As usual, we denote the space of f-integrable functions from X to R by L
1
(X , f)
and we write E
f
[h] =
R
X
h(x)f(x)dx for every h L
1
(X , f).
Proposition 1. fZ is a probability density function on X and the Markov chain
x
(1)
, x
(2)
, . . . induced by steps 1 and 2 of TGS is reversible with respect to fZ. Assuming
that (2) holds and that TGS is f Z-irreducible, then
ˆ
h
T GS
n
=
P
n
t=1
w(x
(t)
)h(x
(t)
)
P
n
t=1
w(x
(t)
)
E
f
[h] , as n , (3)
almost surely (a.s.) for every h L
1
(X , f).
Proof. Reversibility w.r.t. f(x)Z(x) can be checked as in the proof of Proposition 6
in Section A.4 of the supplement. Representing f(x)Z(x) as a mixture of d probability
densities on X we have
Z
X
f(x)Z(x)dx =
Z
X
1
d
d
X
i=1
f(x)
g(x
i
|x
i
)
f(x
i
|x
i
)
dx =
1
d
d
X
i=1
Z
X
f(x
i
)g(x
i
|x
i
)dx = 1 .
The functions h and hw have identical support from (2). Moreover it is clear that
h L
1
(X , f) if and only if hw L
1
(X , fZ) and that in fact
E
f
[h] =
Z
h(x)f(x)dx =
Z
h(x)w(x)f(x)Z(x)dx = E
fZ
[hw] .
Therefore from Theorem 17.0.1 of Meyn and Tweedie (1993) applied to both numerator
and denominator, (3) holds since by hypothesis TGS is fZ-irreducible so that (x
(t)
)
t=1
is ergodic. 2

4 G.Zanella and G.Roberts
We note that fZ-irreducibility of TGS can be established in specific examples using
standard techniques, see for example Roberts and Smith (1994). Moreover under (2)
conditions from that paper which imply f-irreducibility of the standard Gibbs sampler
readily extend to demonstrating that TGS is fZ-irreducible.
The implementation of TGS requires the user to specify a collection of densities
{g(x
i
|x
i
)}
i,x
i
. Possible choices of these include tempered conditionals of the form
g(x
i
|x
i
) = f
(β)
(x
i
|x
i
) =
f(x
i
|x
i
)
β
R
X
i
f(y
i
|x
i
)
β
dy
i
, (4)
where β is a fixed value in (0, 1), and mixed conditionals of the form
g(x
i
|x
i
) =
1
2
f(x
i
|x
i
) +
1
2
f
(β)
(x
i
|x
i
) , (5)
with β (0, 1) and f
(β)
defined as in (4). Note that g(x
i
|x
i
) in (5) are not the full
conditionals of
1
2
f(x) +
1
2
f
(β)
(x) as the latter would have mixing weights depending on
x. Indeed g(x
i
|x
i
) in (5) are unlikely to be the full conditionals of any distribution.
The theory developed in Section 3 will provide insight into which choice for g(x
i
|x
i
)
leads to effective Monte Carlo methods. Moreover, we shall see that building g(x
i
|x
i
)
as a mixture of f(x
i
|x
i
) and a flattened version of f(x
i
|x
i
), as in (5), is typically a
robust and efficient choice.
The modified conditionals need to be tractable, as we need to sample from them
and evaluate their density. In many cases, if the original full conditionals f(x
i
|x
i
) are
tractable (e.g. Bernoulli, Normal, Beta or Gamma distributions), then also the densi-
ties of the form f
(β)
(x
i
|x
i
) are. More generally, one can use any flattened version of
f(x
i
|x
i
) instead of f
(β)
(x
i
|x
i
). For example in Section 3.5 we provide an illustration
using a t-distribution for g(x
i
|x
i
) when f (x
i
|x
i
) is normal.
TGS has various potential advantages over GS. First it makes an “informed choice”
on which variable to update, choosing with higher probability coordinates whose value
is currently in the tail of their conditional distribution. Secondly it induces potentially
longer jumps by sampling x
i
from a tempered distribution g(x
i
|x
i
). Finally, as we
will see in the next sections, the invariant distribution fZ has potentially much less
correlation among variables compared to the original distribution f.
2.1. Illustrative example.
Consider the following illustrative example, where the target is a bivariate Gaussian with
correlation ρ = 0.999. Posterior distributions with such strong correlations naturally
arise in Bayesian modeling, e.g. in the context of hierarchical linear models with a large
number of observations. The left of Figure 1 displays the first 200 iterations of GS.
As expected, the strong correlation slows down the sampler dramatically and the chain
hardly moves away from the starting point, in this case (3, 3). The center and right of
Figure 1 display the first 200 iterations of TGS with modified conditionals given by (4)
and (5), respectively, and β = 1 ρ
2
. See Section 3 for some discussion on the choice
fo β in practice. Now the tempered conditional distributions of TGS allow the chain
to move freely around the state space despite correlation. However, the vanilla version

Citations
More filters
Journal ArticleDOI
TL;DR: The majority of algorithms used in practice today involve the Hastings algorithm, which generalizes the Metropolis algorithm to allow a much broader class of proposal distributions instead of just symmetric cases.
Abstract: In a 1970 Biometrika paper, W. K. Hastings developed a broad class of Markov chain algorithms for sampling from probability distributions that are difficult to sample from directly. The algorithm draws a candidate value from a proposal distribution and accepts the candidate with a probability that can be computed using only the unnormalized density of the target distribution, allowing one to sample from distributions known only up to a constant of proportionality. The stationary distribution of the corresponding Markov chain is the target distribution one is attempting to sample from. The Hastings algorithm generalizes the Metropolis algorithm to allow a much broader class of proposal distributions instead of just symmetric cases. An important class of applications for the Hastings algorithm corresponds to sampling from Bayesian posterior distributions, which have densities given by a prior density multiplied by a likelihood function and divided by a normalizing constant equal to the marginal likelihood. The marginal likelihood is typically intractable, presenting a fundamental barrier to implementation in Bayesian statistics. This barrier can be overcome by Markov chain Monte Carlo sampling algorithms. Amazingly, even after 50 years, the majority of algorithms used in practice today involve the Hastings algorithm. This article provides a brief celebration of the continuing impact of this ingenious algorithm on the 50th anniversary of its publication.

36 citations

Posted Content
TL;DR: It is shown that asymptotically BMS keeps any covariate with predictive power for either the outcome or censoring times, and discards other covariates, and argues for using simple models that are computationally practical yet attain good power to detect potentially complex effects, despite misspecification.
Abstract: We discuss the role of misspecification and censoring on Bayesian model selection in the contexts of right-censored survival and concave log-likelihood regression. Misspecification includes wrongly assuming the censoring mechanism to be non-informative. Emphasis is placed on additive accelerated failure time, Cox proportional hazards and probit models. We offer a theoretical treatment that includes local and non-local priors, and a general non-linear effect decomposition to improve power-sparsity trade-offs. We discuss a fundamental question: what solution can one hope to obtain when (inevitably) models are misspecified, and how to interpret it? Asymptotically, covariates that do not have predictive power for neither the outcome nor (for survival data) censoring times, in the sense of reducing a likelihood-associated loss, are discarded. Misspecification and censoring have an asymptotically negligible effect on false positives, but their impact on power is exponential. We show that it can be advantageous to consider simple models that are computationally practical yet attain good power to detect potentially complex effects, including the use of finite-dimensional basis to detect truly non-parametric effects. We also discuss algorithms to capitalize on sufficient statistics and fast likelihood approximations for Gaussian-based survival and binary models.

20 citations

Posted Content
TL;DR: This paper takes the reader on a chronological tour of Bayesian computation over the past two and a half centuries, and place all computational problems into a common framework, and describe all computational methods using a common notation.
Abstract: The Bayesian statistical paradigm uses the language of probability to express uncertainty about the phenomena that generate observed data. Probability distributions thus characterize Bayesian inference, with the rules of probability used to transform prior probability distributions for all unknowns - models, parameters, latent variables - into posterior distributions, subsequent to the observation of data. Conducting Bayesian inference requires the evaluation of integrals in which these probability distributions appear. Bayesian computation is all about evaluating such integrals in the typical case where no analytical solution exists. This paper takes the reader on a chronological tour of Bayesian computation over the past two and a half centuries. Beginning with the one-dimensional integral first confronted by Bayes in 1763, through to recent problems in which the unknowns number in the millions, we place all computational problems into a common framework, and describe all computational methods using a common notation. The aim is to help new researchers in particular - and more generally those interested in adopting a Bayesian approach to empirical work - make sense of the plethora of computational techniques that are now on offer; understand when and why different methods are useful; and see the links that do exist, between them all.

19 citations

Journal ArticleDOI
TL;DR: In this paper, the adaptive Markov chain Monte Carlo (MCMC) algorithm was proposed to address the problem of large-p, small-n settings, where the majority of the p variables will be approximately uncorrelated a posteriori.
Abstract: The availability of datasets with large numbers of variables is rapidly increasing. The effective application of Bayesian variable selection methods for regression with these datasets has proved difficult since available Markov chain Monte Carlo methods do not perform well in typical problem sizes of interest. We propose new adaptive Markov chain Monte Carlo algorithms to address this shortcoming. The adaptive design of these algorithms exploits the observation that in large-p⁠, small-n settings, the majority of the p variables will be approximately uncorrelated a posteriori. The algorithms adaptively build suitable nonlocal proposals that result in moves with squared jumping distance significantly larger than standard methods. Their performance is studied empirically in high-dimensional problems and speed-ups of up to four orders of magnitude are observed.

17 citations

Journal ArticleDOI
TL;DR: In this paper, a Monte Carlo algorithm is proposed to sample from high dimensional probability distributions that combines Markov chain Monte Carlo and importance sampling, which can be applied to Bayesian variable selection problems.
Abstract: We propose a Monte Carlo algorithm to sample from high dimensional probability distributions that combines Markov chain Monte Carlo and importance sampling. We provide a careful theoretical analysis, including guarantees on robustness to high dimensionality, explicit comparison with standard Markov chain Monte Carlo methods and illustrations of the potential improvements in efficiency. Simple and concrete intuition is provided for when the novel scheme is expected to outperform standard schemes. When applied to Bayesian variable-selection problems, the novel algorithm is orders of magnitude more efficient than available alternative sampling schemes and enables fast and reliable fully Bayesian inferences with tens of thousand regressors.

11 citations

References
More filters
Journal ArticleDOI
15 Jul 1992-EPL
TL;DR: In this article, the authors proposed a new global optimization method (Simulated Tempering) for simulating effectively a system with a rough free-energy landscape (i.e., many coexisting states) at finite nonzero temperature.
Abstract: We propose a new global optimization method (Simulated Tempering) for simulating effectively a system with a rough free-energy landscape (i.e., many coexisting states) at finite nonzero temperature. This method is related to simulated annealing, but here the temperature becomes a dynamic variable, and the system is always kept at equilibrium. We analyse the method on the Random Field Ising Model, and we find a dramatic improvement over conventional Metropolis and cluster methods. We analyse and discuss the conditions under which the method has optimal performances.

1,723 citations

Journal ArticleDOI
TL;DR: A prometastatic program induced by TGF-β in the microenvironment that associates with a high risk of CRC relapse upon treatment is unveiled and could be exploited to improve the diagnosis and treatment of CRC.

874 citations

Journal ArticleDOI
TL;DR: This work proposes MCMC methods distantly related to simulated annealing, which simulate realizations from a sequence of distributions, allowing the distribution being simulated to vary randomly over time.
Abstract: Markov chain Monte Carlo (MCMC; the Metropolis-Hastings algorithm) has been used for many statistical problems, including Bayesian inference, likelihood inference, and tests of significance. Though the method generally works well, doubts about convergence often remain. Here we propose MCMC methods distantly related to simulated annealing. Our samplers mix rapidly enough to be usable for problems in which other methods would require eons of computing time. They simulate realizations from a sequence of distributions, allowing the distribution being simulated to vary randomly over time. If the sequence of distributions is well chosen, then the sampler will mix well and produce accurate answers for all the distributions. Even when there is only one distribution of interest, these annealing-like samplers may be the only known way to get a rapidly mixing sampler. These methods are essential for attacking very hard problems, which arise in areas such as statistical genetics. We illustrate the methods wi...

874 citations

Journal ArticleDOI
TL;DR: In this paper, the authors propose a partially noninformative prior structure related to a Natural Conjugate g-prior speciflcation, where the amount of subjective information requested from the user is limited to the choice of a single scalar hyperparameter g0j.

815 citations

Journal ArticleDOI
TL;DR: The variable selection algorithm in the paper is substantially faster than previous Bayesian variable selection algorithms and compares favorably with a kernel-weighted local linear smoother.

639 citations

Frequently Asked Questions (13)
Q1. What are the contributions in "Scalable importance tempering and bayesian variable se- lection" ?

Tempered Gibbs Sampler ( TGS ) this paper combines the best of MCMC and importance sampling in a way that is robust in high-dimensional contexts and ameliorates the slow mixing which plagues many Markov chain based schemes. 

Another direction for further research might aim to reduce the cost per iteration of TGS when d is very large. A further possibility for future research is to construct deterministic scan versions of TGS which may be of value for contexts where deterministic scan Gibbs samplers are known to outperform random scan ones ( see for example Roberts and Rosenthal, 2015 ). Further alternative methodology to overcome strong correlations in Gibbs Sampling include the recently proposed adaptive MCMC approach of Duan et al. ( 2017 ) in the context of data augmentation models. 

The authors ran wTGS for 500, 1000 and 30000 iterations for the DLD, TGFB172 and TGFB datasets, respectively, discarding the first 10% of samples as burnin. 

The standard way to draw samples from p(γ|Y ) is by performing Gibbs Sampling on the p components (γ1, . . . , γp), repeatedly choosing i ∈ {1, . . . , p} either in a random or deterministic scan fashion and then updating γi ∼ p(γi|Y, γ−i). 

The faster mixing of wTGS for the most influential variables accelerates also the estimation of lower but non-negligible PIPs, such as coordinates 3 and 600 in Figures 4 and 5, respectively. 

A major concern with classical importance tempering schemes is that they often collapse in high-dimensional scenarios (see e.g. Owen, 2013, Sec.9.1). 

Binary inclusion variables in Bayesian Variable Selection models typically possess the kind of pairwise and/or negative dependence structures conjectured to be conducive to successful application of TGS in Section 3.5 (see Section 4.5 for a more detailed discussion). 

The qualitative conclusions of these simulations are not sensitive to various set-up details, such as: the value of d, the order of variables (especially in scenario 1) or the degree of symmetry. 

The reason is that the “overlap” between the target distribution f and a tempered version, such as g = f (β) with β ∈ (0, 1), can be extremely low if f is a high-dimensional distribution. 

In absence of prior information on the structure of the problem under consideration, the latter is a desirable robustness properties as it prevents the algorithm for updating some coordinates too often and ignoring others. 

The authors performed 20 independent runs of each algorithm for each dataset with both c = n and c = p2, recording the resulting estimates of PIPs. 

The results suggest that, for appropriate choices of modified conditionals, the mixing time of x(t) is uniformly bounded regardless of the correlation structure of the target. 

the authors now show that, using the notion of deinitialising chains from Roberts and Rosenthal (2001) the authors can obtain rather explicit understanding of the convergence behaviour of x(t) in the bivariate case.