Scalable importance tempering and Bayesian variable selection
Summary (5 min read)
1. Introduction
- Sampling from high-dimensional probability distributions is a common task arising in many scientific areas, such as Bayesian statistics, machine learning and statistical physics.
- The proposed scheme, which the authors call Tempered Gibbs Sampler (TGS), involves componentwise updating rather like Gibbs Sampling (GS), with improved mixing properties and associated importance weights which remain stable as dimension increases.
- Through an appropriately designed tempering mechanism, TGS circumvents the main limitations of standard GS, such as the slow mixing induced by strong posterior correlations.
2. The Tempered Gibbs Sampling scheme
- The authors shall assume the following condition on Z which is stronger than necessary, but which holds naturally for their purposes later on.
- Z(x) is bounded away from 0, and bounded above on compact sets.
- The functions h and hw have identical support from (2).
2.1. Illustrative example.
- Consider the following illustrative example, where the target is a bivariate Gaussian with correlation ρ = 0.999.
- The left of Figure 1 displays the first 200 iterations of GS.
- Now the tempered conditional distributions of TGS allow the chain to move freely around the state space despite correlation.
- This results in high variability of the importance weights w(x(t)) , which deteriorates the efficiency of the estimators ĥTGSt defined in (3).
- On the other hand, the TGS scheme that uses tempered conditionals as in (5), which the authors refer as TGS-mixed here, achieves both fast mixing of the Markov chain x(t) and low variance of the importance weights w(x(t)).
3. Analysis of the algorithm
- In this section the authors provide a careful theoretical and empirical analysis of the TGS algorithm.
- The first aim is providing theoretical guarantees on the robustness of TGS, both in terms of variance of the importance sampling weights in high dimensions and mixing of the resulting Markov chain compared to the GS one.
- The second aim is to provide understanding about which situations will be favourable to TGS and which one will not.
- Throughout the paper, the authors measure the efficiency of Monte Carlo algorithms through their asymptotic variances.
3.1. Robustness to high-dimensionality.
- A major concern with classical importance tempering schemes is that they often collapse in high-dimensional scenarios (see e.g. Owen, 2013, Sec.9.1).
- On the contrary, the importance sampling procedure associated to TGS is robust to high-dimensional scenarios.
3.2. Explicit comparison with standard Gibbs Sampling.
- The authors now compare the efficiency of the Monte Carlo estimators produced by TGS with the ones produced by classical GS.
- The following theorem shows that the efficiency of TGS estimators can never be worse than the one of GS estimators by a factor larger than b2.
- In general it is possible for var(h, TGS) to be finite when Scalable Importance Tempering and Bayesian Variable Selection 9 var(h,GS) is not.
- Choosing too small, however, will reduce the potential benefit obtained with TGS, with the latter collapsing to GS for = 0, so that optimising involves a compromise between these extremes.
- The optimal choice involves a trade-off between small variance of the importance sampling weights and fast mixing of the resulting Markov chain.
3.3. TGS and correlation structure.
- Theorem 1 implies that, under suitable choices of g(xi|x−i), TGS never provides significantly worse (i.e. worse by more than a controllable constant factor) efficiency than GS.
- On the other hand, TGS performances can be much better than standard GS.
- The underlying reason is that the tempering mechanism can dramatically speed up the convergence of the TGS Markov chain x(t) to its stationary distribution fZ by reducing correlations in the target.
- In fact, the covariance structure of fZ is substantially different from the one of the original target f and this can avoid the sampler from getting stuck in situations where GS would.
- Clearly, the same property does not hold for GS, whose mixing time deteriorates as ρ→ 1.
3.5. When does TGS work and when does it not?
- The previous two sections showed that in the bivariate case TGS can induce much faster mixing compared to GS.
- The latter depends on the correlation structure of the target with intuition being as follows.
- Also, it is worth noting that TGS does not require prior knowledge of the global correlation structure or of which variable are strongly correlated Scalable Importance Tempering and Bayesian Variable Selection 13 to be implemented.
- The reason for the presence or lack of improvements given by TGS lies in the different geometrical structure induced by positive and negative correlations.
- Intuitively, the authors conjecture that if the limiting singular distribution for ρ→ 1 can be navigated with pairwise updates (i.e. moving on (xi, xj) “planes” rather than (xi) “lines” as for GS), then TGS should perform well (i.e. uniformly good mixing over ρ for good choice of β), otherwise it will not.
3.6. Controlling the frequency of coordinate updating.
- In absence of prior information on the structure of the problem under consideration, the latter is a desirable robustness properties as it prevents the algorithm for updating some coordinates too often and ignoring others.
- In some contexts, the authors may want to invest more computational effort in updating some coordinates rather than others (see for example the Bayesian Variable Selection problems discussed below).
- This can be done by multiplying the selection probability pi(x) for some weight function ηi(x−i), obtaining pi(x) = ηi(x−i) g(xi|x−i) f(xi|x−i) while leaving the rest of the algorithm unchanged.
- The authors call the resulting algorithm weighted Tempered Gibbs Sampling (wTGS).
4. Application to Bayesian Variable Selection
- The authors shall illustrate the theoretical and methodological conclusions of Section 3 in an important class of statistical models where Bayesian computational issues are known to be particularly challenging.
- Binary inclusion variables in Bayesian Variable Selection models typically possess the kind of pairwise and/or negative dependence structures conjectured to be conducive to successful application of TGS in Section 3.5 (see Section 4.5 for a more detailed discussion).
- Therefore, in this section the authors provide a detailed application of TGS to sampling from the posterior distribution of Gaussian Bayesian Variable Selection models.
- This is a widely used class of models where posterior inferences are computationally challenging due to the presence of high-dimensional discrete parameters.
- The Gibbs Sampler is the standard choice of algorithm to draw samples from the posterior distribution (see Section B.6 in the supplement for more details).
4.1. Model specification.
- Bayesian Variable Selection (BVS) models provide a natural and coherent framework to select a subset of explanatory variables in linear regression contexts (Chipman et al., 2001).
- Under this model set-up, the continuous hyperparameters β and σ can be analytically integrated and one is left with an explicit expression for p(γ|Y ).
4.2. TGS for Bayesian Variable Selection.
- For every value of i and γ−i, the authors set the tempered conditional distribution gi(γi|γ−i) to be the uniform distribution over {0, 1}.
- Since the target state space is discrete, it is more efficient to replace the Gibbs step of updating γi conditional on i and γ−i, with its Metropolised version (see e.g. Liu, 1996).
- The resulting specific instance of TGS is the following.
4.3. wTGS for BVS.
- As discussed in Section 3.6, TGS updates each coordinate with the same frequency.
- In a BVS context, however, this may be inefficient as the resulting sampler would spend most iterations updating variables that have low or negligible posterior inclusion probability, especially when p gets large.
- Here p(γi = 1|Y ) denotes the posterior probability that γi equals 1, while p(γi = 1|γ−i, Y ) denotes the probability of the same event conditional on both the observed data Y and the current value of γ−i.
- Note that with wTGS one can obtain a frequency of updating of the i-th component proportional to p(γi = 1|Y ) without knowing the actual value of p(γi = 1|Y ), but rather using only the conditional expressions p(γi = 1|γ−i, Y ).
- Choosing a uniform frequency of updating favours exploration, as it forces the sampler to explore new regions of the space by flipping variables with low conditional inclusion probability.
4.4. Efficient implementation and Rao-Blackwellisation.
- Compared to GS, TGS and wTGS provide substantially improved convergence properties at the price of an increased computational cost per iteration.
- The additional cost is computing {p(γi|Y, γ−i)}pi=1 given γ ∈ {0, 1}p, which can be done efficiently through vectorised operations as described in Section B.1 of the supplement.
- Such efficient implementation is crucial to the successful application of these TGS schemes.
- The resulting cost per iteration of TGS and wTGS is of order O(np+ |γ|p) .
- See Section B.2 of the supplement for derivations of these expressions.
4.5. Computational complexity results for simple BVS scenarios
- TGS and wTGS in some simple BVS scenarios.the authors.
- In the first case the posterior distribution p(γ|Y ) features independent components and thus it is the ideal case for GS, while the second case it features some maximally correlated components and thus it is a worst-case scenario for GS.
- The authors results show that the computational complexity of TGS and wTGS is not impacted by the change in correlation structure between the two scenarios.
- See Section B.4 of the supplement for a quantitative example.
- Intuitively, strongly correlated regressors provide the same type of information regarding Y .
4.5.3. Fully collinear case
- The authors now consider the other extreme case, where there are maximally correlated regressors.
- In particular, suppose that m out of the p available regressors are perfectly collinear among themselves and with the data vector (i.e. each regressor fully explains the data), while the other p−m regressors are orthogonal to the first m ones.
- The XTX matrix resulting from the scenario described above is not full-rank.
5.1. Illustrative example.
- The differences between GS, TGS and wTGS can be well illustrated considering a scenario where two regressors with good explanatory power are strongly correlated.
- As a result, the Gibbs Sampler (GS) will get stuck in one of the two local modes corresponding to one variable being active and the other inactive.
- All chains were started from the empty model (γi = 0 for every i).
- TGS and wTGS, which have a roughly equivalent cost per iteration, were run for 30000 iterations, after a burn in of 5000 iterations.
- For example variable 3 in this simulated data has a PIP of roughly 0.05.
5.2. Simulated data.
- TGS and wTGS under different simulated scenarios.the authors.
- Scenarios analogous to the ones above have been previously considered in the literature.
- More precisely, the authors define the relative efficiency of TGS over GS as EffTGS EffGS = essTGS/TTGS essGS/TGS = σ2GSTGS σ2TGSTTGS , (20) where σ2GS and σ 2 TGS are the variances of the Monte Carlo estimators produced by GS and TGS, respectively, while TGS and TTGS are the CPU time required to produce such estimators.
- For each simulated dataset, the authors computed the relative efficiency defined by (20) for each PIP estimator, thus getting p values, one for each variable.
- Table 1 reports the median of such p values for each dataset under consideration.
5.3. Real data.
- In this section the authors consider three real datasets with increasing number of covariates.
- The authors compare wTGS to GS and the Hamming Ball (HB) sampler, a recently proposed sampling scheme designed for posterior distributions over discrete spaces, including BVS models (Titsias and Yau, 2017).
- These two datasets are are described in Section 5.3 of Rossell and Telesca (2017) and are freely available from the corresponding supplementary material.
- Points close to the diagonal indicate estimates in accordance with each other across runs, while point far from the diagonal indicate otherwise.
- The authors performed further experiments, in order to compare the wTGS performances with the ones of available R packages for BVS and some alternative methodology from the literature.
6. Discussion
- The authors have introduced a novel Gibbs sampler variant, demonstrating its considerable potential both in toy examples as well as more realistic Bayesian Variable Selection models, and giving underpinning theory to support the use of the method and to explain its impressive convergence properties.
- This would effectively reduce the probability of repeatedly updating the same coordinate in consecutive iterations, which, as shown in Proposition 5, can be interpreted as a rejected move.
- TGS provides a generic way of mitigating the worst effects of dependence on Gibbs sampler convergence.
- 2017; Papaspiliopoulos et al., 2018), the generic implementations requires the ability to perform Gibbs Sampling on generic linear transformations of the target, which is often not practical beyond the Gaussian case.
- Given the results of Sections 4 and 5, it would be interesting to explore the use of the methodology proposed in this paper for other BVS models, such as models with more elaborate priors (e.g. Johnson and Rossell, 2012) or binary response variables.
Did you find this useful? Give us your feedback
Citations
36 citations
20 citations
19 citations
17 citations
11 citations
References
1,723 citations
874 citations
874 citations
815 citations
639 citations
Related Papers (5)
Frequently Asked Questions (13)
Q2. What are the future works mentioned in the paper "Scalable importance tempering and bayesian variable se- lection" ?
Another direction for further research might aim to reduce the cost per iteration of TGS when d is very large. A further possibility for future research is to construct deterministic scan versions of TGS which may be of value for contexts where deterministic scan Gibbs samplers are known to outperform random scan ones ( see for example Roberts and Rosenthal, 2015 ). Further alternative methodology to overcome strong correlations in Gibbs Sampling include the recently proposed adaptive MCMC approach of Duan et al. ( 2017 ) in the context of data augmentation models.
Q3. How many iterations did the authors run for the TGFB172 dataset?
The authors ran wTGS for 500, 1000 and 30000 iterations for the DLD, TGFB172 and TGFB datasets, respectively, discarding the first 10% of samples as burnin.
Q4. What is the standard way to draw samples from p(|Y)?
The standard way to draw samples from p(γ|Y ) is by performing Gibbs Sampling on the p components (γ1, . . . , γp), repeatedly choosing i ∈ {1, . . . , p} either in a random or deterministic scan fashion and then updating γi ∼ p(γi|Y, γ−i).
Q5. What is the effect of wTGS on the estimation of PIPs?
The faster mixing of wTGS for the most influential variables accelerates also the estimation of lower but non-negligible PIPs, such as coordinates 3 and 600 in Figures 4 and 5, respectively.
Q6. What is the main concern with classical importance tempering schemes?
A major concern with classical importance tempering schemes is that they often collapse in high-dimensional scenarios (see e.g. Owen, 2013, Sec.9.1).
Q7. What is the common type of binary inclusion variable in Bayesian Variable Selection models?
Binary inclusion variables in Bayesian Variable Selection models typically possess the kind of pairwise and/or negative dependence structures conjectured to be conducive to successful application of TGS in Section 3.5 (see Section 4.5 for a more detailed discussion).
Q8. What are the qualitative conclusions of these simulations?
The qualitative conclusions of these simulations are not sensitive to various set-up details, such as: the value of d, the order of variables (especially in scenario 1) or the degree of symmetry.
Q9. Why is the overlap between the target distribution f and a tempered version so low?
The reason is that the “overlap” between the target distribution f and a tempered version, such as g = f (β) with β ∈ (0, 1), can be extremely low if f is a high-dimensional distribution.
Q10. What is the reason why the TGS algorithm is a desirable robustness property?
In absence of prior information on the structure of the problem under consideration, the latter is a desirable robustness properties as it prevents the algorithm for updating some coordinates too often and ignoring others.
Q11. How many runs of each algorithm were performed for each dataset?
The authors performed 20 independent runs of each algorithm for each dataset with both c = n and c = p2, recording the resulting estimates of PIPs.
Q12. What is the effect of the modified conditional on the mixing time of x(t)?
The results suggest that, for appropriate choices of modified conditionals, the mixing time of x(t) is uniformly bounded regardless of the correlation structure of the target.
Q13. How can the authors obtain a more explicit understanding of the convergence behaviour of x(t)?
the authors now show that, using the notion of deinitialising chains from Roberts and Rosenthal (2001) the authors can obtain rather explicit understanding of the convergence behaviour of x(t) in the bivariate case.