scispace - formally typeset
Search or ask a question

Showing papers in "Statistics and Computing in 2019"


Journal ArticleDOI
TL;DR: The authors showed that using predictive instead of in-sample goodness-of-fit scores improves the speed of Bayesian network structure learning and improves the accuracy of network reconstruction as well, as previously observed by Chickering and Heckerman.
Abstract: Learning the structure of Bayesian networks from data is known to be a computationally challenging, NP-hard problem. The literature has long investigated how to perform structure learning from data containing large numbers of variables, following a general interest in high-dimensional applications (“small n, large p”) in systems biology and genetics. More recently, data sets with large numbers of observations (the so-called “big data”) have become increasingly common; and these data sets are not necessarily high-dimensional, sometimes having only a few tens of variables depending on the application. We revisit the computational complexity of Bayesian network structure learning in this setting, showing that the common choice of measuring it with the number of estimated local distributions leads to unrealistic time complexity estimates for the most common class of score-based algorithms, greedy search. We then derive more accurate expressions under common distributional assumptions. These expressions suggest that the speed of Bayesian network learning can be improved by taking advantage of the availability of closed-form estimators for local distributions with few parents. Furthermore, we find that using predictive instead of in-sample goodness-of-fit scores improves speed; and we confirm that it improves the accuracy of network reconstruction as well, as previously observed by Chickering and Heckerman (Stat Comput 10: 55–62, 2000). We demonstrate these results on large real-world environmental and epidemiological data; and on reference data sets available from public repositories.

73 citations


Journal ArticleDOI
TL;DR: In empirical tests the new dynamic nested sampling method significantly improves calculation accuracy compared to standard nested sampling with the same number of samples; this increase in accuracy is equivalent to speeding up the computation by factors of up to 72 for parameter estimation and 7 for evidence calculations.
Abstract: We introduce dynamic nested sampling: a generalisation of the nested sampling algorithm in which the number of “live points” varies to allocate samples more efficiently. In empirical tests the new method significantly improves calculation accuracy compared to standard nested sampling with the same number of samples; this increase in accuracy is equivalent to speeding up the computation by factors of up to $$\sim 72$$ for parameter estimation and $$\sim 7$$ for evidence calculations. We also show that the accuracy of both parameter estimation and evidence calculations can be improved simultaneously. In addition, unlike in standard nested sampling, more accurate results can be obtained by continuing the calculation for longer. Popular standard nested sampling implementations can be easily adapted to perform dynamic nested sampling, and several dynamic nested sampling software packages are now publicly available.

73 citations


Journal ArticleDOI
TL;DR: In this paper, an alternative log-posterior gradient estimate for stochastic gradient MCMC was proposed, which uses control variates to reduce the variance of the MCMC output.
Abstract: It is well known that Markov chain Monte Carlo (MCMC) methods scale poorly with dataset size. A popular class of methods for solving this issue is stochastic gradient MCMC (SGMCMC). These methods use a noisy estimate of the gradient of the log-posterior, which reduces the per iteration computational cost of the algorithm. Despite this, there are a number of results suggesting that stochastic gradient Langevin dynamics (SGLD), probably the most popular of these methods, still has computational cost proportional to the dataset size. We suggest an alternative log-posterior gradient estimate for stochastic gradient MCMC which uses control variates to reduce the variance. We analyse SGLD using this gradient estimate, and show that, under log-concavity assumptions on the target distribution, the computational cost required for a given level of accuracy is independent of the dataset size. Next, we show that a different control-variate technique, known as zero variance control variates, can be applied to SGMCMC algorithms for free. This postprocessing step improves the inference of the algorithm by reducing the variance of the MCMC output. Zero variance control variates rely on the gradient of the log-posterior; we explore how the variance reduction is affected by replacing this with the noisy gradient estimate calculated by SGMCMC.

72 citations


Journal ArticleDOI
TL;DR: In this work, deep Gaussian mixture models (DGMM) are introduced and discussed, a network of multiple layers of latent variables where, at each layer, the variables follow a mixture of Gaussian distributions.
Abstract: Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, deep Gaussian mixture models (DGMM) are introduced and discussed. A DGMM is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Thus, the deep mixture model consists of a set of nested mixtures of linear models, which globally provide a nonlinear model able to describe the data in a very flexible way. In order to avoid overparameterized solutions, dimension reduction by factor models can be applied at each layer of the architecture, thus resulting in deep mixtures of factor analyzers.

70 citations


Journal ArticleDOI
TL;DR: In this paper, the authors provide a new view of probabilistic ODE solvers as active inference agents operating on stochastic differential equation models that estimate the unknown initial value problem (IVP) solution from approximate observations of the solution derivative, as provided by the ODE dynamics.
Abstract: We study connections between ordinary differential equation (ODE) solvers and probabilistic regression methods in statistics. We provide a new view of probabilistic ODE solvers as active inference agents operating on stochastic differential equation models that estimate the unknown initial value problem (IVP) solution from approximate observations of the solution derivative, as provided by the ODE dynamics. Adding to this picture, we show that several multistep methods of Nordsieck form can be recasted as Kalman filtering on q-times integrated Wiener processes. Doing so provides a family of IVP solvers that return a Gaussian posterior measure, rather than a point estimate. We show that some such methods have low computational overhead, nontrivial convergence order, and that the posterior has a calibrated concentration rate. Additionally, we suggest a step size adaptation algorithm which completes the proposed method to a practically useful implementation, which we experimentally evaluate using a representative set of standard codes in the DETEST benchmark set.

60 citations


Journal ArticleDOI
TL;DR: This article attempts to place the emergence of probabilistic numerics as a mathematical-statistical research field within its historical context and to explore how its gradual development can be related both to applications and to a modern formal treatment.
Abstract: This article attempts to place the emergence of probabilistic numerics as a mathematical–statistical research field within its historical context and to explore how its gradual development can be related both to applications and to a modern formal treatment. We highlight in particular the parallel contributions of Sul$$'$$din and Larkin in the 1960s and how their pioneering early ideas have reached a degree of maturity in the intervening period, mediated by paradigms such as average-case analysis and information-based complexity. We provide a subjective assessment of the state of research in probabilistic numerics and highlight some difficulties to be addressed by future works.

42 citations


Journal ArticleDOI
TL;DR: In this paper, the authors formulate probabilistic numerical approximations to solutions of ordinary differential equations (ODEs) as problems in Gaussian process regression with nonlinear measurement functions.
Abstract: We formulate probabilistic numerical approximations to solutions of ordinary differential equations (ODEs) as problems in Gaussian process (GP) regression with nonlinear measurement functions. This is achieved by defining the measurement sequence to consist of the observations of the difference between the derivative of the GP and the vector field evaluated at the GP—which are all identically zero at the solution of the ODE. When the GP has a state-space representation, the problem can be reduced to a nonlinear Bayesian filtering problem and all widely used approximations to the Bayesian filtering and smoothing problems become applicable. Furthermore, all previous GP-based ODE solvers that are formulated in terms of generating synthetic measurements of the gradient field come out as specific approximations. Based on the nonlinear Bayesian filtering problem posed in this paper, we develop novel Gaussian solvers for which we establish favourable stability properties. Additionally, non-Gaussian approximations to the filtering problem are derived by the particle filter approach. The resulting solvers are compared with other probabilistic solvers in illustrative experiments.

38 citations


Journal ArticleDOI
TL;DR: A new way to find clusters in large vectors of time series by using a measure of similarity between two time series, the generalized cross correlation, which can be applied to large data sets and useful to find groups in dynamic factor models.
Abstract: We present a new way to find clusters in large vectors of time series by using a measure of similarity between two time series, the generalized cross correlation. This measure compares the determinant of the correlation matrix until some lag k of the bivariate vector with those of the two univariate time series. A matrix of similarities among the series based on this measure is used as input of a clustering algorithm. The procedure is automatic, can be applied to large data sets and it is useful to find groups in dynamic factor models. The cluster method is illustrated with some Monte Carlo experiments and a real data example.

32 citations


Journal ArticleDOI
TL;DR: This work proposes to apply an additional smoothing operation to the Voronoi estimator, based on resampling the point pattern by independent random thinning, and shows that the resample-smoothing technique improves the estimation substantially.
Abstract: Voronoi estimators are non-parametric and adaptive estimators of the intensity of a point process. The intensity estimate at a given location is equal to the reciprocal of the size of the Voronoi/Dirichlet cell containing that location. Their major drawback is that they tend to paradoxically under-smooth the data in regions where the point density of the observed point pattern is high, and over-smooth where the point density is low. To remedy this behaviour, we propose to apply an additional smoothing operation to the Voronoi estimator, based on resampling the point pattern by independent random thinning. Through a simulation study we show that our resample-smoothing technique improves the estimation substantially. In addition, we study statistical properties such as unbiasedness and variance, and propose a rule-of-thumb and a data-driven cross-validation approach to choose the amount of smoothing to apply. Finally we apply our proposed intensity estimation scheme to two datasets: locations of pine saplings (planar point pattern) and motor vehicle traffic accidents (linear network point pattern).

32 citations


Journal ArticleDOI
TL;DR: In this paper, a penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge.
Abstract: Finite Gaussian mixture models are widely used for model-based clustering of continuous data. Nevertheless, since the number of model parameters scales quadratically with the number of variables, these models can be easily over-parameterized. For this reason, parsimonious models have been developed via covariance matrix decompositions or assuming local independence. However, these remedies do not allow for direct estimation of sparse covariance matrices nor do they take into account that the structure of association among the variables can vary from one cluster to the other. To this end, we introduce mixtures of Gaussian covariance graph models for model-based clustering with sparse covariance matrices. A penalized likelihood approach is employed for estimation and a general penalty term on the graph configurations can be used to induce different levels of sparsity and incorporate prior knowledge. Model estimation is carried out using a structural-EM algorithm for parameters and graph structure estimation, where two alternative strategies based on a genetic algorithm and an efficient stepwise search are proposed for inference. With this approach, sparse component covariance matrices are directly obtained. The framework results in a parsimonious model-based clustering of the data via a flexible model for the within-group joint distribution of the variables. Extensive simulated data experiments and application to illustrative datasets show that the method attains good classification performance and model quality. The general methodology for model-based clustering with sparse covariance matrices is implemented in the R package mixggm, available on CRAN.

30 citations


Journal ArticleDOI
TL;DR: In this article, a trimmed k-barycenter is proposed to trim the most discrepant distributions, which results in a gain in stability and robustness, highly convenient in this setting.
Abstract: A robust clustering method for probabilities in Wasserstein space is introduced. This new ‘trimmed k-barycenters’ approach relies on recent results on barycenters in Wasserstein space that allow intensive computation, as required by clustering algorithms to be feasible. The possibility of trimming the most discrepant distributions results in a gain in stability and robustness, highly convenient in this setting. As a remarkable application, we consider a parallelized clustering setup in which each of m units processes a portion of the data, producing a clustering report, encoded as k probabilities. We prove that the trimmed k-barycenter of the $$m\times k$$ reports produces a consistent aggregation which we consider the result of a ‘wide consensus’. We also prove that a weighted version of trimmed k-means algorithms based on k-barycenters in the space of Wasserstein keeps the descending character of the concentration step, guaranteeing convergence to local minima. We illustrate the methodology with simulated and real data examples. These include clustering populations by age distributions and analysis of cytometric data.

Journal ArticleDOI
TL;DR: A maximum penalized likelihood method to tackle Bayesian networks from discrete or categorical data, which model the conditional distribution of a node given its parents by multi-logit regression instead of the commonly used multinomial distribution.
Abstract: Bayesian networks, with structure given by a directed acyclic graph (DAG), are a popular class of graphical models. However, learning Bayesian networks from discrete or categorical data is particularly challenging, due to the large parameter space and the difficulty in searching for a sparse structure. In this article, we develop a maximum penalized likelihood method to tackle this problem. Instead of the commonly used multinomial distribution, we model the conditional distribution of a node given its parents by multi-logit regression, in which an edge is parameterized by a set of coefficient vectors with dummy variables encoding the levels of a node. To obtain a sparse DAG, a group norm penalty is employed, and a blockwise coordinate descent algorithm is developed to maximize the penalized likelihood subject to the acyclicity constraint of a DAG. When interventional data are available, our method constructs a causal network, in which a directed edge represents a causal relation. We apply our method to various simulated and real data sets. The results show that our method is very competitive, compared to many existing methods, in DAG estimation from both interventional and high-dimensional observational data.

Journal ArticleDOI
TL;DR: In this paper, the authors extend the convergence analysis of probabilistic integrators for deterministic ordinary differential equations, as proposed by Conrad et al. (2017), to establish mean-square convergence in the uniform norm on discrete-or continuous-time solutions under relaxed regularity assumptions on the driving vector fields and their induced flows.
Abstract: Probabilistic integration of a continuous dynamical system is a way of systematically introducing discretisation error, at scales no larger than errors introduced by standard numerical discretisation, in order to enable thorough exploration of possible responses of the system to inputs. It is thus a potentially useful approach in a number of applications such as forward uncertainty quantification, inverse problems, and data assimilation. We extend the convergence analysis of probabilistic integrators for deterministic ordinary differential equations, as proposed by Conrad et al. (Stat Comput 27(4):1065–1082, 2017. https://doi.org/10.1007/s11222-016-9671-0), to establish mean-square convergence in the uniform norm on discrete- or continuous-time solutions under relaxed regularity assumptions on the driving vector fields and their induced flows. Specifically, we show that randomised high-order integrators for globally Lipschitz flows and randomised Euler integrators for dissipative vector fields with polynomially bounded local Lipschitz constants all have the same mean-square convergence rate as their deterministic counterparts, provided that the variance of the integration noise is not of higher order than the corresponding deterministic integrator. These and similar results are proven for probabilistic integrators where the random perturbations may be state-dependent, non-Gaussian, or non-centred random variables.

Journal ArticleDOI
TL;DR: A very efficient approach to the Monte Carlo estimation of the expected value of partial perfect information (EVPPI) that measures the average benefit of knowing the value of a subset of uncertain parameters involved in a decision model.
Abstract: In this paper, we develop a very efficient approach to the Monte Carlo estimation of the expected value of partial perfect information (EVPPI) that measures the average benefit of knowing the value of a subset of uncertain parameters involved in a decision model. The calculation of EVPPI is inherently a nested expectation problem, with an outer expectation with respect to one random variable X and an inner conditional expectation with respect to the other random variable Y. We tackle this problem by using a multilevel Monte Carlo (MLMC) method (Giles in Oper Res 56(3): 607–617, 2008) in which the number of inner samples for Y increases geometrically with level, so that the accuracy of estimating the inner conditional expectation improves and the cost also increases with level. We construct an antithetic MLMC estimator and provide sufficient assumptions on a decision model under which the antithetic property of the estimator is well exploited, and consequently a root-mean-square accuracy of $${\varepsilon }$$ can be achieved at a cost of $$O({\varepsilon }^{-2})$$ . Numerical results confirm the considerable computational savings compared to the standard, nested Monte Carlo method for some simple test cases and a more realistic medical application.

Journal ArticleDOI
TL;DR: This paper shows how the previously proposed MALA method can be extended to exploit irreversible stochastic dynamics as proposal distributions in the I-Jump sampler and explores how irreversibility can increase the efficiency of the samplers in different situations.
Abstract: In this paper, we propose irreversible versions of the Metropolis–Hastings (MH) and Metropolis-adjusted Langevin algorithm (MALA) with a main focus on the latter. For the former, we show how one can simply switch between different proposal and acceptance distributions upon rejection to obtain an irreversible jump sampler (I-Jump). The resulting algorithm has a simple implementation akin to MH, but with the demonstrated benefits of irreversibility. We then show how the previously proposed MALA method can also be extended to exploit irreversible stochastic dynamics as proposal distributions in the I-Jump sampler. Our experiments explore how irreversibility can increase the efficiency of the samplers in different situations.

Journal ArticleDOI
TL;DR: Gibbs sampling is a widely used Markov chain Monte Carlo (MCMC) method for numerically approximating integrals of interest in Bayesian statistics and other mathematical sciences.
Abstract: Gibbs sampling is a widely used Markov chain Monte Carlo (MCMC) method for numerically approximating integrals of interest in Bayesian statistics and other mathematical sciences. Many implementations of MCMC methods do not extend easily to parallel computing environments, as their inherently sequential nature incurs a large synchronization cost. In the case study illustrated by this paper, we show how to do Gibbs sampling in a fully data-parallel manner on a graphics processing unit, for a large class of exchangeable models that admit latent variable representations. Our approach takes a systems perspective, with emphasis placed on efficient use of compute hardware. We demonstrate our method on a Horseshoe Probit regression model and find that our implementation scales effectively to thousands of predictors and millions of data points simultaneously.

Journal ArticleDOI
TL;DR: The proposed algorithm suggests to divide the big dataset into some smaller subsets and provides a simple method to aggregate the subset posteriors to approximate the full data posterior and is coined as “Double-Parallel Monte Carlo.”
Abstract: This paper proposes a simple, practical, and efficient MCMC algorithm for Bayesian analysis of big data. The proposed algorithm suggests to divide the big dataset into some smaller subsets and provides a simple method to aggregate the subset posteriors to approximate the full data posterior. To further speed up computation, the proposed algorithm employs the population stochastic approximation Monte Carlo algorithm, a parallel MCMC algorithm, to simulate from each subset posterior. Since this algorithm consists of two levels of parallel, data parallel and simulation parallel, it is coined as "Double-Parallel Monte Carlo." The validity of the proposed algorithm is justified mathematically and numerically.

Journal ArticleDOI
TL;DR: A new method to estimate large-scale multivariate normal probabilities that combines a hierarchical representation with processing of the covariance matrix that decomposes the n-dimensional problem into a sequence of smaller m-dimensional ones and introduces an inexpensive block reordering strategy to provide improved accuracy in the overall probability computation.
Abstract: This paper presents a new method to estimate large-scale multivariate normal probabilities. The approach combines a hierarchical representation with processing of the covariance matrix that decomposes the n-dimensional problem into a sequence of smaller m-dimensional ones. It also includes a d-dimensional conditioning method that further decomposes the m-dimensional problems into smaller d-dimensional problems. The resulting two-level hierarchical-block conditioning method requires Monte Carlo simulations to be performed only in d dimensions, with $$d \ll n$$ , and allows the complexity of the algorithm’s major cost to be $$O(n \log n)$$ . The run-time cost of the method depends on two parameters, m and d, where m represents the diagonal block size and controls the sizes of the blocks of the covariance matrix that are replaced by low-rank approximations, and d allows a trade-off of accuracy for expensive computations in the evaluation of the probabilities of m-dimensional blocks. We also introduce an inexpensive block reordering strategy to provide improved accuracy in the overall probability computation. The downside of this method, as with other such conditioning approximations, is the absence of an internal estimate of its error to use in tuning the approximation. Numerical simulations on problems from 2D spatial statistics with dimensions up to 16,384 indicate that the algorithm achieves a $$1\%$$ error level and improves the run time over a one-level hierarchical Quasi-Monte Carlo method by a factor between 10 and 15.

Journal ArticleDOI
TL;DR: In this article, it was shown that the worst case integration error in reproducing kernel Hilbert spaces of standard Monte Carlo methods with n random points decays as n − 1/2 − 2 − 1 2 − 2, where n is the number of points in the kernel Hilbert space.
Abstract: The worst case integration error in reproducing kernel Hilbert spaces of standard Monte Carlo methods with n random points decays as $$n^{-1/2}$$. However, the re-weighting of random points, as exemplified in the Bayesian Monte Carlo method, can sometimes be used to improve the convergence order. This paper contributes general theoretical results for Sobolev spaces on closed Riemannian manifolds, where we verify that such re-weighting yields optimal approximation rates up to a logarithmic factor. We also provide numerical experiments matching the theoretical results for some Sobolev spaces on the sphere $${\mathbb {S}}^2$$ and on the Grassmannian manifold $${\mathcal {G}}_{2,4}$$. Our theoretical findings also cover function spaces on more general sets such as the unit ball, the cube, and the simplex.

Journal ArticleDOI
TL;DR: In this article, Langevin diffusions with stationary distributions equal to well-known distributions from directional statistics are considered as toroidal analogues of the Ornstein-Uhlenbeck process and their likelihood function is a product of transition densities with no analytical expression.
Abstract: We introduce stochastic models for continuous-time evolution of angles and develop their estimation. We focus on studying Langevin diffusions with stationary distributions equal to well-known distributions from directional statistics, since such diffusions can be regarded as toroidal analogues of the Ornstein–Uhlenbeck process. Their likelihood function is a product of transition densities with no analytical expression, but that can be calculated by solving the Fokker–Planck equation numerically through adequate schemes. We propose three approximate likelihoods that are computationally tractable: (i) a likelihood based on the stationary distribution; (ii) toroidal adaptations of the Euler and Shoji–Ozaki pseudo-likelihoods; (iii) a likelihood based on a specific approximation to the transition density of the wrapped normal process. A simulation study compares, in dimensions one and two, the approximate transition densities to the exact ones, and investigates the empirical performance of the approximate likelihoods. Finally, two diffusions are used to model the evolution of the backbone angles of the protein G (PDB identifier 1GB1) during a molecular dynamics simulation. The software package sdetorus implements the estimation methods and applications presented in the paper.

Journal ArticleDOI
TL;DR: The Bayesian nonparametric prior based on a mixture of B-spline distributions is specified and provides more accurate Monte Carlo estimates in terms of $$L_1$$L1-error and uniform coverage probabilities than the Bernstein polynomial prior.
Abstract: We present a new Bayesian nonparametric approach to estimating the spectral density of a stationary time series. A nonparametric prior based on a mixture of B-spline distributions is specified and can be regarded as a generalization of the Bernstein polynomial prior of Petrone (Scand J Stat 26:373–393, 1999a; Can J Stat 27:105–126, 1999b) and Choudhuri et al. (J Am Stat Assoc 99(468):1050–1059, 2004). Whittle’s likelihood approximation is used to obtain the pseudo-posterior distribution. This method allows for a data-driven choice of the number of mixture components and the location of knots. Posterior samples are obtained using a Metropolis-within-Gibbs Markov chain Monte Carlo algorithm, and mixing is improved using parallel tempering. We conduct a simulation study to demonstrate that for complicated spectral densities, the B-spline prior provides more accurate Monte Carlo estimates in terms of $$L_1$$ -error and uniform coverage probabilities than the Bernstein polynomial prior. We apply the algorithm to annual mean sunspot data to estimate the solar cycle. Finally, we demonstrate the algorithm’s ability to estimate a spectral density with sharp features, using real gravitational wave detector data from LIGO’s sixth science run, recoloured to match the Advanced LIGO target sensitivity.

Journal ArticleDOI
TL;DR: In this paper, the integrand is an instance of a Gaussian stochastic process parameterized by a constant mean and a covariance kernel defined by a scale parameter times a parameterized function specifying how the integrant values at two different points in the domain are related.
Abstract: Automatic cubatures approximate integrals to user-specified error tolerances. For high-dimensional problems, it is difficult to adaptively change the sampling pattern, but one can automatically determine the sample size, n, given a reasonable, fixed sampling pattern. We take this approach here using a Bayesian perspective. We postulate that the integrand is an instance of a Gaussian stochastic process parameterized by a constant mean and a covariance kernel defined by a scale parameter times a parameterized function specifying how the integrand values at two different points in the domain are related. These hyperparameters are inferred or integrated out using integrand values via one of three techniques: empirical Bayes, full Bayes, or generalized cross-validation. The sample size, n, is increased until the half-width of the credible interval for the Bayesian posterior mean is no greater than the error tolerance. The process outlined above typically requires a computational cost of $$O(N_{\text {opt}}n^3)$$, where $$N_{\text {opt}}$$ is the number of optimization steps required to identify the hyperparameters. Our innovation is to pair low discrepancy nodes with matching covariance kernels to lower the computational cost to $$O(N_{\text {opt}} n \log n)$$. This approach is demonstrated explicitly with rank-1 lattice sequences and shift-invariant kernels. Our algorithm is implemented in the Guaranteed Automatic Integration Library (GAIL).

Journal ArticleDOI
TL;DR: In this article, a structure selection method for high-dimensional (€ d > 100€ ) sparse vine copulas is proposed, which uses a connection between the vine and structural equation models.
Abstract: We propose a novel structure selection method for high-dimensional ( $$d > 100$$ ) sparse vine copulas Current sequential greedy approaches for structure selection require calculating spanning trees in hundreds of dimensions and fitting the pair copulas and their parameters iteratively throughout the structure selection process Our method uses a connection between the vine and structural equation models The later can be estimated very fast using the Lasso, also in very high dimensions, to obtain sparse models Thus, we obtain a structure estimate independently of the chosen pair copulas and parameters Additionally, we define the novel concept of regularization paths for R-vine matrices It relates sparsity of the vine copula model in terms of independence copulas to a penalization coefficient in the structural equation models We illustrate our approach and provide many numerical examples These include simulations and data applications in high dimensions, showing the superiority of our approach to other existing methods

Journal ArticleDOI
TL;DR: In this paper, an unrestricted skew-normal generalized hyperbolic (SUNGH) distribution is proposed for finite mixture modeling or clustering problems, which has several desirable properties, including an analytically tractable density and ease of computation for simulation and estimation of parameters.
Abstract: In this paper, we introduce an unrestricted skew-normal generalized hyperbolic (SUNGH) distribution for use in finite mixture modeling or clustering problems. The SUNGH is a broad class of flexible distributions that includes various other well-known asymmetric and symmetric families such as the scale mixtures of skew-normal, the skew-normal generalized hyperbolic and its corresponding symmetric versions. The class of distributions provides a much needed unified framework where the choice of the best fitting distribution can proceed quite naturally through either parameter estimation or by placing constraints on specific parameters and assessing through model choice criteria. The class has several desirable properties, including an analytically tractable density and ease of computation for simulation and estimation of parameters. We illustrate the flexibility of the proposed class of distributions in a mixture modeling context using a Bayesian framework and assess the performance using simulated and real data.

Journal ArticleDOI
TL;DR: In this work surprisingly general conditions for equivalence of these disparate methods are presented, and connections between Probabilistic linear solvers and projection methods for linear systems are described, providing a probabilistic interpretation of a far more general class of iterative methods.
Abstract: Several recent works have developed a new, probabilistic interpretation for numerical algorithms solving linear systems in which the solution is inferred in a Bayesian framework, either directly or by inferring the unknown action of the matrix inverse. These approaches have typically focused on replicating the behaviour of the conjugate gradient method as a prototypical iterative method. In this work, surprisingly general conditions for equivalence of these disparate methods are presented. We also describe connections between probabilistic linear solvers and projection methods for linear systems, providing a probabilistic interpretation of a far more general class of iterative methods. In particular, this provides such an interpretation of the generalised minimum residual method. A probabilistic view of preconditioning is also introduced. These developments unify the literature on probabilistic linear solvers and provide foundational connections to the literature on iterative solvers for linear systems.

Journal ArticleDOI
TL;DR: Two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction are proposed and compare favorably with other clustering algorithms, including k-mean, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm.
Abstract: We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene---gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene---gene interactions extracted from the online search tool "Zodiac."

Journal ArticleDOI
TL;DR: A probabilistic model to cluster the nodes of a dynamic graph, accounting for the content of textual edges as well as their frequency is developed, and an application to the Enron dataset is illustrated.
Abstract: The present paper develops a probabilistic model to cluster the nodes of a dynamic graph, accounting for the content of textual edges as well as their frequency. Ver-tices are clustered in groups which are homogeneous both in terms of interaction frequency and discussed topics. The dynamic graph is considered stationary on a latent time interval if the proportions of topics discussed between each pair of node groups do not change in time during that interval. A classification variational expectation-maximization (C-VEM) algorithm is adopted to perform inference. A model selection criterion is also derived to select the number of node groups, time clusters and topics. Experiments on simulated data are carried out to assess the proposed methodology. We finally illustrate an application to the Enron dataset.

Journal ArticleDOI
TL;DR: The natural gradient adaptation for the estimation process which primarily relies on the property that the student-$$ {t}$$t model naturally has orthogonal parametrization becomes significantly more robust than the traditional approach using Newton’s methods.
Abstract: We propose the Laplace method to derive approximate inference for Gaussian process (GP) regression in the location and scale parameters of the student- $$ {t}$$ probabilistic model. This allows both mean and variance of data to vary as a function of covariates with the attractive feature that the student- $$ {t}$$ model has been widely used as a useful tool for robustifying data analysis. The challenge in the approximate inference for the model, lies in the analytical intractability of the posterior distribution and the lack of concavity of the log-likelihood function. We present the natural gradient adaptation for the estimation process which primarily relies on the property that the student- $$ {t}$$ model naturally has orthogonal parametrization. Due to this particular property of the model the Laplace approximation becomes significantly more robust than the traditional approach using Newton’s methods. We also introduce an alternative Laplace approximation by using model’s Fisher information matrix. According to experiments this alternative approximation provides very similar posterior approximations and predictive performance to the traditional Laplace approximation with model’s Hessian matrix. However, the proposed Laplace–Fisher approximation is faster and more stable to calculate compared to the traditional Laplace approximation. We also compare both of these Laplace approximations with the Markov chain Monte Carlo (MCMC) method. We discuss how our approach can, in general, improve the inference algorithm in cases where the probabilistic model assumed for the data is not log-concave.

Journal ArticleDOI
TL;DR: A novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators for Gram matrices: matrices of kernel values between pairs of observations, is presented.
Abstract: In the modern age, rankings data are ubiquitous and they are useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world are incomplete, which prevent the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators for Gram matrices: matrices of kernel values between pairs of observations. We also present a novel variance-reduction scheme based on an antithetic variate construction between permutations to obtain an improved estimator for the Mallows kernel. The corresponding antithetic kernel estimator has lower variance, and we demonstrate empirically that it has a better performance in a variety of machine learning tasks. Both kernel estimators are based on extending kernel mean embeddings to the embedding of a set of full rankings consistent with an observed partial ranking. They form a computationally tractable alternative to previous approaches for partial rankings data. An overview of the existing kernels and metrics for permutations is also provided.

Journal ArticleDOI
TL;DR: A method is introduced which can accurately detect changepoints in categorical data streams with fixed storage and computational requirements, and relies on the ability to adaptively monitor the category probabilities of a multinomial distribution, where temporal adaptivity is introduced using forgetting factors.
Abstract: The need for efficient tools is pressing in the era of big data, particularly in streaming data applications. As data streams are ubiquitous, the ability to accurately detect multiple changepoints, without affecting the continuous flow of data, is an important issue. Change detection for categorical data streams is understudied, and existing work commonly introduces fixed control parameters while providing little insight into how they may be chosen. This is ill-suited to the streaming paradigm, motivating the need for an approach that introduces few parameters which may be set without requiring any prior knowledge of the stream. This paper introduces such a method, which can accurately detect changepoints in categorical data streams with fixed storage and computational requirements. The detector relies on the ability to adaptively monitor the category probabilities of a multinomial distribution, where temporal adaptivity is introduced using forgetting factors. A novel adaptive threshold is also developed which can be computed given a desired false positive rate. This method is then compared to sequential and nonsequential change detectors in a large simulation study which verifies the usefulness of our approach. A real data set consisting of nearly 40 million events from a computer network is also investigated.