Showing papers in &quot;Statistics and Computing in 2012&quot;

Design of computer experiments: space filling and beyond

TL;DR: An adaptive SMC algorithm is proposed which admits a computational complexity that is linear in the number of samples and adaptively determines the simulation parameters.

...read moreread less

Abstract: Approximate Bayesian computation (ABC) is a popular approach to address inference problems where the likelihood function is intractable, or expensive to calculate To improve over Markov chain Monte Carlo (MCMC) implementations of ABC, the use of sequential Monte Carlo (SMC) methods has recently been suggested Most effective SMC algorithms that are currently available for ABC have a computational complexity that is quadratic in the number of Monte Carlo samples (Beaumont et al, Biometrika 86:983---990, 2009; Peters et al, Technical report, 2008; Toni et al, J Roy Soc Interface 6:187---202, 2009) and require the careful choice of simulation parameters In this article an adaptive SMC algorithm is proposed which admits a computational complexity that is linear in the number of samples and adaptively determines the simulation parameters We demonstrate our algorithm on a toy example and on a birth-death-mutation model arising in epidemiology

...read moreread less

530 citations

Journal Article•DOI•

[...]

Luc Pronzato¹, Werner G. Müller²•Institutions (2)

University of Nice Sophia Antipolis¹, Johannes Kepler University of Linz²

Sequential design of computer experiments for the estimation of a probability of failure

TL;DR: The circumstances under which space-filling superiority holds are reviewed, some new arguments are provided and some motives to go beyond space- filling are clarified.

...read moreread less

Abstract: When setting up a computer experiment, it has become a standard practice to select the inputs spread out uniformly across the available space. These so-called space-filling designs are now ubiquitous in corresponding publications and conferences. The statistical folklore is that such designs have superior properties when it comes to prediction and estimation of emulator functions. In this paper we want to review the circumstances under which this superiority holds, provide some new arguments and clarify the motives to go beyond space-filling. An overview over the state of the art of space-filling is introducing and complementing these results.

...read moreread less

342 citations

Journal Article•DOI•

[...]

Julien Bect¹, David Ginsbourger², Ling Li¹, Victor Picheny³, Emmanuel Vazquez¹ - Show less +1 more•Institutions (3)

Supélec¹, University of Bern², École Centrale Paris³

Robust adaptive Metropolis algorithm with coerced acceptance rate

TL;DR: SUR (stepwise uncertainty reduction) strategies are derived from a Bayesian formulation of the problem of estimating a probability of failure of a function f using a Gaussian process model of f and aim at performing evaluations of f as efficiently as possible to infer the value of the probabilities of failure.

...read moreread less

Abstract: This paper deals with the problem of estimating the volume of the excursion set of a function f:? d ?? above a given threshold, under a probability measure on ? d that is assumed to be known. In the industrial world, this corresponds to the problem of estimating a probability of failure of a system. When only an expensive-to-simulate model of the system is available, the budget for simulations is usually severely limited and therefore classical Monte Carlo methods ought to be avoided. One of the main contributions of this article is to derive SUR (stepwise uncertainty reduction) strategies from a Bayesian formulation of the problem of estimating a probability of failure. These sequential strategies use a Gaussian process model of f and aim at performing evaluations of f as efficiently as possible to infer the value of the probability of failure. We compare these strategies to other strategies also based on a Gaussian process model for estimating a probability of failure.

...read moreread less

330 citations

Journal Article•DOI•

[...]

Matti Vihola¹•Institutions (1)

University of Jyväskylä¹

Cases for the nugget in modeling computer experiments

TL;DR: A new robust adaptive Metropolis algorithm estimating the shape of the target distribution and simultaneously coercing the acceptance rate and showing promising behaviour in an example with Student target distribution having no finite second moment.

...read moreread less

Abstract: The adaptive Metropolis (AM) algorithm of Haario, Saksman and Tamminen (Bernoulli 7(2):223---242, 2001) uses the estimated covariance of the target distribution in the proposal distribution. This paper introduces a new robust adaptive Metropolis algorithm estimating the shape of the target distribution and simultaneously coercing the acceptance rate. The adaptation rule is computationally simple adding no extra cost compared with the AM algorithm. The adaptation strategy can be seen as a multidimensional extension of the previously proposed method adapting the scale of the proposal distribution in order to attain a given acceptance rate. The empirical results show promising behaviour of the new algorithm in an example with Student target distribution having no finite second moment, where the AM covariance estimate is unstable. In the examples with finite second moments, the performance of the new approach seems to be competitive with the AM algorithm combined with scale adaptation.

...read moreread less

267 citations

Journal Article•DOI•

[...]

Robert B. Gramacy¹, Herbert K. H. Lee²•Institutions (2)

University of Chicago¹, University of California, Santa Cruz²

Slope heuristics: overview and implementation

TL;DR: It is shown that estimating a (non-zero) nugget can lead to surrogate models with better statistical properties, such as predictive accuracy and coverage, in a variety of common situations.

...read moreread less

Abstract: Most surrogate models for computer experiments are interpolators, and the most common interpolator is a Gaussian process (GP) that deliberately omits a small-scale (measurement) error term called the nugget The explanation is that computer experiments are, by definition, "deterministic", and so there is no measurement error We think this is too narrow a focus for a computer experiment and a statistically inefficient way to model them We show that estimating a (non-zero) nugget can lead to surrogate models with better statistical properties, such as predictive accuracy and coverage, in a variety of common situations

...read moreread less

240 citations

Journal Article•DOI•

[...]

Jean-Patrick Baudry¹, Cathy Maugis², Bertrand Michel³•Institutions (3)

Paris Descartes University¹, Institut de Mathématiques de Toulouse², Pierre-and-Marie-Curie University³

Sequential Monte Carlo for rare event estimation

TL;DR: An introduction to the slope heuristics and an overview of the theoretical and practical results about it are presented and a new practical approach is carried out and compared to the standard dimension jump method.

...read moreread less

Abstract: Model selection is a general paradigm which includes many statistical problems. One of the most fruitful and popular approaches to carry it out is the minimization of a penalized criterion. Birge and Massart (Probab. Theory Relat. Fields 138:33---73, 2006) have proposed a promising data-driven method to calibrate such criteria whose penalties are known up to a multiplicative factor: the "slope heuristics". Theoretical works validate this heuristic method in some situations and several papers report a promising practical behavior in various frameworks. The purpose of this work is twofold. First, an introduction to the slope heuristics and an overview of the theoretical and practical results about it are presented. Second, we focus on the practical difficulties occurring for applying the slope heuristics. A new practical approach is carried out and compared to the standard dimension jump method. All the practical solutions discussed in this paper in different frameworks are implemented and brought together in a Matlab graphical user interface called capushe. Supplemental Materials containing further information and an additional application, the capushe package and the datasets presented in this paper, are available on the journal Web site.

...read moreread less

231 citations

Journal Article•DOI•

[...]

Frédéric Cérou¹, Pierre Del Moral², Teddy Furon¹, Arnaud Guyader•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Bordeaux²

Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions

TL;DR: A novel strategy for simulating rare events and an associated Monte Carlo estimation of tail probabilities using a system of interacting particles and exploits a Feynman-Kac representation of that system to analyze their fluctuations.

...read moreread less

Abstract: This paper discusses a novel strategy for simulating rare events and an associated Monte Carlo estimation of tail probabilities. Our method uses a system of interacting particles and exploits a Feynman-Kac representation of that system to analyze their fluctuations. Our precise analysis of the variance of a standard multilevel splitting algorithm reveals an opportunity for improvement. This leads to a novel method that relies on adaptive levels and produces, in the limit of an idealized version of the algorithm, estimates with optimal variance. The motivation for this theoretical work comes from problems occurring in watermarking and fingerprinting of digital contents, which represents a new field of applications of rare event simulation techniques. Some numerical results show performance close to the idealized version of our technique for these practical applications.

...read moreread less

216 citations

Journal Article•DOI•

[...]

Jeffrey L. Andrews¹, Paul D. McNicholas¹•Institutions (1)

University of Guelph¹

Filtering via approximate Bayesian computation

TL;DR: A novel family of mixture models wherein each component is modeled using a multivariate t-distribution with an eigen-decomposed covariance structure is put forth, known as the tEIGEN family.

...read moreread less

Abstract: The last decade has seen an explosion of work on the use of mixture models for clustering The use of the Gaussian mixture model has been common practice, with constraints sometimes imposed upon the component covariance matrices to give families of mixture models Similar approaches have also been applied, albeit with less fecundity, to classification and discriminant analysis In this paper, we begin with an introduction to model-based clustering and a succinct account of the state-of-the-art We then put forth a novel family of mixture models wherein each component is modeled using a multivariate t-distribution with an eigen-decomposed covariance structure This family, which is largely a t-analogue of the well-known MCLUST family, is known as the tEIGEN family The efficacy of this family for clustering, classification, and discriminant analysis is illustrated with both real and simulated data The performance of this family is compared to its Gaussian counterpart on three real data sets

...read moreread less

151 citations

Journal Article•DOI•

[...]

Ajay Jasra¹, Sumeetpal S. Singh², James S. Martin¹, Emma J. McCoy¹•Institutions (2)

Imperial College London¹, University of Cambridge²

01 Nov 2012-Statistics and Computing

TL;DR: This article presents an ABC approximation designed to perform biased filtering for a Hidden Markov Model when the likelihood function is intractable and uses a sequential Monte Carlo algorithm to both fit and sample from the ABC approximation of the target probability density.

...read moreread less

Abstract: Approximate Bayesian computation (ABC) has become a popular technique to facilitate Bayesian inference from complex models. In this article we present an ABC approximation designed to perform biased filtering for a Hidden Markov Model when the likelihood function is intractable. We use a sequential Monte Carlo (SMC) algorithm to both fit and sample from our ABC approximation of the target probability density. This approach is shown to, empirically, be more accurate w.r.t. the original filter than competing methods. The theoretical bias of our method is investigated; it is shown that the bias goes to zero at the expense of increased computational effort. Our approach is illustrated on a constrained sequential lasso for portfolio allocation to 15 constituents of the FTSE 100 share index.

...read moreread less

128 citations

Journal Article•DOI•

Global sensitivity analysis of stochastic computer models with joint metamodels

[...]

Amandine Marrel¹, Bertrand Iooss, Sébastien Da Veiga¹, Mathieu Ribatet²•Institutions (2)

French Institute of Petroleum¹, University of Montpellier²

Efficient Monte Carlo simulation via the generalized splitting method

TL;DR: In this article, the authors proposed a global sensitivity analysis methodology for stochastic computer codes, for which the result of each code run is itself random and the framework of the joint modeling of the mean and dispersion of heteroscedastic data is used.

...read moreread less

Abstract: The global sensitivity analysis method used to quantify the influence of uncertain input variables on the variability in numerical model responses has already been applied to deterministic computer codes; deterministic means here that the same set of input variables always gives the same output value. This paper proposes a global sensitivity analysis methodology for stochastic computer codes, for which the result of each code run is itself random. The framework of the joint modeling of the mean and dispersion of heteroscedastic data is used. To deal with the complexity of computer experiment outputs, nonparametric joint models are discussed and a new Gaussian process-based joint model is proposed. The relevance of these models is analyzed based upon two case studies. Results show that the joint modeling approach yields accurate sensitivity index estimators even when heteroscedasticity is strong.

...read moreread less

Journal Article•DOI•

[...]

Zdravko I. Botev¹, Dirk P. Kroese²•Institutions (2)

Université de Montréal¹, University of Queensland²

Missing values: sparse inverse covariance estimation and an extension to sparse regression

TL;DR: A new Monte Carlo algorithm for the consistent and unbiased estimation of multidimensional integrals and the efficient sampling from multiddimensional densities is described, inspired by the classical splitting method.

...read moreread less

Abstract: We describe a new Monte Carlo algorithm for the consistent and unbiased estimation of multidimensional integrals and the efficient sampling from multidimensional densities. The algorithm is inspired by the classical splitting method and can be applied to general static simulation models. We provide examples from rare-event probability estimation, counting, and sampling, demonstrating that the proposed method can outperform existing Markov chain sampling methods in terms of convergence speed and accuracy.

...read moreread less

Journal Article•DOI•

[...]

Nicolas Städler¹, Peter Bühlmann¹•Institutions (1)

ETH Zurich¹

Simultaneous model-based clustering and visualization in the Fisher discriminative subspace

TL;DR: An efficient EM algorithm for optimization with provable numerical convergence properties is proposed and the methodology to handle missing values in a sparse regression context is extended.

...read moreread less

Abstract: We propose an ? 1-regularized likelihood method for estimating the inverse covariance matrix in the high-dimensional multivariate normal model in presence of missing data. Our method is based on the assumption that the data are missing at random (MAR) which entails also the completely missing at random case. The implementation of the method is non-trivial as the observed negative log-likelihood generally is a complicated and non-convex function. We propose an efficient EM algorithm for optimization with provable numerical convergence properties. Furthermore, we extend the methodology to handle missing values in a sparse regression context. We demonstrate both methods on simulated and real data.

...read moreread less

Journal Article•DOI•

[...]

Charles Bouveyron¹, Camille Brunet²•Institutions (2)

University of Paris¹, University of Évry Val d'Essonne²

Quantile regression for longitudinal data based on latent Markov subject-specific parameters

TL;DR: In this paper, a discriminative latent mixture (DLM) model is proposed to fit the data in a latent orthonormal discriminant subspace with an intrinsic dimension lower than the dimension of the original space.

...read moreread less

Abstract: Clustering in high-dimensional spaces is nowadays a recurrent problem in many scientific domains but remains a difficult task from both the clustering accuracy and the result understanding points of view. This paper presents a discriminative latent mixture (DLM) model which fits the data in a latent orthonormal discriminative subspace with an intrinsic dimension lower than the dimension of the original space. By constraining model parameters within and between groups, a family of 12 parsimonious DLM models is exhibited which allows to fit onto various situations. An estimation algorithm, called the Fisher-EM algorithm, is also proposed for estimating both the mixture parameters and the discriminative subspace. Experiments on simulated and real datasets highlight the good performance of the proposed approach as compared to existing clustering methods while providing a useful representation of the clustered data. The method is as well applied to the clustering of mass spectrometry data.

...read moreread less

Journal Article•DOI•

[...]

Alessio Farcomeni¹•Institutions (1)

Sapienza University of Rome¹

On sequential Monte Carlo, partial rejection control and approximate Bayesian computation

TL;DR: A latent Markov quantile regression model for longitudinal data with non-informative drop-out that allows exact inference through an ad hoc EM-type algorithm based on appropriate recursions is proposed.

...read moreread less

Abstract: We propose a latent Markov quantile regression model for longitudinal data with non-informative drop-out. The observations, conditionally on covariates, are modeled through an asymmetric Laplace distribution. Random effects are assumed to be time-varying and to follow a first order latent Markov chain. This latter assumption is easily interpretable and allows exact inference through an ad hoc EM-type algorithm based on appropriate recursions. Finally, we illustrate the model on a benchmark data set.

...read moreread less

Journal Article•DOI•

[...]

Gareth W. Peters¹, Gareth W. Peters², Yanan Fan¹, Scott A. Sisson¹•Institutions (2)

University of New South Wales¹, Commonwealth Scientific and Industrial Research Organisation²

15 Feb 2012-Statistics and Computing

TL;DR: In this paper, a variant of the sequential Monte Carlo sampler by incorporating the partial rejection control mechanism of Liu (2001) is presented, which can reduce the variance of the incremental importance weights when compared with standard sequential Monte-Carlo samplers.

...read moreread less

Abstract: We present a variant of the sequential Monte Carlo sampler by incorporating the partial rejection control mechanism of Liu (2001). We show that the resulting algorithm can be considered as a sequential Monte Carlo sampler with a modified mutation kernel. We prove that the new sampler can reduce the variance of the incremental importance weights when compared with standard sequential Monte Carlo samplers, and provide a central limit theorem. Finally, the sampler is adapted for application under the challenging approximate Bayesian computation modelling framework.

...read moreread less

Journal Article•DOI•

Block clustering with collapsed latent block models

[...]

Jason Wyse¹, Nial Friel¹•Institutions (1)

University College London¹

Exact posterior distributions and model selection criteria for multiple change-point detection problems

TL;DR: A Bayesian extension of the latent block model for model-based block clustering of data matrices considers a block model where block parameters may be integrated out and produces a posterior defined over the number of clusters in rows and columns and cluster memberships.

...read moreread less

Abstract: We introduce a Bayesian extension of the latent block model for model-based block clustering of data matrices. Our approach considers a block model where block parameters may be integrated out. The result is a posterior defined over the number of clusters in rows and columns and cluster memberships. The number of row and column clusters need not be known in advance as these are sampled along with cluster memberhips using Markov chain Monte Carlo. This differs from existing work on latent block models, where the number of clusters is assumed known or is chosen using some information criteria. We analyze both simulated and real data to validate the technique.

...read moreread less

Journal Article•DOI•

[...]

Guillem Rigaill¹, Emilie Lebarbier², Stéphane Robin²•Institutions (2)

Curie Institute¹, Agro ParisTech²

01 Jul 2012-Statistics and Computing

TL;DR: This work derives exact, explicit and tractable formulae for the posterior distribution of variables such as the number of change-points or their positions and demonstrates that several classical Bayesian model selection criteria can be computed exactly.

...read moreread less

Abstract: In segmentation problems, inference on change-point position and model selection are two difficult issues due to the discrete nature of change-points. In a Bayesian context, we derive exact, explicit and tractable formulae for the posterior distribution of variables such as the number of change-points or their positions. We also demonstrate that several classical Bayesian model selection criteria can be computed exactly. All these results are based on an efficient strategy to explore the whole segmentation space, which is very large. We illustrate our methodology on both simulated data and a comparative genomic hybridization profile.

...read moreread less

Journal Article•DOI•

Estimation in nonlinear mixed-effects models using heavy-tailed distributions

[...]

Cristian Meza¹, Felipe Osorio¹, Rolando de la Cruz²•Institutions (2)

Valparaiso University¹, Pontifical Catholic University of Chile²

Considerate approaches to constructing summary statistics for ABC model selection

TL;DR: This work proposes an exact estimation procedure to obtain the maximum likelihood estimates of the fixed-effects and variance components, using a stochastic approximation of the EM algorithm, and compares the performance of the normal and the SMN models with two real data sets.

...read moreread less

Abstract: Nonlinear mixed-effects models are very useful to analyze repeated measures data and are used in a variety of applications. Normal distributions for random effects and residual errors are usually assumed, but such assumptions make inferences vulnerable to the presence of outliers. In this work, we introduce an extension of a normal nonlinear mixed-effects model considering a subclass of elliptical contoured distributions for both random effects and residual errors. This elliptical subclass, the scale mixtures of normal (SMN) distributions, includes heavy-tailed multivariate distributions, such as Student-t, the contaminated normal and slash, among others, and represents an interesting alternative to outliers accommodation maintaining the elegance and simplicity of the maximum likelihood theory. We propose an exact estimation procedure to obtain the maximum likelihood estimates of the fixed-effects and variance components, using a stochastic approximation of the EM algorithm. We compare the performance of the normal and the SMN models with two real data sets.

...read moreread less

Journal Article•DOI•

[...]

Chris P. Barnes¹, Sarah Filippi¹, Michael P. H. Stumpf¹, Thomas Thorne¹•Institutions (1)

Imperial College London¹

09 Jun 2012-Statistics and Computing

TL;DR: This work employs an information-theoretical framework that can be used to construct appropriate (approximately sufficient) statistics by combining different statistics until the loss of information is minimized, and demonstrates that such sets of statistics can be constructed for both parameter estimation and model selection problems.

...read moreread less

Abstract: For nearly any challenging scientific problem evaluation of the likelihood is problematic if not impossible. Approximate Bayesian computation (ABC) allows us to employ the whole Bayesian formalism to problems where we can use simulations from a model, but cannot evaluate the likelihood directly. When summary statistics of real and simulated data are compared—rather than the data directly—information is lost, unless the summary statistics are sufficient. Sufficient statistics are, however, not common but without them statistical inference in ABC inferences are to be considered with caution. Previously other authors have attempted to combine different statistics in order to construct (approximately) sufficient statistics using search and information heuristics. Here we employ an information-theoretical framework that can be used to construct appropriate (approximately sufficient) statistics by combining different statistics until the loss of information is minimized. We start from a potentially large number of different statistics and choose the smallest set that captures (nearly) the same information as the complete set. We then demonstrate that such sets of statistics can be constructed for both parameter estimation and model selection problems, and we apply our approach to a range of illustrative and real-world model selection problems.

...read moreread less

Journal Article•DOI•

Improved cross-entropy method for estimation

[...]

Joshua C. C. Chan¹, Dirk P. Kroese²•Institutions (2)

Australian National University¹, University of Queensland²

Smooth functional tempering for nonlinear differential equation models

TL;DR: This work considers a variation of the CE method whose performance does not deteriorate as the dimension of the problem increases, and illustrates the algorithm via a high-dimensional estimation problem in risk management.

...read moreread less

Abstract: The cross-entropy (CE) method is an adaptive importance sampling procedure that has been successfully applied to a diverse range of complicated simulation problems. However, recent research has shown that in some high-dimensional settings, the likelihood ratio degeneracy problem becomes severe and the importance sampling estimator obtained from the CE algorithm becomes unreliable. We consider a variation of the CE method whose performance does not deteriorate as the dimension of the problem increases. We then illustrate the algorithm via a high-dimensional estimation problem in risk management.

...read moreread less

Journal Article•DOI•

[...]

David A. Campbell¹, Russell Steele²•Institutions (2)

Simon Fraser University¹, McGill University²

Data-driven Kriging models based on FANOVA-decomposition

TL;DR: This paper presents Smooth Functional Tempering, a new population Markov Chain Monte Carlo approach for posterior estimation of parameters, which tempers towards data features rather than tempering via approximations to the posterior that are more heavily rooted in the prior.

...read moreread less

Abstract: Differential equations are used in modeling diverse system behaviors in a wide variety of sciences. Methods for estimating the differential equation parameters traditionally depend on the inclusion of initial system states and numerically solving the equations. This paper presents Smooth Functional Tempering, a new population Markov Chain Monte Carlo approach for posterior estimation of parameters. The proposed method borrows insights from parallel tempering and model based smoothing to define a sequence of approximations to the posterior. The tempered approximations depend on relaxations of the solution to the differential equation model, reducing the need for estimating the initial system states and obtaining a numerical differential equation solution. Rather than tempering via approximations to the posterior that are more heavily rooted in the prior, this new method tempers towards data features. Using our proposed approach, we observed faster convergence and robustness to both initial values and prior distributions that do not reflect the features of the data. Two variations of the method are proposed and their performance is examined through simulation studies and a real application to the chemical reaction dynamics of producing nylon.

...read moreread less

Journal Article•DOI•

[...]

Thomas Muehlenstaedt¹, Olivier Roustant², Laurent Carraro, Sonja Kuhnt¹•Institutions (2)

Technical University of Dortmund¹, Mines ParisTech²

A review and some new results on permutation testing for multivariate problems

TL;DR: This paper constructs kernels that reproduce the computer code complexity by mimicking its interaction structure by constructing a Kriging model suited for a general interaction structure, and will take advantage of the absence of interaction between some inputs.

...read moreread less

Abstract: Kriging models have been widely used in computer experiments for the analysis of time-consuming computer codes. Based on kernels, they are flexible and can be tuned to many situations. In this paper, we construct kernels that reproduce the computer code complexity by mimicking its interaction structure. While the standard tensor-product kernel implicitly assumes that all interactions are active, the new kernels are suited for a general interaction structure, and will take advantage of the absence of interaction between some inputs. The methodology is twofold. First, the interaction structure is estimated from the data, using a first initial standard Kriging model, and represented by a so-called FANOVA graph. New FANOVA-based sensitivity indices are introduced to detect active interactions. Then this graph is used to derive the form of the kernel, and the corresponding Kriging model is estimated by maximum likelihood. The performance of the overall procedure is illustrated by several 3-dimensional and 6-dimensional simulated and real examples. A substantial improvement is observed when the computer code has a relatively high level of complexity.

...read moreread less

Journal Article•DOI•

[...]

Fortunato Pesarin¹, Luigi Salmaso¹•Institutions (1)

University of Padua¹

A semiparametric Bayesian approach to extreme value estimation

TL;DR: This paper reviews the method of nonparametric combination of dependent permutation tests and its main properties along with some new results in experimental and observational situations (robust testing, multi-sided alternatives and testing for survival functions).

...read moreread less

Abstract: In recent years permutation testing methods have increased both in number of applications and in solving complex multivariate problems. When available permutation tests are essentially of an exact nonparametric nature in a conditional context, where conditioning is on the pooled observed data set which is often a set of sufficient statistics in the null hypothesis. Whereas, the reference null distribution of most parametric tests is only known asymptotically. Thus, for most sample sizes of practical interest, the possible lack of efficiency of permutation solutions may be compensated by the lack of approximation of parametric counterparts. There are many complex multivariate problems, quite common in empirical sciences, which are difficult to solve outside the conditional framework and in particular outside the method of nonparametric combination (NPC) of dependent permutation tests. In this paper we review such a method and its main properties along with some new results in experimental and observational situations (robust testing, multi-sided alternatives and testing for survival functions).

...read moreread less

Journal Article•DOI•

[...]

Fernando Ferraz do Nascimento¹, Dani Gamerman², Hedibert F. Lopes³•Institutions (3)

Federal University of Piauí¹, Federal University of Rio de Janeiro², University of Chicago³

Non-parametric detection of meaningless distances in high dimensional data

TL;DR: The generalized Pareto distribution beyond a given threshold is combined with a nonparametric estimation approach below the threshold and this semiparametric setup is shown to generalize a few existing approaches and enables density estimation over the complete sample space.

...read moreread less

Abstract: This paper is concerned with extreme value density estimation. The generalized Pareto distribution (GPD) beyond a given threshold is combined with a nonparametric estimation approach below the threshold. This semiparametric setup is shown to generalize a few existing approaches and enables density estimation over the complete sample space. Estimation is performed via the Bayesian paradigm, which helps identify model components. Estimation of all model parameters, including the threshold and higher quantiles, and prediction for future observations is provided. Simulation studies suggest a few useful guidelines to evaluate the relevance of the proposed procedures. They also provide empirical evidence about the improvement of the proposed methodology over existing approaches. Models are then applied to environmental data sets. The paper is concluded with a few directions for future work.

...read moreread less

Journal Article•DOI•

[...]

Ata Kaban¹•Institutions (1)

University of Birmingham¹

The Gaussian rank correlation estimator: robustness properties

TL;DR: This work quantifies the phenomenon that, in certain conditions, the contrast between the nearest and the farthest neighbouring points vanishes as the data dimensionality increases by bounding the tails of the probability that distances become meaningless in a distribution-free manner.

...read moreread less

Abstract: Distance concentration is the phenomenon that, in certain conditions, the contrast between the nearest and the farthest neighbouring points vanishes as the data dimensionality increases. It affects high dimensional data processing, analysis, retrieval, and indexing, which all rely on some notion of distance or dissimilarity. Previous work has characterised this phenomenon in the limit of infinite dimensions. However, real data is finite dimensional, and hence the infinite-dimensional characterisation is insufficient. Here we quantify the phenomenon more precisely, for the possibly high but finite dimensional case in a distribution-free manner, by bounding the tails of the probability that distances become meaningless. As an application, we show how this can be used to assess the concentration of a given distance function in some unknown data distribution solely on the basis of an available data sample from it. This can be used to test and detect problematic cases more rigorously than it is currently possible, and we demonstrate the working of this approach on both synthetic data and ten real-world data sets from different domains.

...read moreread less

Journal Article•DOI•

[...]

Kris Boudt¹, Jonathan Cornelissen¹, Christophe Croux¹•Institutions (1)

Katholieke Universiteit Leuven¹

Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution

TL;DR: The Gaussian rank correlation as discussed by the authors is the usual correlation coefficient computed from the normal scores of the data and it has attractive robustness properties, in particular, its breakdown point is above 12%.

...read moreread less

Abstract: The Gaussian rank correlation equals the usual correlation coefficient computed from the normal scores of the data. Although its influence function is unbounded, it still has attractive robustness properties. In particular, its breakdown point is above 12%. Moreover, the estimator is consistent and asymptotically efficient at the normal distribution. The correlation matrix obtained from pairwise Gaussian rank correlations is always positive semidefinite, and very easy to compute, also in high dimensions. We compare the properties of the Gaussian rank correlation with the popular Kendall and Spearman correlation measures. A simulation study confirms the good efficiency and robustness properties of the Gaussian rank correlation. In the empirical application, we show how it can be used for multivariate outlier detection based on robust principal component analysis.

...read moreread less

Journal Article•DOI•

[...]

Kenneth Lo¹, Raphael Gottardo²•Institutions (2)

University of Washington¹, Fred Hutchinson Cancer Research Center²

Maximum likelihood inference for mixtures of skew Student-t-normal distributions through practical EM-type algorithms

TL;DR: A new class of distributions, multivariate t distributions with the Box-Cox transformation, is proposed for mixture modeling, which provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues.

...read moreread less

Abstract: Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.

...read moreread less

Journal Article•DOI•

[...]

Hsiu J. Ho¹, Saumyadipta Pyne², Tsung-I Lin³•Institutions (3)

Tunghai University¹, Harvard University², National Chung Hsing University³

Tuning tempered transitions

TL;DR: The proposed methodology is particularly useful for analyzing multimodal asymmetric data as produced by major biotechnological platforms like flow cytometry and is provided with the help of an illustrative example.

...read moreread less

Abstract: This paper deals with the problem of maximum likelihood estimation for a mixture of skew Student-t-normal distributions, which is a novel model-based tool for clustering heterogeneous (multiple groups) data in the presence of skewed and heavy-tailed outcomes. We present two analytically simple EM-type algorithms for iteratively computing the maximum likelihood estimates. The observed information matrix is derived for obtaining the asymptotic standard errors of parameter estimates. A small simulation study is conducted to demonstrate the superiority of the skew Student-t-normal distribution compared to the skew t distribution. The proposed methodology is particularly useful for analyzing multimodal asymmetric data as produced by major biotechnological platforms like flow cytometry. We provide such an application with the help of an illustrative example.

...read moreread less

Journal Article•DOI•

[...]

Gundula Behrens¹, Nial Friel², Merrilee Hurn³•Institutions (3)

University Hospital Regensburg¹, University College Dublin², University of Bath³