scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Statistics Theory in 2015"


Posted Content
TL;DR: A review of the most recent theoretical and methodological developments for random forests can be found in this article, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures.
Abstract: The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad-hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.

1,119 citations


Book ChapterDOI
TL;DR: In this article, a review of various global sensitivity analysis methods of model output is presented, in a complete methodological framework, in which three kinds of methods are distinguished: the screening (coarse sorting of the most influential inputs among a large number), the measures of importance (quantitative sensitivity indices) and the deep exploration of the model behaviour (measuring the effects of inputs on their all variation range).
Abstract: This chapter makes a review, in a complete methodological framework, of various global sensitivity analysis methods of model output. Numerous statistical and probabilistic tools (regression, smoothing, tests, statistical learning, Monte Carlo, …) aim at determining the model input variables which mostly contribute to an interest quantity depending on model output. This quantity can be for instance the variance of an output variable. Three kinds of methods are distinguished: the screening (coarse sorting of the most influential inputs among a large number), the measures of importance (quantitative sensitivity indices) and the deep exploration of the model behaviour (measuring the effects of inputs on their all variation range). A progressive application methodology is illustrated on a scholar application. A synthesis is given to place every method according to several axes, mainly the cost in number of model evaluations, the model complexity and the nature of brought information.

744 citations


Posted Content
TL;DR: In this article, a general recipe for constructing MCMCMCMC samplers, including stochastic gradient versions, based on continuous Markov processes specified via two matrices is provided.
Abstract: Many recent Markov chain Monte Carlo (MCMC) samplers leverage continuous dynamics to define a transition kernel that efficiently explores a target distribution. In tandem, a focus has been on devising scalable variants that subsample the data and use stochastic gradients in place of full-data gradients in the dynamic simulations. However, such stochastic gradient MCMC samplers have lagged behind their full-data counterparts in terms of the complexity of dynamics considered since proving convergence in the presence of the stochastic gradient noise is non-trivial. Even with simple dynamics, significant physical intuition is often required to modify the dynamical system to account for the stochastic gradient noise. In this paper, we provide a general recipe for constructing MCMC samplers--including stochastic gradient versions--based on continuous Markov processes specified via two matrices. We constructively prove that the framework is complete. That is, any continuous Markov process that provides samples from the target distribution can be written in our framework. We show how previous continuous-dynamic samplers can be trivially "reinvented" in our framework, avoiding the complicated sampler-specific proofs. We likewise use our recipe to straightforwardly propose a new state-adaptive sampler: stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC). Our experiments on simulated data and a streaming Wikipedia analysis demonstrate that the proposed SGRHMC sampler inherits the benefits of Riemann HMC, with the scalability of stochastic gradient methods.

345 citations


Posted Content
TL;DR: This work provides a simple set of conditions under which projected gradient descent, when given a suitable initialization, converges geometrically to a statistically useful solution to the factorized optimization problem with rank constraints.
Abstract: Optimization problems with rank constraints arise in many applications, including matrix regression, structured PCA, matrix completion and matrix decomposition problems. An attractive heuristic for solving such problems is to factorize the low-rank matrix, and to run projected gradient descent on the nonconvex factorized optimization problem. The goal of this problem is to provide a general theoretical framework for understanding when such methods work well, and to characterize the nature of the resulting fixed point. We provide a simple set of conditions under which projected gradient descent, when given a suitable initialization, converges geometrically to a statistically useful solution. Our results are applicable even when the initial solution is outside any region of local convexity, and even when the problem is globally concave. Working in a non-asymptotic framework, we show that our conditions are satisfied for a wide range of concrete models, including matrix regression, structured PCA, matrix completion with real and quantized observations, matrix decomposition, and graph clustering problems. Simulation results show excellent agreement with the theoretical predictions.

311 citations


Journal ArticleDOI
TL;DR: In this paper, the authors explore the notion of elicitability for multi-dimensional functionals and give both necessary and sufficient conditions for strictly consistent scoring functions, and show that one dimensional functionals that are not elicitable can be a component of a higher order elicitable functional.
Abstract: A statistical functional, such as the mean or the median, is called elicitable if there is a scoring function or loss function such that the correct forecast of the functional is the unique minimizer of the expected score. Such scoring functions are called strictly consistent for the functional. The elicitability of a functional opens the possibility to compare competing forecasts and to rank them in terms of their realized scores. In this paper, we explore the notion of elicitability for multi-dimensional functionals and give both necessary and sufficient conditions for strictly consistent scoring functions. We cover the case of functionals with elicitable components, but we also show that one-dimensional functionals that are not elicitable can be a component of a higher order elicitable functional. In the case of the variance this is a known result. However, an important result of this paper is that spectral risk measures with a spectral measure with finite support are jointly elicitable if one adds the `correct' quantiles. A direct consequence of applied interest is that the pair (Value at Risk, Expected Shortfall) is jointly elicitable under mild conditions that are usually fulfilled in risk management applications.

260 citations


Posted Content
TL;DR: This paper derived a bounding factor and a sharp inequality such that sensitivity analysis parameters must satisfy the inequality if an unmeasured confounder is to explain away the observed effect estimate or reduce it to a particular level.
Abstract: Unmeasured confounding may undermine the validity of causal inference with observational studies. Sensitivity analysis provides an attractive way to partially circumvent this issue by assessing the potential influence of unmeasured confounding on the causal conclusions. However, previous sensitivity analysis approaches often make strong and untestable assumptions such as having a confounder that is binary, or having no interaction between the effects of the exposure and the confounder on the outcome, or having only one confounder. Without imposing any assumptions on the confounder or confounders, we derive a bounding factor and a sharp inequality such that the sensitivity analysis parameters must satisfy the inequality if an unmeasured confounder is to explain away the observed effect estimate or reduce it to a particular level. Our approach is easy to implement and involves only two sensitivity parameters. Surprisingly, our bounding factor, which makes no simplifying assumptions, is no more conservative than a number of previous sensitivity analysis techniques that do make assumptions. Our new bounding factor implies not only the traditional Cornfield conditions that both the relative risk of the exposure on the confounder and that of the confounder on the outcome must satisfy, but also a high threshold that the maximum of these relative risks must satisfy. Furthermore, this new bounding factor can be viewed as a measure of the strength of confounding between the exposure and the outcome induced by a confounder.

210 citations


Posted Content
TL;DR: In this article, the effects of bias correction on confidence interval coverage in the context of kernel density and local polynomial regression estimation were studied. But bias correction can be preferred to undersmoothing for minimizing coverage error and increasing robustness to tuning parameter choice.
Abstract: Nonparametric methods play a central role in modern empirical work. While they provide inference procedures that are more robust to parametric misspecification bias, they may be quite sensitive to tuning parameter choices. We study the effects of bias correction on confidence interval coverage in the context of kernel density and local polynomial regression estimation, and prove that bias correction can be preferred to undersmoothing for minimizing coverage error and increasing robustness to tuning parameter choice. This is achieved using a novel, yet simple, Studentization, which leads to a new way of constructing kernel-based bias-corrected confidence intervals. In addition, for practical cases, we derive coverage error optimal bandwidths and discuss easy-to-implement bandwidth selectors. For interior points, we show that the MSE-optimal bandwidth for the original point estimator (before bias correction) delivers the fastest coverage error decay rate after bias correction when second-order (equivalent) kernels are employed, but is otherwise suboptimal because it is too "large". Finally, for odd-degree local polynomial regression, we show that, as with point estimation, coverage error adapts to boundary points automatically when appropriate Studentization is used; however, the MSE-optimal bandwidth for the original point estimator is suboptimal. All the results are established using valid Edgeworth expansions and illustrated with simulated data. Our findings have important consequences for empirical work as they indicate that bias-corrected confidence intervals, coupled with appropriate standard errors, have smaller coverage error and are less sensitive to tuning parameter choices in practically relevant cases where additional smoothness is available.

202 citations


Posted Content
TL;DR: A unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model in a high-dimensional asymptotic regime and finds that predictive accuracy has a nuanced dependence on the eigenvalue distribution of the covariance matrix.
Abstract: We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p, n \to \infty$ and $p/n \to \gamma \in (0, \, \infty)$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limiting predictive risk, which depends only on the spectrum of the feature-covariance matrix, the signal strength, and the aspect ratio $\gamma$. Especially in the case of regularized discriminant analysis, we find that predictive accuracy has a nuanced dependence on the eigenvalue distribution of the covariance matrix, suggesting that analyses based on the operator norm of the covariance matrix may not be sharp. Our results also uncover several qualitative insights about both methods: for example, with ridge regression, there is an exact inverse relation between the limiting predictive risk and the limiting estimation risk given a fixed signal strength. Our analysis builds on recent advances in random matrix theory.

150 citations


Journal ArticleDOI
TL;DR: In this paper, a new method is proposed to determine the time-frequency content of time-dependent signals consisting of multiple oscillatory components, with time-varying amplitudes and instantaneous frequencies.
Abstract: A new method is proposed to determine the time-frequency content of time-dependent signals consisting of multiple oscillatory components, with time-varying amplitudes and instantaneous frequencies. Numerical experiments as well as a theoretical analysis are presented to assess its effectiveness.

136 citations


Posted Content
TL;DR: This approach breaks tree training into a model selection phase, followed by a model fitting phase where the best regression model consistent with these splits is found, and shows that the fitted regression tree concentrates around the optimal predictor with the same splits.
Abstract: We study the convergence of the predictive surface of regression trees and forests. To support our analysis we introduce a notion of adaptive concentration for regression trees. This approach breaks tree training into a model selection phase in which we pick the tree splits, followed by a model fitting phase where we find the best regression model consistent with these splits. We then show that the fitted regression tree concentrates around the optimal predictor with the same splits: as d and n get large, the discrepancy is with high probability bounded on the order of sqrt(log(d) log(n)/k) uniformly over the whole regression surface, where d is the dimension of the feature space, n is the number of training examples, and k is the minimum leaf size for each tree. We also provide rate-matching lower bounds for this adaptive concentration statement. From a practical perspective, our result enables us to prove consistency results for adaptively grown forests in high dimensions, and to carry out valid post-selection inference in the sense of Berk et al. [2013] for subgroups defined by tree leaves.

122 citations


Journal ArticleDOI
TL;DR: Functional Additive Regression uses a penalized least squares optimization approach to efficiently deal with high-dimensional problems involving a large number of functional predictors and can significantly outperform competing methods.
Abstract: We suggest a new method, called Functional Additive Regression, or FAR, for efficiently performing high-dimensional functional regression. FAR extends the usual linear regression model involving a functional predictor, $X(t)$, and a scalar response, $Y$, in two key respects. First, FAR uses a penalized least squares optimization approach to efficiently deal with high-dimensional problems involving a large number of functional predictors. Second, FAR extends beyond the standard linear regression setting to fit general nonlinear additive models. We demonstrate that FAR can be implemented with a wide range of penalty functions using a highly efficient coordinate descent algorithm. Theoretical results are developed which provide motivation for the FAR optimization criterion. Finally, we show through simulations and two real data sets that FAR can significantly outperform competing methods.

Posted Content
TL;DR: For both constant and decreasing step sizes in the Euler discretization, non-asymptotic bounds for the convergence to the target distribution $\pi$ in total variation distance are obtained.
Abstract: In this paper, we study a method to sample from a target distribution $\pi$ over $\mathbb{R}^d$ having a positive density with respect to the Lebesgue measure, known up to a normalisation factor. This method is based on the Euler discretization of the overdamped Langevin stochastic differential equation associated with $\pi$. For both constant and decreasing step sizes in the Euler discretization, we obtain non-asymptotic bounds for the convergence to the target distribution $\pi$ in total variation distance. A particular attention is paid to the dependency on the dimension $d$, to demonstrate the applicability of this method in the high dimensional setting. These bounds improve and extend the results of (Dalalyan 2014).

Journal ArticleDOI
TL;DR: The Bayesian group lasso is revisits and the Bayesian sparse group selection is proposed again with spike and slab priors to select variables both at the group level and also within a group, and it is demonstrated via simulation that the posterior median estimator of the spikeand slab models has excellent performance for both variable selection and estimation.
Abstract: The paper revisits the Bayesian group lasso and uses spike and slab priors for group variable selection. In the process, the connection of our model with penalized regression is demonstrated, and the role of posterior median for thresholding is pointed out. We show that the posterior median estimator has the oracle property for group variable selection and estimation under orthogonal designs, while the group lasso has suboptimal asymptotic estimation rate when variable selection consistency is achieved. Next we consider bi-level selection problem and propose the Bayesian sparse group selection again with spike and slab priors to select variables both at the group level and also within a group. We demonstrate via simulation that the posterior median estimator of our spike and slab models has excellent performance for both variable selection and estimation.

Journal ArticleDOI
TL;DR: In this article, the authors provide an asymptotic theory that allows the construction of simultaneous confidence bands for dependent change point tests, and explicitly allows us to determine the location of the change both in time and coordinates in high-dimensional time series.
Abstract: Consider $d$ dependent change point tests, each based on a CUSUM-statistic. We provide an asymptotic theory that allows us to deal with the maximum over all test statistics as both the sample size $n$ and $d$ tend to infinity. We achieve this either by a consistent bootstrap or an appropriate limit distribution. This allows for the construction of simultaneous confidence bands for dependent change point tests, and explicitly allows us to determine the location of the change both in time and coordinates in high-dimensional time series. If the underlying data has sample size greater or equal $n$ for each test, our conditions explicitly allow for the large $d$ small $n$ situation, that is, where $n/d\to0$. The setup for the high-dimensional time series is based on a general weak dependence concept. The conditions are very flexible and include many popular multivariate linear and nonlinear models from the literature, such as ARMA, GARCH and related models. The construction of the tests is completely nonparametric, difficulties associated with parametric model selection, model fitting and parameter estimation are avoided. Among other things, the limit distribution for $\max_{1\leq h\leq d}\sup_{0\leq t\leq1}\vert \mathcal{W}_{t,h}-t\mathcal{W}_{1,h}\vert$ is established, where $\{\mathcal{W}_{t,h}\}_{1\leq h\leq d}$ denotes a sequence of dependent Brownian motions. As an application, we analyze all S&P 500 companies over a period of one year.

Posted Content
TL;DR: A class of random graphs is introduced that meets many of the desiderata one would demand of a model to serve as the foundation for a statistical analysis of real-world networks, and is given a representation theorem via a straightforward specialization of Kallenberg's representation theorem.
Abstract: We introduce a class of random graphs that we argue meets many of the desiderata one would demand of a model to serve as the foundation for a statistical analysis of real-world networks. The class of random graphs is defined by a probabilistic symmetry: invariance of the distribution of each graph to an arbitrary relabelings of its vertices. In particular, following Caron and Fox, we interpret a symmetric simple point process on $\mathbb{R}_+^2$ as the edge set of a random graph, and formalize the probabilistic symmetry as joint exchangeability of the point process. We give a representation theorem for the class of random graphs satisfying this symmetry via a straightforward specialization of Kallenberg's representation theorem for jointly exchangeable random measures on $\mathbb{R}_+^2$. The distribution of every such random graph is characterized by three (potentially random) components: a nonnegative real $I \in \mathbb{R}_+$, an integrable function $S: \mathbb{R}_+ \to \mathbb{R}_+$, and a symmetric measurable function $W: \mathbb{R}_+^2 \to [0,1]$ that satisfies several weak integrability conditions. We call the triple $(I,S,W)$ a graphex, in analogy to graphons, which characterize the (dense) exchangeable graphs on $\mathbb{N}$. Indeed, the model we introduce here contains the exchangeable graphs as a special case, as well as the "sparse exchangeable" model of Caron and Fox. We study the structure of these random graphs, and show that they can give rise to interesting structure, including sparse graph sequences. We give explicit equations for expectations of certain graph statistics, as well as the limiting degree distribution. We also show that certain families of graphexes give rise to random graphs that, asymptotically, contain an arbitrarily large fraction of the vertices in a single connected component.

Posted Content
TL;DR: An approach based on the log likelihood ratio statistic is considered and its asymptotic properties under model misspecification are analyzed, showing the limiting distribution of the statistic in the case of underfitting is normal and its convergence rate in the cases of overfitting.
Abstract: The stochastic block model (SBM) provides a popular framework for modeling community structures in networks. However, more attention has been devoted to problems concerning estimating the latent node labels and the model parameters than the issue of choosing the number of blocks. We consider an approach based on the log likelihood ratio statistic and analyze its asymptotic properties under model misspecification. We show the limiting distribution of the statistic in the case of underfitting is normal and obtain its convergence rate in the case of overfitting. These conclusions remain valid when the average degree grows at a polylog rate. The results enable us to derive the correct order of the penalty term for model complexity and arrive at a likelihood-based model selection criterion that is asymptotically consistent. Our analysis can also be extended to a degree-corrected block model (DCSBM). In practice, the likelihood function can be estimated using more computationally efficient variational methods or consistent label estimation algorithms, allowing the criterion to be applied to large networks.

Journal ArticleDOI
TL;DR: In this paper, the authors consider three methods for selecting a single objective prior and study whether or not the resulting prior is a reasonable overall prior in a variety of problems including the multinomial problem.
Abstract: In multi-parameter models, reference priors typically depend on the parameter or quantity of interest, and it is well known that this is necessary to produce objective posterior distributions with optimal properties. There are, however, many situations where one is simultaneously interested in all the parameters of the model or, more realistically, in functions of them that include aspects such as prediction, and it would then be useful to have a single objective prior that could safely be used to produce reasonable posterior inferences for all the quantities of interest. In this paper, we consider three methods for selecting a single objective prior and study, in a variety of problems including the multinomial problem, whether or not the resulting prior is a reasonable overall prior.

Posted Content
TL;DR: This work defines always valid p-values and confidence intervals that let users try to take advantage of data as fast as it becomes available, providing valid statistical inference whenever they make their decision.
Abstract: A/B tests are typically analyzed via frequentist p-values and confidence intervals; but these inferences are wholly unreliable if users endogenously choose samples sizes by *continuously monitoring* their tests. We define *always valid* p-values and confidence intervals that let users try to take advantage of data as fast as it becomes available, providing valid statistical inference whenever they make their decision. Always valid inference can be interpreted as a natural interface for a sequential hypothesis test, which empowers users to implement a modified test tailored to them. In particular, we show in an appropriate sense that the measures we develop tradeoff sample size and power efficiently, despite a lack of prior knowledge of the user's relative preference between these two goals. We also use always valid p-values to obtain multiple hypothesis testing control in the sequential context. Our methodology has been implemented in a large scale commercial A/B testing platform to analyze hundreds of thousands of experiments to date.

Posted Content
TL;DR: In this article, the authors discuss the possibilities and limitations of estimating the mean of a real-valued random variable from independent and identically distributed observations from a nonasymptotic point of view.
Abstract: We discuss the possibilities and limitations of estimating the mean of a real-valued random variable from independent and identically distributed observations from a non-asymptotic point of view. In particular, we define estimators with a sub-Gaussian behavior even for certain heavy-tailed distributions. We also prove various impossibility results for mean estimators.

Posted Content
TL;DR: This paper addresses the important question of how to choose k as n grows large, providing a theoretical upper bound on k such that the information loss due to the divide and conquer algorithm is negligible.
Abstract: This paper studies hypothesis testing and parameter estimation in the context of the divide and conquer algorithm. In a unied likelihood based framework, we propose new test statistics and point estimators obtained by aggregating various statistics from k subsamples of size n=k, where n is the sample size. In both low dimensional and high dimensional settings, we address the important question of how to choose k as n grows large, providing a theoretical upper bound on k such that the information loss due to the divide and conquer algorithm is negligible. In other words, the resulting estimators have the same inferential eciencies and estimation rates as a practically infeasible oracle with access to the full sample. Thorough numerical results are provided to back up the theory.

Journal ArticleDOI
TL;DR: A new framework for structure learning is proposed that is based on continuous spike and slab priors and uses latent variables to identify graphs and efficiently handles problems with hundreds of variables.
Abstract: Gaussian concentration graph models and covariance graph models are two classes of graphical models that are useful for uncovering latent dependence structures among multivariate variables. In the Bayesian literature, graphs are often determined through the use of priors over the space of positive definite matrices with fixed zeros, but these methods present daunting computational burdens in large problems. Motivated by the superior computational efficiency of continuous shrinkage priors for regression analysis, we propose a new framework for structure learning that is based on continuous spike and slab priors and uses latent variables to identify graphs. We discuss model specification, computation, and inference for both concentration and covariance graph models. The new approach produces reliable estimates of graphs and efficiently handles problems with hundreds of variables.

Journal ArticleDOI
TL;DR: In this article, the authors present an expository, general analysis of valid post-selection or post-regularization inference about a low-dimensional target parameter, $\alpha$, in the presence of a very high-dimensional nuisance parameter, which is estimated using modern selection or regularization methods.
Abstract: Here we present an expository, general analysis of valid post-selection or post-regularization inference about a low-dimensional target parameter, $\alpha$, in the presence of a very high-dimensional nuisance parameter, $\eta$, which is estimated using modern selection or regularization methods. Our analysis relies on high-level, easy-to-interpret conditions that allow one to clearly see the structures needed for achieving valid post-regularization inference. Simple, readily verifiable sufficient conditions are provided for a class of affine-quadratic models. We focus our discussion on estimation and inference procedures based on using the empirical analog of theoretical equations $$M(\alpha, \eta)=0$$ which identify $\alpha$. Within this structure, we show that setting up such equations in a manner such that the orthogonality/immunization condition $$\partial_\eta M(\alpha, \eta) = 0$$ at the true parameter values is satisfied, coupled with plausible conditions on the smoothness of $M$ and the quality of the estimator $\hat \eta$, guarantees that inference on for the main parameter $\alpha$ based on testing or point estimation methods discussed below will be regular despite selection or regularization biases occurring in estimation of $\eta$. In particular, the estimator of $\alpha$ will often be uniformly consistent at the root-$n$ rate and uniformly asymptotically normal even though estimators $\hat \eta$ will generally not be asymptotically linear and regular. The uniformity holds over large classes of models that do not impose highly implausible "beta-min" conditions. We also show that inference can be carried out by inverting tests formed from Neyman's $C(\alpha)$ (orthogonal score) statistics.

Posted Content
TL;DR: A gentle introduction to Gaussian processes (GPs) for regression, classification, and dimensionality reduction.
Abstract: A gentle introduction to Gaussian processes (GPs). The three parts of the document consider GPs for regression, classification, and dimensionality reduction.

Journal ArticleDOI
TL;DR: This work derives finite-sample optimal CIs and sharp efficiency bounds under normal errors with known variance under the assumption that the regression function is known to lie in a convex function class, which covers most smoothness and/or shape assumptions used in econometrics.
Abstract: We consider the problem of constructing confidence intervals (CIs) for a linear functional of a regression function, such as its value at a point, the regression discontinuity parameter, or a regression coefficient in a linear or partly linear regression. Our main assumption is that the regression function is known to lie in a convex function class, which covers most smoothness and/or shape assumptions used in econometrics. We derive finite-sample optimal CIs and sharp efficiency bounds under normal errors with known variance. We show that these results translate to uniform (over the function class) asymptotic results when the error distribution is not known. When the function class is centrosymmetric, these efficiency bounds imply that minimax CIs are close to efficient at smooth regression functions. This implies, in particular, that it is impossible to form CIs that are tighter using data-dependent tuning parameters, and maintain coverage over the whole function class. We specialize our results to inference on the regression discontinuity parameter, and illustrate them in simulations and an empirical application.

Posted Content
TL;DR: The paper applies the derivation and the implementation of an expectation-maximisation algorithm, for the estimation of mixtures of Riemannian Gaussian distributions, to the problem of texture classification, in computer vision, showing that it yields significantly better performance, in comparison to recent approaches.
Abstract: Data which lie in the space $\mathcal{P}_{m\,}$, of $m \times m$ symmetric positive definite matrices, (sometimes called tensor data), play a fundamental role in applications including medical imaging, computer vision, and radar signal processing. An open challenge, for these applications, is to find a class of probability distributions, which is able to capture the statistical properties of data in $\mathcal{P}_{m\,}$, as they arise in real-world situations. The present paper meets this challenge by introducing Riemannian Gaussian distributions on $\mathcal{P}_{m\,}$. Distributions of this kind were first considered by Pennec in $2006$. However, the present paper gives an exact expression of their probability density function for the first time in existing literature. This leads to two original contributions. First, a detailed study of statistical inference for Riemannian Gaussian distributions, uncovering the connection between maximum likelihood estimation and the concept of Riemannian centre of mass, widely used in applications. Second, the derivation and implementation of an expectation-maximisation algorithm, for the estimation of mixtures of Riemannian Gaussian distributions. The paper applies this new algorithm, to the classification of data in $\mathcal{P}_{m\,}$, (concretely, to the problem of texture classification, in computer vision), showing that it yields significantly better performance, in comparison to recent approaches.

Posted Content
TL;DR: The large sample properties of this method, without assuming normality, are studied, and it is proved that the test statistic of Tibshirani et al. (2016) is asymptotically valid, as the number of samples n grows and the dimension d of the regression problem stays fixed.
Abstract: Recently, Tibshirani et al. (2016) proposed a method for making inferences about parameters defined by model selection, in a typical regression setting with normally distributed errors. Here, we study the large sample properties of this method, without assuming normality. We prove that the test statistic of Tibshirani et al. (2016) is asymptotically valid, as the number of samples n grows and the dimension d of the regression problem stays fixed. Our asymptotic result holds uniformly over a wide class of nonnormal error distributions. We also propose an efficient bootstrap version of this test that is provably (asymptotically) conservative, and in practice, often delivers shorter intervals than those from the original normality-based approach. Finally, we prove that the test statistic of Tibshirani et al. (2016) does not enjoy uniform validity in a high-dimensional setting, when the dimension d is allowed grow.

Posted Content
TL;DR: In this article, a randomization-based framework for estimating causal effects under interference between units is presented, which integrates three components: an experimental design that defines the probability distribution of treatment assignments, a mapping that relates experimental treatment assignments to exposures received by units in the experiment, and estimands that make use of the experiment to answer questions of substantive interest.
Abstract: This paper presents a randomization-based framework for estimating causal effects under interference between units. The framework integrates three components: (i) an experimental design that defines the probability distribution of treatment assignments, (ii) a mapping that relates experimental treatment assignments to exposures received by units in the experiment, and (iii) estimands that make use of the experiment to answer questions of substantive interest. Using this framework, we develop the case of estimating average unit-level causal effects from a randomized experiment with interference of arbitrary but known form. The resulting estimators are based on inverse probability weighting. We provide randomization-based variance estimators that account for the complex clustering that can occur when interference is present. We also establish consistency and asymptotic normality under local dependence assumptions. We discuss refinements including covariate-adjusted effect estimators and ratio estimation. We illustrate and assess empirical performance with a naturalistic simulation using network data from American high schools.

Posted Content
TL;DR: In this paper, the root-swap phenomenon occurs in root-MUSIC algorithm in the low sample size region and degrades the performance of the DOA estimation, and a new method is then proposed to alleviate this problem.
Abstract: Classical methods of DOA estimation such as the MUSIC algorithm are based on estimating the signal and noise subspaces from the sample covariance matrix. For a small number of samples, such methods are exposed to performance breakdown, as the sample covariance matrix can largely deviate from the true covariance matrix. In this paper, the problem of DOA estimation performance breakdown is investigated. We consider the structure of the sample covariance matrix and the dynamics of the root-MUSIC algorithm. The performance breakdown in the threshold region is associated with the subspace leakage where some portion of the true signal subspace resides in the estimated noise subspace. In this paper, the subspace leakage is theoretically derived. We also propose a two-step method which improves the performance by modifying the sample covariance matrix such that the amount of the subspace leakage is reduced. Furthermore, we introduce a phenomenon named as root-swap which occurs in the root-MUSIC algorithm in the low sample size region and degrades the performance of the DOA estimation. A new method is then proposed to alleviate this problem. Numerical examples and simulation results are given for uncorrelated and correlated sources to illustrate the improvement achieved by the proposed methods. Moreover, the proposed algorithms are combined with the pseudo-noise resampling method to further improve the performance.

Posted Content
TL;DR: It is proved that this regularization indeed forces Laplacian to concentrate even in sparse graphs, establishing the validity of one of the simplest and fastest approaches to community detection -- regularized spectral clustering, under the stochastic block model.
Abstract: Author(s): Le, Can M; Levina, Elizaveta; Vershynin, Roman | Abstract: We study random graphs with possibly different edge probabilities in the challenging sparse regime of bounded expected degrees. Unlike in the dense case, neither the graph adjacency matrix nor its Laplacian concentrate around their expectations due to the highly irregular distribution of node degrees. It has been empirically observed that simply adding a constant of order $1/n$ to each entry of the adjacency matrix substantially improves the behavior of Laplacian. Here we prove that this regularization indeed forces Laplacian to concentrate even in sparse graphs. As an immediate consequence in network analysis, we establish the validity of one of the simplest and fastest approaches to community detection -- regularized spectral clustering, under the stochastic block model. Our proof of concentration of regularized Laplacian is based on Grothendieck's inequality and factorization, combined with paving arguments.

Posted Content
TL;DR: In this article, the authors show that the logistic missing mechanism is less identifiable than those under the Probit missing mechanism and give necessary and sufficient conditions for identifiability of models under the Logistic missing mechanisms, which sometimes can be checked in real data analysis.
Abstract: Missing data problems arise in many applied research studies. They may jeopardize statistical inference of the model of interest, if the missing mechanism is nonignorable, that is, the missing mechanism depends on the missing values themselves even conditional on the observed data. With a nonignorable missing mechanism, the model of interest is often not identifiable without imposing further assumptions. We find that even if the missing mechanism has a known parametric form, the model is not identifiable without specifying a parametric outcome distribution. Although it is fundamental for valid statistical inference, identifiability under nonignorable missing mechanisms is not established for many commonly-used models. In this paper, we first demonstrate identifiability of the normal distribution under monotone missing mechanisms. We then extend it to the normal mixture and $t$ mixture models with non-monotone missing mechanisms. We discover that models under the Logistic missing mechanism are less identifiable than those under the Probit missing mechanism. We give necessary and sufficient conditions for identifiability of models under the Logistic missing mechanism, which sometimes can be checked in real data analysis. We illustrate our methods using a series of simulations, and apply them to a real-life dataset.