scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Statistics Theory in 2014"


Journal ArticleDOI
TL;DR: A step forward in forest exploration is taken by proving a consistency result for Breiman's original algorithm in the context of additive regression models, and sheds an interesting light on how random forests can nicely adapt to sparsity.
Abstract: Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45 (2001) 5--32] that combines several randomized decision trees and aggregates their predictions by averaging. Despite its wide usage and outstanding practical performance, little is known about the mathematical properties of the procedure. This disparity between theory and practice originates in the difficulty to simultaneously analyze both the randomization process and the highly data-dependent tree structure. In the present paper, we take a step forward in forest exploration by proving a consistency result for Breiman's [Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive regression models. Our analysis also sheds an interesting light on how random forests can nicely adapt to sparsity. 1. Introduction. Random forests are an ensemble learning method for classification and regression that constructs a number of randomized decision trees during the training phase and predicts by averaging the results. Since its publication in the seminal paper of Breiman (2001), the procedure has become a major data analysis tool, that performs well in practice in comparison with many standard methods. What has greatly contributed to the popularity of forests is the fact that they can be applied to a wide range of prediction problems and have few parameters to tune. Aside from being simple to use, the method is generally recognized for its accuracy and its ability to deal with small sample sizes, high-dimensional feature spaces and complex data structures. The random forest methodology has been successfully involved in many practical problems, including air quality prediction (winning code of the EMC data science global hackathon in 2012, see this http URL), chemoinformatics [Svetnik et al. (2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3D

356 citations


Posted Content
TL;DR: To perform inference after model selection, this work proposes controlling the selective type I error; i.e., the error rate of a test given that it was performed to recover long-run frequency properties among selected hypotheses analogous to those that apply in the classical (non-adaptive) context.
Abstract: To perform inference after model selection, we propose controlling the selective type I error; i.e., the error rate of a test given that it was performed. By doing so, we recover long-run frequency properties among selected hypotheses analogous to those that apply in the classical (non-adaptive) context. Our proposal is closely related to data splitting and has a similar intuitive justification, but is more powerful. Exploiting the classical theory of Lehmann

296 citations


Posted Content
TL;DR: This article proposes a new class of Dirichlet–Laplace priors, which possess optimal posterior concentration and lead to efficient posterior computation.
Abstract: Penalized regression methods, such as $L_1$ regularization, are routinely used in high-dimensional applications, and there is a rich literature on optimality properties under sparsity assumptions. In the Bayesian paradigm, sparsity is routinely induced through two-component mixture priors having a probability mass at zero, but such priors encounter daunting computational problems in high dimensions. This has motivated an amazing variety of continuous shrinkage priors, which can be expressed as global-local scale mixtures of Gaussians, facilitating computation. In sharp contrast to the frequentist literature, little is known about the properties of such priors and the convergence and concentration of the corresponding posterior distribution. In this article, we propose a new class of Dirichlet--Laplace (DL) priors, which possess optimal posterior concentration and lead to efficient posterior computation exploiting results from normalized random measure theory. Finite sample performance of Dirichlet--Laplace priors relative to alternatives is assessed in simulated and real data examples.

278 citations


Journal ArticleDOI
TL;DR: Under compatibility conditions on the design matrix, the posterior distribution is shown to contract at the optimal rate for recovery of the unknown sparse vector, and to give optimal prediction of the response vector.
Abstract: We study full Bayesian procedures for high-dimensional linear regression under sparsity constraints. The prior is a mixture of point masses at zero and continuous distributions. Under compatibility conditions on the design matrix, the posterior distribution is shown to contract at the optimal rate for recovery of the unknown sparse vector, and to give optimal prediction of the response vector. It is also shown to select the correct sparse model, or at least the coefficients that are significantly different from zero. The asymptotic shape of the posterior distribution is characterized and employed to the construction and study of credible sets for uncertainty quantification.

263 citations


Posted Content
TL;DR: In this paper, the central limit and bootstrap theorems for probabilities that sums of centered high-dimensional random vectors hit hyperrectangles and sparsely convex sets are derived.
Abstract: This paper derives central limit and bootstrap theorems for probabilities that sums of centered high-dimensional random vectors hit hyperrectangles and sparsely convex sets. Specifically, we derive Gaussian and bootstrap approximations for probabilities $\Pr(n^{-1/2}\sum_{i=1}^n X_i\in A)$ where $X_1,\dots,X_n$ are independent random vectors in $\mathbb{R}^p$ and $A$ is a hyperrectangle, or, more generally, a sparsely convex set, and show that the approximation error converges to zero even if $p=p_n\to \infty$ as $n \to \infty$ and $p \gg n$; in particular, $p$ can be as large as $O(e^{Cn^c})$ for some constants $c,C>0$. The result holds uniformly over all hyperrectangles, or more generally, sparsely convex sets, and does not require any restriction on the correlation structure among coordinates of $X_i$. Sparsely convex sets are sets that can be represented as intersections of many convex sets whose indicator functions depend only on a small subset of their arguments, with hyperrectangles being a special case.

254 citations


Journal ArticleDOI
TL;DR: This paper establishes optimal rate of convergence for graphon estimation in a H\"{o}lder class with smoothness $\alpha$, which is, to the surprise, identical to the classical nonparametric rate.
Abstract: Network analysis is becoming one of the most active research areas in statistics. Significant advances have been made recently on developing theories, methodologies and algorithms for analyzing networks. However, there has been little fundamental study on optimal estimation. In this paper, we establish optimal rate of convergence for graphon estimation. For the stochastic block model with $k$ clusters, we show that the optimal rate under the mean squared error is $n^{-1}\log k+k^2/n^2$. The minimax upper bound improves the existing results in literature through a technique of solving a quadratic equation. When $k\leq\sqrt{n\log n}$, as the number of the cluster $k$ grows, the minimax rate grows slowly with only a logarithmic order $n^{-1}\log k$. A key step to establish the lower bound is to construct a novel subset of the parameter space and then apply Fano's lemma, from which we see a clear distinction of the nonparametric graphon estimation problem from classical nonparametric regression, due to the lack of identifiability of the order of nodes in exchangeable random graph models. As an immediate application, we consider nonparametric graphon estimation in a Holder class with smoothness $\alpha$. When the smoothness $\alpha\geq1$, the optimal rate of convergence is $n^{-1}\log n$, independent of $\alpha$, while for $\alpha\in(0,1)$, the rate is $n^{-2\alpha/(\alpha+1)}$, which is, to our surprise, identical to the classical nonparametric rate.

179 citations


Journal ArticleDOI
TL;DR: In this article, a Bayesian approach to variable selection in the presence of high dimensional covariates based on a hierarchical model that places prior distributions on the regression coefficients as well as on the model space is proposed.
Abstract: We consider a Bayesian approach to variable selection in the presence of high dimensional covariates based on a hierarchical model that places prior distributions on the regression coefficients as well as on the model space. We adopt the well-known spike and slab Gaussian priors with a distinct feature, that is, the prior variances depend on the sample size through which appropriate shrinkage can be achieved. We show the strong selection consistency of the proposed method in the sense that the posterior probability of the true model converges to one even when the number of covariates grows nearly exponentially with the sample size. This is arguably the strongest selection consistency result that has been available in the Bayesian variable selection literature; yet the proposed method can be carried out through posterior sampling with a simple Gibbs sampler. Furthermore, we argue that the proposed method is asymptotically similar to model selection with the $L_0$ penalty. We also demonstrate through empirical work the fine performance of the proposed approach relative to some state of the art alternatives.

167 citations


Posted Content
TL;DR: In this paper, the authors proposed the cross quantilogram to measure the quantile dependence between two time series, and applied it to test the hypothesis that one time series has no directional predictability to another time series.
Abstract: This paper proposes the cross-quantilogram to measure the quantile dependence between two time series. We apply it to test the hypothesis that one time series has no directional predictability to another time series. We establish the asymptotic distribution of the cross quantilogram and the corresponding test statistic. The limiting distributions depend on nuisance parameters. To construct consistent confidence intervals we employ the stationary bootstrap procedure; we show the consistency of this bootstrap. Also, we consider the self-normalized approach, which is shown to be asymptotically pivotal under the null hypothesis of no predictability. We provide simulation studies and two empirical applications. First, we use the cross-quantilogram to detect predictability from stock variance to excess stock return. Compared to existing tools used in the literature of stock return predictability, our method provides a more complete relationship between a predictor and stock return. Second, we investigate the systemic risk of individual financial institutions, such as JP Morgan Chase, Goldman Sachs and AIG. This article has supplementary materials online.

147 citations


Posted Content
TL;DR: In this paper, the STIV estimator is proposed for linear instrumental variables models with many regressors, all of which could be endogenous, and confidence sets are derived by solving linear programs.
Abstract: This article considers inference in linear instrumental variables models with many regressors, all of which could be endogenous. We propose the STIV estimator. Identification robust confidence sets are derived by solving linear programs. We present results on rates of convergence, variable selection, confidence sets which adapt to the sparsity, and analyze confidence bands for vectors of linear functions using bias correction. We also provide solutions to some instruments being endogenous. The application is to the EASI demand system.

146 citations


Journal ArticleDOI
TL;DR: The authors showed that the incorporation of a simple correlation measure into the tuning parameter can lead to a nearly optimal prediction performance of the Lasso even for highly correlated covariates, however, they also reveal that for moderately correlated covates, the prediction performance can be mediocre irrespective of the choice of the tuning parameters.
Abstract: Although the Lasso has been extensively studied, the relationship between its prediction performance and the correlations of the covariates is not fully understood. In this paper, we give new insights into this relationship in the context of multiple linear regression. We show, in particular, that the incorporation of a simple correlation measure into the tuning parameter can lead to a nearly optimal prediction performance of the Lasso even for highly correlated covariates. However, we also reveal that for moderately correlated covariates, the prediction performance of the Lasso can be mediocre irrespective of the choice of the tuning parameter. We finally show that our results also lead to near-optimal rates for the least-squares estimator with total variation penalty.

129 citations


Journal ArticleDOI
TL;DR: In this paper, the authors consider the multivariate normal mean model in the situation that the mean vector is sparse in the nearly black sense and show that if the number of nonzero parameters of the mean vectors is known, the horseshoe estimator attains the minimax risk, possibly up to a multiplicative constant.
Abstract: We consider the horseshoe estimator due to Carvalho, Polson and Scott (2010) for the multivariate normal mean model in the situation that the mean vector is sparse in the nearly black sense. We assume the frequentist framework where the data is generated according to a fixed mean vector. We show that if the number of nonzero parameters of the mean vector is known, the horseshoe estimator attains the minimax $\ell_2$ risk, possibly up to a multiplicative constant. We provide conditions under which the horseshoe estimator combined with an empirical Bayes estimate of the number of nonzero means still yields the minimax risk. We furthermore prove an upper bound on the rate of contraction of the posterior distribution around the horseshoe estimator, and a lower bound on the posterior variance. These bounds indicate that the posterior distribution of the horseshoe prior may be more informative than that of other one-component priors, including the Lasso.

Journal ArticleDOI
TL;DR: Wild binary segmentation (WBS) as mentioned in this paper is a new technique for consistent estimation of the number and locations of multiple change-points in data, which does not require the choice of a window or span parameter and does not lead to a significant increase in computational complexity.
Abstract: We propose a new technique, called wild binary segmentation (WBS), for consistent estimation of the number and locations of multiple change-points in data. We assume that the number of change-points can increase to infinity with the sample size. Due to a certain random localisation mechanism, WBS works even for very short spacings between the change-points and/or very small jump magnitudes, unlike standard binary segmentation. On the other hand, despite its use of localisation, WBS does not require the choice of a window or span parameter, and does not lead to a significant increase in computational complexity. WBS is also easy to code. We propose two stopping criteria for WBS: one based on thresholding and the other based on what we term the `strengthened Schwarz information criterion'. We provide default recommended values of the parameters of the procedure and show that it offers very good practical performance in comparison with the state of the art. The WBS methodology is implemented in the R package wbs, available on CRAN. In addition, we provide a new proof of consistency of binary segmentation with improved rates of convergence, as well as a corresponding result for WBS.

Journal ArticleDOI
TL;DR: In this article, a goodness-of-fit test for the stochastic block model is proposed, which is based on the largest singular value of a residual matrix obtained by subtracting the estimated block mean effect from the adjacency matrix.
Abstract: The stochastic block model is a popular tool for studying community structures in network data. We develop a goodness-of-fit test for the stochastic block model. The test statistic is based on the largest singular value of a residual matrix obtained by subtracting the estimated block mean effect from the adjacency matrix. Asymptotic null distribution is obtained using recent advances in random matrix theory. The test is proved to have full power against alternative models with finer structures. These results naturally lead to a consistent sequential testing estimate of the number of communities.

Posted Content
TL;DR: In this paper, a variant of the Davis-Kahan theorem that relies only on a population eigenvalue separation condition is presented, making it more natural and convenient for direct application in statistical contexts.
Abstract: The Davis--Kahan theorem is used in the analysis of many statistical procedures to bound the distance between subspaces spanned by population eigenvectors and their sample versions. It relies on an eigenvalue separation condition between certain relevant population and sample eigenvalues. We present a variant of this result that depends only on a population eigenvalue separation condition, making it more natural and convenient for direct application in statistical contexts, and improving the bounds in some cases. We also provide an extension to situations where the matrices under study may be asymmetric or even non-square, and where interest is in the distance between subspaces spanned by corresponding singular vectors.

Posted Content
TL;DR: In this paper, the authors proposed a nonlinear shrinkage of the eigenvalues of the covariance matrix, which is motivated on asymptotic grounds, to estimate the spectrum of covariance matrices.
Abstract: Covariance matrix estimation and principal component analysis (PCA) are two cornerstones of multivariate analysis. Classic textbook solutions perform poorly when the dimension of the data is of a magnitude similar to the sample size, or even larger. In such settings, there is a common remedy for both statistical problems: nonlinear shrinkage of the eigenvalues of the sample covariance matrix. The optimal nonlinear shrinkage formula depends on unknown population quantities and is thus not available. It is, however, possible to consistently estimate an oracle nonlinear shrinkage, which is motivated on asymptotic grounds. A key tool to this end is consistent estimation of the set of eigenvalues of the population covariance matrix (also known as the spectrum), an interesting and challenging problem in its own right. Extensive Monte Carlo simulations demonstrate that our methods have desirable finite-sample properties and outperform previous proposals.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a nonparametric maximum likelihood approach to detect multiple change-points in the data sequence, which does not impose any parametric assumption on the underlying distributions.
Abstract: In multiple change-point problems, different data segments often follow different distributions, for which the changes may occur in the mean, scale or the entire distribution from one segment to another. Without the need to know the number of change-points in advance, we propose a nonparametric maximum likelihood approach to detecting multiple change-points. Our method does not impose any parametric assumption on the underlying distributions of the data sequence, which is thus suitable for detection of any changes in the distributions. The number of change-points is determined by the Bayesian information criterion and the locations of the change-points can be estimated via the dynamic programming algorithm and the use of the intrinsic order structure of the likelihood function. Under some mild conditions, we show that the new method provides consistent estimation with an optimal rate. We also suggest a prescreening procedure to exclude most of the irrelevant points prior to the implementation of the nonparametric likelihood method. Simulation studies show that the proposed method has satisfactory performance of identifying multiple change-points in terms of estimation accuracy and computation time.

Posted Content
TL;DR: The authors showed that spline and wavelet series regression estimators for weakly dependent regressors attain the optimal uniform (i.e., sup-norm) convergence rate of Stone (1982), where d is the number of regressors and p is the smoothness of the regression function.
Abstract: We show that spline and wavelet series regression estimators for weakly dependent regressors attain the optimal uniform (i.e. sup-norm) convergence rate $(n/\log n)^{-p/(2p+d)}$ of Stone (1982), where $d$ is the number of regressors and $p$ is the smoothness of the regression function. The optimal rate is achieved even for heavy-tailed martingale difference errors with finite $(2+(d/p))$th absolute moment for $d/p<2$. We also establish the asymptotic normality of t statistics for possibly nonlinear, irregular functionals of the conditional mean function under weak conditions. The results are proved by deriving a new exponential inequality for sums of weakly dependent random matrices, which is of independent interest.

Journal ArticleDOI
TL;DR: A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with $N$, which admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers.
Abstract: Community detection, which aims to cluster $N$ nodes in a given graph into $r$ distinct groups based on the observed undirected edges, is an important problem in network data analysis. In this paper, the popular stochastic block model (SBM) is extended to the generalized stochastic block model (GSBM) that allows for adversarial outlier nodes, which are connected with the other nodes in the graph in an arbitrary way. Under this model, we introduce a procedure using convex optimization followed by $k$-means algorithm with $k=r$. Both theoretical and numerical properties of the method are analyzed. A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with $N$. This theoretical result admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers. Numerical results show that our method is both computationally fast and robust to different kinds of outliers, while some popular computationally fast community detection algorithms, such as spectral clustering applied to adjacency matrices or graph Laplacians, may fail to retrieve the major clusters due to a small portion of outliers. We apply a slight modification of our method to a political blogs data set, showing that our method is competent in practice and comparable to existing computationally feasible methods in the literature. To the best of the authors' knowledge, our result is the first in the literature in terms of clustering communities with fast growing numbers under the GSBM where a portion of arbitrary outlier nodes exist.

Posted Content
TL;DR: In this article, the authors provide an exposition of a flexible geometric framework for high dimensional estimation problems with constraints, including sparse recovery, matrix completion, quantization, linear and logistic regression, and generalized linear models.
Abstract: This tutorial provides an exposition of a flexible geometric framework for high dimensional estimation problems with constraints. The tutorial develops geometric intuition about high dimensional sets, justifies it with some results of asymptotic convex geometry, and demonstrates connections between geometric results and estimation problems. The theory is illustrated with applications to sparse recovery, matrix completion, quantization, linear and logistic regression and generalized linear models.

Journal ArticleDOI
TL;DR: The Rare/Weak (RW) model is a theoretical framework simultaneously controlling the size and prevalence of useful/significant items among the useless/null bulk, and shows that HC has important advantages over better known procedures such as False Discovery Rate (FDR) control and Family-wise Error control (FwER), in particular, certain optimality properties.
Abstract: In modern high-throughput data analysis, researchers perform a large number of statistical tests, expecting to find perhaps a small fraction of significant effects against a predominantly null background. Higher Criticism (HC) was introduced to determine whether there are any nonzero effects; more recently, it was applied to feature selection, where it provides a method for selecting useful predictive features from a large body of potentially useful features, among which only a rare few will prove truly useful. In this article, we review the basics of HC in both the testing and feature selection settings. HC is a flexible idea, which adapts easily to new situations; we point out simple adaptions to clique detection and bivariate outlier detection. HC, although still early in its development, is seeing increasing interest from practitioners; we illustrate this with worked examples. HC is computationally effective, which gives it a nice leverage in the increasingly more relevant "Big Data" settings we see today. We also review the underlying theoretical "ideology" behind HC. The Rare/Weak (RW) model is a theoretical framework simultaneously controlling the size and prevalence of useful/significant items among the useless/null bulk. The RW model shows that HC has important advantages over better known procedures such as False Discovery Rate (FDR) control and Family-wise Error control (FwER), in particular, certain optimality properties. We discuss the rare/weak phase diagram, a way to visualize clearly the class of RW settings where the true signals are so rare or so weak that detection and feature selection are simply impossible, and a way to understand the known optimality properties of HC.

Posted Content
TL;DR: In this article, the authors consider recovery of low-rank matrices from noisy data by shrinkage of singular values, in which a single, univariate nonlinearity is applied to each of the empirical singular values.
Abstract: We consider recovery of low-rank matrices from noisy data by shrinkage of singular values, in which a single, univariate nonlinearity is applied to each of the empirical singular values. We adopt an asymptotic framework, in which the matrix size is much larger than the rank of the signal matrix to be recovered, and the signal-to-noise ratio of the low-rank piece stays constant. For a variety of loss functions, including Mean Square Error (MSE - square Frobenius norm), the nuclear norm loss and the operator norm loss, we show that in this framework there is a well-defined asymptotic loss that we evaluate precisely in each case. In fact, each of the loss functions we study admits a unique admissible shrinkage nonlinearity dominating all other nonlinearities. We provide a general method for evaluating these optimal nonlinearities, and demonstrate our framework by working out simple, explicit formulas for the optimal nonlinearities in the Frobenius, nuclear and operator norm cases. For example, for a square low-rank n-by-n matrix observed in white noise with level $\sigma$, the optimal nonlinearity for MSE loss simply shrinks each data singular value $y$ to $\sqrt{y^2-4n\sigma^2 }$ (or to 0 if $y 0.

Journal ArticleDOI
TL;DR: The theory substantially generalizes earlier ones by allowing dependence, by allowing nonstationarity and by relaxing the associated moment conditions and the rate of convergence for the thresholded estimate is obtained.
Abstract: We consider estimation of covariance matrices and their inverses (a.k.a. precision matrices) for high-dimensional stationary and locally stationary time series. In the latter case the covariance matrices evolve smoothly in time, thus forming a covariance matrix function. Using the functional dependence measure of Wu [Proc. Natl. Acad. Sci. USA 102 (2005) 14150-14154 (electronic)], we obtain the rate of convergence for the thresholded estimate and illustrate how the dependence affects the rate of convergence. Asymptotic properties are also obtained for the precision matrix estimate which is based on the graphical Lasso principle. Our theory substantially generalizes earlier ones by allowing dependence, by allowing nonstationarity and by relaxing the associated moment conditions.

Posted Content
TL;DR: The distance-to-a-measure (DTM), and the kernel distance, introduced by Phillips et al. (2014), are smooth functions that provide useful topological information but are robust to noise and outliers.
Abstract: Let P be a distribution with support S. The salient features of S can be quantified with persistent homology, which summarizes topological features of the sublevel sets of the distance function (the distance of any point x to S). Given a sample from P we can infer the persistent homology using an empirical version of the distance function. However, the empirical distance function is highly non-robust to noise and outliers. Even one outlier is deadly. The distance-to-a-measure (DTM), introduced by Chazal et al. (2011), and the kernel distance, introduced by Phillips et al. (2014), are smooth functions that provide useful topological information but are robust to noise and outliers. Chazal et al. (2014) derived concentration bounds for DTM. Building on these results, we derive limiting distributions and confidence sets, and we propose a method for choosing tuning parameters.

Journal ArticleDOI
TL;DR: This paper presents three general results about the problem of estimating the mean of a Gaussian random vector, including an exact computation of the main term in the estimation error by relating it to expected maxima of Gaussian processes, a theorem showing that the least squares estimator is always admissible up to a universal constant in any problem of the above kind and a counterexample showing that least squares estimating may not always be minimax rate-optimal.
Abstract: Consider the problem of estimating the mean of a Gaussian random vector when the mean vector is assumed to be in a given convex set. The most natural solution is to take the Euclidean projection of the data vector on to this convex set; in other words, performing "least squares under a convex constraint." Many problems in modern statistics and statistical signal processing theory are special cases of this general situation. Examples include the lasso and other high-dimensional regression techniques, function estimation problems, matrix estimation and completion, shape-restricted regression, constrained denoising, linear inverse problems, etc. This paper presents three general results about this problem, namely, (a) an exact computation of the main term in the estimation error by relating it to expected maxima of Gaussian processes (existing results only give upper bounds), (b) a theorem showing that the least squares estimator is always admissible up to a universal constant in any problem of the above kind and (c) a counterexample showing that least squares estimator may not always be minimax rate-optimal. The result from part (a) is then used to compute the error of the least squares estimator in two examples of contemporary interest.

Journal ArticleDOI
TL;DR: The goal is to develop mathematical tools needed to study the consistency of graph-based machine learning algorithms for tasks such as clustering, by obtaining almost optimal conditions on the scaling of the size of the neighborhood over which the points are connected by an edge for the Γ-convergence to hold.
Abstract: We consider point clouds obtained as random samples of a measure on a Euclidean domain. A graph representing the point cloud is obtained by assigning weights to edges based on the distance between the points they connect. Our goal is to develop mathematical tools needed to study the consistency, as the number of available data points increases, of graph-based machine learning algorithms for tasks such as clustering. In particular, we study when is the cut capacity, and more generally total variation, on these graphs a good approximation of the perimeter (total variation) in the continuum setting. We address this question in the setting of $\Gamma$-convergence. We obtain almost optimal conditions on the scaling, as number of points increases, of the size of the neighborhood over which the points are connected by an edge for the $\Gamma$-convergence to hold. Taking the limit is enabled by a transportation based metric which allows to suitably compare functionals defined on different point clouds.

Posted Content
TL;DR: Under some regularity assumptions on the regression function, it is shown that the bias of an infinite forest decreases at a faster rate (with respect to the size of each tree) than a single tree, and infinite forests attain a strictly better risk rate than single trees.
Abstract: Random forests are a very effective and commonly used statistical method, but their full theoretical analysis is still an open problem. As a first step, simplified models such as purely random forests have been introduced, in order to shed light on the good performance of random forests. In this paper, we study the approximation error (the bias) of some purely random forest models in a regression framework, focusing in particular on the influence of the number of trees in the forest. Under some regularity assumptions on the regression function, we show that the bias of an infinite forest decreases at a faster rate (with respect to the size of each tree) than a single tree. As a consequence, infinite forests attain a strictly better risk rate (with respect to the sample size) than single trees. Furthermore, our results allow to derive a minimum number of trees sufficient to reach the same rate as an infinite forest. As a by-product of our analysis, we also show a link between the bias of purely random forests and the bias of some kernel estimators.

Posted Content
TL;DR: In particular, when the design matrix is ill-conditioned, the minimax prediction loss achievable by polynomial-time algorithms can be substantially greater than that of an optimal algorithm as discussed by the authors.
Abstract: Under a standard assumption in complexity theory (NP not in P/poly), we demonstrate a gap between the minimax prediction risk for sparse linear regression that can be achieved by polynomial-time algorithms, and that achieved by optimal algorithms. In particular, when the design matrix is ill-conditioned, the minimax prediction loss achievable by polynomial-time algorithms can be substantially greater than that of an optimal algorithm. This result is the first known gap between polynomial and optimal algorithms for sparse linear regression, and does not depend on conjectures in average-case complexity.

Posted Content
TL;DR: In this paper, the authors studied the problem of estimating spectral projectors of the covariance operator by their empirical counterparts, i.i.d. Gaussian random variables, and derived sharp concentration bounds for bilinear forms of empirical spectral projector in terms of sample size and effective dimension.
Abstract: Let $X,X_1,\dots, X_n$ be i.i.d. Gaussian random variables with zero mean and covariance operator $\Sigma={\mathbb E}(X\otimes X)$ taking values in a separable Hilbert space ${\mathbb H}.$ Let $$ {\bf r}(\Sigma):=\frac{{\rm tr}(\Sigma)}{\|\Sigma\|_{\infty}} $$ be the effective rank of $\Sigma,$ ${\rm tr}(\Sigma)$ being the trace of $\Sigma$ and $\|\Sigma\|_{\infty}$ being its operator norm. Let $$\hat \Sigma_n:=n^{-1}\sum_{j=1}^n (X_j\otimes X_j)$$ be the sample (empirical) covariance operator based on $(X_1,\dots, X_n).$ The paper deals with a problem of estimation of spectral projectors of the covariance operator $\Sigma$ by their empirical counterparts, the spectral projectors of $\hat \Sigma_n$ (empirical spectral projectors). The focus is on the problems where both the sample size $n$ and the effective rank ${\bf r}(\Sigma)$ are large. This framework includes and generalizes well known high-dimensional spiked covariance models. Given a spectral projector $P_r$ corresponding to an eigenvalue $\mu_r$ of covariance operator $\Sigma$ and its empirical counterpart $\hat P_r,$ we derive sharp concentration bounds for bilinear forms of empirical spectral projector $\hat P_r$ in terms of sample size $n$ and effective dimension ${\bf r}(\Sigma).$ Building upon these concentration bounds, we prove the asymptotic normality of bilinear forms of random operators $\hat P_r -{\mathbb E}\hat P_r$ under the assumptions that $n\to \infty$ and ${\bf r}(\Sigma)=o(n).$ In a special case of eigenvalues of multiplicity one, these results are rephrased as concentration bounds and asymptotic normality for linear forms of empirical eigenvectors. Other results include bounds on the bias ${\mathbb E}\hat P_r-P_r$ and a method of bias reduction as well as a discussion of possible applications to statistical inference in high-dimensional principal component analysis.

Posted Content
TL;DR: In this paper, it was shown that the number of measurements required for exact reconstruction is the same as the best possible estimate of a random gaussian matrix, exhibited by a Gaussian matrix.
Abstract: We prove that iid random vectors that satisfy a rather weak moment assumption can be used as measurement vectors in Compressed Sensing, and the number of measurements required for exact reconstruction is the same as the best possible estimate -- exhibited by a random gaussian matrix. We also prove that this moment condition is necessary, up to a $\log \log $ factor. Applications to the Compatibility Condition and the Restricted Eigenvalue Condition in the noisy setup and to properties of neighbourly random polytopes are also discussed.

Journal ArticleDOI
TL;DR: A new empirical Bayes approach for inference in the normal linear model, using the use of data in the prior in two ways, for centering and regularization, relevant for both estimation and model selection.
Abstract: We propose a new empirical Bayes approach for inference in the $p \gg n$ normal linear model. The novelty is the use of data in the prior in two ways, for centering and regularization. Under suitable sparsity assumptions, we establish a variety of concentration rate results for the empirical Bayes posterior distribution, relevant for both estimation and model selection. Computation is straightforward and fast, and simulation results demonstrate the strong finite-sample performance of the empirical Bayes model selection procedure.