scispace - formally typeset
Search or ask a question

Showing papers in "Annals of Statistics in 2014"


Journal ArticleDOI
TL;DR: In this paper, a general method for constructing confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in a high-dimensional model is proposed, which can be easily adjusted for multiplicity taking dependence among tests into account.
Abstract: We propose a general method for constructing confidence intervals and statistical tests for single or low-dimensional components of a large parameter vector in a high-dimensional model. It can be easily adjusted for multiplicity taking dependence among tests into account. For linear models, our method is essentially the same as in Zhang and Zhang [J. R. Stat. Soc. Ser. B Stat. Methodol. 76 (2014) 217–242]: we analyze its asymptotic properties and establish its asymptotic optimality in terms of semiparametric efficiency. Our method naturally extends to generalized linear models with convex loss functions. We develop the corresponding theory which includes a careful analysis for Gaussian, sub-Gaussian and bounded correlated designs.

619 citations


Journal ArticleDOI
TL;DR: Wild binary segmentation (WBS) as discussed by the authors is a new technique for consistent estimation of the number and locations of multiple change-points in data, which does not require the choice of a window or span parameter and does not lead to a significant increase in computational complexity.
Abstract: We propose a new technique, called wild binary segmentation (WBS), for consistent estimation of the number and locations of multiple change-points in data. We assume that the number of change-points can increase to infinity with the sample size. Due to a certain random localisation mechanism, WBS works even for very short spacings between the change-points and/or very small jump magnitudes, unlike standard binary segmentation. On the other hand, despite its use of localisation, WBS does not require the choice of a window or span parameter, and does not lead to a significant increase in computational complexity. WBS is also easy to code. We propose two stopping criteria for WBS: one based on thresholding and the other based on what we term the ‘strengthened Schwarz information criterion’. We provide default recommended values of the parameters of the procedure and show that it offers very good practical performance in comparison with the state of the art. The WBS methodology is implemented in the R package wbs, available on CRAN. In addition, we provide a new proof of consistency of binary segmentation with improved rates of convergence, as well as a corresponding result for WBS.

493 citations


Journal ArticleDOI
TL;DR: In this paper, the covariance test statistic is proposed to test the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path.
Abstract: In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p > n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a [Formula: see text] distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than [Formula: see text] under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the [Formula: see text] penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties-adaptivity and shrinkage-and its null distribution is tractable and asymptotically Exp(1).

425 citations


Journal ArticleDOI
TL;DR: In this paper, an algorithm inspired by sparse subspace clustering (SSC) was proposed to cluster noisy data, and a theory demonstrating its correctness was developed by using geometric functional analysis.
Abstract: Subspace clustering refers to the task of nding a multi-subspace representation that best ts a collection of points taken from a high-dimensional space. This paper introduces an algorithm inspired by sparse subspace clustering (SSC) (25) to cluster noisy data, and develops some novel theory demonstrating its correctness. In particular, the theory uses ideas from geometric functional analysis to show that the algorithm can accurately recover the underlying subspaces under minimal requirements on their orientation, and on the number of samples per subspace. Synthetic as well as real data experiments complement our theoretical study, illustrating our approach and demonstrating its eectiveness.

297 citations


ReportDOI
TL;DR: An abstract approximation theorem that is applicable to a wide variety of problems, primarily in statistics, is proved and the bound in the main approximation theorem is non-asymptotic and the theorem does not require uniform boundedness of the class of functions.
Abstract: This paper develops a new direct approach to approximating suprema of general empirical processes by a sequence of suprema of Gaussian processes, without taking the route of approximating whole empirical processes in the sup-norm. We prove an abstract approximation theorem applicable to a wide variety of statistical problems, such as construction of uniform confidence bands for functions. Notably, the bound in the main approximation theorem is nonasymptotic and the theorem allows for functions that index the empirical process to be unbounded and have entropy divergent with the sample size. The proof of the approximation theorem builds on a new coupling inequality for maxima of sums of random vectors, the proof of which depends on an effective use of Stein’s method for normal approximation, and some new empirical process techniques. We study applications of this approximation theorem to local and series empirical processes arising in nonparametric estimation via kernel and series methods, where the classes of functions change with the sample size and are non-Donsker. Importantly, our new technique is able to prove the Gaussian approximation for the supremum type statistics under weak regularity conditions, especially concerning the bandwidth and the number of series functions, in those examples.

257 citations


Journal ArticleDOI
TL;DR: Recently, the authors showed that trend filtering estimates adapt to the local level of smoothness much better than smoothing splines, and further, they exhibit a remarkable similarity to locally adaptive regression splines.
Abstract: We study trend filtering, a recently proposed tool of Kim et al. [SIAM Rev. 51 (2009) 339–360] for nonparametric regression. The trend filtering estimate is defined as the minimizer of a penalized least squares criterion, in which the penalty term sums the absolute $k$th order discrete derivatives over the input points. Perhaps not surprisingly, trend filtering estimates appear to have the structure of $k$th degree spline functions, with adaptively chosen knot points (we say “appear” here as trend filtering estimates are not really functions over continuous domains, and are only defined over the discrete set of inputs). This brings to mind comparisons to other nonparametric regression tools that also produce adaptive splines; in particular, we compare trend filtering to smoothing splines, which penalize the sum of squared derivatives across input points, and to locally adaptive regression splines [Ann. Statist. 25 (1997) 387–413], which penalize the total variation of the $k$th derivative. Empirically, we discover that trend filtering estimates adapt to the local level of smoothness much better than smoothing splines, and further, they exhibit a remarkable similarity to locally adaptive regression splines. We also provide theoretical support for these empirical findings; most notably, we prove that (with the right choice of tuning parameter) the trend filtering estimate converges to the true underlying function at the minimax rate for functions whose $k$th derivative is of bounded variation. This is done via an asymptotic pairing of trend filtering and locally adaptive regression splines, which have already been shown to converge at the minimax rate [Ann. Statist. 25 (1997) 387–413]. At the core of this argument is a new result tying together the fitted values of two lasso problems that share the same outcome vector, but have different predictor matrices.

229 citations


Journal ArticleDOI
TL;DR: In this article, a unified theory for obtaining the strong oracle property via local linear approximation is provided, where the problem is localizable and the oracle estimator is well behaved.
Abstract: Folded concave penalization methods have been shown to enjoy the strong oracle property for high-dimensional sparse estimation. However, a folded concave penalization problem usually has multiple local solutions and the oracle property is established only for one of the unknown local solutions. A challenging fundamental issue still remains that it is not clear whether the local optimum computed by a given optimization algorithm possesses those nice theoretical properties. To close this important theoretical gap in over a decade, we provide a unified theory to show explicitly how to obtain the oracle solution via the local linear approximation algorithm. For a folded concave penalized estimation problem, we show that as long as the problem is localizable and the oracle estimator is well behaved, we can obtain the oracle estimator by using the one-step local linear approximation. In addition, once the oracle estimator is obtained, the local linear approximation algorithm converges, namely it produces the same estimator in the next iteration. The general theory is demonstrated by using four classical sparse estimation problems, i.e., sparse linear regression, sparse logistic regression, sparse precision matrix estimation and sparse quantile regression.

209 citations


Journal ArticleDOI
TL;DR: In this paper, the authors define the partial distance correlation statistics with the help of a new Hilbert space, and develop and implement a test for zero partial distance correlations, and provide an unbiased estimator of squared distance covariance, and a neat solution to the problem of distance correlation for dissimilarities rather than distances.
Abstract: Distance covariance and distance correlation are scalar coefficients that characterize independence of random vectors in arbitrary dimension. Properties, extensions and applications of distance correlation have been discussed in the recent literature, but the problem of defining the partial distance correlation has remained an open question of considerable interest. The problem of partial distance correlation is more complex than partial correlation partly because the squared distance covariance is not an inner product in the usual linear space. For the definition of partial distance correlation, we introduce a new Hilbert space where the squared distance covariance is the inner product. We define the partial distance correlation statistics with the help of this Hilbert space, and develop and implement a test for zero partial distance correlation. Our intermediate results provide an unbiased estimator of squared distance covariance, and a neat solution to the problem of distance correlation for dissimilarities rather than distances.

195 citations


Journal ArticleDOI
TL;DR: This work looks into the behavior of this sudden change of the minimal error probability of the first kind when the error exponent of the second kind passes by the point of the relative entropy of the two states in an increasing way, and obtains the second-order asymptotics for quantum hypothesis testing.
Abstract: In the asymptotic theory of quantum hypothesis testing, the minimal error probability of the first kind jumps sharply from zero to one when the error exponent of the second kind passes by the point of the relative entropy of the two states in an increasing way. This is well known as the direct part and strong converse of quantum Stein’s lemma. Here we look into the behavior of this sudden change and have make it clear how the error of first kind grows smoothly according to a lower order of the error exponent of the second kind, and hence we obtain the second-order asymptotics for quantum hypothesis testing. This actually implies quantum Stein’s lemma as a special case. Meanwhile, our analysis also yields tight bounds for the case of finite sample size. These results have potential applications in quantum information theory. Our method is elementary, based on basic linear algebra and probability theory. It deals with the achievability part and the optimality part in a unified fashion.

183 citations


Journal ArticleDOI
TL;DR: These results show that the final estimator attains an oracle statistical property due to the usage of nonconvex penalty, and improves upon existing results by providing a more refined sample complexity bound as well as an exact support recovery result for the final estimation.
Abstract: We provide theoretical analysis of the statistical and computational properties of penalized M-estimators that can be formulated as the solution to a possibly nonconvex optimization problem. Many important estimators fall in this category, including least squares regression with nonconvex regularization, generalized linear models with nonconvex regularization and sparse elliptical random design regression. For these problems, it is intractable to calculate the global solution due to the nonconvex formulation. In this paper, we propose an approximate regularization path-following method for solving a variety of learning problems with nonconvex objective functions. Under a unified analytic framework, we simultaneously provide explicit statistical and computational rates of convergence for any local solution attained by the algorithm. Computationally, our algorithm attains a global geometric rate of convergence for calculating the full regularization path, which is optimal among all first-order algorithms. Unlike most existing methods that only attain geometric rates of convergence for one single regularization parameter, our algorithm calculates the full regularization path with the same iteration complexity. In particular, we provide a refined iteration complexity bound to sharply characterize the performance of each stage along the regularization path. Statistically, we provide sharp sample complexity analysis for all the approximate local solutions along the regularization path. In particular, our analysis improves upon existing results by providing a more refined sample complexity bound as well as an exact support recovery result for the final estimator. These results show that the final estimator attains an oracle statistical property due to the usage of nonconvex penalty.

182 citations


Journal ArticleDOI
TL;DR: This work substantially simplify the problem of structure search and estimation for an important class of causal models by establishing consistency of the (restricted) maximum likelihood estimator for low- and high-dimensional scenarios, and allowing for misspecification of the error distribution.
Abstract: We develop estimation for potentially high-dimensional additive structural equation models. A key component of our approach is to decouple order search among the variables from feature or edge selection in a directed acyclic graph encoding the causal structure. We show that the former can be done with nonregularized (restricted) maximum likelihood estimation while the latter can be efficiently addressed using sparse regression techniques. Thus, we substantially simplify the problem of structure search and estimation for an important class of causal models. We establish consistency of the (restricted) maximum likelihood estimator for low- and high-dimensional scenarios, and we also allow for misspecification of the error distribution. Furthermore, we develop an efficient computational algorithm which can deal with many variables, and the new method’s accuracy and performance is illustrated on simulated and real data.

Journal ArticleDOI
TL;DR: In this article, the adaptive robust Lasso (AR-Lasso) was proposed to solve the bias problem induced by the weighted L1-penalty in weighted quantile regression.
Abstract: Heavy-tailed high-dimensional data are commonly encountered in various scientific fields and pose great challenges to modern statistical analysis. A natural procedure to address this problem is to use penalized quantile regression with weighted L1-penalty, called weighted robust Lasso (WR-Lasso), in which weights are introduced to ameliorate the bias problem induced by the L1-penalty. In the ultra-high dimensional setting, where the dimensionality can grow exponentially with the sample size, we investigate the model selection oracle property and establish the asymptotic normality of the WR-Lasso. We show that only mild conditions on the model error distribution are needed. Our theoretical results also reveal that adaptive choice of the weight vector is essential for the WR-Lasso to enjoy these nice asymptotic properties. To make the WR-Lasso practically feasible, we propose a two-step procedure, called adaptive robust Lasso (AR-Lasso), in which the weight vector in the second step is constructed based on the L1-penalized quantile regression estimate from the first step. This two-step procedure is justified theoretically to possess the oracle property and the asymptotic normality. Numerical studies demonstrate the favorable finite-sample performance of the AR-Lasso.

Journal ArticleDOI
TL;DR: In this paper, the authors derive confidence sets that allow us to separate topological signal from topological noise in persistent homology, a method for probing topological properties of point clouds and functions.
Abstract: Persistent homology is a method for probing topological properties of point clouds and functions. The method involves tracking the birth and death of topological features (2000) as one varies a tuning parameter. Features with short lifetimes are informally considered to be “topological noise,” and those with a long lifetime are considered to be “topological signal.” In this paper, we bring some statistical ideas to persistent homology. In particular, we derive confidence sets that allow us to separate topological signal from topological noise.

Journal ArticleDOI
TL;DR: In this article, the authors formalize the problem of detecting a community in a network into testing whether in a given (random) graph there is a subgraph that is unusually dense.
Abstract: We formalize the problem of detecting a community in a network into testing whether in a given (random) graph there is a subgraph that is unusually dense. Specifically, we observe an undirected and unweighted graph on $N$ nodes. Under the null hypothesis, the graph is a realization of an Erdős–Renyi graph with probability $p_{0}$. Under the (composite) alternative, there is an unknown subgraph of $n$ nodes where the probability of connection is $p_{1}>p_{0}$. We derive a detection lower bound for detecting such a subgraph in terms of $N$, $n$, $p_{0}$, $p_{1}$ and exhibit a test that achieves that lower bound. We do this both when $p_{0}$ is known and unknown. We also consider the problem of testing in polynomial-time. As an aside, we consider the problem of detecting a clique, which is intimately related to the planted clique problem. Our focus in this paper is in the quasi-normal regime where $np_{0}$ is either bounded away from zero, or tends to zero slowly.

Journal ArticleDOI
TL;DR: In this article, a Bayesian approach to variable selection in the presence of high dimensional covariates based on a hierarchical model that places prior distributions on the regression coefficients as well as on the model space is proposed.
Abstract: We consider a Bayesian approach to variable selection in the presence of high dimensional covariates based on a hierarchical model that places prior distributions on the regression coefficients as well as on the model space. We adopt the well-known spike and slab Gaussian priors with a distinct feature, that is, the prior variances depend on the sample size through which appropriate shrinkage can be achieved. We show the strong selection consistency of the proposed method in the sense that the posterior probability of the true model converges to one even when the number of covariates grows nearly exponentially with the sample size. This is arguably the strongest selection consistency result that has been available in the Bayesian variable selection literature; yet the proposed method can be carried out through posterior sampling with a simple Gibbs sampler. Furthermore, we argue that the proposed method is asymptotically similar to model selection with the $L_{0}$ penalty. We also demonstrate through empirical work the fine performance of the proposed approach relative to some state of the art alternatives.

Journal ArticleDOI
TL;DR: Ridge estimation is an extension of mode finding and is useful for understanding the structure of a density and can be used to find hidden structure in point cloud data.
Abstract: We study the problem of estimating the ridges of a density function. Ridge estimation is an extension of mode finding and is useful for understanding the structure of a density. It can also be used to find hidden structure in point cloud data. We show that, under mild regularity conditions, the ridges of the kernel density estimator consistently estimate the ridges of the true density. When the data are noisy measurements of a manifold, we show that the ridges are close and topologically similar to the hidden manifold. To find the estimated ridges in practice, we adapt the modified mean-shift algorithm proposed by Ozertem and Erdogmus [J. Mach. Learn. Res. 12 (2011) 1249–1286]. Some numerical experiments verify that the algorithm is accurate.

ReportDOI
TL;DR: A self-tuning Lasso method that simultaneously resolves three important practical problems in high-dimensional regression analysis, namely it handles the unknown scale, heteroscedasticity and (drastic) non-Gaussianity of the noise, and generates sharp bounds even in extreme cases, such as the infinite variance case and the noiseless case.
Abstract: We propose a self-tuning $\sqrt{\mathrm {Lasso}} $ method that simultaneously resolves three important practical problems in high-dimensional regression analysis, namely it handles the unknown scale, heteroscedasticity and (drastic) non-Gaussianity of the noise. In addition, our analysis allows for badly behaved designs, for example, perfectly collinear regressors, and generates sharp bounds even in extreme cases, such as the infinite variance case and the noiseless case, in contrast to Lasso. We establish various nonasymptotic bounds for $\sqrt{\mathrm {Lasso}} $ including prediction norm rate and sparsity. Our analysis is based on new impact factors that are tailored for bounding prediction norm. In order to cover heteroscedastic non-Gaussian noise, we rely on moderate deviation theory for self-normalized sums to achieve Gaussian-like results under weak conditions. Moreover, we derive bounds on the performance of ordinary least square (ols) applied to the model selected by $\sqrt{\mathrm {Lasso}} $ accounting for possible misspecification of the selected model. Under mild conditions, the rate of convergence of ols post $\sqrt{\mathrm {Lasso}} $ is as good as $\sqrt{\mathrm {Lasso}} $’s rate. As an application, we consider the use of $\sqrt{\mathrm {Lasso}} $ and ols post $\sqrt{\mathrm {Lasso}} $ as estimators of nuisance parameters in a generic semiparametric problem (nonlinear moment condition or $Z$-problem), resulting in a construction of $\sqrt{n}$-consistent and asymptotically normal estimators of the main parameters.

ReportDOI
TL;DR: In this paper, an anti-concentration property of the supremum of a Gaussian process is derived from an inequality leading to a generalized SBR condition for separable Gaussian processes.
Abstract: Modern construction of uniform condence e and Nickl (2010). This condition requires the existence of a limit distribution of an extreme value type for the supremum of a studentized empirical process (equivalently, for the supremum of a Gaussian process with the same covariance function as that of the studentized empirical process). The principal contribution of this paper is to remove the need for this classical condition. We show that a considerably weaker sucient condi- tion is derived from an anti-concentration property of the supremum of the approximating Gaussian process, and we derive an inequality lead- ing to such a property for separable Gaussian processes. We refer to the new condition as a generalized SBR condition. Our new result shows that the supremum does not concentrate too fast around any value. We then apply this result to derive a Gaussian multiplier bootstrap procedure for constructing honest condence bands for nonparametric density estimators (this result can be applied in other nonparametric problems as well). An essential advantage of our approach is that it ap- plies generically even in those cases where the limit distribution of the supremum of the studentized empirical process does not exist (or is un- known). This is of particular importance in problems where resolution levels or other tuning parameters have been chosen in a data-driven fash- ion, which is needed for adaptive constructions of the condence bands. Furthermore, our approach is asymptotically honest at a polynomial rate { namely, the error in coverage level converges to zero at a fast, polynomial speed (with respect to the sample size). In sharp contrast, the approach based on extreme value theory is asymptotically honest only at a logarithmic rate { the error converges to zero at a slow, loga- rithmic speed. Finally, of independent interest is our introduction of a new, practical version of Lepski's method, which computes the optimal, non-conservative resolution levels via a Gaussian multiplier bootstrap method.

Journal ArticleDOI
TL;DR: In this article, the authors developed new methods for estimating the graphical structures and underlying parameters, namely, the row and column covariance and inverse covariance matrices from the matrix variate data.
Abstract: Undirected graphs can be used to describe matrix variate distributions. In this paper, we develop new methods for estimating the graphical structures and underlying parameters, namely, the row and column covariance and inverse covariance matrices from the matrix variate data. Under sparsity conditions, we show that one is able to recover the graphs and covariance matrices with a single random matrix from the matrix variate normal distribution. Our method extends, with suitable adaptation, to the general setting where replicates are available. We establish consistency and obtain the rates of convergence in the operator and the Frobenius norm. We show that having replicates will allow one to estimate more complicated graphical structures and achieve faster rates of convergence. We provide simulation evidence showing that we can recover graphical structures as well as estimating the precision matrices, as predicted by theory.

Journal ArticleDOI
TL;DR: In this paper, a multiscale space on which nonparametric priors and posteriors are naturally defined is introduced, and the authors prove Bernstein-von-Mises theorems for a variety of priors in the setting of Gaussian non-parametric regression and in the i.i.d. sampling model.
Abstract: We continue the investigation of Bernstein–von Mises theorems for nonparametric Bayes procedures from [Ann. Statist. 41 (2013) 1999–2028]. We introduce multiscale spaces on which nonparametric priors and posteriors are naturally defined, and prove Bernstein–von Mises theorems for a variety of priors in the setting of Gaussian nonparametric regression and in the i.i.d. sampling model. From these results we deduce several applications where posterior-based inference coincides with efficient frequentist procedures, including Donsker– and Kolmogorov–Smirnov theorems for the random posterior cumulative distribution functions. We also show that multiscale posterior credible bands for the regression or density function are optimal frequentist confidence bands.

Journal ArticleDOI
TL;DR: In this article, the problem of matrix denoising is solved by applying soft thresholding to the singular values of the noisy measurement, where the noise matrix has i.i.d. Gaussian entries.
Abstract: An unknown $m$ by $n$ matrix $X_{0}$ is to be estimated from noisy measurements $Y=X_{0}+Z$, where the noise matrix $Z$ has i.i.d. Gaussian entries. A popular matrix denoising scheme solves the nuclear norm penalization problem $\operatorname{min}_{X}\|Y-X\|_{F}^{2}/2+\lambda\|X\|_{*}$, where $\|X\|_{*}$ denotes the nuclear norm (sum of singular values). This is the analog, for matrices, of $\ell_{1}$ penalization in the vector case. It has been empirically observed that if $X_{0}$ has low rank, it may be recovered quite accurately from the noisy measurement $Y$. In a proportional growth framework where the rank $r_{n}$, number of rows $m_{n}$ and number of columns $n$ all tend to $\infty$ proportionally to each other ($r_{n}/m_{n}\rightarrow \rho$, $m_{n}/n\rightarrow \beta$), we evaluate the asymptotic minimax MSE $ \mathcal{M} (\rho,\beta)=\lim_{m_{n},n\rightarrow \infty}\inf_{\lambda}\sup_{\operatorname{rank}(X)\leq r_{n}}\operatorname{MSE}(X_{0},\hat{X}_{\lambda})$. Our formulas involve incomplete moments of the quarter- and semi-circle laws ($\beta=1$, square case) and the Marcenko–Pastur law ($\beta<1$, nonsquare case). For finite $m$ and $n$, we show that MSE increases as the nonzero singular values of $X_{0}$ grow larger. As a result, the finite-$n$ worst-case MSE, a quantity which can be evaluated numerically, is achieved when the signal $X_{0}$ is “infinitely strong.” The nuclear norm penalization problem is solved by applying soft thresholding to the singular values of $Y$. We also derive the minimax threshold, namely the value $\lambda^{*}(\rho)$, which is the optimal place to threshold the singular values. All these results are obtained for general (nonsquare, nonsymmetric) real matrices. Comparable results are obtained for square symmetric nonnegative-definite matrices.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a nonparametric maximum likelihood approach to detect multiple change-points in the data sequence, which does not impose any parametric assumption on the underlying distributions.
Abstract: In multiple change-point problems, different data segments often follow different distributions, for which the changes may occur in the mean, scale or the entire distribution from one segment to another. Without the need to know the number of change-points in advance, we propose a nonparametric maximum likelihood approach to detecting multiple change-points. Our method does not impose any parametric assumption on the underlying distributions of the data sequence, which is thus suitable for detection of any changes in the distributions. The number of change-points is determined by the Bayesian information criterion and the locations of the change-points can be estimated via the dynamic programming algorithm and the use of the intrinsic order structure of the likelihood function. Under some mild conditions, we show that the new method provides consistent estimation with an optimal rate. We also suggest a prescreening procedure to exclude most of the irrelevant points prior to the implementation of the nonparametric likelihood method. Simulation studies show that the proposed method has satisfactory performance of identifying multiple change-points in terms of estimation accuracy and computation time.

Journal ArticleDOI
TL;DR: In this article, a new class of continuous shrinkage priors is proposed for sparse Bayesian factor models with sparsity assumptions on the true covariance matrix, and the convergence rates of these priors are derived for high-dimensional covariance matrices.
Abstract: Sparse Bayesian factor models are routinely implemented for parsimonious dependence modeling and dimensionality reduction in high-dimensional applications. We provide theoretical understanding of such Bayesian procedures in terms of posterior convergence rates in inferring high-dimensional covariance matrices where the dimension can be larger than the sample size. Under relevant sparsity assumptions on the true covariance matrix, we show that commonly-used point mass mixture priors on the factor loadings lead to consistent estimation in the operator norm even when $p\gg n$. One of our major contributions is to develop a new class of continuous shrinkage priors and provide insights into their concentration around sparse vectors. Using such priors for the factor loadings, we obtain similar rate of convergence as obtained with point mass mixture priors. To obtain the convergence rates, we construct test functions to separate points in the space of high-dimensional covariance matrices using insights from random matrix theory; the tools developed may be of independent interest. We also derive minimax rates and show that the Bayesian posterior rates of convergence coincide with the minimax rates upto a $\sqrt{\log n}$ term.

Journal ArticleDOI
TL;DR: In this paper, the problem of estimating the mean of a Gaussian random vector when the mean vector is assumed to be in a given convex set is considered. And the most natural solution is to take the Euclidean projection of the data vector on to this convex sets; in other words, performing "least squares under a convex constraint".
Abstract: Consider the problem of estimating the mean of a Gaussian random vector when the mean vector is assumed to be in a given convex set. The most natural solution is to take the Euclidean projection of the data vector on to this convex set; in other words, performing “least squares under a convex constraint.” Many problems in modern statistics and statistical signal processing theory are special cases of this general situation. Examples include the lasso and other high-dimensional regression techniques, function estimation problems, matrix estimation and completion, shape-restricted regression, constrained denoising, linear inverse problems, etc. This paper presents three general results about this problem, namely, (a) an exact computation of the main term in the estimation error by relating it to expected maxima of Gaussian processes (existing results only give upper bounds), (b) a theorem showing that the least squares estimator is always admissible up to a universal constant in any problem of the above kind and (c) a counterexample showing that least squares estimator may not always be minimax rate-optimal. The result from part (a) is then used to compute the error of the least squares estimator in two examples of contemporary interest.

Journal ArticleDOI
TL;DR: In this paper, a penalized focused generalized method of moments (FGMM) criterion function is proposed to cope with the incidental endogeneity of high-dimensional regression, which achieves the dimension reduction and applies the instrumental variable methods.
Abstract: Most papers on high-dimensional statistics are based on the assumption that none of the regressors are correlated with the regression error, namely, they are exogenous. Yet, endogeneity can arise incidentally from a large pool of regressors in a high-dimensional regression. This causes the inconsistency of the penalized least-squares method and possible false scientific discoveries. A necessary condition for model selection consistency of a general class of penalized regression methods is given, which allows us to prove formally the inconsistency claim. To cope with the incidental endogeneity, we construct a novel penalized focused generalized method of moments (FGMM) criterion function. The FGMM effectively achieves the dimension reduction and applies the instrumental variable methods. We show that it possesses the oracle property even in the presence of endogenous predictors, and that the solution is also near global minimum under the over-identification assumption. Finally, we also show how the semi-parametric efficiency of estimation can be achieved via a two-step approach.

Journal ArticleDOI
TL;DR: The authors proposed a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features.
Abstract: For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by c > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

Journal ArticleDOI
TL;DR: The introduced methodology is used to prove that commonly used families of prior distri- butions on densities, namely log-density priors and dyadic random density histograms, can indeed achieve optimal sup-norm rates of convergence.
Abstract: Building on ideas from Castillo and Nickl (6), a method is pro- vided to study nonparametric Bayesian posterior convergence rates when 'strong' measures of distances, such as the sup-norm, are considered. In particular, we show that likelihood methods can achieve optimal minimax sup-norm rates in density estimation on the unit interval. The introduced methodology is used to prove that commonly used families of prior distri- butions on densities, namely log-density priors and dyadic random density histograms, can indeed achieve optimal sup-norm rates of convergence. New results are also derived in the Gaussian white noise model as a further il- lustration of the presented techniques. AMS 2000 subject classications: Primary 62G20; secondary 62G05, 62G07.

Journal ArticleDOI
TL;DR: In this article, the identifiability of demographic models under the restriction that the population sizes are piecewise defined where each piece belongs to some family of biologically-motivated functions is examined.
Abstract: The sample frequency spectrum (SFS) is a widely-used summary statistic of genomic variation in a sample of homologous DNA sequences. It provides a highly efficient dimensional reduction of large-scale population genomic data and its mathematical dependence on the underlying population demography is well understood, thus enabling the development of efficient inference algorithms. However, it has been recently shown that very different population demographies can actually generate the same SFS for arbitrarily large sample sizes. Although in principle this nonidentifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. Here, we revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes are piecewise-defined where each piece belongs to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain a general bound on the sample size sufficient for identifiability; the bound depends on the number of pieces in the demographic model and also on the type of population size function in each piece. In the cases of piecewise-constant, piecewise-exponential and piecewise-generalized-exponential models, which are often assumed in population genomic inferences, we provide explicit formulas for the bounds as simple functions of the number of pieces. Lastly, we obtain analogous results for the "folded" SFS, which is often used when there is ambiguity as to which allelic type is ancestral. Our results are proved using a generalization of Descartes' rule of signs for polynomials to the Laplace transform of piecewise continuous functions.

Journal ArticleDOI
TL;DR: In this article, an efficient estimator for the quadratic covariation or integrated co-volatility matrix of a multivariate continuous martingale based on noisy and nonsynchronous observations under high-frequency asymptotics is constructed.
Abstract: An efficient estimator is constructed for the quadratic covariation or integrated co-volatility matrix of a multivariate continuous martingale based on noisy and nonsynchronous observations under high-frequency asymptotics. Our approach relies on an asymptotically equivalent continuous-time observation model where a local generalised method of moments in the spectral domain turns out to be optimal. Asymptotic semi-parametric efficiency is established in the Cramer–Rao sense. Main findings are that nonsynchronicity of observation times has no impact on the asymptotics and that major efficiency gains are possible under correlation. Simulations illustrate the finite-sample behaviour.

Journal ArticleDOI
TL;DR: This paper proposed a consistent estimator of sharp bounds on the variance of the difference-in-means estimator in completely randomized experiments, which facilitates the asymptotically narrowest conservative Wald-type confidence intervals, with applications in randomized controlled and clinical trials.
Abstract: We propose a consistent estimator of sharp bounds on the variance of the difference-in-means estimator in completely randomized experiments. Generalizing Robins [Stat. Med. 7 (1988) 773–785], our results resolve a well-known identification problem in causal inference posed by Neyman [Statist. Sci. 5 (1990) 465–472. Reprint of the original 1923 paper]. A practical implication of our results is that the upper bound estimator facilitates the asymptotically narrowest conservative Wald-type confidence intervals, with applications in randomized controlled and clinical trials.