scispace - formally typeset
Search or ask a question

Showing papers in "Annals of Statistics in 2011"


Journal ArticleDOI
TL;DR: In this article, the authors studied spectral clustering under the stochastic block model and showed that the eigenvectors of the normalized graph Laplacian asymptotically converges to the eigens of a population normalized graph.
Abstract: Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors. Social ne tworks, representing people who communicate with each other, are one example. Communities or clusters of highly connected actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally feasi ble method to discover these communities. The Stochastic Block Model (Holland et al., 1983) is a social network model with well defined communities; each node is a member of one community. For a network generated from the Stochastic Block Model, we bound the number of nodes "misclus- tered" by spectral clustering. The asymptotic results in th is paper are the first clustering results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional. In order to study spectral clustering under the Stochastic Block Model, we first show that under the more general latent space model, the eigenvectors of the normalized graph Laplacian asymptotically converge to the eigenvectors of a "population" normal- ized graph Laplacian. Aside from the implication for spectral clustering, this provides insight into a graph visualization technique. Our method of studying the eigenvectors of random matrices is original. AMS 2000 subject classifications: Primary 62H30, 62H25; secondary 60B20.

713 citations


Journal ArticleDOI
TL;DR: In this paper, a path algorithm for the generalized lasso problem is presented, which is based on solving the dual of the generalized Lasso, which greatly facilitates computation of the path.
Abstract: We present a path algorithm for the generalized lasso problem. This problem penalizes the l1 norm of a matrix D times the coefficient vector, and has a wide range of applications, dictated by the choice of D. Our algorithm is based on solving the dual of the generalized lasso, which greatly facilitates computation of the path. For D = I (the usual lasso), we draw a connection between our approach and the well-known LARS algorithm. For an arbitrary D, we derive an unbiased estimate of the degrees of freedom of the generalized lasso fit. This estimate turns out to be quite intuitive in many applications.

683 citations


Journal ArticleDOI
TL;DR: In this article, a new nuclear-norm penalized estimator of A0 was proposed and established a general sharp oracle inequality for this estimator for arbitrary values of n, m1, m2 under the condition of isometry in expectation.
Abstract: This paper deals with the trace regression model where n entries or linear combinations of entries of an unknown m1 × m2 matrix A0 corrupted by noise are observed. We propose a new nuclear-norm penalized estimator of A0 and establish a general sharp oracle inequality for this estimator for arbitrary values of n, m1, m2 under the condition of isometry in expectation. Then this method is applied to the matrix completion problem. In this case, the estimator admits a simple explicit form, and we prove that it satisfies oracle inequalities with faster rates of convergence than in the previous works. They are valid, in particular, in the high-dimensional setting m1m2 ≫ n. We show that the obtained rates are optimal up to logarithmic factors in a minimax sense and also derive, for any fixed matrix A0, a nonminimax lower bound on the rate of convergence of our estimator, which coincides with the upper bound up to a constant factor. Finally, we show that our procedure provides an exact recovery of the rank of A0 with probability close to 1. We also discuss the statistical learning setting where there is no underlying model determined by A0, and the aim is to find the best trace regression model approximating the data. As a by-product, we show that, under the restricted eigenvalue condition, the usual vector Lasso estimator satisfies a sharp oracle inequality (i.e., an oracle inequality with leading constant 1).

613 citations


Journal ArticleDOI
TL;DR: Simulations show excellent agreement with the high-dimensional scaling of the error predicted by the theory, and illustrate their consequences for a number of specific learning models, including low-rank multivariate or multi-task regression, system identification in vector autoregressive processes, and recovery of low- rank matrices from random projections.
Abstract: We study an instance of high-dimensional inference in which the goal is to estimate a matrix Θ * ∈ ℝ m1×m2 on the basis of N noisy observations. The unknown matrix Θ * is assumed to be either exactly low rank, or "near" low-rank, meaning that it can be well-approximated by a matrix with low rank. We consider a standard M-estimator based on regularization by the nuclear or trace norm over matrices, and analyze its performance under high-dimensional scaling. We define the notion of restricted strong convexity (RSC) for the loss function, and use it to derive nonasymptotic bounds on the Frobenius norm error that hold for a general class of noisy observation models, and apply to both exactly low-rank and approximately low rank matrices. We then illustrate consequences of this general theory for a number of specific matrix models, including low-rank multivariate or multi-task regression, system identification in vector autoregressive processes and recovery of low-rank matrices from random projections. These results involve nonasymptotic random matrix theory to establish that the RSC condition holds, and to determine an appropriate choice of regularization parameter. Simulation results show excellent agreement with the high-dimensional scaling of the error predicted by our theory.

587 citations


Journal ArticleDOI
TL;DR: The use of clinical trial data is considered in the construction of an individualized treatment rule leading to highest mean response and estimation based on l(1) penalized least squares is considered.
Abstract: Because many illnesses show heterogeneous response to treatment, there is increasing interest in individualizing treatment to patients [11]. An individualized treatment rule is a decision rule that recommends treatment according to patient characteristics. We consider the use of clinical trial data in the construction of an individualized treatment rule leading to highest mean response. This is a difficult computational problem because the objective function is the expectation of a weighted indicator function that is non-concave in the parameters. Furthermore there are frequently many pretreatment variables that may or may not be useful in constructing an optimal individualized treatment rule yet cost and interpretability considerations imply that only a few variables should be used by the individualized treatment rule. To address these challenges we consider estimation based on l(1) penalized least squares. This approach is justified via a finite sample upper bound on the difference between the mean response due to the estimated individualized treatment rule and the mean response due to the optimal individualized treatment rule.

518 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of estimating a sparse linear regression vector β* under a Gaussian noise model, for the purpose of both prediction and model selection, and establish oracle inequalities for the prediction and l2 estimation errors of this estimator.
Abstract: We consider the problem of estimating a sparse linear regression vector β* under a Gaussian noise model, for the purpose of both prediction and model selection. We assume that prior knowledge is available on the sparsity pattern, namely the set of variables is partitioned into prescribed groups, only few of which are relevant in the estimation process. This group sparsity assumption suggests us to consider the Group Lasso method as a means to estimate β*. We establish oracle inequalities for the prediction and l2 estimation errors of this estimator. These bounds hold under a restricted eigenvalue condition on the design matrix. Under a stronger condition, we derive bounds for the estimation error for mixed (2, p)-norms with 1 ≤ p ≤ ∞. When p=∞, this result implies that a thresholded version of the Group Lasso estimator selects the sparsity pattern of β* with high probability. Next, we prove that the rate of convergence of our upper bounds is optimal in a minimax sense, up to a logarithmic factor, for all estimators over a class of group sparse vectors. Furthermore, we establish lower bounds for the prediction and l2 estimation errors of the usual Lasso estimator. Using this result, we demonstrate that the Group Lasso can achieve an improvement in the prediction and estimation errors as compared to the Lasso. An important application of our results is provided by the problem of estimating multiple regression equations simultaneously or multi-task learning. In this case, we obtain refinements of the results in [In Proc. of the 22nd Annual Conference on Learning Theory (COLT) (2009)], which allow us to establish a quantitative advantage of the Group Lasso over the usual Lasso in the multi-task setting. Finally, within the same setting, we show how our results can be extended to more general noise distributions, of which we only require the fourth moment to be finite. To obtain this extension, we establish a new maximal moment inequality, which may be of independent interest.

404 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider median regression and a possibly infinite collection of quantile regressions in high-dimensional sparse models, and show that under general conditions l1-QR is consistent at a near-oracle rate.
Abstract: We consider median regression and, more generally, a possibly infinite collection of quantile regressions in high-dimensional sparse models. In these models, the number of regressors p is very large, possibly larger than the sample size n, but only at most s regressors have a nonzero impact on each conditional quantile of the response variable, where s grows more slowly than n. Since ordinary quantile regression is not consistent in this case, we consider l1-penalized quantile regression (l1-QR), which penalizes the l1-norm of regression coefficients, as well as the post-penalized QR estimator (post-l1-QR), which applies ordinary QR to the model selected by l1-QR. First, we show that under general conditions l1-QR is consistent at the near-oracle rate $\sqrt{s/n}\sqrt{\log(p\vee n)}$, uniformly in the compact set $\mathcal{U}\subset(0,1)$ of quantile indices. In deriving this result, we propose a partly pivotal, data-driven choice of the penalty level and show that it satisfies the requirements for achieving this rate. Second, we show that under similar conditions post-l1-QR is consistent at the near-oracle rate $\sqrt{s/n}\sqrt{\log(p\vee n)}$, uniformly over $\mathcal{U}$, even if the l1-QR-selected models miss some components of the true models, and the rate could be even closer to the oracle rate otherwise. Third, we characterize conditions under which l1-QR contains the true model as a submodel, and derive bounds on the dimension of the selected model, uniformly over $\mathcal{U}$; we also provide conditions under which hard-thresholding selects the minimal true model, uniformly over $\mathcal{U}$.

382 citations


Journal ArticleDOI
TL;DR: This work investigates penalized least squares estimators with a Schatten-p quasi-norm penalty term and derives bounds for the kth entropy numbers of the quasi-convex Schatten class embeddings S M p → S M 2 , p < 1, which are of independent interest.
Abstract: Suppose that we observe entries or, more generally, linear combinations of entries of an unknown m x T-matrix A corrupted by noise. We are particularly interested in the high-dimensional setting where the number mT of unknown entries can be much larger than the sample size N. Motivated by several applications, we consider estimation of matrix A under the assumption that it has small rank. This can be viewed as dimension reduction or sparsity assumption. In order to shrink toward a low-rank representation, we investigate penalized least squares estimators with a Schatten-p quasi-norm penalty term, p ≤ 1. We study these estimators under two possible assumptions—a modified version of the restricted isometry condition and a uniform bound on the ratio "empirical norm induced by the sampling operator/Frobenius norm." The main results are stated as nonasymptotic upper bounds on the prediction risk and on the Schatten-q risk of the estimators, where q ∈ [p, 2]. The rates that we obtain for the prediction risk are of the form rm/N (for m = T), up to logarithmic factors, where r is the rank of A. The particular examples of multi-task learning and matrix completion are worked out in detail. The proofs are based on tools from the theory of empirical processes. As a by-product, we derive bounds for the kth entropy numbers of the quasi-convex Schatten class embeddings S M p → S M 2 , p < 1, which are of independent interest.

331 citations


Journal ArticleDOI
TL;DR: In this article, the multivariate group Lasso was shown to exhibit a threshold for the recovery of the exact row pattern with high probability over the random design and noise that is specified by the sample complexity parameter (n,p,s) := n/[2 (B )log(p s).
Abstract: multivariate group Lasso, in which block regularization based on the ‘1/‘2 norm is used for support union recovery, or recovery of the set of s rows for which B is non-zero. Under high-dimensional scaling, we show that the multivariate group Lasso exhibits a threshold for the recovery of the exact row pattern with high probability over the random design and noise that is specified by the sample complexity parameter (n,p,s) := n/[2 (B )log(p s)]. Here n is the sample size, and (B ) is a sparsity-overlap function measuring a combination of the sparsities and overlaps of the K-regression coecient vectors that constitute the model. We prove that the multivariate group Lasso succeeds for problem sequences (n,p,s) such that (n,p,s) exceeds a critical level u, and fails for sequences such that (n,p,s) lies below a critical level ‘. For the special case of the standard Gaussian ensemble, we show that ‘ = u so that the characterization is sharp. The sparsity-overlap function (B ) reveals that, if the design is uncorrelated on the active rows, ‘1/‘2 regularization for multivariate regression never harms performance relative to an ordinary Lasso approach, and can yield substantial improvements in sample complexity (up to a factor of K) when the regression vectors are suitably orthogonal. For more general designs, it is possible for the ordinary Lasso to outperform the multivariate group Lasso. We complement our analysis with simulations that demonstrate the sharpness of our theoretical results, even for relatively small problems.

284 citations


Journal ArticleDOI
TL;DR: This work proposes adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and proves that the methods possess the oracle property.
Abstract: The complexity of semiparametric models poses new challenges to statistical inference and model selection that frequently arise from real applications In this work, we propose new estimation and variable selection procedures for the semiparametric varying-coefficient partially linear model We first study quantile regression estimates for the nonparametric varying-coefficient functions and the parametric regression coefficients To achieve nice efficiency properties, we further develop a semiparametric composite quantile regression procedure We establish the asymptotic normality of proposed estimators for both the parametric and nonparametric parts and show that the estimators achieve the best convergence rate Moreover, we show that the proposed method is much more efficient than the least-squares-based method for many non-normal errors and that it only loses a small amount of efficiency for normal errors In addition, it is shown that the loss in efficiency is at most 111% for estimating varying coefficient functions and is no greater than 136% for estimating parametric components To achieve sparsity with high-dimensional covariates, we propose adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and prove that the methods possess the oracle property Extensive Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures Finally, we apply the new methods to analyze the plasma beta-carotene level data

265 citations


Journal ArticleDOI
TL;DR: In this paper, the authors estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu [J. Amer. Statist. Assoc. 106 (2011) 672-684], taking into account the fact that direct observations of the idiosyncratic components are unavailable.
Abstract: The variance–covariance matrix plays a central role in the inferential theories of high-dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu [J. Amer. Statist. Assoc. 106 (2011) 672–684], taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied.

Journal ArticleDOI
TL;DR: In this paper, the authors consider a general, nonnecessarily linear, regression with Gaussian noise and study a related question, that is, to find a linear combination of approximating functions, which is at the same time sparse and has small mean squared error (MSE).
Abstract: In high-dimensional linear regression, the goal pursued here is to estimate an unknown regression function using linear combinations of a suitable set of covariates. One of the key assumptions for the success of any statistical procedure in this setup is to assume that the linear combination is sparse in some sense, for example, that it involves only few covariates. We consider a general, nonnecessarily linear, regression with Gaussian noise and study a related question, that is, to find a linear combination of approximating functions, which is at the same time sparse and has small mean squared error (MSE). We introduce a new estimation procedure, called Exponential Screening, that shows remarkable adaptation properties. It adapts to the linear combination that optimally balances MSE and sparsity, whether the latter is measured in terms of the number of nonzero entries in the combination (l0 norm) or in terms of the global weight of the combination (l1 norm). The power of this adaptation result is illustrated by showing that Exponential Screening solves optimally and simultaneously all the problems of aggregation in Gaussian regression that have been discussed in the literature. Moreover, we show that the performance of the Exponential Screening estimator cannot be improved in a minimax sense, even if the optimal sparsity is known in advance. The theoretical and numerical superiority of Exponential Screening compared to state-of-the-art sparse procedures is also discussed.

Journal ArticleDOI
TL;DR: Rank Selection Criterion (RSC) as mentioned in this paper minimizes the Frobenius norm of the fit plus a regularization term proportional to the number of parameters in the reduced rank model.
Abstract: We introduce a new criterion, the Rank Selection Criterion (RSC), for selecting the optimal reduced rank estimator of the coefficient matrix in multivariate response regression models. The corresponding RSC estimator minimizes the Frobenius norm of the fit plus a regularization term proportional to the number of parameters in the reduced rank model. The rank of the RSC estimator provides a consistent estimator of the rank of the coefficient matrix; in general, the rank of our estimator is a consistent estimate of the effective rank, which we define to be the number of singular values of the target matrix that are appropriately large. The consistency results are valid not only in the classic asymptotic regime, when n, the number of responses, and p, the number of predictors, stay bounded, and m, the number of observations, grows, but also when either, or both, n and p grow, possibly much faster than m. We establish minimax optimal bounds on the mean squared errors of our estimators. Our finite sample performance bounds for the RSC estimator show that it achieves the optimal balance between the approximation error and the penalty term. Furthermore, our procedure has very low computational complexity, linear in the number of candidate models, making it particularly appealing for large scale problems. We contrast our estimator with the nuclear norm penalized least squares (NNP) estimator, which has an inherently higher computational complexity than RSC, for multivariate regression models. We show that NNP has estimation properties similar to those of RSC, albeit under stronger conditions. However, it is not as parsimonious as RSC. We offer a simple correction of the NNP estimator which leads to consistent rank estimation. We verify and illustrate our theoretical findings via an extensive simulation study.

Journal ArticleDOI
TL;DR: In this paper, a sparse linear discriminant analysis (LDA) was proposed to classify human cancer into two classes of leukemia based on a set of 7,129 genes and a training sample of size 72.
Abstract: In many social, economical, biological and medical studies, one objective is to classify a subject into one of several classes based on a set of variables observed from the subject. Because the probability distribution of the variables is usually unknown, the rule of classification is constructed using a training sample. The well-known linear discriminant analysis (LDA) works well for the situation where the number of variables used for classification is much smaller than the training sample size. Because of the advance in technologies, modern statistical studies often face classification problems with the number of variables much larger than the sample size, and the LDA may perform poorly. We explore when and why the LDA has poor performance and propose a sparse LDA that is asymptotically optimal under some sparsity conditions on the unknown parameters. For illustration of application, we discuss an example of classifying human cancer into two classes of leukemia based on a set of 7,129 genes and a training sample of size 72. A simulation is also conducted to check the performance of the proposed method.

Journal ArticleDOI
TL;DR: In this article, the authors show that under moderate sparsity levels, that is, 0 ≤ α ≤ 1/2, the analysis of variance (ANOVA) is essentially optimal under some conditions on the design.
Abstract: Testing for the significance of a subset of regression coefficients in a linear model, a staple of statistical analysis, goes back at least to the work of Fisher who introduced the analysis of variance (ANOVA). We study this problem under the assumption that the coefficient vector is sparse, a common situation in modern high-dimensional settings. Suppose we have p covariates and that under the alternative, the response only depends upon the order of p^(1−α) of those, 0 ≤ α ≤ 1. Under moderate sparsity levels, that is, 0 ≤ α ≤ 1/2, we show that ANOVA is essentially optimal under some conditions on the design. This is no longer the case under strong sparsity constraints, that is, α > 1/2. In such settings, a multiple comparison procedure is often preferred and we establish its optimality when α ≥ 3/4. However, these two very popular methods are suboptimal, and sometimes powerless, under moderately strong sparsity where 1/2 1/2. This optimality property is true for a variety of designs, including the classical (balanced) multi-way designs and more modern “p > n” designs arising in genetics and signal processing. In addition to the standard fixed effects model, we establish similar results for a random effects model where the nonzero coefficients of the regression vector are normally distributed.

Journal ArticleDOI
TL;DR: In this paper, a general method of moments approach was proposed to fit a large class of probability models through empirical counts of certain patterns in a graph, and the empirical graph moments were used to prove consistency of the estimates as the graph size grows.
Abstract: Probability models on graphs are becoming increasingly important in many applications, but statistical tools for fitting such models are not yet well developed. Here we propose a general method of moments approach that can be used to fit a large class of probability models through empirical counts of certain patterns in a graph. We establish some general asymptotic properties of empirical graph moments and prove consistency of the estimates as the graph size grows for all ranges of the average degree including Ω(1). Additional results are obtained for the important special case of degree distributions.

Journal ArticleDOI
TL;DR: In this paper, the consistency of the least square estimator of a convex regression function when the predictor is multidimensional was investigated. But the consistency was not investigated in the context of convex and componentwise nonincreasing regression.
Abstract: This paper deals with the consistency of the least squares estimator of a convex regression function when the predictor is multidimensional. We characterize and discuss the computation of such an estimator via the solution of certain quadratic and linear programs. Mild sufficient conditions for the consistency of this estimator and its subdifferentials in fixed and stochastic design regression settings are provided. We also consider a regression function which is known to be convex and componentwise nonincreasing and discuss the characterization, computation and consistency of its least squares estimator.

Journal ArticleDOI
TL;DR: In this paper, a two-stage estimation procedure is proposed to estimate the link function for the single index and the parameters in the single indices, as well as the linear component of the model, and asymptotic normality is established for both parametric components.
Abstract: In this paper, we study the estimation for a partial-linear single-index model. A two-stage estimation procedure is proposed to estimate the link function for the single index and the parameters in the single index, as well as the parameters in the linear component of the model. Asymptotic normality is established for both parametric components. For the index, a constrained estimating equation leads to an asymptotically more efficient estimator than existing estimators in the sense that it is of a smaller limiting variance. The estimator of the nonparametric link function achieves optimal convergence rates, and the structural error variance is obtained. In addition, the results facilitate the construction of confidence regions and hypothesis testing for the unknown parameters. A simulation study is performed and an application to a real dataset is illustrated. The extension to multiple indices is briefly sketched.

Journal ArticleDOI
TL;DR: The proposed selection rule leads to the estimator being minimax adaptive over a scale of the anisotropic Nikol'skii classes and the main technical tools used in derivations are uniform bounds on the L s-norms of empirical processes developed recently by Goldenshluger and Lepski.
Abstract: We address the problem of density estimation with "$\mathbb{L}_{s}$-loss by selection of kernel estimators. We develop a selection procedure and derive corresponding $\mathbb{L}_{s}$-risk oracle inequalities. It is shown that the proposed selection rule leads to the estimator being minimax adaptive over a scale of the anisotropic Nikol’skii classes. The main technical tools used in our derivations are uniform bounds on the $\mathbb{L}_{s}$-norms of empirical processes developed recently by Goldenshluger and Lepski [Ann. Probab. (2011), to appear].

Journal ArticleDOI
TL;DR: In this paper, the limiting laws of the coherence of an n × p random matrix in the high-dimensional setting where p can be much larger than n were derived and applied to the construction of compressed sensing matrices.
Abstract: Testing covariance structure is of significant interest in many areas of statistical analysis and construction of compressed sensing matrices is an important problem in signal processing. Motivated by these applications, we study in this paper the limiting laws of the coherence of an n × p random matrix in the high-dimensional setting where p can be much larger than n. Both the law of large numbers and the limiting distribution are derived. We then consider testing the bandedness of the covariance matrix of a high-dimensional Gaussian distribution which includes testing for independence as a special case. The limiting laws of the coherence of the data matrix play a critical role in the construction of the test. We also apply the asymptotic results to the construction of compressed sensing matrices.

Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of detecting whether or not in a given sensor network, there is a cluster of sensors which exhibit an "unusual behavior." Formally, suppose we are given a set of nodes and attach a random variable to each node.
Abstract: We consider the problem of detecting whether or not in a given sensor network, there is a cluster of sensors which exhibit an "unusual behavior." Formally, suppose we are given a set of nodes and attach a random variable to each node. We observe a realization of this process and want to decide between the following two hypotheses: under the null, the variables are i.i.d. standard normal; under the alternative, there is a cluster of variables that are i.i.d. normal with positive mean and unit variance, while the rest are i.i.d. standard normal. We also address surveillance settings where each sensor in the network collects information over time. The resulting model is similar, now with a time series attached to each node. We again observe the process over time and want to decide between the null, where all the variables are i.i.d. standard normal; and the alternative, where there is an emerging cluster of i.i.d. normal variables with positive mean and unit variance. The growth models used to represent the emerging cluster are quite general, and in particular include cellular automata used in modelling epidemics. In both settings, we consider classes of clusters that are quite general, for which we obtain a lower bound on their respective minimax detection rate, and show that some form of scan statistic, by far the most popular method in practice, achieves that same rate within a logarithmic factor. Our results are not limited to the normal location model, but generalize to any one-parameter exponential family when the anoma- lous clusters are large enough.

Journal ArticleDOI
TL;DR: In this article, the posterior distribution in a nonparametric inverse problem is shown to contract to the true parameter at a rate that depends on the smoothness of the parameter, and the smoothing and scale of the prior.
Abstract: The posterior distribution in a nonparametric inverse problem is shown to contract to the true parameter at a rate that depends on the smoothness of the parameter, and the smoothness and scale of the prior. Correct combinations of these characteristics lead to the minimax rate. The frequentist coverage of credible sets is shown to depend on the combination of prior and true parameter, with smoother priors leading to zero coverage and rougher priors to conservative coverage. In the latter case credible sets are of the correct order of magnitude. The results are numerically illustrated by the problem of recovering a function from observation of a noisy version of its primitive.

Journal ArticleDOI
TL;DR: In this article, a nonparametric linear model is proposed to estimate the link function nonparametrically and an approach to multi-index modeling is proposed using adaptively defined linear projections of functional data.
Abstract: Fully nonparametric methods for regression from functional data have poor accuracy from a statistical viewpoint, reflecting the fact that their convergence rates are slower than nonparametric rates for the estimation of high-dimensional functions. This difficulty has led to an emphasis on the so-called functional linear model, which is much more flexible than common linear models in finite dimension, but nevertheless imposes structural constraints on the relationship between predictors and responses. Recent advances have extended the linear approach by using it in conjunction with link functions, and by considering multiple indices, but the flexibility of this technique is still limited. For example, the link may be modeled parametrically or on a grid only, or may be constrained by an assumption such as monotonicity; multiple indices have been modeled by making finite-dimensional assumptions. In this paper we introduce a new technique for estimating the link function nonparametrically, and we suggest an approach to multi-index modeling using adaptively defined linear projections of functional data. We show that our methods enable prediction with polynomial convergence rates. The finite sample performance of our methods is studied in simulations, and is illustrated by an application to a functional regression problem.

Journal ArticleDOI
TL;DR: In this article, the authors considered the problem of robustly predicting the best linear combination of d given functions in least squares regression, and variants of this problem including constraints on the parameters of the linear combination.
Abstract: We consider the problem of robustly predicting as well as the best linear combination of d given functions in least squares regression, and variants of this problem including constraints on the parameters of the linear combination. For the ridge estimator and the ordinary least squares estimator, and their variants, we provide new risk bounds of order d/n without logarithmic factor unlike some standard results, where n is the size of the training data. We also provide a new estimator with better deviations in presence of heavy-tailed noise. It is based on truncating differences of losses in a min-max framework and satisfies a d/n risk bound both in expectation and in deviations. The key common surprising factor of these results is the absence of exponential moment condition on the output distribution while achieving exponential deviations. All risk bounds are obtained through a PAC-Bayesian analysis on truncated differences of losses. Experimental results strongly back up our truncated min-max estimator.

Journal ArticleDOI
TL;DR: In this article, a fixed-point iterative scheme for computing this estimator is proposed, which only involves one-dimensional nonparametric smoothers, thereby avoiding the data sparsity problem caused by high model dimensionality.
Abstract: Single-index models are natural extensions of linear models and circumvent the so-called curse of dimensionality. They are becoming increasingly popular in many scientific fields including biostatistics, medicine, economics and financial econometrics. Estimating and testing the model index coefficients β is one of the most important objectives in the statistical analysis. However, the commonly used assumption on the index coefficients, ‖β‖ = 1, represents a nonregular problem: the true index is on the boundary of the unit ball. In this paper we introduce the EFM approach, a method of estimating functions, to study the single-index model. The procedure is to first relax the equality constraint to one with (d − 1) components of β lying in an open unit ball, and then to construct the associated (d − 1) estimating functions by projecting the score function to the linear space spanned by the residuals with the unknown link being estimated by kernel estimating functions. The root-n consistency and asymptotic normality for the estimator obtained from solving the resulting estimating equations are achieved, and a Wilks type theorem for testing the index is demonstrated. A noticeable result we obtain is that our estimator for β has smaller or equal limiting variance than the estimator of Carroll et al. [J. Amer. Statist. Assoc. 92 (1997) 447–489]. A fixed-point iterative scheme for computing this estimator is proposed. This algorithm only involves one-dimensional nonparametric smoothers, thereby avoiding the data sparsity problem caused by high model dimensionality. Numerical studies based on simulation and on applications suggest that this new estimating system is quite powerful and easy to implement.

ComponentDOI
TL;DR: A class of variable selection procedures for the linear parameters is developed by employing a nonconcave penalized quasi-likelihood, which is shown to have an asymptotic oracle property and results in gains in computational simplicity.
Abstract: We study generalized additive partial linear models, proposing the use of polynomial spline smoothing for estimation of nonparametric functions, and deriving quasi-likelihood based estimators for the linear parameters. We establish asymptotic normality for the estimators of the parametric components. The procedure avoids solving large systems of equations as in kernel-based procedures and thus results in gains in computational simplicity. We further develop a class of variable selection procedures for the linear parameters by employing a nonconcave penalized quasi-likelihood, which is shown to have an asymptotic oracle property. Monte Carlo simulations and an empirical example are presented for illustration.

Journal ArticleDOI
TL;DR: In this paper, a focused information criterion (FIC) and a frequentist model average (FMA) estimator were proposed for generalized additive partial linear models (GAPLMs).
Abstract: We study model selection and model averaging in generalized additive partial linear models (GAPLMs). Polynomial spline is used to approximate nonparametric functions. The corresponding estimators of the linear parameters are shown to be asymptotically normal. We then develop a focused information criterion (FIC) and a frequentist model average (FMA) estimator on the basis of the quasi-likelihood principle and examine theoretical properties of the FIC and FMA. The major advantages of the proposed procedures over the existing ones are their computational expediency and theoretical reliability. Simulation experiments have provided evidence of the superiority of the proposed procedures. The approach is further applied to a real-world data example.

Journal ArticleDOI
TL;DR: In this article, the authors studied the general problem of model selection for active learning with a nested hierarchy of hypothesis classes and proposed an algorithm whose error rate provably converges to the best achievable error among classifiers in the hierarchy at a rate adaptive to both the complexity of the optimal classifier and the noise conditions.
Abstract: We study the rates of convergence in generalization error achievable by active learning under various types of label noise. Additionally, we study the general problem of model selection for active learning with a nested hierarchy of hypothesis classes and propose an algorithm whose error rate provably converges to the best achievable error among classifiers in the hierarchy at a rate adaptive to both the complexity of the optimal classifier and the noise conditions. In particular, we state sufficient conditions for these rates to be dramatically faster than those achievable by passive learning.

Journal ArticleDOI
TL;DR: In this article, the authors explore the limits of the autoregressive (AR) sieve bootstrap, and show that its applicability extends well beyond the realm of linear time series as has been previously thought.
Abstract: We explore the limits of the autoregressive (AR) sieve bootstrap, and show that its applicability extends well beyond the realm of linear time series as has been previously thought. In particular, for appropriate statistics, the AR-sieve bootstrap is valid for stationary processes possessing a general Wold-type autoregressive representation with respect to a white noise; in essence, this includes all stationary, purely nondeterministic processes, whose spectral density is everywhere positive. Our main theorem provides a simple and effective tool in assessing whether the AR-sieve bootstrap is asymptotically valid in any given situation. In effect, the large-sample distribution of the statistic in question must only depend on the first and second order moments of the process; prominent examples include the sample mean and the spectral density. As a counterexample, we show how the AR-sieve bootstrap is not always valid for the sample autocovariance even when the underlying process is linear.

Journal ArticleDOI
TL;DR: In this paper, a Gaussian prior is devised for Gaussian nonparametric regression and the posterior contracts at the optimal rate in all Lr-norms, 1 ≤ r ≤ ∞, of the unknown parameter.
Abstract: The frequentist behavior of nonparametric Bayes estimates, more specifically, rates of contraction of the posterior distributions to shrinking Lr-norm neighborhoods, 1 ≤ r ≤ ∞, of the unknown parameter, are studied. A theorem for nonparametric density estimation is proved under general approximation-theoretic assumptions on the prior. The result is applied to a variety of common examples, including Gaussian process, wavelet series, normal mixture and histogram priors. The rates of contraction are minimax-optimal for 1 ≤ r ≤ 2, but deteriorate as r increases beyond 2. In the case of Gaussian nonparametric regression a Gaussian prior is devised for which the posterior contracts at the optimal rate in all Lr-norms, 1 ≤ r ≤ ∞.