Showing papers in "arXiv: Statistics Theory in 2009"

PDF

Open Access

Journal Article•DOI•

A survey of cross-validation procedures for model selection

[...]

Sylvain Arlot¹, Alain Celisse•Institutions (1)

27 Jul 2009-arXiv: Statistics Theory

TL;DR: In this paper, a survey on the model selection performances of cross-validation procedures is presented, with a particular emphasis on distinguishing empirical statements from rigorous theoretical results, and guidelines are provided for choosing the best crossvalidation procedure according to the particular features of the problem in hand.

...read moreread less

Abstract: Used to estimate the risk of an estimator or to perform model selection, cross-validation is a widespread strategy because of its simplicity and its apparent universality. Many results exist on the model selection performances of cross-validation procedures. This survey intends to relate these results to the most recent advances of model selection theory, with a particular emphasis on distinguishing empirical statements from rigorous theoretical results. As a conclusion, guidelines are provided for choosing the best cross-validation procedure according to the particular features of the problem in hand.

...read moreread less

2,720 citations

Journal Article•DOI•

Covariance regularization by thresholding

[...]

Peter J. Bickel¹, Elizaveta Levina•Institutions (1)

University of California, Berkeley¹

20 Jan 2009-arXiv: Statistics Theory

TL;DR: In this article, the authors consider regularizing a covariance matrix of variables estimated from observations, by hard thresholding, and show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrices is sparse in a suitable sense, the variables are Gaussian or sub-Gaussian, and the parameter values are constant.

...read moreread less

Abstract: This paper considers regularizing a covariance matrix of $p$ variables estimated from $n$ observations, by hard thresholding. We show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrix is sparse in a suitable sense, the variables are Gaussian or sub-Gaussian, and $(\log p)/n\to0$, and obtain explicit rates. The results are uniform over families of covariance matrices which satisfy a fairly natural notion of sparsity. We discuss an intuitive resampling scheme for threshold selection and prove a general cross-validation result that justifies this approach. We also compare thresholding to other covariance estimators in simulations and on an example from climate data.

...read moreread less

971 citations

Journal Article•DOI•

The pseudo-marginal approach for efficient Monte Carlo computations

[...]

Christophe Andrieu, Gareth O. Roberts

31 Mar 2009-arXiv: Statistics Theory

TL;DR: A powerful and flexible MCMC algorithm for stochastic simulation that builds on a pseudo-marginal method, showing how algorithms which are approximations to an idealized marginal algorithm, can share the same marginal stationary distribution as the idealized method.

...read moreread less

Abstract: We introduce a powerful and flexible MCMC algorithm for stochastic simulation. The method builds on a pseudo-marginal method originally introduced in [Genetics 164 (2003) 1139--1160], showing how algorithms which are approximations to an idealized marginal algorithm, can share the same marginal stationary distribution as the idealized method. Theoretical results are given describing the convergence properties of the proposed method, and simple numerical examples are given to illustrate the promising empirical characteristics of the technique. Interesting comparisons with a more obvious, but inexact, Monte Carlo approximation to the marginal algorithm, are also given.

...read moreread less

723 citations

Journal Article•DOI•

Multivariate Archimedean copulas, $d$-monotone functions and $\ell_1$-norm symmetric distributions

[...]

Alexander J. McNeil, Johanna Nešlehová

26 Aug 2009-arXiv: Statistics Theory

TL;DR: It is shown that a necessary and sufficient condition for an Archimedean copula generator to generate a $d-dimensional copula is that the generator is a d-monotone function.

...read moreread less

Abstract: It is shown that a necessary and sufficient condition for an Archimedean copula generator to generate a $d$-dimensional copula is that the generator is a $d$-monotone function. The class of $d$-dimensional Archimedean copulas is shown to coincide with the class of survival copulas of $d$-dimensional $\ell_1$-norm symmetric distributions that place no point mass at the origin. The $d$-monotone Archimedean copula generators may be characterized using a little-known integral transform of Williamson [Duke Math. J. 23 (1956) 189--207] in an analogous manner to the well-known Bernstein--Widder characterization of completely monotone generators in terms of the Laplace transform. These insights allow the construction of new Archimedean copula families and provide a general solution to the problem of sampling multivariate Archimedean copulas. They also yield useful expressions for the $d$-dimensional Kendall function and Kendall's rank correlation coefficients and facilitate the derivation of results on the existence of densities and the description of singular components for Archimedean copulas. The existence of a sharp lower bound for Archimedean copulas with respect to the positive lower orthant dependence ordering is shown.

...read moreread less

617 citations

Journal Article•DOI•

The composite absolute penalties family for grouped and hierarchical variable selection

[...]

Peng Zhao, Guilherme Rocha, Bin Yu

02 Sep 2009-arXiv: Statistics Theory

TL;DR: CAP is shown to improve on the predictive performance of the LASSO in a series of simulated experiments, including cases with $p\gg n$ and possibly mis-specified groupings, and iCAP is seen to be parsimonious in the experiments.

...read moreread less

Abstract: Extracting useful information from high-dimensional data is an important focus of today's statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the $L_1$-penalized squared error minimization method Lasso has been popular in regression models and beyond. In this paper, we combine different norms including $L_1$ to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family, which allows given grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and combining the properties of norm penalties at the across-group and within-group levels. Grouped selection occurs for nonoverlapping groups. Hierarchical variable selection is reached by defining groups with particular overlapping patterns. We propose using the BLASSO and cross-validation to compute CAP estimates in general. For a subfamily of CAP estimates involving only the $L_1$ and $L_{\infty}$ norms, we introduce the iCAP algorithm to trace the entire regularization path for the grouped selection problem. Within this subfamily, unbiased estimates of the degrees of freedom (df) are derived so that the regularization parameter is selected without cross-validation. CAP is shown to improve on the predictive performance of the LASSO in a series of simulated experiments, including cases with $p\gg n$ and possibly mis-specified groupings. When the complexity of a model is properly calculated, iCAP is seen to be parsimonious in the experiments.

...read moreread less

592 citations

Journal Article•DOI•

On the conditions used to prove oracle results for the Lasso

[...]

Sara van de Geer, Peter Bühlmann

05 Oct 2009-arXiv: Statistics Theory

TL;DR: The restricted eigenvalue condition (Bickel et al., 2009) or the slightly weaker compatibility condition (van de Geer, 2007) are sufficient for oracle results.

...read moreread less

Abstract: Oracle inequalities and variable selection properties for the Lasso in linear models have been established under a variety of different assumptions on the design matrix. We show in this paper how the different conditions and concepts relate to each other. The restricted eigenvalue condition (Bickel et al., 2009) or the slightly weaker compatibility condition (van de Geer, 2007) are sufficient for oracle results. We argue that both these conditions allow for a fairly general class of design matrices. Hence, optimality of the Lasso for prediction and estimation holds for more general situations than what it appears from coherence (Bunea et al, 2007b,c) or restricted isometry (Candes and Tao, 2005) assumptions.

...read moreread less

591 citations

Posted Content•

Minimax rates of estimation for high-dimensional linear regression over $\ell_q$-balls

[...]

Garvesh Raskutti¹, Martin J. Wainwright¹, Bin Yu¹•Institutions (1)

University of California, Berkeley¹

11 Oct 2009-arXiv: Statistics Theory

TL;DR: The results show that although computationally efficient LSTM-based methods can achieve the minimax rates up to constant factors, they require slightly stronger assumptions on the design matrix X than optimal algorithms involving least-squares over the l₀-ball.

...read moreread less

Abstract: Consider the standard linear regression model $\y = \Xmat \betastar + w$, where $\y \in \real^ umobs$ is an observation vector, $\Xmat \in \real^{ umobs \times \pdim}$ is a design matrix, $\betastar \in \real^\pdim$ is the unknown regression vector, and $w \sim \mathcal{N}(0, \sigma^2 I)$ is additive Gaussian noise. This paper studies the minimax rates of convergence for estimation of $\betastar$ for $\ell_\rpar$-losses and in the $\ell_2$-prediction loss, assuming that $\betastar$ belongs to an $\ell_{\qpar}$-ball $\Ballq(\myrad)$ for some $\qpar \in [0,1]$. We show that under suitable regularity conditions on the design matrix $\Xmat$, the minimax error in $\ell_2$-loss and $\ell_2$-prediction loss scales as $\Rq \big(\frac{\log \pdim}{n}\big)^{1-\frac{\qpar}{2}}$. In addition, we provide lower bounds on minimax risks in $\ell_{\rpar}$-norms, for all $\rpar \in [1, +\infty], \rpar eq \qpar$. Our proofs of the lower bounds are information-theoretic in nature, based on Fano's inequality and results on the metric entropy of the balls $\Ballq(\myrad)$, whereas our proofs of the upper bounds are direct and constructive, involving direct analysis of least-squares over $\ell_{\qpar}$-balls. For the special case $q = 0$, a comparison with $\ell_2$-risks achieved by computationally efficient $\ell_1$-relaxations reveals that although such methods can achieve the minimax rates up to constant factors, they require slightly stronger assumptions on the design matrix $\Xmat$ than algorithms involving least-squares over the $\ell_0$-ball.

...read moreread less

423 citations

Journal Article•DOI•

Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing

[...]

David L. Donoho¹, Jared Tanner²•Institutions (2)

Stanford University¹, University of Edinburgh²

14 Jun 2009-arXiv: Statistics Theory

TL;DR: An extensive computational experiment and formal inferential analysis is conducted to test the hypothesis that phase transitions occurring in modern high-dimensional data analysis and signal processing are universal across a range of underlying matrix ensembles, and shows that finite-sample universality can be rejected.

...read moreread less

Abstract: We review connections between phase transitions in high-dimensional combinatorial geometry and phase transitions occurring in modern high-dimensional data analysis and signal processing. In data analysis, such transitions arise as abrupt breakdown of linear model selection, robust data fitting or compressed sensing reconstructions, when the complexity of the model or the number of outliers increases beyond a threshold. In combinatorial geometry these transitions appear as abrupt changes in the properties of face counts of convex polytopes when the dimensions are varied. The thresholds in these very different problems appear in the same critical locations after appropriate calibration of variables. These thresholds are important in each subject area: for linear modelling, they place hard limits on the degree to which the now-ubiquitous high-throughput data analysis can be successful; for robustness, they place hard limits on the degree to which standard robust fitting methods can tolerate outliers before breaking down; for compressed sensing, they define the sharp boundary of the undersampling/sparsity tradeoff in undersampling theorems. Existing derivations of phase transitions in combinatorial geometry assume the underlying matrices have independent and identically distributed (iid) Gaussian elements. In applications, however, it often seems that Gaussianity is not required. We conducted an extensive computational experiment and formal inferential analysis to test the hypothesis that these phase transitions are {\it universal} across a range of underlying matrix ensembles. The experimental results are consistent with an asymptotic large-$n$ universality across matrix ensembles; finite-sample universality can be rejected.

...read moreread less

371 citations

Journal Article•DOI•

The formal definition of reference priors

[...]

James O. Berger, José M. Bernardo, Dongchu Sun

01 Apr 2009-arXiv: Statistics Theory

TL;DR: It is shown how an explicit expression for the reference prior can be obtained under very weak regularity conditions and used to derive new reference priors both analytically and numerically.

...read moreread less

Abstract: Reference analysis produces objective Bayesian inference, in the sense that inferential statements depend only on the assumed model and the available data, and the prior distribution used to make an inference is least informative in a certain information-theoretic sense. Reference priors have been rigorously defined in specific contexts and heuristically defined in general, but a rigorous general definition has been lacking. We produce a rigorous general definition here and then show how an explicit expression for the reference prior can be obtained under very weak regularity conditions. The explicit expression can be used to derive new reference priors both analytically and numerically.

...read moreread less

344 citations

Journal Article•DOI•

Operator norm consistent estimation of large-dimensional sparse covariance matrices

[...]

Noureddine El Karoui

21 Jan 2009-arXiv: Statistics Theory

TL;DR: The estimator is shown to be consistent in operator norm, when, for instance, the authors have p ?

...read moreread less

Abstract: Estimating covariance matrices is a problem of fundamental importance in multivariate statistics In practice it is increasingly frequent to work with data matrices $X$ of dimension $n\times p$, where $p$ and $n$ are both large Results from random matrix theory show very clearly that in this setting, standard estimators like the sample covariance matrix perform in general very poorly In this "large $n$, large $p$" setting, it is sometimes the case that practitioners are willing to assume that many elements of the population covariance matrix are equal to 0, and hence this matrix is sparse We develop an estimator to handle this situation The estimator is shown to be consistent in operator norm, when, for instance, we have $p\asymp n$ as $n\to\infty$ In other words the largest singular value of the difference between the estimator and the population covariance matrix goes to zero This implies consistency of all the eigenvalues and consistency of eigenspaces associated to isolated eigenvalues We also propose a notion of sparsity for matrices, that is, "compatible" with spectral analysis and is independent of the ordering of the variables

...read moreread less

333 citations

Journal Article•DOI•

Estimation of high-dimensional low-rank matrices

[...]

Angelika Rohde, Alexandre B. Tsybakov

29 Dec 2009-arXiv: Statistics Theory

TL;DR: In this paper, the authors consider the high-dimensional setting where the number of unknown entries can be much larger than the sample size, and derive nonasymptotic upper bounds on the prediction risk and on the Schatten-$q$ risk of the estimators.

...read moreread less

Abstract: Suppose that we observe entries or, more generally, linear combinations of entries of an unknown $m\times T$-matrix $A$ corrupted by noise. We are particularly interested in the high-dimensional setting where the number $mT$ of unknown entries can be much larger than the sample size $N$. Motivated by several applications, we consider estimation of matrix $A$ under the assumption that it has small rank. This can be viewed as dimension reduction or sparsity assumption. In order to shrink toward a low-rank representation, we investigate penalized least squares estimators with a Schatten-$p$ quasi-norm penalty term, $p\leq1$. We study these estimators under two possible assumptions---a modified version of the restricted isometry condition and a uniform bound on the ratio "empirical norm induced by the sampling operator/Frobenius norm." The main results are stated as nonasymptotic upper bounds on the prediction risk and on the Schatten-$q$ risk of the estimators, where $q\in[p,2]$. The rates that we obtain for the prediction risk are of the form $rm/N$ (for $m=T$), up to logarithmic factors, where $r$ is the rank of $A$. The particular examples of multi-task learning and matrix completion are worked out in detail. The proofs are based on tools from the theory of empirical processes. As a by-product, we derive bounds for the $k$th entropy numbers of the quasi-convex Schatten class embeddings $S_p^M\hookrightarrow S_2^M$, $p<1$, which are of independent interest.

...read moreread less

Journal Article•DOI•

Intersection Bounds: Estimation and Inference

[...]

Victor Chernozhukov, Sokbae Lee, Adam M. Rosen

20 Jul 2009-arXiv: Statistics Theory

TL;DR: A practical and novel method for inference on intersection bounds, namely bounds defined by either the infimum or supremum of a parametric or nonparametric function, or equivalently, the value of a linear programming problem with a potentially infinite constraint set is developed.

...read moreread less

Abstract: We develop a practical and novel method for inference on intersection bounds, namely bounds defined by either the infimum or supremum of a parametric or nonparametric function, or equivalently, the value of a linear programming problem with a potentially infinite constraint set. We show that many bounds characterizations in econometrics, for instance bounds on parameters under conditional moment inequalities, can be formulated as intersection bounds. Our approach is especially convenient for models comprised of a continuum of inequalities that are separable in parameters, and also applies to models with inequalities that are non-separable in parameters. Since analog estimators for intersection bounds can be severely biased in finite samples, routinely underestimating the size of the identified set, we also offer a median-bias-corrected estimator of such bounds as a by-product of our inferential procedures. We develop theory for large sample inference based on the strong approximation of a sequence of series or kernel-based empirical processes by a sequence of "penultimate" Gaussian processes. These penultimate processes are generally not weakly convergent, and thus non-Donsker. Our theoretical results establish that we can nonetheless perform asymptotically valid inference based on these processes. Our construction also provides new adaptive inequality/moment selection methods. We provide conditions for the use of nonparametric kernel and series estimators, including a novel result that establishes strong approximation for any general series estimator admitting linearization, which may be of independent interest.

...read moreread less

Journal Article•DOI•

Testing for jumps in a discretely observed process

[...]

Yacine Ait-Sahalia, Jean Jacod

02 Mar 2009-arXiv: Statistics Theory

TL;DR: In this paper, the authors proposed a new test to determine whether jumps are present in asset returns or other discretely sampled processes, which is valid for all Ito semimartingales, depends neither on the law of the process nor on the coefficients of the equation which it solves, does not require a preliminary estimation of these coefficients, and when there are jumps the test is applicable whether jumps have finite or infinite-activity and for an arbitrary Blumenthal--Getoor index.

...read moreread less

Abstract: We propose a new test to determine whether jumps are present in asset returns or other discretely sampled processes. As the sampling interval tends to 0, our test statistic converges to 1 if there are jumps, and to another deterministic and known value (such as 2) if there are no jumps. The test is valid for all Ito semimartingales, depends neither on the law of the process nor on the coefficients of the equation which it solves, does not require a preliminary estimation of these coefficients, and when there are jumps the test is applicable whether jumps have finite or infinite-activity and for an arbitrary Blumenthal--Getoor index. We finally implement the test on simulations and asset returns data.

...read moreread less

Journal Article•DOI•

Kernel dimension reduction in regression

[...]

Kenji Fukumizu, Francis Bach, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

13 Aug 2009-arXiv: Statistics Theory

TL;DR: In this paper, a new methodology for sufficient dimension reduction (SDR) is presented, which derives directly from the formulation of SDR in terms of the conditional independence of the covariate $X$ from the response $Y$ given the projection of $X $ on the central subspace.

...read moreread less

Abstract: We present a new methodology for sufficient dimension reduction (SDR). Our methodology derives directly from the formulation of SDR in terms of the conditional independence of the covariate $X$ from the response $Y$, given the projection of $X$ on the central subspace [cf. J. Amer. Statist. Assoc. 86 (1991) 316--342 and Regression Graphics (1998) Wiley]. We show that this conditional independence assertion can be characterized in terms of conditional covariance operators on reproducing kernel Hilbert spaces and we show how this characterization leads to an $M$-estimator for the central subspace. The resulting estimator is shown to be consistent under weak conditions; in particular, we do not have to impose linearity or ellipticity conditions of the kinds that are generally invoked for SDR methods. We also present empirical results showing that the new methodology is competitive in practice.

...read moreread less

Journal Article•DOI•

Finite sample approximation results for principal component analysis: a matrix perturbation approach

[...]

Boaz Nadler

21 Jan 2009-arXiv: Statistics Theory

TL;DR: A matrix perturbation view of the "phase transition phenomenon," and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit of finite sample PCA are presented.

...read moreread less

Abstract: Principal component analysis (PCA) is a standard tool for dimensional reduction of a set of $n$ observations (samples), each with $p$ variables. In this paper, using a matrix perturbation approach, we study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size $n$, and those of the limiting population PCA as $n\to\infty$. As in machine learning, we present a finite sample theorem which holds with high probability for the closeness between the leading eigenvalue and eigenvector of sample PCA and population PCA under a spiked covariance model. In addition, we also consider the relation between finite sample PCA and the asymptotic results in the joint limit $p,n\to\infty$, with $p/n=c$. We present a matrix perturbation view of the "phase transition phenomenon," and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit. Moreover, our analysis also applies for finite $p,n$ where we show that although there is no sharp phase transition as in the infinite case, either as a function of noise level or as a function of sample size $n$, the eigenvector of sample PCA may exhibit a sharp "loss of tracking," suddenly losing its relation to the (true) eigenvector of the population PCA matrix. This occurs due to a crossover between the eigenvalue due to the signal and the largest eigenvalue due to noise, whose eigenvector points in a random direction.

...read moreread less

Journal Article•DOI•

Smoothing splines estimators for functional linear regression

[...]

Christophe Crambes¹, Alois Kneip, Pascal Sarda¹•Institutions (1)

Institut de Mathématiques de Toulouse¹

25 Feb 2009-arXiv: Statistics Theory

TL;DR: In this paper, a smoothing splines estimator for the functional slope parameter based on a slight modification of the usual penalty was proposed, and it was shown that these rates are optimal in the sense that they are minimax over large classes of possible slope functions and distributions of the predictive curves.

...read moreread less

Abstract: The paper considers functional linear regression, where scalar responses $Y_1,...,Y_n$ are modeled in dependence of random functions $X_1,...,X_n$. We propose a smoothing splines estimator for the functional slope parameter based on a slight modification of the usual penalty. Theoretical analysis concentrates on the error in an out-of-sample prediction of the response for a new random function $X_{n+1}$. It is shown that rates of convergence of the prediction error depend on the smoothness of the slope function and on the structure of the predictors. We then prove that these rates are optimal in the sense that they are minimax over large classes of possible slope functions and distributions of the predictive curves. For the case of models with errors-in-variables the smoothing spline estimator is modified by using a denoising correction of the covariance matrix of discretized curves. The methodology is then applied to a real case study where the aim is to predict the maximum of the concentration of ozone by using the curve of this concentration measured the preceding day.

...read moreread less

Journal Article•DOI•

Estimation of volatility functionals in the simultaneous presence of microstructure noise and jumps

[...]

Mark Podolskij¹, Mathias Vetter²•Institutions (2)

Aarhus University¹, Ruhr University Bochum²

04 Sep 2009-arXiv: Statistics Theory

TL;DR: In this paper, a new concept of modulated bipower variation for diffusion models with microstructure noise is proposed, which provides simple estimates for such important quantities as integrated volatility or integrated quarticity.

...read moreread less

Abstract: We propose a new concept of modulated bipower variation for diffusion models with microstructure noise. We show that this method provides simple estimates for such important quantities as integrated volatility or integrated quarticity. Under mild conditions the consistency of modulated bipower variation is proven. Under further assumptions we prove stable convergence of our estimates with the optimal rate $n^{-{1}/{4}}$. Moreover, we construct estimates which are robust to finite activity jumps.

...read moreread less

Journal Article•DOI•

PCA consistency in high dimension, low sample size context

[...]

Sungkyu Jung¹, James Stephen Marron•Institutions (1)

University of North Carolina at Chapel Hill¹

19 Nov 2009-arXiv: Statistics Theory

TL;DR: This work investigates the asymptotic behavior of the Principal Component (PC) directions and shows that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most otherPC directions are strongly inconsistent.

...read moreread less

Abstract: Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows [i.e., High Dimension, Low Sample Size (HDLSS)] are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sufficient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a $\rho$-mixing condition and a broad range of sphericity measures of the covariance matrix.

...read moreread less

Journal Article•DOI•

Some sharp performance bounds for least squares regression with $L_1$ regularization

[...]

Tong Zhang

20 Aug 2009-arXiv: Statistics Theory

TL;DR: In this article, the authors derived sharp performance bounds for least squares regression with $L 1$ regularization from parameter estimation accuracy and feature selection quality perspectives, and analyzed a two-stage procedure with selective penalization.

...read moreread less

Abstract: We derive sharp performance bounds for least squares regression with $L_1$ regularization from parameter estimation accuracy and feature selection quality perspectives. The main result proved for $L_1$ regularization extends a similar result in [Ann. Statist. 35 (2007) 2313--2351] for the Dantzig selector. It gives an affirmative answer to an open question in [Ann. Statist. 35 (2007) 2358--2364]. Moreover, the result leads to an extended view of feature selection that allows less restrictive conditions than some recent work. Based on the theoretical insights, a novel two-stage $L_1$-regularization procedure with selective penalization is analyzed. It is shown that if the target parameter vector can be decomposed as the sum of a sparse parameter vector with large coefficients and another less sparse vector with relatively small coefficients, then the two-stage procedure can lead to improved performance.

...read moreread less

Journal Article•DOI•

Estimating the degree of activity of jumps in high frequency data

[...]

Yacine Ait-Sahalia, Jean Jacod¹•Institutions (1)

Institut de Mathématiques de Jussieu¹

21 Aug 2009-arXiv: Statistics Theory

TL;DR: In this paper, a generalized index of jump activity is defined and estimators of that index for a discretely sampled process are derived. But the estimators are applicable despite the presence of Brownian volatility in the process, which makes it more challenging to infer the characteristics of the small, infinite activity jumps.

...read moreread less

Abstract: We define a generalized index of jump activity, propose estimators of that index for a discretely sampled process and derive the estimators' properties. These estimators are applicable despite the presence of Brownian volatility in the process, which makes it more challenging to infer the characteristics of the small, infinite activity jumps. When the method is applied to high frequency stock returns, we find evidence of infinitely active jumps in the data and estimate their index of activity.

...read moreread less

Journal Article•DOI•

Functional linear regression that's interpretable

[...]

Gareth M. James¹, Jing Wang, Ji Zhu•Institutions (1)

University of Southern California¹

20 Aug 2009-arXiv: Statistics Theory

TL;DR: This article introduces a new approach which uses variable selection ideas, applied to various derivatives of $\beta(t)$, to produce estimates that are both interpretable, flexible and accurate and is demonstrated on simulated and real-world data sets.

...read moreread less

Abstract: Regression models to relate a scalar $Y$ to a functional predictor $X(t)$ are becoming increasingly common. Work in this area has concentrated on estimating a coefficient function, $\beta(t)$, with $Y$ related to $X(t)$ through $\int\beta(t)X(t) dt$. Regions where $\beta(t) e0$ correspond to places where there is a relationship between $X(t)$ and $Y$. Alternatively, points where $\beta(t)=0$ indicate no relationship. Hence, for interpretation purposes, it is desirable for a regression procedure to be capable of producing estimates of $\beta(t)$ that are exactly zero over regions with no apparent relationship and have simple structures over the remaining regions. Unfortunately, most fitting procedures result in an estimate for $\beta(t)$ that is rarely exactly zero and has unnatural wiggles making the curve hard to interpret. In this article we introduce a new approach which uses variable selection ideas, applied to various derivatives of $\beta(t)$, to produce estimates that are both interpretable, flexible and accurate. We call our method "Functional Linear Regression That's Interpretable" (FLiRTI) and demonstrate it on simulated and real-world data sets. In addition, non-asymptotic theoretical bounds on the estimation error are presented. The bounds provide strong theoretical motivation for our approach.

...read moreread less

Journal Article•DOI•

Corrections to LRT on Large Dimensional Covariance Matrix by RMT

[...]

Zhidong Bai, Dandan Jiang, Jianfeng Yao, Shurong Zheng

03 Feb 2009-arXiv: Statistics Theory

TL;DR: In this article, the authors give an explanation to the failure of two likelihood ratio procedures for testing about covariance matrices from Gaussian populations when the dimension is large compared to the sample size.

...read moreread less

Abstract: In this paper, we give an explanation to the failure of two likelihood ratio procedures for testing about covariance matrices from Gaussian populations when the dimension is large compared to the sample size. Next, using recent central limit theorems for linear spectral statistics of sample covariance matrices and of random F-matrices, we propose necessary corrections for these LR tests to cope with high-dimensional effects. The asymptotic distributions of these corrected tests under the null are given. Simulations demonstrate that the corrected LR tests yield a realized size close to nominal level for both moderate p (around 20) and high dimension, while the traditional LR tests with chi-square approximation fails. Another contribution from the paper is that for testing the equality between two covariance matrices, the proposed correction applies equally for non-Gaussian populations yielding a valid pseudo-likelihood ratio test.

...read moreread less

Journal Article•DOI•

Quantile regression in partially linear varying coefficient models

[...]

Huixia Judy Wang, Zhongyi Zhu, Jianhui Zhou

18 Nov 2009-arXiv: Statistics Theory

TL;DR: In this article, a class of marginal partially linear quantile models with possibly varying coefficients is studied, where the functional coefficients are estimated by basis function approximations, and rank score tests for hypotheses on the coefficients are developed.

...read moreread less

Abstract: Semiparametric models are often considered for analyzing longitudinal data for a good balance between flexibility and parsimony. In this paper, we study a class of marginal partially linear quantile models with possibly varying coefficients. The functional coefficients are estimated by basis function approximations. The estimation procedure is easy to implement, and it requires no specification of the error distributions. The asymptotic properties of the proposed estimators are established for the varying coefficients as well as for the constant coefficients. We develop rank score tests for hypotheses on the coefficients, including the hypotheses on the constancy of a subset of the varying coefficients. Hypothesis testing of this type is theoretically challenging, as the dimensions of the parameter spaces under both the null and the alternative hypotheses are growing with the sample size. We assess the finite sample performance of the proposed method by Monte Carlo simulation studies, and demonstrate its value by the analysis of an AIDS data set, where the modeling of quantiles provides more comprehensive information than the usual least squares approach.

...read moreread less

Journal Article•DOI•

Adaptive estimation for Hawkes processes; application to genome analysis

[...]

Patricia Reynaud-Bouret, Sophie Schbath

17 Mar 2009-arXiv: Statistics Theory

TL;DR: In this article, a nonasymptotic penalized model selection approach for the detection of either favored or avoided distances between genomic events along DNA sequences is proposed, based on the Hawkes process.

...read moreread less

Abstract: The aim of this paper is to provide a new method for the detection of either favored or avoided distances between genomic events along DNA sequences. These events are modeled by a Hawkes process. The biological problem is actually complex enough to need a nonasymptotic penalized model selection approach. We provide a theoretical penalty that satisfies an oracle inequality even for quite complex families of models. The consecutive theoretical estimator is shown to be adaptive minimax for H\"{o}lderian functions with regularity in $(1/2,1]$: those aspects have not yet been studied for the Hawkes' process. Moreover, we introduce an efficient strategy, named Islands, which is not classically used in model selection, but that happens to be particularly relevant to the biological question we want to answer. Since a multiplicative constant in the theoretical penalty is not computable in practice, we provide extensive simulations to find a data-driven calibration of this constant. The results obtained on real genomic data are coherent with biological knowledge and eventually refine them.

...read moreread less

Journal Article•DOI•

Innovated higher criticism for detecting sparse signals in correlated noise

[...]

Peter A. Hall, Jiashun Jin

23 Feb 2009-arXiv: Statistics Theory

TL;DR: It turns out that the case of independent noise is the most difficult of all, from a statistical viewpoint, and that more accurate signal detection can be obtained when correlation is present, by exploiting the nature of correlation.

...read moreread less

Abstract: Higher criticism is a method for detecting signals that are both sparse and weak. Although first proposed in cases where the noise variables are independent, higher criticism also has reasonable performance in settings where those variables are correlated. In this paper we show that, by exploiting the nature of the correlation, performance can be improved by using a modified approach which exploits the potential advantages that correlation has to offer. Indeed, it turns out that the case of independent noise is the most difficult of all, from a statistical viewpoint, and that more accurate signal detection (for a given level of signal sparsity and strength) can be obtained when correlation is present. We characterize the advantages of correlation by showing how to incorporate them into the definition of an optimal detection boundary. The boundary has particularly attractive properties when correlation decays at a polynomial rate or the correlation matrix is Toeplitz.

...read moreread less

Posted Content•

Sparse Principal Components Analysis

[...]

Iain M. Johnstone, Arthur Yu Lu

28 Jan 2009-arXiv: Statistics Theory

TL;DR: In this article, a simple "sparse PCA" algorithm was proposed to estimate eigenvectors from PCA on the selected subset, threshold and reexpress in the original basis.

...read moreread less

Abstract: Principal components analysis (PCA) is a classical method for the reduction of dimensionality of data in the form of n observations (or cases) of a vector with p variables. For a simple model of factor analysis type, it is proved that ordinary PCA can produce a consistent (for n large) estimate of the principal factor if and only if p(n) is asymptotically of smaller order than n. There may be a basis in which typical signals have sparse representations: most co-ordinates have small signal energies. If such a basis (e.g. wavelets) is used to represent the signals, then the variation in many coordinates is likely to be small. Consequently, we study a simple "sparse PCA" algorithm: select a subset of coordinates of largest variance, estimate eigenvectors from PCA on the selected subset, threshold and reexpress in the original basis. We illustrate the algorithm on some exercise ECG data, and prove that in a single factor model, under an appropriate sparsity assumption, it yields consistent estimates of the principal factor.

...read moreread less

Journal Article•DOI•

General maximum likelihood empirical Bayes estimation of normal means

[...]

Wenhua Jiang¹, Cun-Hui Zhang¹•Institutions (1)

Rutgers University¹

12 Aug 2009-arXiv: Statistics Theory

TL;DR: Simulation experiments demonstrate that the GMLEB outperforms the James―Stein and several state-of-the-art threshold estimators in a wide range of settings without much down side.

...read moreread less

Abstract: We propose a general maximum likelihood empirical Bayes (GMLEB) method for the estimation of a mean vector based on observations with i.i.d. normal errors. We prove that under mild moment conditions on the unknown means, the average mean squared error (MSE) of the GMLEB is within an infinitesimal fraction of the minimum average MSE among all separable estimators which use a single deterministic estimating function on individual observations, provided that the risk is of greater order than $(\log n)^5/n$. We also prove that the GMLEB is uniformly approximately minimax in regular and weak $\ell_p$ balls when the order of the length-normalized norm of the unknown means is between $(\log n)^{\kappa_1}/n^{1/(p\wedge2)}$ and $n/(\log n)^{\kappa_2}$. Simulation experiments demonstrate that the GMLEB outperforms the James--Stein and several state-of-the-art threshold estimators in a wide range of settings without much down side.

...read moreread less

Journal Article•DOI•

A unified approach to model selection and sparse recovery using regularized least squares

[...]

Jinchi Lv, Yingying Fan

21 May 2009-arXiv: Statistics Theory

TL;DR: In this paper, the authors study the properties of regularization methods for model selection and sparse recovery under the unified framework of regularized least squares with concave penalties, and propose the sequentially and iteratively reweighted squares (SIRS) algorithm for sparse recovery.

...read moreread less

Abstract: Model selection and sparse recovery are two important problems for which many regularization methods have been proposed. We study the properties of regularization methods in both problems under the unified framework of regularized least squares with concave penalties. For model selection, we establish conditions under which a regularized least squares estimator enjoys a nonasymptotic property, called the weak oracle property, where the dimensionality can grow exponentially with sample size. For sparse recovery, we present a sufficient condition that ensures the recoverability of the sparsest solution. In particular, we approach both problems by considering a family of penalties that give a smooth homotopy between $L_0$ and $L_1$ penalties. We also propose the sequentially and iteratively reweighted squares (SIRS) algorithm for sparse recovery. Numerical studies support our theoretical results and demonstrate the advantage of our new methods for model selection and sparse recovery.

...read moreread less

Journal Article•DOI•

Hypothesis test for normal mixture models: The EM approach

[...]

Jiahua Chen, Pengfei Li

24 Aug 2009-arXiv: Statistics Theory

TL;DR: In this paper, the EM-test for homogeneity is applied to finite normal mixtures, and it is shown that the limiting distribution is a simple function of the $0.5\chi^2_0+0.1$ and $2_1$ distributions when the mixing variances are equal but unknown.

...read moreread less

Abstract: Normal mixture distributions are arguably the most important mixture models, and also the most technically challenging. The likelihood function of the normal mixture model is unbounded based on a set of random samples, unless an artificial bound is placed on its component variance parameter. Moreover, the model is not strongly identifiable so it is hard to differentiate between over dispersion caused by the presence of a mixture and that caused by a large variance, and it has infinite Fisher information with respect to mixing proportions. There has been extensive research on finite normal mixture models, but much of it addresses merely consistency of the point estimation or useful practical procedures, and many results require undesirable restrictions on the parameter space. We show that an EM-test for homogeneity is effective at overcoming many challenges in the context of finite normal mixtures. We find that the limiting distribution of the EM-test is a simple function of the $0.5\chi^2_0+0.5\chi^2_1$ and $\chi^2_1$ distributions when the mixing variances are equal but unknown and the $\chi^2_2$ when variances are unequal and unknown. Simulations show that the limiting distributions approximate the finite sample distribution satisfactorily. Two genetic examples are used to illustrate the application of the EM-test.

...read moreread less

Posted Content•

Making and Evaluating Point Forecasts

[...]

Tilmann Gneiting

04 Dec 2009-arXiv: Statistics Theory

TL;DR: In this paper, the authors demonstrate that the common practice of comparing and assessing point forecasting methods by means of an error measure or scoring function, such as the absolute error or the squared error, can lead to grossly misguided inferences, unless the scoring function and the forecasting task are carefully matched.

...read moreread less

Abstract: Typically, point forecasting methods are compared and assessed by means of an error measure or scoring function, such as the absolute error or the squared error. The individual scores are then averaged over forecast cases, to result in a summary measure of the predictive performance, such as the mean absolute error or the (root) mean squared error. I demonstrate that this common practice can lead to grossly misguided inferences, unless the scoring function and the forecasting task are carefully matched. Effective point forecasting requires that the scoring function be specified ex ante, or that the forecaster receives a directive in the form of a statistical functional, such as the mean or a quantile of the predictive distribution. If the scoring function is specified ex ante, the forecaster can issue the optimal point forecast, namely, the Bayes rule. If the forecaster receives a directive in the form of a functional, it is critical that the scoring function be consistent for it, in the sense that the expected score is minimized when following the directive. A functional is elicitable if there exists a scoring function that is strictly consistent for it. Expectations, ratios of expectations and quantiles are elicitable. For example, a scoring function is consistent for the mean functional if and only if it is a Bregman function. It is consistent for a quantile if and only if it is generalized piecewise linear. Similar characterizations apply to ratios of expectations and to expectiles. Weighted scoring functions are consistent for functionals that adapt to the weighting in peculiar ways. Not all functionals are elicitable; for instance, conditional value-at-risk is not, despite its popularity in quantitative finance.

...read moreread less

Collapse