scispace - formally typeset
Search or ask a question

Showing papers in "Annals of Statistics in 2004"


Journal ArticleDOI
TL;DR: A publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates is described.
Abstract: The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

7,828 citations


Journal ArticleDOI
TL;DR: In this article, Fan and Li showed that the nonconcave penalized likelihood has an oracle property when the number of parameters is finite, and the consistency of the sandwich formula of the covariance matrix is demonstrated.
Abstract: A class of variable selection procedures for parametric models via nonconcave penalized likelihood was proposed by Fan and Li to simultaneously estimate parameters and select important variables. They demonstrated that this class of procedures has an oracle property when the number of parameters is finite. However, in most model selection problems the number of parameters should be large and grow with the sample size. In this paper some asymptotic properties of the nonconcave penalized likelihood are established for situations in which the number of parameters tends to ∞ as the sample size increases. Under regularity conditions we have established an oracle property and the asymptotic normality of the penalized likelihood estimators. Furthermore, the consistency of the sandwich formula of the covariance matrix is demonstrated. Nonconcave penalized likelihood ratio statistics are discussed, and their asymptotic distributions under the null hypothesis are obtained by imposing some mild conditions on the penalty functions. The asymptotic results are augmented by a simulation study, and the newly developed methodology is illustrated by an analysis of a court case on the sexual discrimination of salary.

978 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that the median probability model is often the optimal predictive model, which is defined as the model consisting of those variables which have overall posterior probability greater than or equal to 1/2 of being in a model.
Abstract: Often the goal of model selection is to choose a model for future prediction, and it is natural to measure the accuracy of a future prediction by squared error loss. Under the Bayesian approach, it is commonly perceived that the optimal predictive model is the model with highest posterior probability, but this is not necessarily the case. In this paper we show that, for selection among normal linear models, the optimal predictive model is often the median probability model, which is defined as the model consisting of those variables which have overall posterior probability greater than or equal to 1/2 of being in a model. The median probability model often differs from the highest probability model.

881 citations


Journal ArticleDOI
TL;DR: In this paper, higher criticism is used to test whether n normal means are all zero versus the alternative that a small fraction of nonzero means is nonzero, and it is shown that higher criticism works well over a range of non-Gaussian cases.
Abstract: Higher criticism, or second-level significance testing, is a multiple-comparisons concept mentioned in passing by Tukey. It concerns a situation where there are many independent tests of significance and one is interested in rejecting the joint null hypothesis. Tukey suggested comparing the fraction of observed significances at a given α-level to the expected fraction under the joint null. In fact, he suggested standardizing the difference of the two quantities and forming a z-score; the resulting z-score tests the significance of the body of significance tests. We consider a generalization, where we maximize this z-score over a range of significance levels 0<α≤α0. We are able to show that the resulting higher criticism statistic is effective at resolving a very subtle testing problem: testing whether n normal means are all zero versus the alternative that a small fraction is nonzero. The subtlety of this “sparse normal means” testing problem can be seen from work of Ingster and Jin, who studied such problems in great detail. In their studies, they identified an interesting range of cases where the small fraction of nonzero means is so small that the alternative hypothesis exhibits little noticeable effect on the distribution of the p-values either for the bulk of the tests or for the few most highly significant tests. In this range, when the amplitude of nonzero means is calibrated with the fraction of nonzero means, the likelihood ratio test for a precisely specified alternative would still succeed in separating the two hypotheses. We show that the higher criticism is successful throughout the same region of amplitude sparsity where the likelihood ratio test would succeed. Since it does not require a specification of the alternative, this shows that higher criticism is in a sense optimally adaptive to unknown sparsity and size of the nonnull effects. While our theoretical work is largely asymptotic, we provide simulations in finite samples and suggest some possible applications. We also show that higher critcism works well over a range of non-Gaussian cases.

812 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that the problem of maximizing entropy and minimizing a related discrepancy or divergence between distributions can be viewed as dual problems, with the solution to each providing that to the other.
Abstract: We describe and develop a close relationship between two problems that have customarily been regarded as distinct: that of maximizing entropy, and that of minimizing worst-case expected loss Using a formulation grounded in the equilibrium theory of zero-sum games between Decision Maker and Nature, these two problems are shown to be dual to each other, the solution to each providing that to the other Although Topsoe described this connection for the Shannon entropy over 20 years ago, it does not appear to be widely known even in that important special case We here generalize this theory to apply to arbitrary decision problems and loss functions We indicate how an appropriate generalized definition of entropy can be associated with such a problem, and we show that, subject to certain regularity conditions, the above-mentioned duality continues to apply in this extended context This simultaneously provides a possible rationale for maximizing entropy and a tool for finding robust Bayes acts We also describe the essential identity between the problem of maximizing entropy and that of minimizing a related discrepancy or divergence between distributions This leads to an extension, to arbitrary discrepancies, of a well-known minimax theorem for the case of Kullback–Leibler divergence (the “redundancy-capacity theorem” of information theory) For the important case of families of distributions having certain mean values specified, we develop simple sufficient conditions and methods for identifying the desired solutions We use this theory to introduce a new concept of “generalized exponential family” linked to the specific decision problem under consideration, and we demonstrate that this shares many of the properties of standard exponential families Finally, we show that the existence of an equilibrium in our game can be rephrased in terms of a “Pythagorean property” of the related divergence, thus generalizing previously announced results for Kullback–Leibler and Bregman divergences

502 citations


Journal ArticleDOI
TL;DR: An empirical Bayes approach to the estimation of possibly sparse sequences observed in Gaussian white noise is set out and investigated, using a mixture of an atom of probability at zero and a heavy-tailed density y with the mixing weight chosen by marginal maximum likelihood.
Abstract: An empirical Bayes approach to the estimation of possibly sparse sequences observed in Gaussian white noise is set out and investigated. The prior considered is a mixture of an atom of probability at zero and a heavy-tailed density y, with the mixing weight chosen by marginal maximum likelihood, in the hope of adapting between sparse and dense sequences. If estimation is then carried out using the posterior median, this is a random thresholding procedure. Other thresholding rules employing the same threshold can also be used. Probability bounds on the threshold chosen by the marginal maximum likelihood approach lead to overall risk bounds over classes of signal sequences of length n, allowing for sparsity of various kinds and degrees. The signal classes considered are nearly black sequences where only a proportion η is allowed to be nonzero, and sequences with normalized p norm bounded by η, for η > 0 and 0 1. Simulations show excellent performance. For appropriately chosen functions y, the method is computationally tractable and software is available. The extension to a modified thresholding method relevant to the estimation of very sparse sequences is also considered.

489 citations


Journal ArticleDOI
TL;DR: A central limit theorem for the Monte Carlo estimates produced by these computational methods is established in this paper, and applies in a general framework which encompasses most of the sequential Monte Carlo methods that have been considered in the literature, including the resample-move algorithm of Gilks and Berzuini [J. R. Stat. Ser. B Statol. 63 (2001) 127,146] and the residual resampling scheme.
Abstract: The term “sequential Monte Carlo methods” or, equivalently, “particle filters,” refers to a general class of iterative algorithms that performs Monte Carlo approximations of a given sequence of distributions of interest (πt). We establish in this paper a central limit theorem for the Monte Carlo estimates produced by these computational methods. This result holds under minimal assumptions on the distributions πt, and applies in a general framework which encompasses most of the sequential Monte Carlo methods that have been considered in the literature, including the resample-move algorithm of Gilks and Berzuini [J. R. Stat. Soc. Ser. B Stat. Methodol. 63 (2001) 127–146] and the residual resampling scheme. The corresponding asymptotic variances provide a convenient measurement of the precision of a given particle filter. We study, in particular, in some typical examples of Bayesian applications, whether and at which rate these asymptotic variances diverge in time, in order to assess the long term reliability of the considered algorithm.

481 citations


Journal ArticleDOI
TL;DR: In this paper, the authors developed a framework in which the false discovery proportion (FDP) is treated as a stochastic process and derived confidence thresholds for controlling the quantiles of the distribution of the FDP as well as controlling the number of false discoveries.
Abstract: This paper extends the theory of false discovery rates (FDR) pioneered by Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289–300]. We develop a framework in which the False Discovery Proportion (FDP)—the number of false rejections divided by the number of rejections—is treated as a stochastic process. After obtaining the limiting distribution of the process, we demonstrate the validity of a class of procedures for controlling the False Discovery Rate (the expected FDP). We construct a confidence envelope for the whole FDP process. From these envelopes we derive confidence thresholds, for controlling the quantiles of the distribution of the FDP as well as controlling the number of false discoveries. We also investigate methods for estimating the p-value distribution.

459 citations


Journal ArticleDOI
TL;DR: In this article, a multistep-multiscale bootstrap was proposed for the exponential family of distributions with unknown expectation parameter vector, where the null hypothesis is represented as an arbitrary-shaped region with smooth boundaries.
Abstract: Approximately unbiased tests based on bootstrap probabilities are considered for the exponential family of distributions with unknown expectation parameter vector, where the null hypothesis is represented as an arbitrary-shaped region with smooth boundaries. This problem has been discussed previously in Efron and Tibshirani [Ann. Statist. 26 (1998) 1687–1718], and a corrected p-value with second-order asymptotic accuracy is calculated by the two-level bootstrap of Efron, Halloran and Holmes [Proc. Natl. Acad. Sci. U.S.A. 93 (1996) 13429–13434] based on the ABC bias correction of Efron [J. Amer. Statist. Assoc. 82 (1987) 171–185]. Our argument is an extension of their asymptotic theory, where the geometry, such as the signed distance and the curvature of the boundary, plays an important role. We give another calculation of the corrected p-value without finding the “nearest point” on the boundary to the observation, which is required in the two-level bootstrap and is an implementational burden in complicated problems. The key idea is to alter the sample size of the replicated dataset from that of the observed dataset. The frequency of the replicates falling in the region is counted for several sample sizes, and then the p-value is calculated by looking at the change in the frequencies along the changing sample sizes. This is the multiscale bootstrap of Shimodaira [Systematic Biology 51 (2002) 492–508], which is third-order accurate for the multivariate normal model. Here we introduce a newly devised multistep-multiscale bootstrap, calculating a third-order accurate p-value for the exponential family of distributions. In fact, our p-value is asymptotically equivalent to those obtained by the double bootstrap of Hall [The Bootstrap and Edgeworth Expansion (1992) Springer, New York] and the modified signed likelihood ratio of Barndorff-Nielsen [Biometrika 73 (1986) 307–322] ignoring O(n−3/2) terms, yet the computation is less demanding and free from model specification. The algorithm is remarkably simple despite complexity of the theory behind it. The differences of the p-values are illustrated in simple examples, and the accuracies of the bootstrap methods are shown in a systematic way.

347 citations


Journal ArticleDOI
TL;DR: In this paper, the asymptotic properties of the maximum likelihood estimator in a possibly nonstationary process of this kind for which the hidden state space is compact but not necessarily finite are investigated.
Abstract: An autoregressive process with Markov regime is an autoregressive process for which the regression function at each time point is given by a nonobservable Markov chain. In this paper we consider the asymptotic properties of the maximum likelihood estimator in a possibly nonstationary process of this kind for which the hidden state space is compact but not necessarily finite. Consistency and asymptotic normality are shown to follow from uniform exponential forgetting of the initial distribution for the hidden Markov chain conditional on the observations.

257 citations


Journal ArticleDOI
TL;DR: In this article, the local Whittle estimator in the nonstationary case (d> 1 ) was investigated and it was shown that the estimator converges in probability to unity with a polynomial trend of order α > 1.
Abstract: Asymptotic properties of the local Whittle estimator in the nonstationary case ( d> 1 ) are explored. For 1 1 and when the process has a polynomial trend of order α > 1 , the estimator is shown to be inconsistent and to converge in probability to unity.


Journal ArticleDOI
TL;DR: In this article, the authors consider mean squared errors (MSE) of empirical predictors under a general setup, where ML or REML estimators are used for the second stage.
Abstract: The term “empirical predictor” refers to a two-stage predictor of a linear combination of fixed and random effects. In the first stage, a predictor is obtained but it involves unknown parameters; thus, in the second stage, the unknown parameters are replaced by their estimators. In this paper, we consider mean squared errors (MSE) of empirical predictors under a general setup, where ML or REML estimators are used for the second stage. We obtain second-order approximation to the MSE as well as an estimator of the MSE correct to the same order. The general results are applied to mixed linear models to obtain a second-order approximation to the MSE of the empirical best linear unbiased predictor (EBLUP) of a linear mixed effect and an estimator of the MSE of EBLUP whose bias is correct to second order. The general mixed linear model includes the mixed ANOVA model and the longitudinal model as special cases.

Journal ArticleDOI
TL;DR: This work uses martingales to study Bayesian consistency and derives sufficient conditions for both Hellinger and Kullback-Leibler consistency which do not rely on the use of a sieve.
Abstract: We use martingales to study Bayesian consistency. We derive sufficient conditions for both Hellinger and Kullback-Leibler consistency, which do not rely on the use of a sieve. Alternative sufficient conditions for Hellinger consistency are also found and demonstrated on examples.

Journal ArticleDOI
TL;DR: In this article, a class of estimators for the parameters of a generalized autoregressive conditional heteroscedastic (GARCH) sequence was proposed, which are consistent and asymptotically normal under mild conditions.
Abstract: We propose a class of estimators for the parameters of a GARCH(p, q) sequence. We show that our estimators are consistent and asymptotically normal under mild conditions. The quasi-maximum likelihood and the likelihood estimators are discussed in detail. We show that the maximum likelihood estimator is optimal. If the tail of the distribution of the innovations is polynomial, even a quasi-maximum likelihood estimator based on exponential density performs better than the standard normal density-based quasi-likelihood estimator of Lee and Hansen and Lumsdaine. 1. Introduction. The generalized autoregressive conditional heteroscedastic (GARCH) process was introduced by Bollerslev (1986). The GARCH process has received considerable attention from applied as well as from theoretical points of view. We say that {yk, −∞ < k < ∞} is a GARCH(p,q) process if it satisfies the equations

Journal ArticleDOI
TL;DR: In this paper, a new approach for estimating and forecasting the volatility of financial time series is proposed, based on the assumption that the volatility can be approximated by a constant over some interval.
Abstract: This paper offers a new approach for estimating and forecasting the volatility of financial time series. No assumption is made about the parametric form of the processes. On the contrary, we only suppose that the volatility can be approximated by a constant over some interval. In such a framework, the main problem consists of filtering this interval of time homogeneity; then the estimate of the volatility can be simply obtained by local averaging. We construct a locally adaptive volatility estimate (LAVE) which can perform this task and investigate it both from the theoretical point of view and through Monte Carlo simulations. Finally, the LAVE procedure is applied to a data set of nine exchange rates and a comparison with a standard GARCH model is also provided. Both models appear to be capable of explaining many of the features of the data; nevertheless, the new approach seems to be superior to the GARCH method as far as the out-of-sample results are concerned.


Journal ArticleDOI
TL;DR: In this article, the authors consider a class of semiparametric regression models which are one-parameter extensions of the Cox [J. Roy. Statist. Ser. B 34 (1972) 187-220] model for right-censored univariate failure times.
Abstract: We consider a class of semiparametric regression models which are one-parameter extensions of the Cox [J. Roy. Statist. Soc. Ser. B 34 (1972) 187–220] model for right-censored univariate failure times. These models assume that the hazard given the covariates and a random frailty unique to each individual has the proportional hazards form multiplied by the frailty. The frailty is assumed to have mean 1 within a known one-parameter family of distributions. Inference is based on a nonparametric likelihood. The behavior of the likelihood maximizer is studied under general conditions where the fitted model may be misspecified. The joint estimator of the regression and frailty parameters as well as the baseline hazard is shown to be uniformly consistent for the pseudo-value maximizing the asymptotic limit of the likelihood. Appropriately standardized, the estimator converges weakly to a Gaussian process. When the model is correctly specified, the procedure is semiparametric efficient, achieving the semiparametric information bound for all parameter components. It is also proved that the bootstrap gives valid inferences for all parameters, even under misspecification. We demonstrate analytically the importance of the robust inference in several examples. In a randomized clinical trial, a valid test of the treatment effect is possible when other prognostic factors and the frailty distribution are both misspecified. Under certain conditions on the covariates, the ratios of the regression parameters are still identifiable. The practical utility of the procedure is illustrated on a non-Hodgkin’s lymphoma dataset.

Journal ArticleDOI
TL;DR: In this paper, the authors developed tests of the hypothesis of no effect for selected predictors in regression, without assuming a model for the conditional distribution of the response given the predictors.
Abstract: We develop tests of the hypothesis of no effect for selected predictors in regression, without assuming a model for the conditional distribution of the response given the predictors. Predictor effects need not be limited to the mean function and smoothing is not required. The general approach is based on sufficient dimension reduction, the idea being to replace the predictor vector with a lower-dimensional version without loss of information on the regression. Methodology using sliced inverse regression is developed in detail.

Journal ArticleDOI
TL;DR: In this paper, a multiscale likelihood decomposition of an L 2 function has been proposed, where a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale.
Abstract: We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L 2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of sufficient conditions for the existence of such factorizations, formulated in analogy to those underlying a standard multiresolution analysis for wavelets, and hence can be viewed as a multiresolution analysis for likelihoods. We then consider the use of these factorizations in the task of nonparametric, complexity penalized likelihood estimation. We study the risk properties of certain thresholding and partitioning estimators, and demonstrate their adaptivity and near-optimality, in a minimax sense over a broad range of function spaces, based on squared Hellinger distance as a loss function. In particular, our results provide an illustration of how properties of classical wavelet-based estimators can be obtained in a single, unified framework that includes models for continuous, count and categorical data types.

Journal ArticleDOI
TL;DR: In this paper, the authors discuss two goodness-of-fit testing problems: the first problem pertains to fitting an error distribution to an assumed nonlinear parametric regression model.
Abstract: This paper discusses two goodness-of-fit testing problems. The first problem pertains to fitting an error distribution to an assumed nonlinear parametric regression model, while the second pertains to fitting a parametric regression model when the error distribution is unknown. For the first problem the paper contains tests based on a certain martingale type transform of residual empirical processes. The advantage of this transform is that the corresponding tests are asymptotically distribution free. For the second problem the proposed asymptotically distribution free tests are based on innovation martingale transforms. A Monte Carlo study shows that the simulated level of the proposed tests is close to the asymptotic level for moderate sample sizes.

Journal ArticleDOI
TL;DR: In this article, a polynomial form of indicator functions is used to characterize the geometric structure of factorial designs with quantitative factors, and a new aberration criteria is proposed and some minimum aberration designs are presented.
Abstract: Factorial designs have broad applications in agricultural, engineering and scientific studies. In constructing and studying properties of factorial designs, traditional design theory treats all factors as nominal. However, this is not appropriate for experiments that involve quantitative factors. For designs with quantitative factors, level permutation of one or more factors in a design matrix could result in different geometric structures, and, thus, different design properties. In this paper indicator functions are introduced to represent factorial designs. A polynomial form of indicator functions is used to characterize the geometric structure of those designs. Geometric isomorphism is defined for classifying designs with quantitative factors. Based on indicator functions, a new aberration criteria is proposed and some minimum aberration designs are presented.

Journal ArticleDOI
TL;DR: In this paper, a local linear kernel estimator of the regression function x → g(x) is proposed and investigated under mild regularity assumptions, asymptotic normality of the estimators of g (x) and its derivatives is established.
Abstract: A local linear kernel estimator of the regression function x → g(x):= E[Y i |X i = x], x ∈ R d , of a stationary (d + 1)-dimensional spatial process {(Y i , X i ), i ∈ Z N } observed over a rectangular domain of the form in:= {i = (i 1 ,..,i N ) ∈ Z N |1 ≤ i k ≤ n k , k = 1,...,N}, n = (n 1 ,...,n N ) ∈ Z N , is proposed and investigated. Under mild regularity assumptions, asymptotic normality of the estimators of g(x) and its derivatives is established. Appropriate choices of the bandwidths are proposed. The spatial process is assumed to satisfy some very general mixing conditions, generalizing classical time-series strong mixing concepts. The size of the rectangular domain in is allowed to tend to infinity at different rates depending on the direction in Z N .

Journal ArticleDOI
TL;DR: In this article, an extension of classical X 2 goodness-of-fit tests to Bayesian model assessment is described, which essentially involves evaluating Pearson's goodness of fit statistic at a parameter value drawn from its posterior distribution, has the important property that it is asymptotically distributed as a X 2 random variable on K - 1 degrees of freedom, independently of the dimension of the underlying parameter vector.
Abstract: This article describes an extension of classical X 2 goodness-of-fit tests to Bayesian model assessment. The extension, which essentially involves evaluating Pearson's goodness-of-fit statistic at a parameter value drawn from its posterior distribution, has the important property that it is asymptotically distributed as a X 2 random variable on K - 1 degrees of freedom, independently of the dimension of the underlying parameter vector. By examining the posterior distribution of this statistic, global goodness-of-fit diagnostics are obtained. Advantages of these diagnostics include ease of interpretation, computational convenience and favorable power properties. The proposed diagnostics can be used to assess the adequacy of a broad class of Bayesian models, essentially requiring only a finite-dimensional parameter vector and conditionally independent observations.

Journal ArticleDOI
TL;DR: In this paper, Gibbs and block Gibbs samplers for a Bayesian hierarchical version of the one-way random effects model are considered and drift and minorization conditions are established for the underlying Markov chains.
Abstract: We consider Gibbs and block Gibbs samplers for a Bayesian hierarchical version of the one-way random effects model. Drift and minorization conditions are established for the underlying Markov chains. The drift and minorization are used in conjunction with results from J. S. Rosenthal [J. Amer. Statist. Assoc. 90 (1995) 558– 566] and G. O. Roberts and R. L. Tweedie [Stochastic Process. Appl. 80 (1999) 211–229] to construct analytical upper bounds on the distance to stationarity. These lead to upper bounds on the amount of burn-in that is required to get the chain within a prespecified (total variation) distance of the stationary distribution. The results are illustrated with a numerical example. 1. Introduction. We consider a Bayesian hierarchical version of the standard normal theory one-way random effects model. The posterior density for this model is intractable in the sense that the integrals required for making inferences cannot be computed in closed form. Hobert and Geyer (1998) analyzed a Gibbs sampler and a block Gibbs sampler for this problem and showed that the Markov chains underlying these algorithms converge to the stationary (i.e., posterior) distribution at a geometric rate. However, Hobert and Geyer stopped short of constructing analytical upper bounds on the total variation distance to stationarity. In this article, we construct such upper bounds and this leads to a method for determining a sufficient burn-in. Our results are useful from a practical standpoint because they obviate troublesome, ad hoc convergence diagnostics [Cowles and Carlin (1996) and

Journal ArticleDOI
TL;DR: In this paper, the authors studied the problem of estimating the coefficients of a diffusion (X t, t ≥ 0), where the estimation is based on discrete data X n Δ, n = 0, 1,..., N. The sampling frequency is constant, and asymptotics are taken as the number N of observations tends to infinity.
Abstract: We study the problem of estimating the coefficients of a diffusion (X t , t ≥ 0); the estimation is based on discrete data X n Δ, n = 0, 1,..., N. The sampling frequency Δ -1 is constant, and asymptotics are taken as the number N of observations tends to infinity. We prove that the problem of estimating both the diffusion coefficient (the volatility) and the drift in a nonparametric setting is ill-posed: the minimax rates of convergence for Sobolev constraints and squared-error loss coincide with that of a, respectively, first- and second-order linear inverse problem. To ensure ergodicity and limit technical difficulties we restrict ourselves to scalar diffusions living on a compact interval with reflecting boundary conditions. Our approach is based on the spectral analysis of the associated Markov semigroup. A rate-optimal estimation of the coefficients is obtained via the nonparametric estimation of an eigenvalue-eigenfunction pair of the transition operator of the discrete time Markov chain (X nΔ , n = 0, 1,..., N) in a suitable Sobolev norm, together with an estimation of its invariant density.

Journal ArticleDOI
TL;DR: It turns out that the two alternatives, while adding stability in the presence of outliers of moderate size, do not possess a substantially better breakdown behavior than estimation based on Normal mixtures.
Abstract: ML-estimation based on mixtures of Normal distributions is a widely used tool for cluster analysis. However, a single outlier can make the parameter estimation of at least one of the mixture components break down. Among others, the estimation of mixtures of t-distributions by McLachlan and Peel [Finite Mixture Models (2000) Wiley, New York] and the addition of a further mixture component accounting for ?noise? by Fraley and Raftery [The Computer J. 41 (1998) 578?588] were suggested as more robust alternatives. In this paper, the definition of an adequate robustness measure for cluster analysis is discussed and bounds for the breakdown points of the mentioned methods are given. It turns out that the two alternatives, while adding stability in the presence of outliers of moderate size, do not possess a substantially better breakdown behavior than estimation based on Normal mixtures. If the number of clusters s is treated as fixed, r additional points suffice for all three methods to let the parameters of r clusters explode. Only in the case of r=s is this not possible for t-mixtures. The ability to estimate the number of mixture components, for example, by use of the Bayesian information criterion of Schwarz [Ann. Statist. 6 (1978) 461?464], and to isolate gross outliers as clusters of one point, is crucial for an improved breakdown behavior of all three techniques. Furthermore, a mixture of Normals with an improper uniform distribution is proposed to achieve more robustness in the case of a fixed number of components.

Journal ArticleDOI
TL;DR: A variety of methods for choosing training samples to allow utilization of improper objective priors are developed and successfully applied in challenging situations.
Abstract: Central to several objective approaches to Bayesian model selection is the use of training samples (subsets of the data), so as to allow utilization of improper objective priors. The most common prescription for choosing training samples is to choose them to be as small as possible, subject to yielding proper posteriors; these are called minimal training samples. When data can vary widely in terms of either information content or impact on the improper priors, use of minimal training samples can be inadequate. Important examples include certain cases of discrete data, the presence of censored observations, and certain situations involving linear models and explanatory variables. Such situations require more sophisticated methods of choosing training samples. A variety of such methods are developed in this paper, and successfully applied in challenging situations.

Journal ArticleDOI
TL;DR: In this article, a nonparametric adaptation theory is developed for the construction of confidence intervals for linear functionals, and a between class modulus of continuity captures the expected length of adaptive confidence intervals.
Abstract: A nonparametric adaptation theory is developed for the construction of confidence intervals for linear functionals. A between class modulus of continuity captures the expected length of adaptive confidence intervals. Sharp lower bounds are given for the expected length and an ordered modulus of continuity is used to construct adaptive confidence procedures which are within a constant factor of the lower bounds. In addition, minimax theory over nonconvex parameter spaces is developed.

Journal ArticleDOI
TL;DR: This article proposes a new complexity-penalized model selection method based on data-dependent penalties, and considers the binary classification problem where, given a random observation X ∈ R d, one has to predict Y ∈ {0,1}.
Abstract: In this article, model selection via penalized empirical loss minimization in nonparametric classification problems is studied. Data-dependent penalties are constructed, which are based on estimates of the complexity of a small subclass of each model class, containing only those functions with small empirical loss. The penalties are novel since those considered in the literature are typically based on the entire model class. Oracle inequalities using these penalties are established, and the advantage of the new penalties over those based on the complexity of the whole model class is demonstrated.