scispace - formally typeset
Search or ask a question
Author

Charles J. Stone

Bio: Charles J. Stone is an academic researcher from University of California, Berkeley. The author has contributed to research in topics: Density estimation & Probability density function. The author has an hindex of 22, co-authored 34 publications receiving 8126 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: In this article, a sequence of probability weight functions defined in terms of nearest neighbors is constructed and sufficient conditions for consistency are obtained, which are applied to verify the consistency of the estimators of the various quantities discussed above and the consistency in Bayes risk of the approximate Bayes rules.
Abstract: Let $(X, Y)$ be a pair of random variables such that $X$ is $\mathbb{R}^d$-valued and $Y$ is $\mathbb{R}^{d'}$-valued. Given a random sample $(X_1, Y_1), \cdots, (X_n, Y_n)$ from the distribution of $(X, Y)$, the conditional distribution $P^Y(\bullet \mid X)$ of $Y$ given $X$ can be estimated nonparametrically by $\hat{P}_n^Y(A \mid X) = \sum^n_1 W_{ni}(X)I_A(Y_i)$, where the weight function $W_n$ is of the form $W_{ni}(X) = W_{ni}(X, X_1, \cdots, X_n), 1 \leqq i \leqq n$. The weight function $W_n$ is called a probability weight function if it is nonnegative and $\sum^n_1 W_{ni}(X) = 1$. Associated with $\hat{P}_n^Y(\bullet \mid X)$ in a natural way are nonparametric estimators of conditional expectations, variances, covariances, standard deviations, correlations and quantiles and nonparametric approximate Bayes rules in prediction and multiple classification problems. Consistency of a sequence $\{W_n\}$ of weight functions is defined and sufficient conditions for consistency are obtained. When applied to sequences of probability weight functions, these conditions are both necessary and sufficient. Consistent sequences of probability weight functions defined in terms of nearest neighbors are constructed. The results are applied to verify the consistency of the estimators of the various quantities discussed above and the consistency in Bayes risk of the approximate Bayes rules.

1,754 citations

Journal ArticleDOI
TL;DR: In this article, it was shown that the optimal rate of convergence for an estimator of an unknown regression function (i.e., a regression function of order 2p + d) with respect to a training sample of size n = (p - m)/(2p + 2p+d) is O(n−1/n−r) under appropriate regularity conditions, where n−1 is the optimal convergence rate if q < q < \infty.
Abstract: Consider a $p$-times differentiable unknown regression function $\theta$ of a $d$-dimensional measurement variable Let $T(\theta)$ denote a derivative of $\theta$ of order $m$ and set $r = (p - m)/(2p + d)$ Let $\hat{T}_n$ denote an estimator of $T(\theta)$ based on a training sample of size $n$, and let $\| \hat{T}_n - T(\theta)\|_q$ be the usual $L^q$ norm of the restriction of $\hat{T}_n - T(\theta)$ to a fixed compact set Under appropriate regularity conditions, it is shown that the optimal rate of convergence for $\| \hat{T}_n - T(\theta)\|_q$ is $n^{-r}$ if $0 < q < \infty$; while $(n^{-1} \log n)^r$ is the optimal rate if $q = \infty$

1,513 citations

Journal ArticleDOI
TL;DR: In this article, a variety of parametric and nonparametric models for the joint distribution of a pair of random variables are discussed in relation to flexibility, dimensionality, and interpretability.
Abstract: Let $(X, Y)$ be a pair of random variables such that $X = (X_1, \cdots, X_J)$ and let $f$ by a function that depends on the joint distribution of $(X, Y).$ A variety of parametric and nonparametric models for $f$ are discussed in relation to flexibility, dimensionality, and interpretability. It is then supposed that each $X_j \in \lbrack 0, 1\rbrack,$ that $Y$ is real valued with mean $\mu$ and finite variance, and that $f$ is the regression function of $Y$ on $X.$ Let $f^\ast,$ of the form $f^\ast(x_1, \cdots, x_J) = \mu + f^\ast_1(x_1) + \cdots + f^\ast_J(x_J),$ be chosen subject to the constraints $Ef^\ast_j = 0$ for $1 \leq j \leq J$ to minimize $E\lbrack(f(X) - f^\ast(X))^2\rbrack.$ Then $f^\ast$ is the closest additive approximation to $f,$ and $f^\ast = f$ if $f$ itself is additive. Spline estimates of $f^\ast_j$ and its derivatives are considered based on a random sample from the distribution of $(X, Y).$ Under a common smoothness assumption on $f^\ast_j, 1 \leq j \leq J,$ and some mild auxiliary assumptions, these estimates achieve the same (optimal) rate of convergence for general $J$ as they do for $J = 1.$

1,239 citations

Journal ArticleDOI
TL;DR: In this paper, it was shown that for nonparametric estimators of a density function, the Taylor polynomial is the optimal (uniform) rate of convergence for a sequence of estimators.
Abstract: Let $d$ denote a positive integer, $\|x\| = (x^2_1 + \cdots + x^2_d)^{1/2}$ the Euclidean norm of $x = (x_1, \cdots, x_d) \in \mathbb{R}^d, k$ a nonnegative integer, $\mathscr{C}_k$ the collection of $k$ times continuously differentiable functions on $\mathbb{R}^d$, and $g_k$ the Taylor polynomial of degree $k$ about the origin corresponding to $g \in \mathscr{C}_k$. Let $M$ and $p > k$ denote positive constants and let $U$ be an open neighborhood of the origin of $\mathbb{R}^d$. Let $\mathscr{G}$ denote the collection of functions $g \in \mathscr{C}_k$ such that $|g(x) - g_k(x)| \leq M \|x\|^P$ for $x\in U$. Let $m \leq k$ be a nonnegative integer, let $\theta_0\in\mathscr{C}_m$ and set $\Theta = \{\theta_0 + g:g \in \mathscr{G}\}$. Let $L$ be a linear differential operator of order $m$ on $\mathscr{C}_m$ and set $T(\theta) = L\theta(0)$ for $\theta \in \Theta$. Let $(X, Y)$ be a pair of random variables such that $X$ is $\mathbb{R}^d$ valued and $Y$ is real valued. It is assumed that the distribution of $X$ is absolutely continuous and that its density is bounded away from zero and infinity on $U$. The conditional distribution of $Y$ given $X$ is assumed to be (say) normal, with a conditional variance which is bounded away from zero and infinity on $U$. The regression function of $Y$ on $X$ is assumed to belong to $\Theta$. It is shown that $r = (p - m)/(2p + d)$ is the optimal (uniform) rate of convergence for a sequence $\{\hat{T}_n\}$ of estimators of $T(\theta)$ such that $\hat{T}_n$ is based on a random sample of size $n$ from the distribution of $(X, Y)$. An analogous result is obtained for nonparametric estimators of a density function.

837 citations

Journal ArticleDOI
01 Sep 1973

421 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This tutorial gives an overview of the basic ideas underlying Support Vector (SV) machines for function estimation, and includes a summary of currently used algorithms for training SV machines, covering both the quadratic programming part and advanced methods for dealing with large datasets.
Abstract: In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of currently used algorithms for training SV machines, covering both the quadratic (or convex) programming part and advanced methods for dealing with large datasets. Finally, we mention some modifications and extensions that have been applied to the standard SV algorithm, and discuss the aspect of regularization from a SV perspective.

10,696 citations

Journal ArticleDOI
William S. Cleveland1
TL;DR: Robust locally weighted regression as discussed by the authors is a method for smoothing a scatterplot, in which the fitted value at z k is the value of a polynomial fit to the data using weighted least squares, where the weight for (x i, y i ) is large if x i is close to x k and small if it is not.
Abstract: The visual information on a scatterplot can be greatly enhanced, with little additional cost, by computing and plotting smoothed points. Robust locally weighted regression is a method for smoothing a scatterplot, (x i , y i ), i = 1, …, n, in which the fitted value at z k is the value of a polynomial fit to the data using weighted least squares, where the weight for (x i , y i ) is large if x i is close to x k and small if it is not. A robust fitting procedure is used that guards against deviant points distorting the smoothed points. Visual, computational, and statistical issues of robust locally weighted regression are discussed. Several examples, including data on lead intoxication, are used to illustrate the methodology.

10,225 citations

Journal ArticleDOI

9,941 citations

Journal ArticleDOI
TL;DR: The authors prove two results about this type of estimator that are unprecedented in several ways: with high probability f/spl circ/*/sub n/ is at least as smooth as f, in any of a wide variety of smoothness measures.
Abstract: Donoho and Johnstone (1994) proposed a method for reconstructing an unknown function f on [0,1] from noisy data d/sub i/=f(t/sub i/)+/spl sigma/z/sub i/, i=0, ..., n-1,t/sub i/=i/n, where the z/sub i/ are independent and identically distributed standard Gaussian random variables. The reconstruction f/spl circ/*/sub n/ is defined in the wavelet domain by translating all the empirical wavelet coefficients of d toward 0 by an amount /spl sigma//spl middot//spl radic/(2log (n)/n). The authors prove two results about this type of estimator. [Smooth]: with high probability f/spl circ/*/sub n/ is at least as smooth as f, in any of a wide variety of smoothness measures. [Adapt]: the estimator comes nearly as close in mean square to f as any measurable estimator can come, uniformly over balls in each of two broad scales of smoothness classes. These two properties are unprecedented in several ways. The present proof of these results develops new facts about abstract statistical inference and its connection with an optimal recovery model. >

9,359 citations

Journal ArticleDOI
TL;DR: In this article, penalized likelihood approaches are proposed to handle variable selection problems, and it is shown that the newly proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well if the correct submodel were known.
Abstract: Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of ...

8,314 citations