scispace - formally typeset
Search or ask a question

Showing papers in "Annals of Statistics in 1983"


Journal ArticleDOI
TL;DR: In this paper, the EM algorithm converges to a local maximum or a stationary value of the (incomplete-data) likelihood function under conditions that are applicable to many practical situations.
Abstract: Two convergence aspects of the EM algorithm are studied: (i) does the EM algorithm find a local maximum or a stationary value of the (incomplete-data) likelihood function? (ii) does the sequence of parameter estimates generated by EM converge? Several convergence results are obtained under conditions that are applicable to many practical situations Two useful special cases are: (a) if the unobserved complete-data specification can be described by a curved exponential family with compact parameter space, all the limit points of any EM sequence are stationary points of the likelihood function; (b) if the likelihood function is unimodal and a certain differentiability condition is satisfied, then any EM sequence converges to the unique maximum likelihood estimate A list of key properties of the algorithm is included

3,414 citations


Journal ArticleDOI
TL;DR: In this article, the minimum description length (MDL) criterion is used to estimate the total number of binary digits required to rewrite the observed data, when each observation is given with some precision.
Abstract: of the number of bits required to write down the observed data, has been reformulated to extend the classical maximum likelihood principle. The principle permits estimation of the number of the parameters in statistical models in addition to their values and even of the way the parameters appear in the models; i.e., of the model structures. The principle rests on a new way to interpret and construct a universal prior distribution for the integers, which makes sense even when the parameter is an individual object. Truncated realvalued parameters are converted to integers by dividing them by their precision, and their prior is determined from the universal prior for the integers by optimizing the precision. 1. Introduction. In this paper we study estimation based upon the principle of minimizing the total number of binary digits required to rewrite the observed data, when each observation is given with some precision. Instead of attempting at an absolutely shortest description, which would be futile, we look for the optimum relative to a class of parametrically given distributions. This Minimum Description Length (MDL) principle, which we introduced in a less comprehensive form in [25], turns out to degenerate to the more familiar Maximum Likelihood (ML) principle in case the number of parameters in the models is fixed, so that the description length of the parameters themselves can be ignored. In another extreme case, where the parameters determine the data, it similarly degenerates to Jaynes's principle of maximum entropy, [14]. But the main power of the new criterion is that it permits estimates of the entire model, its parameters, their number, and even the way the parameters appear in the model; i.e., the model structure. Hence, there will be no need to supplement the estimated parameters with a separate hypothesis test to decide whether a model is adequately parameterized or, perhaps, over parameterized.

1,762 citations


Journal ArticleDOI
TL;DR: In this article, negative association is defined as the property that a random variable is negatively associated (NA) if for every pair of disjoint subsets $A_1, A_2$ of Ω(1, 2, \cdots, k, k) of a function f(X, i, i = 1, 3, 4, k), f(G, g), g, g, j, j \in A_1), g(G), g), rbrack \leq 0, for all nondecreasing
Abstract: Random variables, $X_1, \cdots, X_k$ are said to be negatively associated (NA) if for every pair of disjoint subsets $A_1, A_2$ of $\{1, 2, \cdots, k\}, \operatorname{Cov}\lbrack f(X_1, i \in A_1), g(X_j, j \in A_2) \rbrack \leq 0$, for all nondecreasing functions $f, g$. The basic properties of negative association are derived. Especially useful is the property that nondecreasing functions of mutually exclusive subsets of NA random variables are NA. This property is shown not to hold for several other types of negative dependence recently proposed. One consequence is the inequality $P(X_i \leq x_i, i = 1, \cdots, k) \leq \prod^k_1P(X_i \leq x_i)$ for NA random variables $X_1, \cdots, X_k$, and the dual inequality resulting from reversing the inequalities inside the square brackets. In another application it is shown that negatively correlated normal random variables are NA. Other NA distributions are the (a) multinomial, (b) convolution of unlike multinomials, (c) multivariate hypergeometric, (d) Dirichlet, and (e) Dirichlet compound multinomial. Negative association is shown to arise in situations where the probability measure is permutation invariant. Applications of this are considered for sampling without replacement as well as for certain multiple ranking and selection procedures. In a somewhat striking example, NA and positive association representing quite strong opposing types of dependence, are shown to exist side by side in models of categorical data analysis.

1,410 citations


Journal ArticleDOI
TL;DR: In this paper, the connection between quasi-likelihood functions, exponential family models and nonlinear weighted least squares is examined and consistency and asymptotic normality of the parameter estimates are discussed under second moment assumptions.
Abstract: The connection between quasi-likelihood functions, exponential family models and nonlinear weighted least squares is examined. Consistency and asymptotic normality of the parameter estimates are discussed under second moment assumptions. The parameter estimates are shown to satisfy a property of asymptotic optimality similar in spirit to, but more general than, the corresponding optimal property of Gauss-Markov estimators.

763 citations


Journal ArticleDOI
TL;DR: In this paper, the existence, support size, likelihood equations, and uniqueness of the estimator are revealed to be directly related to the properties of the convex hull of the likelihood set and the support hyperplanes of that hull.
Abstract: In this paper certain fundamental properties of the maximum likelihood estimator of a mixing distribution are shown to be geometric properties of the likelihood set. The existence, support size, likelihood equations, and uniqueness of the estimator are revealed to be directly related to the properties of the convex hull of the likelihood set and the support hyperplanes of that hull. It is shown using geometric techniques that the estimator exists under quite general conditions, with a support size no larger than the number of distinct observations. Analysis of the convex dual of the likelihood set leads to a dual maximization problem. A convergent algorithm is described. The defining equations for the estimator are compared with the usual parametric likelihood equations for finite mixtures. Sufficient conditions for uniqueness are given. Part II will deal with a special theory for exponential family mixtures.

674 citations


Journal ArticleDOI
TL;DR: In this paper, the kernel function method is used to estimate counting process intensities using kernel functions to smooth the nonparametric Nelson estimator for the cumulative intensity, and uniform consistency and asymptotic normality are proved.
Abstract: The kernel function method developed during the last twenty-five years to estimate a probability density function essentially is a way of smoothing the empirical distribution function. This paper shows how one can generalize this method to estimate counting process intensities using kernel functions to smooth the nonparametric Nelson estimator for the cumulative intensity. The properties of the estimator for the intensity itself are investigated, and uniform consistency and asymptotic normality are proved. We also give an illustrative numerical example.

502 citations


Journal ArticleDOI
TL;DR: In this article, lower bounds for estimation of the parameters of models with both parametric and nonparametric components are given in the form of representation theorems (for regular estimates) and asymptotic minimax bounds.
Abstract: Asymptotic lower bounds for estimation of the parameters of models with both parametric and nonparametric components are given in the form of representation theorems (for regular estimates) and asymptotic minimax bounds. The methods used involve: (i) the notion of a "Hellinger-differentiable (root-) density", where part of the differentiation is with respect to the nonparametric part of the model, to obtain appropriate scores; and (ii) calculation of the "effective score" for the real or vector (finite-dimensional) parameter of interest as that component of the score function orthogonal to all nuisance parameter "scores" (perhaps infinite-dimensional). The resulting asymptotic information for estimation of the parametric component of the model is just (4 times) the squared $L^2$-norm of the "effective score". A corollary of these results is a simple necessary condition for "adaptive estimation": adaptation is possible only if the scores for the parameter of interest are orthogonal to the scores for the nuisance function or nonparametric part of the model. Examples considered include the one-sample location model with and without symmetry, mixture models, the two-sample shift model, and Cox's proportional hazards model.

406 citations


Journal ArticleDOI
TL;DR: Weak convergence results for the product-limit estimator on the whole line were proved in this article for confidence band construction, estimation of mean lifetime, and theory of $q$-functions.
Abstract: Weak convergence results are proved for the product-limit estimator on the whole line. Applications are given to confidence band construction, estimation of mean lifetime, and to the theory of $q$-functions. The results are obtained using stochastic calculus and in probability linear bounds for empirical processes.

387 citations


Journal ArticleDOI
TL;DR: In this article, the authors derived the theoretical properties of such a procedure and developed a more robust procedure, and a statistic for describing and comparing the tail-shapes of arbitrary samples is proposed.
Abstract: Stable laws are often fit to outlier-prone data and, if the index $\alpha$ is estimated to be much less than two, then the normal law is rejected in favor of an infinite-variance stable law. This paper derives the theoretical properties of such a procedure. When the true distribution is stable, the distribution of the m.l.e. of $\alpha$ is non-regular if $\alpha = 2$. When the true distribution is not stable, the estimate of $\alpha$ is not a robust measure of the rate of decrease of the tail probabilities. A more robust procedure is developed, and a statistic for describing and comparing the tail-shapes of arbitrary samples is proposed.

378 citations


Journal ArticleDOI
TL;DR: The normal, Poisson, gamma, binomial, negative binomial and NEFGHS distributions are the six univariate natural exponential families with quadratic variance functions (QVF) as mentioned in this paper.
Abstract: The normal, Poisson, gamma, binomial, negative binomial, and NEFGHS distributions are the six univariate natural exponential families (NEF) with quadratic variance functions (QVF). This sequel to Morris (1982) treats certain statistical topics that can be handled within this unified NEF-QVF formulation, including unbiased estimation, Bhattacharyya and Cramer-Rao lower bounds, conditional distributions and moments, quadratic regression, conjugate prior distributions, moments of conjugate priors and posterior distributions, empirical Bayes and $G_2$ minimax, marginal distributions and their moments, parametric empirical Bayes, and characterizations.

257 citations


Journal ArticleDOI
TL;DR: This article showed that least square cross-validation is asymptotically optimal for density estimation, rather then simply consistent, in the sense that the tail conditions are only slightly more severe than the hypothesis of finite variance.
Abstract: We prove that the method of cross-validation suggested by A. W. Bowman and M. Rudemo achieves its goal of minimising integrated square error, in an asymptotic sense. The tail conditions we impose are only slightly more severe than the hypothesis of finite variance, and so least squares cross-validation does not exhibit the pathological behaviour which has been observed for Kullback-Leibler cross-validation. This is apparently the first time that a cross-validatory procedure for density estimation has been shown to be asymptotically optimal, rather then simply consistent.

Journal ArticleDOI
TL;DR: In this paper, the number and location of support points for the nonparametric maximum likelihood estimator of the mixing distribution were linked to sign changes in certain integrated polynomials.
Abstract: Geometric analysis of the mixture likelihood set for univariate exponential family densities yields results which tie the number and location of support points for the nonparametric maximum likelihood estimator of the mixing distribution to sign changes in certain integrated polynomials. One corollary is a very general uniqueness theorem for the estimator.

Journal ArticleDOI
TL;DR: In this paper, a kernel estimate of the hazard function from censored data is obtained by convolution smoothing of the empirical hazards, and conditions for asymptotic normality are investigated using the Hajek projection method.
Abstract: By convolution smoothing of the empirical hazards, a kernel estimate of the hazard function from censored data is obtained. Small and large sample expressions for the mean and the variance of the estimator are given. Conditions for asymptotic normality are investigated using the Hajek projection method.

Journal ArticleDOI
TL;DR: In this paper, limit theorems giving rates of convergence of nonparametric regression estimates obtained from smoothing splines are proved, and new results are obtained for the usual (linear) case.
Abstract: Limit theorems giving rates of convergence of nonparametric regression estimates obtained from smoothing splines are proved. The main emphasis is on nonlinear, robust smoothing splines, but new results are obtained for the usual (linear) case. It is assumed that the knots become asymptotically uniform in a vague sense. Convergence of derivatives is also investigated. The main mathematical tools are a linearization of the robust smoothing spline, and an approximation of the linear smoothing spline utilizing the Green's function of an associated boundary value problem.

Journal ArticleDOI
TL;DR: In this paper, it was shown that for every ε > 0, there exist constants for which ε ≥ 0 such that ε(J_n \geq \varepsilon) \leq \expexp(-rn), n \eq n_0.
Abstract: Let $f$ be a density on $R^d$, and let $f_n$ be the kernel estimate of $f$, $f_n(x) = (nh^d)^{-1} \sum^n_{i=1} K((x - X_i)/h)$ where $h = h_n$ is a sequence of positive numbers, and $K$ is an absolutely integrable function with $\int K(x) dx = 1$. Let $J_n = \int |f_n(x) - f(x)| dx$. We show that when $\lim_nh = 0$ and $\lim_nnh^d = \infty$, then for every $\varepsilon > 0$ there exist constants $r, n_0 > 0$ such that $P(J_n \geq \varepsilon) \leq \exp(-rn), n \geq n_0$. Also, when $J_n \rightarrow 0$ in probability as $n \rightarrow \infty$ and $K$ is a density, then $\lim_nh = 0$ and $\lim_nnh^d = \infty$.

Journal ArticleDOI
TL;DR: In this paper, the statistical properties of a cubic smoothing spline and its derivative are analyzed and it is shown that unless unnatural boundary conditions hold, the integrated squared bias is dominated by local effects near the boundary.
Abstract: : The statistical properties of a cubic smoothing spline and its derivative are analyzed. It is shown that unless unnatural boundary conditions hold, the integrated squared bias is dominated by local effects near the boundary. Similar effects are shown to occur in the regularized solution of a translation-kernel integral equation. These results are derived by developing a Fourier representation for a smoothing spline. (Author)

Journal ArticleDOI
TL;DR: In this article, the authors illustrate the "linear" and "general" empirical Bayes approaches to estimation, and test the null hypothesis that a treatment has had no effect on the estimation.
Abstract: Examples are given to illustrate the "linear" and "general" empirical Bayes approaches to estimation. A final example concerns testing the null hypothesis that a treatment has had no effect.

Journal ArticleDOI
TL;DR: In this paper, the asymptotic behavior of symmetric statistics of arbitrary order is studied. But the authors use as a tool a randomization of the sample size, which they use as an application to describe all limit distributions of square integrable $U$-statistics.
Abstract: The asymptotic behaviour of symmetric statistics of arbitrary order is studied. As an application we describe all limit distributions of square integrable $U$-statistics. We use as a tool a randomization of the sample size. A sample of Poisson size $N_\lambda$ with $EN_\lambda = \lambda$ can be interpreted as a Poisson point process with intensity $\lambda$, and randomized symmetric statistics are its functionals. As $\lambda \rightarrow \infty$, the probability distribution of these functionals tend to the distribution of multiple Wiener integrals. This can be considered as a stronger form of the following well-known fact: properly normalized, a Poisson point process with intensity $\lambda$ approaches a Gaussian random measure, as $\lambda \rightarrow \infty$.

Journal ArticleDOI
TL;DR: In this article, a sufficient condition for second order efficiency of an estimator is presented, which is easily checked in the case of minimum contrast estimators, and the Fisher scoring method is also considered in the light of second-order efficiency.
Abstract: This paper presents a sufficient condition for second order efficiency of an estimator. The condition is easily checked in the case of minimum contrast estimators. The $\alpha^\ast$-minimum contrast estimator is defined and proved to be second order efficient for every $\alpha, 0 < \alpha < 1$. The Fisher scoring method is also considered in the light of second order efficiency. It is shown that a contrast function is associated with the second order tensor and the affine connection. This fact leads us to prove the above assertions in the differential geometric framework due to Amari.

Journal ArticleDOI
TL;DR: The notion of general balance due to Nelder is discussed in this paper in relation to the eigenvectors of an information matrix, combinatorial balance and the simple combinability of information from uncorrelated sources in an experiment.
Abstract: The notion of general balance due to Nelder is discussed in relation to the eigenvectors of an information matrix, combinatorial balance and the simple combinability of information from uncorrelated sources in an experiment.

Journal ArticleDOI
TL;DR: In this paper, the asymptotic accuracy of the bootstrap approximation to the distribution of a $k$-sample studentized mean was studied. And the authors showed that the approximation is robust to the number of samples.
Abstract: We study the asymptotic accuracy of the bootstrap approximation to the distribution of a $k$-sample studentized mean.

Journal ArticleDOI
TL;DR: In this article, the authors use the counting process formulation of Andersen and Gill (1982) to develop asymptotic distribution theory for a class of intensity function regression models in which the usual exponential regression form is relaxed to an arbitrary non-negative twice differentiable form.
Abstract: The theory and application of the Cox (1972) failure time regression model has, almost without exception, assumed an exponential form for the dependence of the hazard function on regression variables. Other regression forms may be more natural or descriptive in some applications. For example, a linear relative risk regression model provides a convenient framework for studying epidemiologic risk factor interactions. This note uses the counting process formulation of Andersen and Gill (1982) to develop asymptotic distribution theory for a class of intensity function regression models in which the usual exponential regression form is relaxed to an arbitrary non-negative twice differentiable form. Some stability and regularity conditions, beyond those of Andersen and Gill, are required to show the consistency of the observed information matrix, which in general need not be positive semidefinite.

Journal ArticleDOI
TL;DR: In this paper, a unified treatment of the consistency properties of the ordinary least squares estimates in an autoregressive fitting of time series from nonstationary or stationary auto-regressive moving average models is given.
Abstract: A unified treatment of the consistency properties of the ordinary least squares estimates in an autoregressive fitting of time series from nonstationary or stationary autoregressive moving average models is given. For a given model, the orders of autoregressions which produce consistent estimates are obtained and the limiting values, hence the biases, of the estimates of other autoregressions are investigated.

Journal ArticleDOI
TL;DR: In this article, the authors considered nonparametric inference for hazard rates with censored serial data and derived strong approximation and simultaneous confidence bands for the Rosenblatt-Parzen estimators.
Abstract: This paper concerns nonparametric inference for hazard rates with censored serial data. The focus is upon "delta sequence" estimators of the form $h_n(x) = \int K_b(x, y) dH_n(y)$ with $K_b$ integrating to 1 and concentrating mass near $x$ as $b \rightarrow 0. H_n$ is the Nelson-Aalen empirical cumulative hazard. Strong approximation and simultaneous confidence bands are derived for Rosenblatt-Parzen estimators, with $K_b(x, y) = w((x - y)/b)/b, b = o(n^{-1}),$ and $w(\cdot)$ a well-behaved density. This work generalizes global deviation and mean square deviation results of Bickel and Rosenblatt and others to censored survival data. Simulations with exponential survival and censoring indicate the effect of censoring on bias, variance, and maximal absolute deviation. Data from a survival experiment with serial sacrifice are analysed.

Journal ArticleDOI
TL;DR: One-and two-sided confidence intervals for the mean and variance of a distribution on a regular function on the space of distribution functions are given in this paper, with a level of 1 -α + O(n^{-j/2}) for any given n.
Abstract: Let $T(\cdot)$ be a suitably regular functional on the space of distribution functions, $F$, on $R^s$. A method is given for obtaining the derivatives of $T$ at $F$. This is used to obtain asymptotic expansions for the distribution and quantiles of $T(F_n)$ where $F_n$ is the empirical distribution of a random sample of size $n$ from a distribution $F$ with an absolutely continuous component. One- and two-sided confidence intervals for $T(F)$ are given of level $1 - \alpha + O(n^{-j/2})$ for any given $j$. Examples include approximate nonparametric confidence intervals for the mean and variance of a distribution on $R$.

Journal ArticleDOI
TL;DR: In this paper, the least squares estimator of the first order non-explosive autoregressive process with unknown parameter beta epsilon (1,1) was shown to be asymptotically normally distributed uniformly in beta.
Abstract: : For a first order non-explosive autoregressive process with unknown parameter beta epsilon (1,1), it is shown that if data are collected according to a particular stopping rule, the least squares estimator of beta is asymptotically normally distributed uniformly in beta. In the case of normal residuals, the stopping rule may be interpreted as sampling until the observed Fisher information reaches a preassigned level. The situation is contrasted with the fixed sample size case, where the estimator has a non-normal limiting distribution when (beta) = 1. (Author)

Journal ArticleDOI
TL;DR: In this paper, the orthogonality condition ensuring that both information matrices are equal is examined in the model for repeated measurements designs which was considered e.g. by Cheng and Wu (1980).
Abstract: The information matrices of one design in a finer and a simpler linear model are compared to each other. The orthogonality condition ensuring that both matrices are equal is examined in the model for repeated measurements designs which was considered e.g. by Cheng and Wu (1980). Examples of unbalanced designs fulfilling the orthogonality condition are shown to be optimum. Moreover, nearly strongly balanced generalized latin squares are introduced and their universal optimality is proved, if the numbers of units and periods are sufficiently large.

Journal ArticleDOI
TL;DR: In this paper, a method for inverting a general Edgeworth expansion, so as to correct a statistic for the effects of non-normality, is presented, which is applied to the special case of the "Studentized" mean.
Abstract: We provide a method for inverting a general Edgeworth expansion, so as to correct a statistic for the effects of non-normality. This technique is applied to the special case of the "Studentized" mean. Explicit formulae are given for the correction terms.

Journal ArticleDOI
TL;DR: In this paper, a modified version of the K-S test is introduced that is more sensitive than the original Kolmogorov-Smirnov (K-S) test to deviations in the tails.
Abstract: It is well known that the Kolmogorov-Smirnov (K-S) test exhibits poor sensitivity to deviations from the hypothesized distribution that occur in the tails. A modified version of the K-S test is introduced that is more sensitive than the K-S test to deviations in the tails. The finite and infinite sample distribution along with the consistency properties of the proposed test are studied. Tables of critical values are provided for two versions of the test (one sensitive to heavy tail alternatives and one sensitive to light tail alternatives) and the finite sample properties of these two versions of the test are investigated.

Journal ArticleDOI
TL;DR: In this paper, the authors review the application of cross-validation to the smoothing problem, and establish $L_1$ consistency for certain crossvalidated kernels and histograms.
Abstract: Application of nonparametric density estimators generally requires the specification of a "smoothing parameter." The kernel estimator, for example, is not fully defined until a window width, or scaling, for the kernels has been chosen. Many "data-driven" techniques have been suggested for the practical choice of smoothing parameter. Of these, the most widely studied is the method of cross-validation. Our own simulations, as well as those of many other investigators, indicate that cross-validated smoothing can be an extremely effective practical solution. However, many of the most basic properties of cross-validated estimators are unknown. Indeed, recent results show that cross-validated estimators can fail even to be consistent for seemingly well-behaved problems. In this paper we will review the application of cross-validation to the smoothing problem, and establish $L_1$ consistency for certain cross-validated kernels and histograms.