scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 1981"


Journal ArticleDOI
TL;DR: In this paper, two procedures for detecting observations with outlying values either in the response variable or in the explanatory variables in multiple regression are presented as half normal plots with envelopes derived from simulation in order to avoid overinterpretation of the data.
Abstract: SUMMARY The paper describes two procedures for detecting observations with outlying values either in the response variable or in the explanatory variables in multiple regression. These procedures are presented as half normal plots with envelopes derived from simulation in order to avoid overinterpretation of the data. Analysis of a well-known data set leads to the use of a data transformation, a simple test for which is commended, and to some comments on the relationship with robust regression. The widespread availability of sophisticated computer software has made the fitting of multiple regression equations painless, but the very ease of these procedures may cause insufficient care to be given to checking and scrutiny of the data. Transcription, punching and data manipulation errors may lead to bad values both of the observations y and of the independent or explanatory variables x. It is the purpose of the present paper to describe and exemplify two plots which provide checks against such bad values. It is argued that these plots and a test for transformations should accompany any thorough regression analysis. Bad values of the response, or outliers, have long been detected by a variety of plots of residuals. Bad values of the explanatory variables are less easily detected even when they lead to an extreme point in the design space. Since the fitted equation will pass close to such an influential observation the residual may not be especially large even after allowance has been made for the small variance of the fitted value at such a point. To detect such behaviour a quantity is needed which exhibits the dependence of the fitted model on each point or group of points. Discussions of such quantities are given by Hoaglin & Welsch (1978) and, more recently, by Belsley, Kuh & Welsch (1980, ? 2 1), by Cook & Weisberg (1980) and by Pregibon (1981). The quantity used here is a modification of a statistic proposed by Cook (1977), which is derived in ? 2. Two half normal plots are presented in ? 3. As an aid to interpretation envelopes to the plots are generated by simulation. Examples of the plots are given in ? 4, both for simulated data and for the stack loss data given by Brownlee (1965, p. 454). Transformations of this data set are discussed in ? 5. The paper concludes with some comments on the relationship with robust

227 citations


Journal ArticleDOI
TL;DR: In this article, a detailed examination of these statistics shows that two different types of influence are being measured and this is illustrated with examples derived from a set of data given by Mickey, Dunn, and Clark (1967).
Abstract: Statistics offered by Cook (1977) and Andrews and Pregibon (1978) purport to reveal influential observations in a regression analysis. Detailed examination of these statistics shows that two different types of influence are being measured and this is illustrated with examples derived from a set of data given by Mickey, Dunn, and Clark (1967). Recommendations are given for obtaining the best use of the statistics available.

158 citations


Book ChapterDOI
01 Jan 1981
TL;DR: In this article, the authors present two time-series outlier models, point out their ordinary regression analogues and the corresponding outlier patterns, and present robust alternatives to the least-squares method of fitting autoregressive-moving-average models.
Abstract: Outliers in time series can wreak havoc with conventional least-squares procedures, just as in the case of ordinary regression. This paper presents two time-series outlier models, points out their ordinary regression analogues and the corresponding outlier patterns, and presents robust alternatives to the least-squares method of fitting autoregressive-moving-average models. The main emphasis is on robust estimation in the presence of additive outliers. This results in the problem having an errors-in-variables aspect. While several methods of robust estimation for this problem are presented, the most attractive approach is an approximate non-Gaussian maximum-likelihood type method which involves the use of a robust non-linear filter/one-sided interpolator with data-dependent scaling. Robust smoothing/two-sided outlier interpolation, forecasting, model selection, and spectral analysis are briefly mentioned, as are the problems of estimating location and dealing with trends, seasonality, and missing data. Some examples of applying the methodology are given.

93 citations


Book
01 Jan 1981
TL;DR: In this paper, the authors focus on two aspects of the errors in variables problem, i.e., variance estimation of the classical estimators of slope and intercept, and the detection of influential observations.
Abstract: This paper focuses on two aspects of the errors in variables problem--variance estimation of the classical estimators of slope and intercept, and the detection of influential observations. The behaviour of the jackknife, bootstrap, normal theory and influence function estimators of variability is examined under a number of sampling situations by Monte Carlo methods. In the multivariate case, perturbation analysis is used to calculate the influence function of the estimator of Gleser (1981). The connection to estimation in linear regression models is discussed. The role of the influence function in the detection of influential observations is considered and an illustration is given by a numerical example.

83 citations


Journal ArticleDOI
TL;DR: In this article, an approximation to the sequential updating of the distribution of location parameters of a linear time series model for non-normal observations is developed for a wide range of symmetric, unimodal error distributions and is both more realistic and elegant than the discrete Gaussian Sum approach.
Abstract: SUMMARY An approximation to the sequential updating of the distribution of location parameters of a linear time series model is developed for non-normal observations. The behaviour of the resulting non-linear recursive filtering algorithm is examined and shown to have certain desirable properties for a variety of non-normal error distributions. Illustrative examples are given and relationships with previous work on robustness and sequential estimation are mentioned. WE consider here the problem of sequential estimation of the location vector of a linear time series model, termed the Dynamic Linear Model by Harrison and Stevens (1976). The straightforward, exact analysis obtained by assuming normal error and prior structure is generally lost when alternative error distributions are adopted, and yet considerations of realism or robustness may strongly suggest such non-normal assumptions. The multi-state model of Harrison and Stevens provides an approximate analysis based on a discrete variance mixture of normal distributions, an approach which has been extensively investigated in the engineering literature under the name of Gaussian Sum approximations; see, for example, Alspach and Sorenson (1971). Our aim in this paper is to provide an approximate, tractible, recursive updating procedure for the location parameters, which is applicable to a wide range of symmetric, unimodal error distributions and is both more realistic and more elegant than the discrete Gaussian Sum approach. In particular, for heavy-tailed distributions our procedures provide approximate Bayesian methods for time series analysis which extend considerably the work of Masreliez and Martin (1977) and have close connections with classical robustness ideas such as M-estimation and influence functions.

83 citations


Journal ArticleDOI
TL;DR: In this article, three different approaches were tested: (i) a mixed-strategy approach alternating between polynomial surface fitting and local averaging, both weighted by distance and sample size, the choice between the alternatives being based on the local dispersion of data; (ii) the theory of regionalized variables with automatic choice of the degree of the POlynomial function representing the generalized covariance distance; (iii) the use in current genetic models of covariation of gene frequency with distance.
Abstract: Graphic display of a surface by isopleths is frequently useful. This technique has also been used, though not widely, to study the geographic distribution of gene frequencies. We were interested in the development of automatic procedures for this purpose. Three different approaches were tested: (i) a mixed-strategy approach alternating between polynomial surface fitting and local averaging, both weighted by distance and sample size, the choice between the alternatives being based on the local dispersion of data; (ii) the theory of regionalized variables with automatic choice of the degree of the polynomial function representing the generalized covariance distance; (iii) the use in (ii) of current genetic models of covariation of gene frequency with distance. The first method was found to be slightly superior to the second (though not significantly so) by the testing criteria adopted. The third method was definitely less satisfactory, as tested by a variety of criteria. Among them, a 'leaveout-s' technique of jackknifing provided a distribution of x2 values measuring the fit of the map to each data point. Outliers observed were justified on biological grounds. Various aspects and problems of automatic map construction are discussed. Practical results seem quite encouraging.

62 citations


Journal ArticleDOI
TL;DR: Weighted least squares estimation is used for fitting nonlinear models to Ames test data that are biologically plausible, incorporating mutagenicity, toxicity, and/or saturation.
Abstract: Weighted least squares estimation is used for fitting nonlinear models to Ames test data. The models are biologically plausible, incorporating mutagenicity, toxicity, and/or saturation. Regressions are weighted to compensate for variance that changes with mean and to attain outlier resistance.

61 citations


Journal ArticleDOI
TL;DR: In this paper, influence functions for a variety of parametric functions in multivariate analysis are obtained, including the generalized variance, the matrix of regression coefficients, the noncentrality matrix Σ-1 δ, and the matrix L, which is a generalization of 1-R2, canonical correlations, principal components and parameters that correspond to Pillai's statistic (1955), Hotelling's (1951) generalized To2 and Wilk's Λ (1932).
Abstract: The influence function introduced by Hampe1 (1968, 1973, 1974) is a tool that can be used for outlier detection. Campbell (1978) has obtained influence function for Mahalanobis’s distance between two populations which can be used for detecting outliers in discrim-inant analysis. In this paper influence functions for a variety of parametric functions in multivariate analysis are obtained. Influence functions for the generalized variance, the matrix of regression coefficients, the noncentrality matrix Σ-1 δ in multivariate analysis of variance and its eigen values, the matrix L, which is a generalization of 1-R2 , canonical correlations, principal components and parameters that correspond to Pillai’s statistic (1955), Hotelling’s (1951) generalized To2 and Wilk’s Λ (1932), which can be used for outlier detection in multivariate analysis, are obtained. Delvin, Ginanadesikan and Kettenring (1975) have obtained influence function for the population correlation co-efficient in the bivariate case. It is shown in...

60 citations


Journal ArticleDOI
TL;DR: Several tests of discordancy for an outlying value in a sample of data from Fisher's distribution are investigated, and recommendations made about their use as discussed by the authors, and one of the tests is modified for application to testing discordancy of several outliers in a single Fisher sample.
Abstract: Several tests of discordancy for an outlying value in a sample of data from Fisher's distribution are investigated, and recommendations made about their use. One of the tests is modified for application to testing discordancy of several outliers in a single Fisher sample. The various tests are applied to some samples of artificial data and palaeomagnetic data.

47 citations


Journal ArticleDOI
TL;DR: In this paper, three estimators that alter the usual sampling weights have been considered and the efficiencies of these estimators have been worked out in terms of the ratio of the mean squared error of the usual estimator of the population total to the mean square error of those estimators.
Abstract: The problem considered is the estimation of the population total of some characteristic from a simple random sample containing a few large or extreme observations. The effect of these large units in the sample is to distort the estimate of the population total. It is therefore important to correct the weights for such units or deflate their values at the estimation stage once they have been sampled and identified as unusually large units. In this paper, three estimators that alter the usual sampling weights have been considered. The efficiencies of these estimators have been worked out in terms of the ratio of the mean squared error of the usual estimator of the population total to the mean squared error of these estimators. A numerical study of these estimators is also discussed.

40 citations


Journal ArticleDOI
TL;DR: This work presents a relatively simple alternative method for assessing the accuracy of the first-order Bonferroni upper bound that can be applied to any linear model and is suitable for routine use.
Abstract: At present, the first-order Bonferroni upper bound is the only practically useful tool for determining approximate critical values or p values for the maximum absolute studentized residual as a criterion for detecting a single outlier in a linear model. Available methods for assessing the accuracy of this bound require numerical integration and are difficult to apply routinely. We present a relatively simple alternative method that can be applied to any linear model and is suitable for routine use. The application to analyses of 2 m factorial experiments and regression models is illustrated with several examples.

Journal ArticleDOI
TL;DR: In this article, the authors used the influence curve introduced by Hampel (1974) as a heuristic tool in evaluating the robustness of some estimators of location for directional data discussed by Mardia (1972).
Abstract: Much recent research has centred on the robustness of estimators, particularly estimators of location and scale for linear data. However, the robustness of estimators for directional data has not been investigated comprehensively. Mardia (1975) and discussants mentioned some aspects of robustness. Collett (1980) investigated outliers in circular data. In this note, we will use the influence curve introduced by Hampel (1974) as a heuristic tool in evaluating the robustness of some estimators of location for directional data discussed by Mardia (1972). Similar calculations can be used to obtain influence curves for scale estimators. For the location model, we consider a circular distribution F which is unimodal and symmetric about the unknown direction p. Let p be the known concentration parameter. The sample circular mean is the most commonly used measure of location. We express the circular population mean in functional form as


Journal ArticleDOI
TL;DR: In this paper, a new type of procedure for estimating the number of outliers in a sample is presented and compared with existing procedures and the probabilities of exact, under-, and overestimation with the different procedures are examined for two different contamination schemes.
Abstract: A new type of procedure for estimating the number of outliers in a sample is presented and compared with existing procedures. The probabilities of exact, under-, and overestimation with the different procedures are examined for two different contamination schemes.

ReportDOI
01 May 1981
TL;DR: In this paper, weibull outlier tests based on three different statistics are investigated with respect to their power optimality under various alternative models, and the tabulated values allow one to identify 'treatment effects' resulting from unsuspected modifications to a process or to predict failure times in a life test.
Abstract: : Weibull outlier tests based on three different statistics are investigated with respect to their power optimality under various alternative models. Two of thes statistics are new in the context of outlier statistics; and one of these is shown to provide a more powerful test in certain situations than other more classical outlier test statistics. Critical values of the three statistics were computer-generated and are tabulated. The tabulated values allow one to identify 'treatment effects' resulting from unsuspected modifications to a process or to predict failure times in a life test. Numerical examples are given.

Journal ArticleDOI
TL;DR: In this article, it was shown that when all the correlations between residuals are smaller in absolute value than a certain tabulated value, the approximate test based on the first Bonferroni inequality will have size between the nominal value oe and Oh-2°e2.
Abstract: The maximum absolute studentized residual may be used for the detection of a single outlier in a general linear model, but approximations to its distribution are required. Ellenberg (1976, Biometrics 32, 637-645) has proposed the use of the second Bonferroni inequality. It is now shown that, when all the correlations between residuals are smaller in absolute value than a certain tabulated value, the approximate test based on the first Bonferroni inequality will have size between the nominal value oe and Oh-2°e2The results are illustrated by two examples.

Book ChapterDOI
01 Jan 1981
TL;DR: The direct linear plot as mentioned in this paper is a method of plotting the Michaelis-Menten equation in parameter space that leads to robust estimates of the parameters, i.e., estimates based on few assumptions about the statistical structure of the data.
Abstract: Although the method of least squares is optimal for fitting equations to data under idealized conditions, it gives poor results if these conditions are violated. It is highly sensitive to the presence of outliers in the data and it requires correct weighting. The direct linear plot is a method of plotting the Michaelis-Menten equation in parameter space that leads to robust estimates of the parameters, i.e., estimates based on few assumptions about the statistical structure of the data. It performs well with the Michaelis-Menten equation, and similar methods can be applied with equal success to other two-parameter models, but it cannot easily be applied to models with more than two parameters to be estimated. For this reason there is interest in other robust methods, such as biweight regression, which resembles least-squares regression in which decreased or zero weight is assigned to outlying observations. Such methods provide good protection against outliers but need correct weights to work satisfactorily.

Journal ArticleDOI
TL;DR: Results show that no procedure is most powerful when the actual number of outlier present in the sample is exactly, under-, and overestimated and the probabilities of inliers being detected as outliers are also substantial.
Abstract: The general problem of outlier detection and five recursive outlier detection procedures considered in the study are defined. The methods to compute powers, probabilities of detecting ≥1 outliers, and >1 observations including at least one inlier as outliers are computed and results are discussed. Results show that no procedure is most powerful when the actual number of outlier present in the sample is exactly, under-, and overestimated. The probabilities of inliers being detected as outliers are also substantial particularly when outliers occur only on one side of the sample

Journal ArticleDOI
TL;DR: In this article, a robust version of Neyman's optimal $C(\alpha)$ test is proposed for contamination neighborhoods, and the proposed robust test is shown to be asymptotically locally maximin.
Abstract: A robust version of Neyman's optimal $C(\alpha)$ test is proposed for contamination neighborhoods. The proposed robust test is shown to be asymptotically locally maximin among all asymptotic level $\alpha$ tests. Asymptotic efficiency of the test procedure at the ideal model is investigated. An outlier resistant version of Student's $t$-test is proposed.

Journal ArticleDOI
TL;DR: The third Bonferroni (upper) bound, which, although conservative, is expensive to calculate, agrees with the second bound to at least four decimal places for all factor combinations considered in this paper.
Abstract: Accurate bounds are presented for the fractiles of the maximum normed residual (which is often used to test for a single outlier) for two- and three-way layouts. It is shown that the second Bonferroni bound of the critical value, while not conservative, is an excellent approximation to the critical value, being much more accurate than the first Bonferroni upper bound. The third Bonferroni (upper) bound, which, although conservative, is expensive to calculate, agrees with the second bound to at least four decimal places for all factor combinations considered in this paper.

Journal ArticleDOI
TL;DR: It is concluded that any automated data-processing method should contain an outlier rejection facility, but its results should be treated with caution.
Abstract: Outlying points often appear in standard curves for radioimmunoassay. We have examined the effect of outlying standard points on the ability of various automated data-processing routines, and of manual operators, to position a radioimmunoassay standard curve correctly. Manual operators were found to be highly subjective in their handling of a standard curve containing outlying points. Automated methods without outlier rejection capability produced standard curves that were significantly erroneous. In contrast, a data-processing method with automated outlier rejection capability successfully identified outliers, but occasionally rejected valid points--and consequently misplaced the standard curve. Visual identification of outliers is unsatisfactory. Automated identification can be more satisfactory, but some patterns of outliers make it less so. We conclude that any automated data-processing method should contain an outlier rejection facility, but its results should be treated with caution.

Journal ArticleDOI
TL;DR: In this article, a particular form of the three-parameter log-normal distribution was derived by utilizing a variate transform of the annual peak-flow series for fitting to a histogram.

01 Jul 1981
TL;DR: The results from a cooperative study initiated by ASTM E-19 to evaluate the quality of data to be expected from the current practice of liquid chromatography are discussed in this article.
Abstract: Results from a cooperative study initiated by ASTM E-19 to evaluate the quality of data to be expected from current practice of liquid chromatography are discussed Seventy-eight laboratories participated in the program using reversed-phase columns to analyze two samples, each in triplicate The first sample was an easily separated four-component mixture; the second was a more complex six-component mixture Mean values of the analytical data submitted were consistent with the known concentrations of the components in each sample, indicating a highly satisfactory degree of overall accuracy However, the spread of data, expressed as percent relative standard deviation, revealed analytical problems for some laboratories Relative standard deviations for the whole data set ranged from 6 to 11% in the first sample The problems were more serious in the more complex sample with relative standard deviations ranging from 9 to 16% Removal of data from outlier laboratories using a multivariate analysis of the data reduced relative standard deviations in the first sample to a range of 3 to 5% and in the second sample to a range of 3 to 8% This latter information was representative of the performance of about 90% of the participating laboratories

01 Dec 1981
TL;DR: Outlier detection techniques when interest is on bivariate observations are presented and the techniques employed are derived from three different areas: the influence function for the correlation coefficient, regression diagnostics, and principal component tests.
Abstract: Outlier detection techniques when interest is on bivariate observations are presented. The techniques employed are derived from three different areas: (1) the influence function for the correlation coefficient, (2) regression diagnostics, and (3) principal component tests. The techniques are described briefly and references for further study are given. Simulations under the null case of no outliers yield percentage points of the sampling distribution of the statistics involved for test of hypothesis purposes. Additional simulations were performed modeling different types of outlier situations to enable comparison of the techniques. A computer code to calculate the statistics and present results is also included.

Journal ArticleDOI
TL;DR: The usual assumptions in regression analysis are: the setup is true; the random errors are uncorrelated (and normally distributed); and the data contain no outliers as mentioned in this paper.
Abstract: The usual assumptions in regression analysis are: The setup is true; the random errors are uncorre-lated (and normally distributed); and the data contain no outliers. Reasoning known recommendations to check these assumptions by inspection of the residuals new proposals are discussed and illustrated by an example.

Journal ArticleDOI
TL;DR: When a random sample of size n from a standard normal population is contaminated by a normal outlier with either a different mean or a different variance, the maximum entropy median outperforms the sample median in terms of squared error loss as mentioned in this paper.