scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 1991"


Journal ArticleDOI
TL;DR: In this paper, critical values at the 95% confidence level for the two-tailed Q test, and related tests based upon subrange ratios, for the statistical rejection of outlying data have been interpolated by applying cubic regression analysis to the values originally published by Dixon.
Abstract: Critical values at the 95% confidence level for the two-tailed Q test, and related tests based upon subrange ratios, for the statistical rejection of outlying data have been interpolated by applying cubic regression analysis to the values originally published by Dixon. Corrections to errors in Dixon's original tables are also included. It is recommended that the newly generated 95% critical values be adopted by analytical chemists as the general standard for the rejection of outlier values

409 citations


Journal ArticleDOI
TL;DR: Simulations show that there is substantial differential bias when comparing conditions with fewer than 10 observations against conditions with more than 20, and strongly skewed distributions and a cutoff of 3.0 standard deviations can influence comparisons of conditions with even more observations.
Abstract: To remove the influence of spuriously long response times, many investigators compute “restricted means”, obtained by throwing out any response time more than 2.0, 2.5, or 3.0 standard deviations from the overall sample average. Because reaction time distributions are skewed, however, the computation of restricted means introduces a bias: the restricted mean underestimates the true average of the population of response times. This problem may be very serious when investigators compare restricted means across conditions with different numbers of observations, because the bias increases with sample size. Simulations show that there is substantial differential bias when comparing conditions with fewer than 10 observations against conditions with more than 20. With strongly skewed distributions and a cutoff of 3.0 standard deviations, differential bias can influence comparisons of conditions with even more observations.

328 citations


Journal ArticleDOI
TL;DR: It is shown that the Gibbs sampler brings considerable conceptual and computational simplicity to the problem of calculating posterior marginals and is notable for its ease of implementation.
Abstract: We consider the Bayesian analysis of outlier models. We show that the Gibbs sampler brings considerable conceptual and computational simplicity to the problem of calculating posterior marginals. Although other techniques for finding posterior marginals are available, the Gibbs sampling approach is notable for its ease of implementation. Allowing the probability of an outlier to be unknown introduces an extra parameter into the model but this turns out to involve only minor modification to the algorithm. We illustrate these ideas using a contaminated Gaussian distribution, at-distribution, a contaminated binomial model and logistic regression.

117 citations


Journal ArticleDOI
TL;DR: In this article, the authors illustrate the effect of outliers on classical statistics, such as the sample average, and use robust techniques to identify the outliers in univariate and multivariate data.
Abstract: In this tutorial we first illustrate the effect of outliers on classical statistics such as the sample average. This motivates the use of robust techniques. For univariate data the sample median is a robust estimator of location, and the dispersion can also be estimated robustly. The resulting ‘z-scores’ are well suited to detect outliers. The sample median can be generalized to very large data sets, which is useful for robust ‘averaging’ of curves or images. For multivariate data a robust regression procedure is described. Its standardized residuals allow us to identify the outliers. Finally, a survey of related approaches is given. (This review overlaps with earlier work by the same author, which appeared elsewhere.)

107 citations


Journal ArticleDOI
TL;DR: This article examined the empirical bias and efficiency of Michael Parkinson's extreme value variance estimator for common stocks using an extensive NYSE/AMEX data base and found that the efficiency of the extreme value estimator significantly exceeds that of the close-close estimator.
Abstract: This article examines the empirical bias and efficiency of Michael Parkinson's extreme-value variance estimator for common stocks using an extensive NYSE/AMEX data base. Bias and efficiency are analyzed as a function of stock price level and trading volume. The results are sensitive to outliers in daily high and low prices. After an outlier screen is applied to the data, the efficiency of the extreme-value estimator significantly exceeds that of the close-close estimator for most price and volume groups. Copyright 1991 by University of Chicago Press.

87 citations


Book ChapterDOI
01 Jan 1991
TL;DR: In this paper, the small sample behavior of the robust distances is studied by means of simulation, and a projection-type algorithm is considered to overcome the computational complexity of the resampling algorithm.
Abstract: It is possible to detect outliers in multivariate point clouds by computing distances based on robust estimates of location and scale. It has been suggested to use the Minimum Volume Ellipsoid estimator, which can be computed using a resampling algorithm. In this paper the small sample behavior of the robust distances is studied by means of simulation. We obtain a correction factor yielding approximately correct coverage percentages for the corresponding ellipsoids. In addition, a projection-type algorithm is considered to overcome the computational complexity of the resampling algorithm. Advantages and disadvantages of the second algorithm are discussed.

83 citations


Journal ArticleDOI
TL;DR: In this paper, the authors used the DFBETAS regression diagnostic statistic to estimate the trend coefficient of crop yield trend in the United States, where an influential end-of-series outlier is suspected.
Abstract: Although ordinary least squares is not efficient when errors are not distributed normally, it generates better crop yield trend coefficient estimates than six alternative robust regression methods. This is because of the econometric properties of an uninterrupted series independent variable as well as the level of skewness typical of corn yields. The evaluation covers actual farm-level corn yield series as well as a set of "contaminated" data series and one thousand sets of Monte Carlo yield series. Where an influential end-of-series outlier is suspected, the DFBETAS regression diagnostic statistic is recommended.

53 citations


Journal ArticleDOI
TL;DR: In this article, a generalized extreme studentized residual (GESR) procedure was proposed to detect multiple y outhers in linear regression, and the performance of this procedure was compared with others by Monte Carlo techniques and found to be superior.
Abstract: This article is concerned with procedures for detecting multiple y outhers in linear regression. A generalized extreme studentized residual (GESR) procedure, which controls type I error rate, is developed. An approximate formula to calculate the percentiles is given for large samples and more accurate percentiles for n ≤ 25 are tabulated. The performance of this procedure is compared with others by Monte Carlo techniques and found to be superior. The procedure. however, fails in detecting y outliers that are on high-leverage cases. For this. a two-phase procedure is suggested. In phase 1, a set of suspect observations is identified by GESR and one of the diagnostics applied sequentially. In phase 2, a backward testing is conducted using the GESR procedure to see which of the suspect cases are outlicrs. Several examples are analyzed.

46 citations


Journal ArticleDOI
TL;DR: In this article, the problem of estimating missing observations in an infinite realization of a linear, possibly nonstationary, stochastic process when the model is known is addressed, and analytical expressions for the optimal estimators and their associated mean squared errors are obtained.
Abstract: The paper addresses the problem of estimating missing observations in an infinite realization of a linear, possibly nonstationary, stochastic processes when the model is known. The general case of any possible distribution of missing observations in the time series is considered, and analytical expressions for the optimal estimators and their associated mean squared errors are obtained. These expressions involve solely the elements of the inverse or dual autocorrelation function of the series. This optimal estimator -the conditional expectation of the missing observations given the available ones- is equal to the estimator that results from filling the missing values in the series with arbitrary numbers, treating these numbers as additive outliers, and removing with intervention analysis the outlier effects from the invented numbers.

38 citations


Journal ArticleDOI
TL;DR: In this paper, the influence of extreme observations on linear measurement error model estimators has been investigated in the context of least square influence diagnostics, and it is shown that these influences are in directions defined by the axis of the response variable and the plane defined by predictors.
Abstract: SUMMARY Influence functions are used to show that extreme observations affect linear measurement error model estimators in directions orthogonal to and along the fitted plane, rather than vertically and horizontally as with least squares estimators. Influence diagnostics patterned after those for least squares and derived for generalized linear models by Pregibon (1981) are adapted to assess the influence of extreme observations on linear measurement error model estimators. Perturbation arguments and asymptotic theory are used to support the use of these diagnostics. It is well known that least squares estimators are influenced by extreme observations in the response and predictor variables. Geometrically, these influences are in directions defined by the axis of the response variable and the plane defined by the predictors. As shown in ? 3, measurement error model estimators are affected in directions that are perpendicular to and along the fitted plane. Thus, least squares influence diagnostics that measure the effects of observations in the vertical and the horizontal directions may fail to accurately indicate the true effects of extreme observations on measurement error model estimators. Kelly (1984) derives an influence function for a structural measurement error model in which the error distribution has finite fourth moments and the error covariance matrix is known up to a multiple. With the aid of this influence function, diagnostics such as Cook's D are defined by analogy to the corresponding least squares equivalents. Fuller (1987) defines a measurement error model estimator of the hat matrix using estimated predictor variable values. He also proposes using diagnostics and plotting techniques analogous to those of least squares with estimated model errors and estimated leverage values from the hat matrix. Miller (1989) provides partial theoretical justification for the use of these least-squares-based techniques by showing that the estimated model errors asymptotically are a Gaussian process. Apart from these efforts, little research has been conducted on influence diagnostics for measurement error models. The derivation of influence diagnostics for measurement error models in this paper proceeds as follows. Measurement error model estimators are briefly described in ? 2. A more comprehensive discussion of these estimators is given by Fuller (1987, Ch. 2). In ? 3, influence function arguments are used to establish the need for measurement error model diagnostics. Measurement error model estimators are expressed as iteratively reweighted least squares estimators in ? 4, leading to the adaptation of diagnostics for

37 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed residual plots where realized errors are represented by interval estimates, and they extended the same concept to other models, including regression models for lifetime data, by using an approach similar to that of Cox & Snell (1968).
Abstract: SUMMARY Use of the posterior distribution of the realized error terms for residual analysis in a linear model was advocated by Zellner (1975) and Zellner & Moulton (1985). The same idea was used by Chaloner & Brant (1988) to define outliers and calculate posterior probabilities of observations being outliers. This paper extends the same concept to other models, including regression models for lifetime data, by using an approach similar to that of Cox & Snell (1968). Residual plots are proposed where realized errors are represented by interval estimates. Incorporating censored observations into this framework is straightforward. 1. THE REALIZED ERROR TERMS Cox & Snell (1968) proposed a general definition of residuals for models where each observation yi can be written as yi = gi(0, Ei) with 0 a vector of unknown parameters and the Ei, for i 1,..., n, a sample of independent and identically distributed random variables from a known distribution. Suppose that the equation yi = gi(0, ?,) has a unique solution Ei = hi(y,, 0). Cox & Snell define Ei = hi(yi, 0) to be the residuals where 0 are the maximum likelihood estimates of 0. In the Bayesian approach used here, define ?i = hi(yi, 0), for i = 1, . . ., n, to be the residuals. Each E, is just a function of the unknown parameters and the posterior distribution is therefore straightforward to calculate. The posterior distribution can be examined for indications of possible departures from the assumed model and the presence of outliers. The posterior distribution of the realized errors is very different from the sampling distribution of their estimates. The posterior distribution represents the uncertainty about functions of the parameters; if the parameters, 0, and observations, yi, are known then so are the realized errors ei = h,(y,, 0). With a large sample size, the posterior distribution of the ri will be, approximately, multivariate normal centred at the posterior mean and with covariance matrix the posterior covariance matrix. An alternative approximation to the posterior mean of the realized

01 Jun 1991
TL;DR: This study aims to summarise the most relevant and recent literature with respect to outlier detection for time series and missing value estimation in traffic count data and transfer recently developed methods for dealing with outliers in general time series into a transport context.
Abstract: As part of a SERC funded project this study aims to summarise the most relevant and recent literature with respect to outlier detection for time series and missing value estimation in traffic count data. Many types of transport data are collected over time and are potentlally suited to the application of time series analysis techniques. including accident data, ticket sales and traffic counts. Missing data or outliers in traffic counts can cause problems when analysing the data, for example in order to produce forecasts. At present it seems that little work has been undertaken to assess the merits of alternative methods to treat such data or develop a more analytic approach. Here we intend to review current practices in the transport field and summarise more general time series techniques for handling outlying or missing data. The literature study forms the fist stage of a research project aiming to establish the applicability of time series and other techniques in estimating missing values and outlier detection/replacement in a variety of transport data. Missing data and outliers can occur for a variety of reasons, for example the breakdown of automatic counters. Initial enquiries suggest that methods for patching such data can be crude. Local authorities are to be approached individually usinga short questionnaire enquiry form in order to attempt to ascertain their current practices. Having reviewed current practices the project aims to transfer recently developed methods for dealing with outliers in general time series into a transport context. It is anticipated that comparisons between possible methods could highlight an alternative and more analytical approach to current practices. A description of the main methods ior detecting outliers in time series is given within the first section. In the second section practical applications of Box-Jenkins methods within a transport context are given. current practices for dealing with outlying and missing data within transport are discussed in section three. Recommendations for methods to be used in our current research are followed by the appendices containing most of the mathematical detail.

Journal ArticleDOI
TL;DR: Concepts developed for the analysis of unbalanced data can help to suggest robust methods of analysis for balanced data.
Abstract: When balanced data are analyzed by a robust method, different weights are applied to data points depending on their apparent status as outliers. This results in the effective unbalancing of the data. Concepts developed for the analysis of unbalanced data can help to suggest robust methods of analysis for balanced data.

Journal ArticleDOI
TL;DR: In this article, the authors introduce new criteria for evaluating test statistics based on the $p$-values of the statistics, and present a constructive approach to finding the optimal statistic.
Abstract: We introduce new criteria for evaluating test statistics based on the $p$-values of the statistics. Given a set of test statistics, a good statistic is one which is robust in being reasonably sensitive to all departures from the null implied by that set. We present a constructive approach to finding the optimal statistic. We apply the criteria to two-sided problems; combining independent tests; testing that the mean of a spherical normal distribution is 0, and extensions to other spherically symmetric and exponential distributions; Bartlett's problem of testing the equality of several normal variances; and testing for one outlier in a normal linear model. For the most part, the optimal statistic is quite easy to use. Often, but not always, it is the likelihood ratio statistic.

Journal ArticleDOI
TL;DR: Particular attention is paid to the implementation of the MLE, the threshold signal-to-noise ratio, probability of outlier, and high SNR mean-squared-error performance, which are evaluated and compared for uniform and nonuniform arrays.
Abstract: The analysis is conducted using the Cramer-Rao lower bound simulations and performance modeling of maximum-likelihood estimation (MLE) while assuming a single source of illumination and additive white Gaussian noise. Particular attention is paid to the implementation of the MLE, the threshold signal-to-noise ratio (SNR), probability of outlier, and high SNR mean-squared-error (MSE) performance, which are evaluated and compared for uniform and nonuniform arrays. The conditions under which tradeoffs exist in choosing a particular geometry and their significance are determined. >

Journal ArticleDOI
TL;DR: In this paper, the authors studied the relationship between US gasoline prices, crude oil prices, and the stock of gasoline and found that the US gasoline price is mainly influenced by the price of crude oil.
Abstract: This paper studies the dynamic relationships between US gasoline prices, crude oil prices, and the stock of gasoline. Using monthly data between January 1973 and December 1987, we find that the US gasoline price is mainly influenced by the price of crude oil. The stock of gasoline has little or no influence on the price of gasoline during the period before the second energy crisis, and seems to have some influence during the period after. We also find that the dynamic relationship between the prices of gasoline and crude oil changes over time, shifting from a longer lag response to a shorter lag response. Box-Jenkins ARIMA and transfer function models are employed in this study. These models are estimated using estimation procedure with and without outlier adjustment. For model estimation with outlier adjustment, an iterative procedure for the joint estimation of model parameters and outlier effects is employed. The forecasting performance of these models is carefully examined. For the purpose of illustration, we also analyze these time series using classical white-noise regression models. The results show the importance of using appropriate time-series methods in modeling and forecasting when the data are serially correlated. This paper also demonstrates the problems of time-series modeling when outliers are present.

01 Nov 1991
TL;DR: This paper shows that using robust least squares estimators corresponds to assuming that data are corrupted by Gaussian noise whose variance fluctuates according to some given probability distribution, that uniquely determines the estimator.
Abstract: Least squares estimators are very common in statistics, but they lead to results that are very sensitive to outliers, and it has been proposed to minimize other measures of error, that lead to ``robust'''' estimates. In this paper we show that using these robust estimators corresponds to assuming that data are corrupted by Gaussian noise whose variance fluctuates according to some given probability distribution, that uniquely determines the estimator.

Journal ArticleDOI
TL;DR: Monte Carlo methods are used to evaluate the performance of the outlier technique as parameters of the true mortality process are varied and indicate that the screening ability of the technique may be very sensitive to how widespread quality-related mortality is among hospitals but insensitive to other factors generally thought to be important.
Abstract: Researchers have proposed that hospitals with excessive statistically unexplained mortality rates are more likely to have quality-of-care problems. The U.S. Health Care Financing Administration currently uses this statistical “outlier” approach to screen for poor quality in hospitals. Little is known, however, about the validity of this technique, since direct measures of quality are difficult to obtain. We use Monte Carlo methods to evaluate the performance of the outlier technique as parameters of the true mortality process are varied. Results indicate that the screening ability of the technique may be very sensitive to how widespread quality-related mortality is among hospitals but insensitive to other factors generally thought to be important.

Book ChapterDOI
01 Jan 1991
TL;DR: It is proved that the approximate algorithm based on p-subsets shares the equivariance and good breakdown properties of the exact estimator, and the same result is also valid for other high-breakdown-point estimators.
Abstract: Regression techniques with high breakdown point can withstand a substantial amount of outliers in the data One such method is the least trimmed squares estimator Unfortunately, its exact computation is quite difficult because the objective function may have a large number of local minima Therefore, we have been using an approximate algorithm based on p-subsets In this paper we prove that the algorithm shares the equivariance and good breakdown properties of the exact estimator The same result is also valid for other high-breakdown-point estimators Finally, the special case of one-dimensional location is discussed separately because of unexpected results concerning half samples

Journal ArticleDOI
TL;DR: Statistics that measure the influence of each observation on the parameter estimates and on the forecasts are introduced and seem to be useful to identify important events, such as additive outliers and trend shifts, in time series data.
Abstract: This article presents a methodology for building measures of influence in regression models with time series data. We introduce statistics that measure the influence of each observation on the parameter estimates and on the forecasts. These statistics take into account the autocorrelation of the sample. They can be easily computed using standard time series software. Their performance is analyzed in an example in which they are shown to be useful in identifying important events, such as additve outliers and trend shifts

Journal ArticleDOI
TL;DR: Vankeerberghen et al. as discussed by the authors applied robust statistical procedures applied to the analysis of chemical data, such as robust estimators of central location and dispersion, hypothesis tests using ranking and randomization methods and robust regression based on ranking.

Journal ArticleDOI
TL;DR: A bound on the bias due to outliers is established and used to define a new policy for optimal experimental design aimed at providing a higher protection against outliers than conventional D-optimal design.

Journal ArticleDOI
TL;DR: An iterative procedure for the joint estimation of model parameters and outlier effects is employed with the intervention analysis and it is found that this joint estimation procedure not only produces more reliable estimates of intervention effects, but also provides information on outliers, which is valuable in many respects.
Abstract: Time series analysis, particularly intervention analysis, is commonly employed in impact studies of environmental data. Environmental time series are susceptible to exogenous variations and often contain various types of outliers. Outliers, depending upon the time of their occurrences and nature, can have substantial impact on the estimates of intervention effects and their test statistics. Hence, outlier detection and adjustment should be an indispensable part of an intervention analysis. In this paper, an iterative procedure for the joint estimation of model parameters and outlier effects is employed with the intervention analysis. We find that this joint estimation procedure not only produces more reliable estimates of intervention effects, but also provides information on outliers, which is valuable in many respects. As a special case of outlier adjustment, this joint estimation procedure can also be used to estimate the values of missing data in a time series. Two data sets are used to illus...

Journal ArticleDOI
TL;DR: In this article, the permutation distribution of the trimmed mean in matched-pairs designs is studied and a polynomial-time algorithm for this inherently exponential-time problem is presented.
Abstract: The permutation distribution of the trimmed mean in matched-pairs designs is studied. This statistic is important since it is robust to both outliers and distributional assumptions. A polynomial-time algorithm for this inherently exponential-time problem is presented. The characteristic function of the distribution provides both the basis for the algorithm and a vehicle for calculating the asymptotic permutation distribution.

Journal ArticleDOI
TL;DR: Based on analyses of diagnosis-specific hospital discharge rates in Michigan, it is shown that a Poisson model with an extra variance component for the systematic variation is superior to several other probability models with regard to specification of the error structure.
Abstract: We consider methods for selecting the joint specification of the mean and variance functions in statistical models for rates or counts. Based on analyses of diagnosis-specific hospital discharge rates in Michigan, we show that a Poisson model with an extra variance component for the systematic variation is superior to several other probability models with regard to specification of the error structure. Further, the deviance residual appears superior to the Pearson residual. The proper specification of such variation is crucial for many types of analyses, such as identification of outliers and regression analyses designed to explain the systematic component of the variation.

01 Jan 1991
TL;DR: In this paper, the authors address the monumental challenge of developing exploratory analysis methods for large data sets and propose simple graphical methods that address some of the problems, such as finding outliers and tail structure, assessing central structure and comparing central structures.
Abstract: This report addresses the monumental challenge of developing exploratory analysis methods for large data sets. The goals of the report are to increase awareness of large data sets problems and to contribute simple graphical methods that address some of the problems. The graphical methods focus on two- and three-dimensional data and common task such as finding outliers and tail structure, assessing central structure and comparing central structures. The methods handle large sample size problems through binning, incorporate information from statistical models and adapt image processing algorithms. Examples demonstrate the application of methods to a variety of publicly available large data sets. The most novel application addresses the too many plots to examine'' problem by using cognostics, computer guiding diagnostics, to prioritize plots. The particular application prioritizes views of computational fluid dynamics solution sets on the fly. That is, as each time step of a solution set is generated on a parallel processor the cognostics algorithms assess virtual plots based on the previous time step. Work in such areas is in its infancy and the examples suggest numerous challenges that remain. 35 refs., 15 figs.

01 Jan 1991
TL;DR: The EM algorithm yields a simple identification procedure to facilitate the maximum likelihood estimation for the state-space models to show that the accuracy of the dynamic model is greatly improved by introducing the possibility that the observed data contains outliers.
Abstract: An iterative identification method for a linear state-space model with outliers and /or missing data is proposed by applying the Expectation-Maximization (EM) algorithm . The EM algorithm yields a simple identification procedure to facilitate the maximum likelihood (ML) estimation for the state-space models. The missing data case is easily manipulated by the EM algorithm. The outliers arc treated as missing data, and the outliers are detected by maximum a posteriori (MAP) estimate of the occurrence of outlier which is modeled by a Bernoulli sequence, EM algorithm is also applied to the MAP estimation, The fixed-interval smoothed estimate of the state vector is simultaneously obtained, since it is used for the parameter identification, The presen: algorithm is applied to real data to show that the accuracy of the dynamic model is greatly improved by introducing the possibility that the observed data contains outliers.


Book ChapterDOI
01 Jan 1991
TL;DR: In this article, the authors formulate the resulting four outlier types (mean-shift observation outlier, variance-shift observations, mean-shift innovation outlier and variance shift outlier) as perturbation schemes, in the manner of Cook(1986).
Abstract: In the context of ARMA time series, two types of outlier have been treated in the literature. These are the observation (type I) and the innovation (type II) outlier. Commonly these outliers have been modelled as mean-shift outliers, but variance-shift outliers may also be considered. In this paper we formulate the resulting 4 outlier types (mean-shift observation outlier, variance-shift observation outlier, mean-shift innovation outlier, and variance-shift innovation outlier) as perturbation schemes, in the manner of Cook(1986). Then we propose 3 types of diagnostics for these perturbation schemes, namely residuals, which will indicate the size of the perturbation in question, diagnostics for the potential influence of a perturbation, and diagnostics for the actual or observed influence of a perturbation.

Journal ArticleDOI
TL;DR: In this paper, a new family of robust estimators for ARMA models is defined by replacing the residual sample autocovariances in the least squares equations by autoc covariances based on ranks.
Abstract: In this paper we introduce a new family of robust estimators for ARMA models. These estimators are defined by replacing the residual sample autocovariances in the least squares equations by autocovariances based on ranks. The asymptotic normality of the proposed estimators is provided. The efficiency and robustness properties of these estimators are studied. An adequate choice of the score functions gives estimators which have high efficiency under normality and robustness in the presence of outliers. The score functions can also be chosen so that the resulting estimators are asymptotically as efficient as the maximum likelihood estimators for a given distribution.