scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 1990"


Journal ArticleDOI
TL;DR: In this paper, the kinematical properties of clusters of galaxies are both resistant in the presence of outliers and robust for a broad range of non-Gaussian underlying populations.
Abstract: The novel estimators proposed for the kinematical properties of clusters of galaxies are both resistant in the presence of outliers and robust for a broad range of non-Gaussian underlying populations. Extensive simulations for a number of common situations realizable in small-to-large samples of cluster radial velocities allow the identification of minimum variance estimators. Also explored is the estimation of confidence intervals, using the jacknife and bootstrap resampling techniques. These methods are compared to simple formulas based on sample estimates of central location and scale. Estimators of confidence intervals on scale require resampling. 61 refs.

1,496 citations


Journal ArticleDOI
TL;DR: This work proposes to compute distances based on very robust estimates of location and covariance, better suited to expose the outliers in a multivariate point cloud, to avoid the masking effect.
Abstract: Detecting outliers in a multivariate point cloud is not trivial, especially when there are several outliers. The classical identification method does not always find them, because it is based on the sample mean and covariance matrix, which are themselves affected by the outliers. That is how the outliers get masked. To avoid the masking effect, we propose to compute distances based on very robust estimates of location and covariance. These robust distances are better suited to expose the outliers. In the case of regression data, the classical least squares approach masks outliers in a similar way. Also here, the outliers may be unmasked by using a highly robust regression method. Finally, a new display is proposed in which the robust regression residuals are plotted versus the robust distances. This plot classifies the data into regular observations, vertical outliers, good leverage points, and bad leverage points. Several examples are discussed.

1,419 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose to compute distances based on very robust estimates of location and covariance, which may be unmasked by using a highly robust regression method, and a new display is proposed in which the robust regression residuals are plotted versus the robust distances.
Abstract: Detecting outliers in a multivariate point cloud is not trivial, especially when there are several outliers. The classical identification method does not always find them, because it is based on the sample mean and covariance matrix, which are themselves affected by the outliers. To avoid this masking effect, we propose to compute distances based on very robust estimates of location and covariance. In the case of regression data, the outliers also may be unmasked by using a highly robust regression method. A new display is proposed in which the robust regression residuals are plotted versus the robust distances

156 citations


Journal ArticleDOI
TL;DR: In this article, it was shown that the most surprising observation must lie at one of the vertices of the convex hull and that the observation with the maximum Mahalanobis distance from the sample mean must lie on the concave hull.
Abstract: SUMMARY The conditional predictive ordinate (CPO) is a Bayesian diagnostic which detects surprising observations. It has been used in a variety of situations such as univariate samples, the multivariate normal distribution and regression models. Results are presented about the most surprising observation which has minimum CPO. For the multivariate normal distribution it is shown that the most surprising observation must lie at one of the vertices of the convex hull. It is also shown that the observation with maximum Mahalanobis distance from the sample mean must lie on the convex hull. Results are given for the expected number of vertices on the convex hull when the sample is contaminated. An alternative, closely related diagnostic, the ratio ordinate measure, is presented. A numerical comparison of the two measures is given.

151 citations


Journal ArticleDOI
TL;DR: In this paper, the authors identify influential observations in univariate autoregressive integrated moving average time series models and measure their effects on the estimated parameters of the model, and the sensitivity of the parameters to the presence of either additive or innovational outliers is analyzed.
Abstract: This article studies how to identify influential observations in univariate autoregressive integrated moving average time series models and how to measure their effects on the estimated parameters of the model. The sensitivity of the parameters to the presence of either additive or innovational outliers is analyzed, and influence statistics based on the Mahalanobis distance are presented. The statistic linked to additive outliers is shown to be very useful for indicating the robustness of the fitted model to the given data set. Its application is illustrated using a relevant set of historical data.

92 citations


Journal ArticleDOI
TL;DR: In this paper, the concept of partial probability weighted moments (PPWM) is introduced to estimate a distribution from censored samples, and unbiased estimators of PPWM are derived.

85 citations


Journal ArticleDOI
TL;DR: Preliminary screening for outliers via outsider labelling rules is investigated and letter‐value‐based methods are adapted to cope with censored observations by means of the product limit estimator.
Abstract: Some simple data analytic techniques are discussed in relation to skewed data, as might arise in reliability or survival studies, for example. Preliminary screening for outliers via outsider labelling rules is investigated and letter-value-based methods are adapted to cope with censored observations by means of the products limit estimator

67 citations


Book ChapterDOI
TL;DR: In this article, the authors explore the properties of residuals from a rank-based fit of the model and present diagnostic techniques that detect outlying cases and cases that have an influential effect on the rankbased fit.
Abstract: Residual plots and diagnostic techniques have become important tools in examining the least squares fit of a linear model. In this article we explore the properties of the residuals from a rank-based fit of the model. We present diagnostic techniques that detect outlying cases and cases that have an influential effect on the rank-based fit. We show that the residuals from this fit can be used to detect curvature not accounted for by the fitted model. Furthermore, our diagnostic techniques inherit the excellent efficiency properties of the rank-based fit over a wide class of error distributions, including asymmetric distributions. We illustrate these techniques with several examples.

54 citations


Journal ArticleDOI
TL;DR: In this paper, a measure of the effect of single observations on a logarithmic Bayes factor defined via the difference in the log-arithm of the Bayes factors conditional first on all the data and then omitting an observation is presented.
Abstract: SUMMARY In this paper we consider a measure of the effect of single observations on a logarithmic Bayes factor defined via the difference in the logarithms of the Bayes factors conditional first on all the data and then omitting an observation. The measure is related to the conditional predictive ordinate. The form of the measure and examples of its use are presented for a variety of situations, normal samples, linear models, log linear models and the checking of distributional assumptions.

52 citations


Journal ArticleDOI
TL;DR: The authors proposed several diagnostic statistics to help identify outliers and influential cases and developed new, more general asymptotic tests for vanishing tetrads for variables with nonnormal distributions and derive a simultaneous test for multiple tetrad differences.
Abstract: The TETRAD search procedure has several limitations: It does not screen data for outliers; it relies on Wishart's test for vanishing tetrads that assumes a multinormal distribution for the random variables; and the significance tests do not take into account that multiple tetrad differences are being tested. I propose several ways to overcome these problems. First, I present several diagnostic statistics to help identify outliers and influential cases. Then I develop new, more general asymptotic tests for vanishing tetrads for variables with nonnormal distributions and derive a simultaneous test for multiple tetrad differences. Finally, the tests are extended to apply to tetrad differences of covariances as well as differences of correlations computed for random variables with “arbitrary” distributions.

48 citations


Journal ArticleDOI
TL;DR: Two general approaches of outliers detection in the context of a bioavailability/bioequivalence study are discussed and data from three-way crossover experiment in the pharmaceutical industry is used.
Abstract: This paper concerns techniques for detection of a potential outlier or extreme observation in a bioavailability/bioequivalence study. A bioavailability analysis that includes possible outlying values may affect the decision on bioequivalence. We consider a general crossover model that takes into account period and formulation effects. We derive two test procedures, the likelihood distance and the estimates distance, to detect potential outliers. We show that the two procedures relate to a chi-square distribution with three degrees of freedom. The main purpose of this paper is to exhibit and discuss these two general approaches of outliers detection in the context of a bioavailability/bioequivalence study. To illustrate these approaches, we use data from three-way crossover experiment in the pharmaceutical industry that concerned the comparison of the bioavailability of two test formulations and a standard (reference) formulation of a drug. This example demonstrates the influence of an outlying value in the study of bioequivalence.

Posted Content
TL;DR: In this article, the problem of estimating missing observations in linear, possibly nonstationary, stochastic processes when the model is known is addressed, and analytical expressions for the optimal estimators and their associated mean squared errors are obtained.
Abstract: The paper addresses the problem of estimating missing observations in linear, possibly nonstationary, stochastic processes when the model is known. The general case of any possible distribution of missing observations in the time series is considered, and analytical expressions for the optimal estimators and their associated mean squared errors are obtained. These expressions involve solely the elements of the inverse or dual autocorrelation function of the series. This optimal estimator -the conditional expectation of the missing observations given the available ones-is equal oto the estimator that results from filling the missing values in the series with arbitrary numbers, treating these numbers as additive outliers, and removing the outlier effects from the invented numbers using intervention analysis.

Journal ArticleDOI
TL;DR: In this article, five procedures for detecting outliers in linear regression are compared: sequential testing of the maximum internally studentized residual or maximum externally studentized (cross-validatory) residual, Marasinghe's multistage procedure, and two procedures based on recursive residuals, calculated on adaptive-ordered observations.
Abstract: Five procedures for detecting outliers in linear regression are compared: sequential testing of the maximum internally studentized residual or maximum externally studentized (cross-validatory) residual, Marasinghe's multistage procedure, and two procedures based on recursive residuals, calculated on adaptively-ordered observations. All of these procedures initially test a no-outliers hypothesis, and they have an underlying unity in their general approach to the outlier identification problem. Which procedure is most effective depends on the number and placement of outliers in the data. The multistage procedure is very effective in some cases, but requires prespecifying a value k, the maximum number of outliers one can then detect; the procedure can suffer severely if the chosen value for k is either larger or smaller than the number of outliers actually in the data.

Proceedings ArticleDOI
01 Sep 1990
TL;DR: This work investigates the use of Robust Estimation in an application requiring the accurate location of the centres of circular objects in an image and provides an approach to parameter estimation in contaminated data distributions capable of greater accuracy.
Abstract: We investigate the use of Robust Estimation in an application requiring the accurate location of the centres of circular objects in an image. A common approach used throughout computer vision for extracting shape information from a data set is to fit a feature model using the Least Squares method. The well known sensitivity of this method to outliers is traditionally accommodated by outlier rejection methods. These usually consist of heuristic applications of model templates or data trimming. Robust Estimation offers a theoretical framework for assessing such rejection schemes, and more importantly, provides an approach to parameter estimation in contaminated data distributions capable of greater accuracy.

Journal ArticleDOI
TL;DR: In this paper, the conditional outside rate, a performance criterion for assessing outlier labeling procedures, is introduced, which represents the probability that an observation, yi, is labeled an outlier, conditional on its distance from the center, e = yi e μ.
Abstract: In this article we compare Rosner's (1983) extension of the outlier t test (the generalized ESD procedure) to the boxplot rules, Tukey's (1977) resistant outlier labeling approach. The underlying principles are contrasted, and the behavior of the procedures is compared in a variety of sampling situations. To facilitate power-type comparisons of procedures that are based on different paradigms, the conditional outside rate, a performance criterion for assessing outlier labeling procedures, is introduced. This quantity represents the probability that an observation, yi , is labeled an outlier, conditional on its distance from the center, e = yi e μ. Some exact expressions, as well as simple approximations, and a Monte Carlo swindle are developed. The conditional outside rate, together with the outside rate per observation and the some-outside rate per sample, are applied to compare performance of the classical and boxplot rules, based on Monte Carlo simulation of samples of sizes n = 10, 30, and 50...

Journal ArticleDOI
TL;DR: Hu et al. as discussed by the authors compared the performance of fuzzy calibration and least median of squares for detecting outliers in a Monte-Carlo simulation, and concluded that fuzzy calibration performed better than least squares.

Journal ArticleDOI
TL;DR: It is shown that the existence of a single aberrant observation, innovation, or intervention causes an ARMA model to be misidentified using unadjusted autocorrelation (acf) and partial autoc orrelation estimates.
Abstract: Fox (1972), Box and Tiao (1975), and Abraham and Box (1979) have proposed methods for detecting outliers in time series whose ARMA form is known (or identified). We show that the existence of a single aberrant observation, innovation, or intervention causes an ARMA model to be misidentified using unadjusted autocorrelation (acf) and partial autocorrelation estimates. The magnitude, location, type of outlier, and in some cases the ARMA's parameters, affect the identification outcome. We use variance inflation, signal-to-noise ratios, and acf critical values to determine an ARMA model's susceptibility to misidentifi-cation. Numerical and simulation examples suggest how to iteratively use the outlier detection methods in practice.

Journal ArticleDOI
TL;DR: A definition for the term outlier is state, and a general framework in which to carry out a qualitative analysis of data is constructed which when applied to the Analysis of Outliers presents certain advantages with respect to the classical approach.
Abstract: Summary We propose an outline which enables us to analyse in a generic way the errors which can affect the experimental observations. From this we state a definition for the term outlier, and typify the problems of Identification Techniques of Outliers. This allows us to construct a general framework in which to carry out a qualitative analysis of data which when applied to the Analysis of Outliers presents certain advantages with respect to the classical approach.

Journal ArticleDOI
TL;DR: In this article, exact conditional methods for identifying outliers in logistic regression data are proposed, using an explicit enumeration of all possible responses consistent with the observed value of the sufficient statistic.
Abstract: SUMMARY We consider exact conditional methods for identifying outliers in logistic regression data. Tests for a single outlier and multiple outliers are developed assuming a logistic slippage model. The p-values for these tests are determined using an explicit enumeration of all possible responses consistent with the observed value of the sufficient statistic. Justifications are given for preferring this computationally intensive approach to standard methods based on asymptotic approximations. The techniques are applied to two examples.

Journal ArticleDOI
TL;DR: The authors developed diagnostic tools to measure the effect of individual observations on the parameter estimates and the fit in order to make the data rejection decision of the data-analyst less ad hoc.
Abstract: SUMMARY In the fitting of finite mixtures of distributions to empirical data it is often felt necessary to exclude certain observations in order to achieve a satisfactory fit. This is particularly true of fisheries lengthfrequency data where the object is to model a discrete number of age classes by a mixture of normal components and the presence of outliers in the sample can have a large effect on the parameter estimates and the fit. This paper develops diagnostic tools to measure the effect of individual observations on the parameter estimates and the fit in order to make the data rejection decision of the data-analyst less ad hoc. Specifically, we show how to compute the parameter influence curves and introduce a statistic similar to "Cook's distance" used in regression diagnostics.

Journal ArticleDOI
TL;DR: A novel scheme for direction-of-arrival (DOA) estimation is presented that provides estimates that are robust against outliers and distributional uncertainties and employs decentralized processing in which each subarray site provides a robust estimate of the number of sources accompanied by its corresponding reliability statistics.
Abstract: A novel scheme for direction-of-arrival (DOA) estimation is presented. The procedure provides estimates that are robust against outliers and distributional uncertainties. It also employs decentralized processing in which each subarray site provides a robust estimate of the number of sources accompanied by its corresponding reliability statistics, so that only the reliable estimates of the number of sources are combined at the fusion center. A robust technique is used to combine the corresponding DOA estimates from the subarray sites. Simulation results show that the scheme performs consistently when the outlier noise is present, whereas the performance of the corresponding nonrobust method deteriorates quickly even with a slight change of the noise environment. This is especially significant at a low signal-to-noise ratio. >

Journal ArticleDOI
TL;DR: In this article, the use of outlier detection, reliability measures from the theory of survey network analysis, and the concept of condition numbers from linear algebra are applied to the statistical and numerical analysis of Global Positioning System (GPS) phase-difference data in baseline adjustments.
Abstract: The use of outlier detection, reliability measures from the theory of survey network analysis, and the concept of condition numbers from linear algebra are applied to the statistical and numerical analysis of Global Positioning System (GPS) phase-difference data in baseline adjustments. Outlier tests are given for the cases of \Ia-priori\N and \Ia-posteriori\N variance factors. Expressions for internal and external reliability measures are treated for the case in which blunders are assumed to be present in the data. These measures are evaluated for a GPS baseline for the cases of fixed integer and floating point ambiguity unknowns. Preliminary testing with real data reveals that the use of covariance analysis augmented with reliability analysis may not be sufficient for the complete assessment of parameters and observations. The numerical sensitivity of the matrix of normal equations in adjustments with floating ambiguity parameters is a potential problem that can be detected with the computation of condition numbers. The use of all these analysis tools in standard surveying engineering practice is suggested.

Journal ArticleDOI
TL;DR: This algorithm has been structured so that each object would have a given probability of being chosen for at least one sample for a specified sample size to be selected and provides a reasonable ability to detect outliers as well as much of the major structure in the data set.

Journal ArticleDOI
TL;DR: In this article, Cook's likelihood displacement is used to measure the impact of individual observations on the time series estimates and a diagnostic that compares the estimates of the innovation variance with and without a particular observation is studied in detail.
Abstract: . Cook's likelihood displacement is a convenient measure of the impact of a model perturbation on parameter estimates. A commonly used model perturbation in regression is the deletion of a case, or equation. A natural model perturbation in the time series context is the deletion of an observation, or a group of observations. Diagnostics that measure the impact of individual observations on the time series estimates are explored in this paper. A diagnostic that compares the estimates of the innovation variance with and without a particular observation is studied in detail.

Journal ArticleDOI
TL;DR: In this paper, an iterative identification method for a linear state-space model with outliers and missing data is proposed by applying the Expectation-Maximization (EM) algorithm.

Journal ArticleDOI
TL;DR: A knowledge-based expert system has been developed to emulate these expert heuristics and can perform data analyses, suggest an appropriate distribution, detect outliers, and provide means to justify a design flood on physical grounds.
Abstract: Single-station flood frequency analysis is an important element in hydrotechnical planning and design. In Canada, no single statistical distribution has been specified for floods; hence, the conventional approach is to select a distribution based on its fit to the observed sample. This selection is not straightforward owing to typically short record lengths and attendant sampling error, magnified influence of apparent outliers, and limited evidence of two populations. Nevertheless, experienced analysts confidently select a distribution for a station based only on a few heuristics. A knowledge-based expert system has been developed to emulate these expert heuristics. It can perform data analyses, suggest an appropriate distribution, detect outliers, and provide means to justify a design flood on physical grounds. If the sample is too small to give reliable quantile estimates, the system performs a Bayesian analysis to combine regional information with station-specific data. The system was calibrated and te...

Journal ArticleDOI
TL;DR: In this article, a discussion of concepts suggested to describe the outlier behavior of probability distributions is given. And the statistical consequences of outlier-proneness or -resistance, respectively, as defined in the above mentioned concepts, are studied.

Journal ArticleDOI
TL;DR: Extended tables of critical values for the reduction in standard deviation tests for single and paired outliers recommended for collaborative studies suggest a confidence interval approach is suggested as a means to treat all sizes of collaborative studies in a uniform manner.
Abstract: Extended tables of critical values for the reduction in standard deviation tests for single and paired outliers recommended for collaborative studies are presented. Critical values for the single outlier test are derived mathematically and those for the paired test are derived from computer simulations. The single outlier test becomes more and more stringent as the size of the study increases. A confidence interval approach is suggested as a means to treat all sizes of collaborative studies in a uniform manner.

Journal ArticleDOI
TL;DR: Hawkins' procedure seems the better method for detecting outliers when multinormal distribution procedure and Hawkins' procedure were applied, and the two subsets produced did not differ greatly.

Journal ArticleDOI
01 Mar 1990
TL;DR: In this article, outlier-contaminated normal errors in regression problems are modelled by exponential power distributions and the resulting maximum likelihood estimators are shown to involve Lp minimisations (1 < P,,2).
Abstract: 1 Abstract: Outlier-contaminated normal errors in regression problems are modelled by exponential power distributions and the resulting maximum likelihood estimators are shown to involve Lp minimisations (1 < P ,,2). It is shown that La estimation is minimax outlier-robust and minimax covariance-robust over the neighbourhood of exponential power distributions. Efficiency loss is negligible. Recursive gradient-type Lp estimators are derived and shown to be convergent and con­ sistent. The major limitation on outlier robustness is seen to be the requirement for convergence of the recursive minimisation. The algorithm is vali­ dated with an application in adaptive control.