scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 1989"


Journal ArticleDOI
TL;DR: The box plot is applied to tabular data from two recently published articles to show how readers can use box plots to improve the interpretation of data in complex tables and recommend that the box plot be used more frequently.
Abstract: Exploratory data analysis involves the use of statistical techniques to identify patterns that may be hidden in a group of numbers. One of these techniques is the "box plot," which is used to visually summarize and compare groups of data. The box plot uses the median, the approximate quartiles, and the lowest and highest data points to convey the level, spread, and symmetry of a distribution of data values. It can also be easily refined to identify outlier data values and can be easily constructed by hand. We apply box plots to tabular data from two recently published articles to show how readers can use box plots to improve the interpretation of data in complex tables. The box plot, like other visual methods, is more than a substitute for a table: It is a tool that can improve our reasoning about quantitative information. We recommend that the box plot be used more frequently.

531 citations


Book
22 May 1989
TL;DR: In this paper, the distribution of order statistics in a sample containing a single outlier has been studied in the context of robustness studies, and bounds on mean record values have been derived.
Abstract: 1: The Distribution of Order Statistics.- Exercises.- 2: Recurrence Relations and Identities for Order Statistics.- 2.0. Introduction.- 2.1. Relations for single moments.- 2.2. Relations for product moments.- 2.3. Relations for covariances.- 2.4. Results for symmetric populations.- 2.5. Results for normal population.- 2.6. Results for two related populations.- 2.7. Results for exchangeable variates.- Exercises.- 3: Bounds on Expectations of Order Statistics.- 3.0. Introduction.- 3.1. Universal bounds in the i.i.d. case.- 3.2. Variations on the Samuelson-Scott theme.- 3.3. Bounds via maximal dependence.- 3.4. Restricted families of parent distributions.- Exercises.- 4: Approximations to Moments of Order Statistics.- 4.0. Introduction.- 4.1. Uniform order statistics and moments.- 4.2. David and Johnson's approximation.- 4.3. Clark and Williams' approximation.- 4.4. Plackett's approximation.- 4.5. Saw's error analysis.- 4.6. Sugiura's orthogonal inverse expansion.- 4.7. Joshi's modified bounds and approximations.- 4.8. Joshi and Balakrishnan's improved bounds for extremes.- Exercises.- 5: Order Statistics From a Sample Containing a Single Outlier.- 5.0. Introduction.- 5.1. Distributions of order statistics.- 5.2. Relations for single moments.- 5.3. Relations for product moments.- 5.4. Relations for covariances.- 5.5. Results for symmetric outlier model.- 5.6 Results for two related outlier models.- 5.7. Functional behaviour of order statistics.- 5.8. Applications in robustness studies.- Exercises.- 6: Record Values.- 6.0. Introduction.- 6.1. Record values.- 6.2. Bounds on mean record values.- 6.3. Record values in dependent sequences.- Exercises.- References.- Author Index.

229 citations


Journal ArticleDOI
TL;DR: In this article, a new set of computational procedures are proposed for estimating the magnetotelluric response functions from time series of natural source electromagnetic field variations, which combine the remote reference method, which is effective at minimizing bias errors in the response, with robust processing, and a nonparametric jackknife estimator for the confidence limits on the response functions is introduced.
Abstract: A new set of computational procedures are proposed for estimating the magnetotelluric response functions from time series of natural source electromagnetic field variations. These combine the remote reference method, which is effective at minimizing bias errors in the response, with robust processing, which is useful for removing contamination by outliers and other departures from Gauss-Markov optimality on regression estimates. In addition, a nonparametric jackknife estimator for the confidence limits on the response functions is introduced. The jackknife offers many advantages over conventional approaches, including robustness to heterogeneity of residual variance, relative insensitivity to correlations induced by the spectral analysis of finite data sequences, and computational simplicity. These techniques are illustrated using long-period magnetotelluric data from the EMSLAB Lincoln line. The paper concludes with a cautionary note about leverage effects by high power events in the dependent variables that are not necessarily removable by any robust method based on regression residuals.

204 citations


Journal ArticleDOI
TL;DR: In this article, the authors derived the resulting increase in the mean square of the l-step-ahead forecast error and showed that this increase is due to a carryover effect of the additive outlier on the forecast, and a bias in the estimates of the autoregressive and moving average coefficients.

134 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method for distinguishing an observational outlier from an innovational one using regression analysis techniques, and a four-step procedure for modeling time series in the presence of outliers.
Abstract: Some statistics used in regression analysis are considered for detection of outliers in time series. Approximations and asymptotic distributions of these statistics are considered. A method is proposed for distinguishing an observational outlier from an innovational one. A four-step procedure for modeling time series in the presence of outliers is also proposed, and an example is presented to illustrate the methodology.

128 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a method to identify statistical outliers, which are candidates for interpretation as true geochemical anomalies, and isolate a multi-element subset that is representative of the geochemical background.

114 citations


Journal ArticleDOI
TL;DR: This paper proposes a new algorithm to obtain an eigenvalue decomposition for the sample covariance matrix of a multivariate dataset, referred to as ROPRC, which is based on the rotation technique employed by Ammann and Van Ness (1988a,b) to obtain a robust solution to an errors-in-variables problem.
Abstract: This paper proposes a new algorithm to obtain an eigenvalue decomposition for the sample covariance matrix of a multivariate dataset. The algorithm is based on the rotation technique employed by Ammann and Van Ness (1988a,b) to obtain a robust solution to an errors-in-variables problem. When this rotation technique is combined with an iterative reweighting of the data, a robust eigenvalue decomposition is obtained. This robust eigenvalue decomposition has important applications to principal component analysis. Monte Carlo simulations are performed to compare ordinary principal component analysis using the standard eigenvalue decomposition with this algorithm, referred to as ROPRC. It is seen that ROPRC is reasonably efficient compared to an eigenvalue decomposition when Gaussian data is available, and that ROPRC is much better than the eigenvalue decomposition if outliers are present or if the data has a heavy-tailed distribution. The algorithm returns useful numerical diagnostic information in the form o...

91 citations


Journal ArticleDOI
TL;DR: A unified and comprehensive approach to the analysis of classifier performance is presented and the theoretical analysis of the bootstrap method is presented for quadratic classifiers.
Abstract: An expression for expected classifier performance previously derived by the authors (ibid, vol11, no8, p873-855, Aug 1989) is applied to a variety of error estimation methods and a unified and comprehensive approach to the analysis of classifier performance is presented After the error expression is introduced, it is applied to three cases: (1) a given classifier and a finite test set; (2) given test distributions a finite design set; and (3) finite and independent design and test sets For all cases, the expected values and variances of the classifier errors are presented Although the study of Case 1 does not produce any new results, it is important to confirm that the proposed approach produces the known results, and also to show how these results are modified when the design set becomes finite, as in Cases 2 and 3 The error expression is used to compute the bias between the leave-one-out and resubstitution errors for quadratic classifiers The effect of outliers in design samples on the classification error is discussed Finally, the theoretical analysis of the bootstrap method is presented for quadratic classifiers >

82 citations


Journal ArticleDOI
TL;DR: In this article, the authors discuss the use of normal order statistics plots, based on deviance residuals, to check distributional assumptions in regression models and discuss a method for discriminating between competing models with different error distributions.
Abstract: SUMMARY We discuss the use of normal order statistics plots, based on deviance residuals, to check distributional assumptions in regression models. Continuous and discrete error distributions are considered, as are censored data. Misspecified error distributions and discrimination between competing models are discussed, with an example. Residual plots to detect inadequacies in normal linear regression models have a long history. An example is the normal scores or rankit plot: a plot of ordered residuals against normal order statistics, which is used to detect outliers and to check distributional assumptions. Under- or overdispersion in such a plot may also indicate a misspecified systematic component of the model. The purpose of this paper is to discuss such plots for regressions with nonnormal errors, such as generalized linear models, for which deviance residuals are commonly used. Deviance residuals are known to be approximately normal in many cases. In this paper we briefly describe the properties of rankit plots based on them for continuous distributions, outline analogous results for discrete and censored data, and describe a method for discriminating between competing models with different error distributions. Most roads lead to Rome for the normal distribution in the sense that many definitions of residuals are functions of (y - jt)/ o. Not so for other distributions, for which various

51 citations


Journal ArticleDOI
TL;DR: In this paper, a new procedure for identifying outliers or influential observations is proposed, which uses recursive residuals calculated on observations that have been ordered according to their Studentized residuals, values of Cook's D, or another regression diagnostic of the user's choice.
Abstract: A new procedure for identifying outliers or influential observations is proposed. The procedure uses recursive residuals, calculated on observations that have been ordered according to their Studentized residuals, values of Cook's D, or another regression diagnostic of the user's choice. Under the model, these recursive residuals, appropriately standardized, have approximate Student's t distributions. Thus, convenient critical values are available for deciding which observations merit scrutiny and, perhaps, special treatment. The power of the test procedure to identify one or more outliers is investigated through simulation, and its dependence on the number and configuration of the outliers, that is, their placement with respect to the main body of the data, explored. The proposed procedure and two variations of it, also based on these recursive residuals, are compared with alternatives based on internally or externally Studentized residuals. The use of recursive residuals, calculated on adaptively-ordered observations, increases power and helps to combat the masking of one outlier by another when multiple outliers are present in configurations that create masking.

45 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a partially adaptive estimator of ARMA models which includes least absolute deviation (LAD or L 1 ), least squares, L p, and optimal L p as special or limiting cases.

Journal ArticleDOI
TL;DR: In this article, the robust regression M-estimate is proposed to estimate the impedance of magnetotelluric (MT) data and compared with the remote reference (RR) method.
Abstract: In situations in which Gaussian error assumptions are not valid, estimation procedures based on the least squares (LS) algorithm can be seriously misleading. It is then essential to use statistical procedures that are robust, in the sense of being relatively resistant or insensitive to the presence of a moderate number of outliers (abnormal data) superimposed on a common Gaussian noise background. This paper demonstrates the implementation of the robust regression M-estimate to magnetotelluric (MT) data. Like the LS estimate, the M-estimate minimizes the difference between prediction and observation, but it differs from the LS estimate in that it defines the measure of misfit in a way that does not allow a few bad points to dominate the estimate. Starting with the description of this estimate, several algorithms for computation are discussed and applied to estimate MT impedance. Using synthetic and real data, it is shown that, in comparison with the remote reference (RR) method (which is based on LS), robust procedures yield impedance estimates that are no worse than RR, and are often better.

Journal ArticleDOI
TL;DR: In this article, an empirical study of efficiency at the plant level, requiring production and financial data, was done using frontier function specifications, and outlier diagnostic tests consistently flag the same subset of efficient and inefficient observations as the frontier models and additionally clarify ranking discrepancies among the frontier model specifications.
Abstract: An empirical study of efficiency at the plant level, requiring production and financial data, was done using frontier function specifications. It is not evident from the implementation of the production-frontier models that different methodologies will consistently flag the same observations as being efficient or inefficient. As a result, outlier diagnostics for individual observations and for subsets of observations are used to achieve a relative index of influentiality within the spectrum of efficiency. These outlier diagnostic tests consistently flag the same subset of efficient and inefficient observations as the frontier models and additionally clarify ranking discrepancies among the frontier model specifications.

Journal ArticleDOI
U.J. Dixit1
TL;DR: In this article, the maximum likelihood estimators and moment estimators for samples from the Gamma distribution in the presence of outliers were derived empirically when all the three parameters are unknown and when one of the parameters is known; their bias and mean square error (MSE) were investigated with the help of numerical technique.
Abstract: The maximum likelihood estimators and moment estimators are derived for samples from the Gamma distribution in the presence of outliers. These estimators are compared empirically when all the three parameters are unknown and when one of the three parameters is known; their bias and mean square error (MSE) are investigated with the help of numerical technique.

Journal ArticleDOI
TL;DR: In this article, the authors investigated the forecasting efficiency of an expert system, an automatic time series modeling system, when applied to a quarterly earnings per share series, and found that the intervention analysis which specifically models the outlier may enhance forecasting efficiency.
Abstract: The purpose of this study is to investigate the forecasting efficiency of an expert system, an automatic time series modeling system, when applied to a quarterly earnings per share series. The Bethlehem steel quarterly earnings series has a severe outlier problem and the intervention analysis which specifically models the outlier may enhance forecasting efficiency. The purpose of this study is to re-examine the intervention analysis of Bethlehem Steel's quarterly earnings per share series behaviour previously analyzed in Hopwood and McKeown (1986). The very large $1 billion loss of Bethlehm Steel in 1982 created an outlier in the quartely earnings series that may distort the autocorrelation and partial autocorrelation function estimates such that traditional time series models may not be appropriate [Box and Tiao (1975)]. The intervention analysis model reported in Hopwood and McKeown is substantiated in this analysis; however, the intervention modeling effort is expanded by using the intervention model i...

Journal ArticleDOI
TL;DR: In this article, the authors investigate properties of a diagnostic-envelope method for evaluating normal probability plots of regression residuals that was proposed by Atkinson (1981), implemented by BMDP (Hardwick 1987), and extended to logistic regression by Landwehr, Pregibon, and Shoemaker (1984).
Abstract: We investigate properties of a diagnostic-envelope method for evaluating normal probability plots of regression residuals that was proposed by Atkinson (1981), implemented by BMDP (Hardwick 1987), and extended to logistic regression by Landwehr, Pregibon, and Shoemaker (1984). The envelope's stability properties and joint residual vector-inclusion levels, undocumented so far in the literature, are explored here with several examples. Alternative resistant techniques for creating envelopes are considered; interpretations that can be derived from these plots are discussed. A resistant version of the envelopes shows good stability and good sensitivity to outlying residuals; both full-normal and half-normal probability plots with this envelope method provide useful information to the analyst.

Journal ArticleDOI
TL;DR: Strong evidence is presented to show that, at present, the field of design metrics is only sufficiently mature to allow outlier analysis to be recommended as a useful aid to the software design process.
Abstract: The tutorial paper considers the question of how the software designer is to make use of the many design metrics that have been proposed by researchers over the past few years. The activity of software design is an iterative process of decision making. Design methodologies provide qualitative criteria for this decision-making process. In contrast, design metrics claim to provide objective, quantitative guidance. Three application methods for metrics are identified: prediction, quality control, and outlier analysis. Strong evidence is presented to show that, at present, the field of design metrics is only sufficiently mature to allow outlier analysis to be recommended as a useful aid to the software design process. The paper demonstrates this technique by evaluating an example design using the information flow metric.

Journal ArticleDOI
01 Dec 1989
TL;DR: A practical method is developed for outlier detection in autoregressive modelling that has the interpretation of a Mahalanobis distance function and requires minimal additional computation once a model is fitted.
Abstract: A practical method is developed for outlier detection in autoregressive modelling. It has the interpretation of a Mahalanobis distance function and requires minimal additional computation once a model is fitted. It can be of use to detect both innovation outliers and additive outliers. Both simulated data and real data re used for illustration, including one data set from water resources.

Journal ArticleDOI
Klaus Danzer1
TL;DR: The fundamentals of robust statistics are explained in their importance for analytical chemistry and the efficiency of several medians and robust confidence intervals is demonstrated by means of examples.
Abstract: The fundamentals of robust statistics are explained in their importance for analytical chemistry Robust statistical techniques are resistant against uncertainties concerning the data, like outliers or divergencies from the normal distribution The efficiency of several medians and robust confidence intervals is demonstrated by means of examples The bases of robust regression are described and the advantages of its use are shown for practically relevant examples

Journal ArticleDOI
TL;DR: In this paper, a test procedure to detect outliers in the one-parameter exponential distribution based on prediction is presented, which can be used to detect more than one outlier and the required percentage points can be easily determined.
Abstract: In this paper, we present a test procedure to detect outliers in the one-parameter exponential distribution based on prediction. The distribution of the test statistic is obtained. The proposed test can be used to detect more than one outlier and the required percentage points can be easily determined. Furthermore, the test provides a simple procedure to detect whether a given set of data is free from outliers or spurious observations.

Journal ArticleDOI
TL;DR: In this paper, the authors generalize these results to the case when the order statistics arise from two related sets of independent and non-identically distributed random variables and apply them to simplify the evaluation of the moments of order statistics in an outlier model.
Abstract: Some recurrence relations among moments of order statistics from two related sets of variables are quite well-known in the iid case and are due to Govindarajulu (1963a, Technometrics, 5, 514–518 and 1966, J Amer Statist Assoc, 61, 248–258) In this paper, we generalize these results to the case when the order statistics arise from two related sets of independent and non-identically distributed random variables These relations can be employed to simplify the evaluation of the moments of order statistics in an outlier model for symmetrically distributed random variables

Journal ArticleDOI
TL;DR: Outlier-resistant algorithms that detect a change from a given nominal stationary process to another such process are given, both in the absence and the presence of outliers.
Abstract: Outlier-resistant algorithms that detect a change from a given nominal stationary process to another such process are given. The nominal processes are assumed to be mutually independent and to satisfy some general regularity conditions. The outlier sequences are assumed to be independently and identically distributed and independent of the nominal processes. The proposed algorithms are sequential and consist of uniformly bounded steps. The asymptotic performance of the algorithms is analyzed, both in the absence and the presence of outliers. Breakdown points and influence functions are defined and analyzed. The algorithms are studied in more detail for Gaussian autoregressive nominal processes. >

Journal ArticleDOI
TL;DR: A random walk model is found that is adequate for the returns here and is extended to real-time applications and on-line outlier rejection.
Abstract: The use of recursive techniques based on Kalman filter algorithms for identification of time series system models for Doppler lidar returns and the subsequent filtering and smoothing of measured data is explored. The form of possible stochastic system models is reviewed, and reiterative maximum likelihood and innovation spectral tests are used for identification. It is found that a random walk model is adequate for the returns here, and possible explanations for this are considered. Examples are given to illustrate the extension of our method to real-time applications and on-line outlier rejection.

Proceedings ArticleDOI
22 Nov 1989
TL;DR: In this paper, the problem of robust estimation of autoregressive parameters in the presence of outliers is considered, and several M-estimates (maximum likelihood type) corresponding to different cost functions show good efficiency robustness against innovation outliers.
Abstract: The problem of robust estimation of autoregressive parameters in the presence of outliers is considered. The least squares estimate lacks efficiency robustness when innovation outliers are present. Several M-estimates (maximum likelihood type) corresponding to different cost functions show good efficiency robustness against innovation outliers. The M-estimate with Welsch cost function is found to be the best in a comparative simulation study. However, in the case of additive outliers, M-estimates are not robust and they give large bias errors. Generalized M-estimates are recommended for the additive outlier case. A simulation study shows that a combination of Welsch function as the weight function and Andrews or Welsch function as the cost function produces the best performance in generalized M-estimates. >

Journal ArticleDOI
01 Dec 1989
TL;DR: In this article, a simple relation satisfied by the covariances of order statistics in the i.i.d. case was derived and generalized to the case when the variables are independent and non-identically distributed.
Abstract: We derive a simple relation satisfied by the covariances of order statistics in the i.i.d. case and then generalize it to the case when the variables are independent and non-identically distributed. This relation could be employed successfully either to check the calculations or to reduce the amount of direct computations involved in evaluating the covariances of order statistics from an outlier model.

01 Jan 1989
TL;DR: This paper illustrates some simple exploratory procedures for detecting outliers with examples from near-infrared and mid-inf infrared spectroscopy using partial least-squares regression as the calibration method.
Abstract: Outlier samples can have very detrimental effects on the performances of multivariate calibration methods, as these methods are generally not very robust. Often, the software implementations of these methods do not check for outliers. If outliers are not detected, invalid predictions may result. This paper illustrates some simple exploratory procedures for detecting outliers with examples from near-infrared and mid-infrared spectroscopy using partial least-squares regression as the calibration method. 8 refs., 9 figs., 1 tab.

Journal ArticleDOI
TL;DR: In this article, the authors present the results of an empirical power study of three prominent goodness-of-fit tests for exponentiality due to Shapiro and Wilk (1972), Durbin (1975), and Tiku (1980) by considering the mixture-and the outlier- exponential models as alternatives.
Abstract: We present the results of an empirical power study of three prominent goodness-of-fit tests for exponentiality due to Shapiro and Wilk (1972), Durbin (1975), and Tiku (1980) by considering the mixture- and the outlier- exponential models as alternatives. This study is on similar lines as those of Dyer and Harbin (1981) and Balakrishnan (1983). We show that Tiku's test is on the whole considerably more powerful than the other two tests.

Journal ArticleDOI
TL;DR: In this paper, Zhao, Krishnaiah and Bai proposed likelihood ratio tests and intuitive tests to test the rank of Γ in the presence of mean slippage and dispersion-slippage outliers, respectively.
Abstract: Likelihood ratio tests for mean-slippage outlier(s) and Roy's (1953) union-intersection principle for dispersion-slippage outlier(s) are applied in the signal processing data. Procedures to get approximate critical values of the tests are given.Then likelihood ratio tests and intuitive tests are suggested to test the rank of Γ in the presence of mean-slippage and dispersion-slippage outliers, respectively, where is the covariance matrix of the observation vector in signal processing. Finally, the estimate of rank of Γ is obtained through information criterion developed by Zhao, Krishnaiah and Bai (1986a,b) and through modified information criterion proposed by the author.

Journal ArticleDOI
TL;DR: In this paper, a chi-square distribution based on one of four identification coefficients is fitted to the presence-absence data and tested for significant misfit using the Kolmogorov-Smirnov method, with a probability below a set cutoff level is considered so different from the majority of group members that they should be removed.

Journal ArticleDOI
TL;DR: In this paper, a geometrically consistent procedure based on the Euclidean distance is proposed, which involves the least absolute deviation (LAD) regression and a new permutation test for matched pairs (PTMP).
Abstract: The effects of outliers on linear regression are examined. The sensitivity of classical least‐squares (LS) procedures to outliers is shown to be associated with the geometric inconsistency between the data space and the analysis space. This is illustrated for both estimation and inference. A geometrically consistent procedure based on the Euclidean distance is proposed. This procedure involves the least absolute deviation (LAD) regression and a new permutation test for matched pairs (PTMP). Comparisons made with LS techniques demonstrate that the proposed procedure is more resistant to the existence of outliers in the data set and leads to more intuitive results. Applications and illustrations using meteorological and climatological data are also discussed.