scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 1995"


Proceedings Article
20 Aug 1995
TL;DR: This paper examines C4.5, a decision tree algorithm that is already quite robust - few algorithms have been shown to consistently achieve higher accuracy, and extends the pruning method to fully remove the effect of outliers, and this results in improvement on many databases.
Abstract: Finding and removing outliers is an important problem in data mining. Errors in large databases can be extremely common, so an important property of a data mining algorithm is robustness with respect to errors in the database. Most sophisticated methods in machine learning address this problem to some extent, but not fully, and can be improved by addressing the problem more directly. In this paper we examine C4.5, a decision tree algorithm that is already quite robust - few algorithms have been shown to consistently achieve higher accuracy. C4.5 incorporates a pruning scheme that partially addresses the outlier removal problem. In our ROBUST-C4.5 algorithm we extend the pruning method to fully remove the effect of outliers, and this results in improvement on many databases.

259 citations


Journal ArticleDOI
TL;DR: In this article, the authors describe the development of a new technique for identifying outlier coefficients in meta-analytic data sets, referred as the sample-adjusted metaanalytic deviancy statistic or SAMD, which takes into account the sample size on which each study is based when determining outlier status.
Abstract: This article describes the development of a new technique for identifying outlier coefficients in meta-analytic data sets. Denoted as the sample-adjusted meta-analytic deviancy statistic or SAMD, this technique takes into account the sample size on which each study is based when determining outlier status. An empirical test of the SAMD statistic with an actual meta-analytic data set resulted in a substantial reduction in residual variabilities and a corresponding increase in the percentage of variance accounted for by statistical artifacts after removal of outlier study coefficients. Moreover, removal of these coefficients helped to clarify what was a confusing and difficult-to-explain finding in this meta-analysis. It is suggested that analysis for outliers become a routine part of meta-analysis methodology. Limitations and directions for future research are discussed

252 citations


Journal ArticleDOI
TL;DR: The authors' robust rules improve the performances of the existing PCA algorithms significantly when outliers are present and perform excellently for fulfilling various PCA-like tasks such as obtaining the first principal component vector, the first k principal component vectors, and directly finding the subspace spanned by the firstk vector principal components vectors without solving for each vector individually.
Abstract: This paper applies statistical physics to the problem of robust principal component analysis (PCA). The commonly used PCA learning rules are first related to energy functions. These functions are generalized by adding a binary decision field with a given prior distribution so that outliers in the data are dealt with explicitly in order to make PCA robust. Each of the generalized energy functions is then used to define a Gibbs distribution from which a marginal distribution is obtained by summing over the binary decision field. The marginal distribution defines an effective energy function, from which self-organizing rules have been developed for robust PCA. Under the presence of outliers, both the standard PCA methods and the existing self-organizing PCA rules studied in the literature of neural networks perform quite poorly. By contrast, the robust rules proposed here resist outliers well and perform excellently for fulfilling various PCA-like tasks such as obtaining the first principal component vector, the first k principal component vectors, and directly finding the subspace spanned by the first k vector principal component vectors without solving for each vector individually. Comparative experiments have been made, and the results show that the authors' robust rules improve the performances of the existing PCA algorithms significantly when outliers are present. >

244 citations


Journal ArticleDOI
TL;DR: In this paper, the Stahel-Donoho estimators (t, V) of multivariate location and scatter are defined as a weighted mean and a weighted covariance matrix with weights of the form w(r), where w is a weight function and r is a measure of "outlyingness", obtained by considering all univariate projections of the data.
Abstract: The Stahel-Donoho estimators (t, V) of multivariate location and scatter are defined as a weighted mean and a weighted covariance matrix with weights of the form w(r), where w is a weight function and r is a measure of “outlyingness,” obtained by considering all univariate projections of the data. It has a high breakdown point for all dimensions and order √n consistency. The asymptotic bias of V for point mass contamination for suitable weight functions is compared with that of Rousseeuw's minimum volume ellipsoid (MVE) estimator. A simulation shows that for a suitable w, t and V exhibit high efficiency for both normal and Cauchy distributions and are better than their competitors for normal data with point-mass contamination. The performances of the estimators for detecting outliers are compared for both a real and a synthetic data set.

237 citations


Journal ArticleDOI
TL;DR: Analytically, it is demonstrated that MINPRAN distinguished good fits to random data andMINPRAN finds accurate fits and nearly the correct number of inliers, regardless of the percentage of true inLiers.
Abstract: MINPRAN is a new robust estimator capable of finding good fits in data sets containing more than 50% outliers. Unlike other techniques that handle large outlier percentages, MINPRAN does not rely on a known error bound for the good data. Instead, it assumes the bad data are randomly distributed within the dynamic range of the sensor. Based on this, MINPRAN uses random sampling to search for the fit and the inliers to the fit that are least likely to have occurred randomly. It runs in time O(N/sup 2/+SN log N), where S is the number of random samples and N is the number of data points. We demonstrate analytically that MINPRAN distinguished good fits to random data and MINPRAN finds accurate fits and nearly the correct number of inliers, regardless of the percentage of true inliers. We confirm MINPRAN's properties experimentally on synthetic data and show it compares favorably to least median of squares. Finally, we apply MINPRAN to fitting planar surface patches and eliminating outliers in range data taken from complicated scenes. >

235 citations


Journal ArticleDOI
TL;DR: In this article, Bayesian residuals have continuous-valued posterior distributions which can be graphed to learn about outlying observations for binary regression data and can be used for outlier detection.
Abstract: SUMMARY In a binary response regression model, classical residuals are difficult to define and interpret due to the discrete nature of the response variable In contrast, Bayesian residuals have continuous-valued posterior distributions which can be graphed to learn about outlying observations Two definitions of Bayesian residuals are proposed for binary regression data Plots of the posterior distributions of the basic 'observed - fitted' residuals can be helpful in outlier detection Alternatively, the notion of a tolerance random variable can be used to define latent data residuals that are functions of the tolerance random variables and the parameters In the probit setting, these residuals are attractive in that a priori they are a sample from a standard normal distribution, and therefore the corresponding posterior distributions are easy to interpret These residual definitions are illustrated in examples and contrasted with classical outlier detection methods for binary data

129 citations


Proceedings ArticleDOI
20 Jun 1995
TL;DR: It is demonstrated that proper modelling of degeneracy in the presence of outlier enables the detection of outliers which would otherwise be missed.
Abstract: New methods are reported for the detection of multiple solutions (degeneracy) when estimating the fundamental matrix, with specific emphasis on robustness in the presence of data contamination (outliers). The fundamental matrix can be used as a first step in the recovery of structure from motion. If the set of correspondences is degenerate then this structure cannot be accurately recovered and many solutions will explain the data equally well. It is essential that we are alerted to such eventualities. However, current feature matchers are very prone to mismatching, giving a high rate of contamination within the data. Such contamination can make a degenerate data set appear non degenerate, thus the need for robust methods becomes apparent. The paper presents such methods with a particular emphasis on providing a method that will work on real imagery and with an automated (non perfect) feature detector and matcher. It is demonstrated that proper modelling of degeneracy in the presence of outliers enables the detection of outliers which would otherwise be missed. Results using real image sequences are presented. All processing, point matching, degeneracy detection and outlier detection is automatic. >

94 citations


Journal ArticleDOI
TL;DR: A robust principal components regression procedure based on the ellipsoidal multivariate trimming (MVT) and the least median of squares (LMS) methods is proposed as an outlier detection tool.

92 citations


Journal ArticleDOI
Andre Lucas1
TL;DR: In this paper, the authors considered unit root tests based on robust estimators with a high breakdown point and high efficiency, and derived the asymptotic distribution of these tests.

81 citations



Patent
26 Apr 1995
TL;DR: In this paper, an improved method for determining when a set of multivariate data (such as a chromatogram or a spectrum) is an outlier is provided, which involves using a procedure such as Principal Component Analysis to create a model describing a calibration set of spectra or chromatograms which is known to be normal, and to create residuals describing the portion of a particular spectrum or Chromatogram which is not described by the model.
Abstract: An improved method is provided for determining when a set of multivariate data (such as a chromatogram or a spectrum) is an outlier. The method involves using a procedure such as Principal Component Analysis to create a model describing a calibration set of spectra or chromatograms which is known to be normal, and to create residuals describing the portion of a particular spectrum or chromatogram which is not described by the model. The improvement comprises using an average residual spectrum calculated for the calibration set, rather than the origin of the model as a reference point for comparing a spectrum or chromatogram obtained from an unknown sample. The present invention also includes separating a complex set of data into various sub-parts such as sub-chromatograms or sub-spectra, so that outliers in any sub-part can be more readily detected. In one particular embodiment, the invention is directed towards a method for dividing a chromatogram into the sub-parts of peak information, baseline shape, baseline offset, and noise.

Journal ArticleDOI
15 Mar 1995
TL;DR: A novel technique to reject outliers from an m-dimensional data set when the underlying model is a hyperplane (a line in two dimensions, a plane in three dimensions) and, using matrix perturbation theory, provides an error model for the solution once the contaminants have been removed.
Abstract: Least squares minimization is by nature global and, hence, vulnerable to distortion by outliers. We present a novel technique to reject outliers from an m -dimensional data set when the underlying model is a hyperplane (a line in two dimensions, a plane in three dimensions). The technique has a sound statistical basis and assumes that Gaussian noise corrupts the otherwise valid data. The majority of alternative techniques available in the literature focus on ordinary least squares , where a single variable is designated to be dependent on all others - a model that is often unsuitable in practice. The method presented here operates in the more general framework of orthogonal regression , and uses a new regression diagnostic based on eigendecomposition. It subsumes the traditional residuals scheme and, using matrix perturbation theory, provides an error model for the solution once the contaminants have been removed.

Journal ArticleDOI
TL;DR: In this paper, the power of the Student t test and the Wilcoxon-Mann-Whitney test declines substantially when samples are obtained from outlier-prone densities, including mixed-normal, Cauchy, lognormal, and mixed-uniform densities.
Abstract: In this study, methods are examined that can be described, somewhat paradoxically, as robust nonparametric statistics. Although nonparametric tests effectively control the probability of Type I errors through rank randomization, they do not always control the probability of Type II errors and power, which can be grossly inflated or deflated by the shape of distributions. The power of the Student t test and the Wilcoxon-Mann-Whitney test declines substantially when samples are obtained from outlier-prone densities, including mixed-normal, Cauchy, lognormal, and mixed-uniform densities. However, the nonparametric test acquires an advantage, because outliers influence the t test to a relatively greater extent. Under these conditions, an outlier detection and downweighting (ODD) procedure, usually associated with parametric significance tests, augments the power of both the t test and the Wilcoxon-Mann-Whitney test.

Journal ArticleDOI
TL;DR: In this paper, high breakdown robust methods are used in conjunction with a robust distance measure defined relative to the minimum volume ellipsoid estimator to statistically and graphically depict both multivariate outliers and leverage points.
Abstract: Given that most data used for production studies have not been accumulated for such purposes, it is important that the quantitative tools for messy data which can affect the accuracy of computed technical efficiency measures be found. In this study, high breakdown robust methods are used in conjunction with a robust distance measure defined relative to the minimum volume ellipsoid estimator. The standardized robust residuals from the high breakdown estimators and the robust distance measures are used to statistically and graphically depict both multivariate outliers and leverage points. Once these points are found, their relationship to those observations that exhibit strong technically efficient or inefficient behavior, scale inefficiency and/or unusual production characteristics is analyzed for three linerboard manufacturing facilities. Additionally, the impact of the outliers and leverage points on the estimated least squares coefficients which are used by the corrected ordinary least squares methodology to compute the full-frontier technical efficiency measures is explored. Finally, a sensitivity analysis of the impact of outliers and leverage points on the computed linear programming based technical efficiency measures is presented.

Journal ArticleDOI
TL;DR: A new method is proposed, Piecewise Linear Online Trending (PLOT), which is statistically based and which performs significantly better, adapts to process variability and noisy data, recognizes and eliminates outliers, and it is robust even in the presence of outliers.

Journal ArticleDOI
TL;DR: The authors argue that it is hazardous to conduct cross-national marketing research without evaluating the potential influential effects of multivariate outliers, which are observations distinct from the majority of cases.
Abstract: Structural equation modeling with latent variables is being used more frequently in international marketing research. However, the authors argue that it is hazardous to conduct cross-national marke...

Journal ArticleDOI
TL;DR: The procedure is illustrated and compared to existing robust methods, using several data sets known to contain multiple outliers and its performance and robustness are additionally tested by a Monte Carlo study on the simulated data sets.

Journal ArticleDOI
TL;DR: It is shown that, both in a classical and a Bayesian framework, the presence of additive outliers moves ‘standard’ inference towards stationarity, and base inference on an independent Student-t instead of a Gaussian likelihood, which yields results that are less sensitive to the existence of outliers.

Journal ArticleDOI
TL;DR: In this paper, a robust hierarchical Bayes method is developed to smooth small area means when a number of covariates are available, which is particularly suited when one or more outliers are present in the data.

Journal ArticleDOI
TL;DR: In this paper, an alternative definition of the finite-sample breakdown point is proposed, which is invariant with respect to reparameterization and compatible with the Donoho-Huber breakdown point in linear regression situations.
Abstract: We propose an alternative definition of the finite-sample breakdown point This breakdown point is invariant with respect to reparameterization and compatible with the Donoho-Huber breakdown point in linear regression situations It also overcomes certain limitations of the definition proposed by Stromberg and Ruppert and can be used in a wide range of estimation problems We investigate the breakdown properties of some nonlinear regression estimators These results alert us to the danger of using familiar M estimators with data sets containing outliers and to the advantages of using estimators based on Hampel's proposal, such as S estimators

Journal ArticleDOI
TL;DR: In this work the feasibility of using genetic algorithms for LMS is demonstrated by means of curved analytical calibration and pharmacokinetic data contaminated with outliers.

Journal ArticleDOI
01 Jun 1995-Test
TL;DR: The authors showed that different types of outlier can have qualitatively different effects on autocorrelations and showed that outliers can strongly influence sample autocorerelations and hence the identification of time series models.
Abstract: It is well known that outliers can strongly influence sample autocorrelations and hence the identification of time series models. The results of this article indicate that different types of outlier can have qualitatively different effects.


Journal ArticleDOI
Visa Koivunen1
TL;DR: A class of nonlinear regression filters based on robust estimation theory is introduced to recover a high-quality image from degraded observations and effectively attenuate both impulsive and nonimpulsive noise while recovering the signal structure and preserving interesting details.
Abstract: A class of nonlinear regression filters based on robust estimation theory is introduced. The goal of the filtering is to recover a high-quality image from degraded observations. Models for desired image structures and contaminating processes are employed, but deviations from strict assumptions are allowed since the assumptions on signal and noise are typically only approximately true. The robustness of filters is usually addressed only in a distributional sense, i.e., the actual error distribution deviates from the nominal one. In this paper, the robustness is considered in a broad sense since the outliers may also be due to inappropriate signal model, or there may be more than one statistical population present in the processing window, causing biased estimates. Two filtering algorithms minimizing a least trimmed squares criterion are provided. The design of the filters is simple since no scale parameters or context-dependent threshold values are required. Experimental results using both real and simulated data are presented. The filters effectively attenuate both impulsive and nonimpulsive noise while recovering the signal structure and preserving interesting details. >

Journal ArticleDOI
TL;DR: The nearest neighbor classification rule is extended to reject outlier data and is implemented with an analog electronic circuit and a continuous membership function is derived from an optimization formulation of the classification rule.

Journal ArticleDOI
TL;DR: In this paper, the authors compare three different comparison procedures and compare them to two methods for comparing means, i.e., the trimmed mean and the sample mean, which is a measure of location having a standard error that is relatively unaffected by heavy tails and outliers.
Abstract: Two common goals when choosing a method for performing all pairwise comparisons of J independent groups are controlling experiment wise Type I error and maximizing power. Typically groups are compared in terms of their means, but it has been known for over 30 years that the power of these methods becomes highly unsatisfactory under slight departures from normality toward heavy-tailed distributions. An approach to this problem, well-known in the statistical literature, is to replace the sample mean with a measure of location having a standard error that is relatively unaffected by heavy tails and outliers. One possibility is to use the trimmed mean. This paper describes three such multiple comparison procedures and compares them to two methods for comparing means.


Journal ArticleDOI
TL;DR: In this article, the authors considered priors and likelihoods for the location problem which have bounded but nonvanishing influence on posterior moments and provided sufficient conditions for the posterior distribution of 0 to approach the prior distribution as x tends to infinity.
Abstract: SUMMARY Let x be a single observation from a distribution having unknown location parameter 0. Dawid (1973) provided sufficient conditions for the posterior distribution of 0 to approach the prior distribution as x tends to infinity, so that an outlier has bounded and vanishing influence on the posterior distribution. We present a result closely related to Dawid's theorem. This enables us to consider priors and likelihoods for the location problem which have bounded but nonvanishing influence on posterior moments. Examples are given.

Journal ArticleDOI
TL;DR: In the process, consistency, continuity and asymptotic normality of M-estimators for stationary sequences are obtained and robust estimates are obtained when an outlier contaminated sample ofa ands is provided.
Abstract: Let a and s denote the inter arrival times and service times in a GI/GI/1 queue. Let a (n), s (n) be the r.v.s, with distributions as the estimated distributions of a and s from iid samples of a and s of sizes n. Let w be a r.v. with the stationary distribution lr of the waiting times of the queue with input (a, s). We consider the problem of estimating E [w~], tx > 0 and 7r via simulations when (a (n), s (n)) are used as input. Conditions for the accuracy of the asymptotic estimate, continuity of the asymptotic variance and uniformity in the rate of convergence to the estimate are obtained. We also obtain rates of convergence for sample moments, the empirical process and the quantile process for the regenerative processes. Robust estimates are also obtained when an outlier contaminated sample of a and s is provided. In the process we obtain consistency, continuity and asymptotic normality of M-estimators for stationary sequences. Some robustness results for Markov processes are included.

Book ChapterDOI
01 Jan 1995
TL;DR: Classical and robust/resistant procedures for the estimation of population parameters and the identification of multiple outliers in univariate and multivariate populations are reviewed.
Abstract: Classical and robust/resistant procedures for the estimation of population parameters and the identification of multiple outliers in univariate and multivariate populations are reviewed. The successful identification of anomalous observations depends on the statistical procedures employed. Commercial industries, local communities, and government agencies such as the United States Environmental Protection Agency (U.S. EPA), often need to assess the extent of contamination at polluted sites. Identification of these contaminants having potentially adverse effects on human health is especially important in various ecological and environmental applications. An environmental scientist typically generates and analyzes large amounts of multidimensional data. These practioners often need to identify experimental conditions and results which look suspicious and are significantly different from the rest of the data. The classical Mahalanobis distance (MD) and its variants (e.g., multivariate kurtosis) are routinely used to identify these anomalies. These test statistics depend upon the estimates of population location and scale. The presence of anomalous observations usually results in distorted and unreliable maximum likelihood estimates (MLEs) and ordinary least-squares (OLS) estimates of the population parameters. These in turn result in deflated and distorted classical MDs and lead to masking effects. This means that the results from statistical tests and inference based upon these classical estimates may be misleading. For example, in an environmental monitoring application, it is possible that the classification procedure based upon the distorted estimates may classify a contaminated sample as coming from the clean population and a clean sample as coming from the contaminated part of the site. This in turn can lead to incorrect remediation decisions.