scispace - formally typeset
Search or ask a question

Showing papers on "Outlier published in 2005"


Posted Content
TL;DR: In this article, the authors evaluated measures for making comparisons of errors across time series and found that the median absolute error of a given method to that from the random walk forecast is not reliable, and therefore inappropriate for comparing accuracy across series.
Abstract: This study evaluated measures for making comparisons of errors across time series. We analyzed 90 annual and 101 quarterly economic time series. We judged error measures on reliability, construct validity, sensitivity to small changes, protection against outliers, and their relationship to decision making. The results lead us to recommend the Geometric Mean of the Relative Absolute Error (GMRAE) when the task involves calibrating a model for a set of time series. The GMRAE compares the absolute error of a given method to that from the random walk forecast. For selecting the most accurate methods, we recommend the Median RAE (MdRAE) when few series are available and the Median Absolute Percentage Error (MdAPE) otherwise. The Root Mean Square Error (RMSE) is not reliable, and is therefore inappropriate for comparing accuracy across series.

1,009 citations


Journal ArticleDOI
TL;DR: There is no good reason to continue to use the [mean+/-2 sdev] rule, originally proposed as a 'filter' to identify approximately 2(1/2)% of the data at each extreme for further inspection at a time when computers to do the drudgery of numerical operations were not widely available and no other practical methods existed.

686 citations


Proceedings ArticleDOI
21 Aug 2005
TL;DR: A novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed, which combines results from multiple outlier detection algorithms that are applied using different set of features.
Abstract: Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines results from multiple outlier detection algorithms that are applied using different set of features. Every outlier detection algorithm uses a small subset of features that are randomly selected from the original feature set. As a result, each outlier detector identifies different outliers, and thus assigns to all data records outlier scores that correspond to their probability of being outliers. The outlier scores computed by the individual outlier detection algorithms are then combined in order to find the better quality outliers. Experiments performed on several synthetic and real life data sets show that the proposed methods for combining outputs from multiple outlier detection algorithms provide non-trivial improvements over the base algorithm.

622 citations


Journal ArticleDOI
TL;DR: Results from both simulated and clinical diffusion data sets indicate that the RESTORE method improves tensor estimation compared to the commonly used linear and nonlinear least‐squares tensor fitting methods and a recently proposed method based on the Geman–McClure M‐estimator.
Abstract: Signal variability in diffusion weighted imaging (DWI) is influenced by both thermal noise and spatially and temporally varying artifacts such as subject motion and cardiac pulsation. In this paper, the effects of DWI artifacts on estimated tensor values, such as trace and fractional anisotropy, are analyzed using Monte Carlo simulations. A novel approach for robust diffusion tensor estimation, called RESTORE (for robust estimation of tensors by outlier rejection), is proposed. This method uses iteratively reweighted least-squares regression to identify potential outliers and subsequently exclude them. Results from both simulated and clinical diffusion data sets indicate that the RESTORE method improves tensor estimation compared to the commonly used linear and nonlinear least-squares tensor fitting methods and a recently proposed method based on the Geman–McClure M-estimator. The RESTORE method could potentially remove the need for cardiac gating in DWI acquisitions and should be applicable to other MR imaging techniques that use univariate or multivariate regression to fit MRI data to a model.

621 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the information used for sensor localization is fundamentally local with regard to the network topology and used to reformulate the problem within a graphical model framework, and that judicious message construction can result in better estimates.
Abstract: Automatic self-localization is a critical need for the effective use of ad hoc sensor networks in military or civilian applications. In general, self-localization involves the combination of absolute location information (e.g., from a global positioning system) with relative calibration information (e.g., distance measurements between sensors) over regions of the network. Furthermore, it is generally desirable to distribute the computational burden across the network and minimize the amount of intersensor communication. We demonstrate that the information used for sensor localization is fundamentally local with regard to the network topology and use this observation to reformulate the problem within a graphical model framework. We then present and demonstrate the utility of nonparametric belief propagation (NBP), a recent generalization of particle filtering, for both estimating sensor locations and representing location uncertainties. NBP has the advantage that it is easily implemented in a distributed fashion, admits a wide variety of statistical models, and can represent multimodal uncertainty. Using simulations of small to moderately sized sensor networks, we show that NBP may be made robust to outlier measurement errors by a simple model augmentation, and that judicious message construction can result in better estimates. Furthermore, we provide an analysis of NBP's communications requirements, showing that typically only a few messages per sensor are required, and that even low bit-rate approximations of these messages can be used with little or no performance impact.

586 citations


Journal ArticleDOI
01 Jul 2005
TL;DR: A robust moving least-squares technique for reconstructing a piecewise smooth surface from a potentially noisy point cloud is introduced, based on a new robust statistics method for outlier detection: the forward-search paradigm.
Abstract: We introduce a robust moving least-squares technique for reconstructing a piecewise smooth surface from a potentially noisy point cloud. We use techniques from robust statistics to guide the creation of the neighborhoods used by the moving least squares (MLS) computation. This leads to a conceptually simple approach that provides a unified framework for not only dealing with noise, but also for enabling the modeling of surfaces with sharp features.Our technique is based on a new robust statistics method for outlier detection: the forward-search paradigm. Using this powerful technique, we locally classify regions of a point-set to multiple outlier-free smooth regions. This classification allows us to project points on a locally smooth region rather than a surface that is smooth everywhere, thus defining a piecewise smooth surface and increasing the numerical stability of the projection operator. Furthermore, by treating the points across the discontinuities as outliers, we are able to define sharp features. One of the nice features of our approach is that it automatically disregards outliers during the surface-fitting phase.

584 citations


Proceedings ArticleDOI
20 Jun 2005
TL;DR: This paper describes a novel multi-view matching framework based on a new type of invariant feature that is used in an automatic 2D panorama stitcher that has been extensively tested on hundreds of sample inputs.
Abstract: This paper describes a novel multi-view matching framework based on a new type of invariant feature. Our features are located at Harris corners in discrete scale-space and oriented using a blurred local gradient. This defines a rotationally invariant frame in which we sample a feature descriptor, which consists of an 8 /spl times/ 8 patch of bias/gain normalised intensity values. The density of features in the image is controlled using a novel adaptive non-maximal suppression algorithm, which gives a better spatial distribution of features than previous approaches. Matching is achieved using a fast nearest neighbour algorithm that indexes features based on their low frequency Haar wavelet coefficients. We also introduce a novel outlier rejection procedure that verifies a pairwise feature match based on a background distribution of incorrect feature matches. Feature matches are refined using RANSAC and used in an automatic 2D panorama stitcher that has been extensively tested on hundreds of sample inputs.

467 citations


Journal ArticleDOI
TL;DR: It is demonstrated that important processes such as the input of metals from contamination sources and the contribution of sea-salts via marine aerosols to the soil can be identified and separated.

394 citations


Proceedings ArticleDOI
17 Oct 2005
TL;DR: A novel and robust approach to the point set registration problem in the presence of large amounts of noise and outliers is proposed, which derives a closed-form expression for the L/sub 2/distance between two Gaussian mixtures, which leads to a computationally efficient registration algorithm.
Abstract: This paper proposes a novel and robust approach to the point set registration problem in the presence of large amounts of noise and outliers. Each of the point sets is represented by a mixture of Gaussians and the point set registration is treated as a problem of aligning the two mixtures. We derive a closed-form expression for the L/sub 2/distance between two Gaussian mixtures, which in turn leads to a computationally efficient registration algorithm. This new algorithm has an intuitive interpretation, is simple to implement and exhibits inherent statistical robustness. Experimental results indicate that our algorithm achieves very good performance in terms of both robustness and accuracy.

380 citations


Journal ArticleDOI
TL;DR: An in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases are presented.
Abstract: A new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers that remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.

348 citations


Journal ArticleDOI
TL;DR: This work shows that robust iteratively reweighted least squares (IRLS) at the 2nd level is a computationally efficient technique that both increases statistical power and decreases false positive rates in the presence of outliers and provides software to implement IRLS in group neuroimaging analyses.

Journal ArticleDOI
01 Mar 2005
TL;DR: A Bayesian approach to mixture modelling based on Student-t distributions, which are heavier tailed than Gaussians and hence more robust, is developed, which includes Gaussian mixtures as a special case.
Abstract: Bayesian approaches to density estimation and clustering using mixture distributions allow the automatic determination of the number of components in the mixture. Previous treatments have focussed on mixtures having Gaussian components, but these are well known to be sensitive to outliers, which can lead to excessive sensitivity to small numbers of data points and consequent over-estimates of the number of components. In this paper we develop a Bayesian approach to mixture modelling based on Student-t distributions, which are heavier tailed than Gaussians and hence more robust. By expressing the Student-t distribution as a marginalization over additional latent variables we are able to derive a tractable variational inference algorithm for this model, which includes Gaussian mixtures as a special case. Results on a variety of real data sets demonstrate the improved robustness of our approach.

Journal ArticleDOI
01 Apr 2005
TL;DR: A novel multilevel hierarchical Kohonen Net for an intrusion detection system in which each layer operates on a small subset of the feature space is superior to a single-layer K-Map operating on the whole feature space in detecting a variety of attacks in terms of detection rate as well as false positive rate.
Abstract: A novel multilevel hierarchical Kohonen Net (K-Map) for an intrusion detection system is presented. Each level of the hierarchical map is modeled as a simple winner-take-all K-Map. One significant advantage of this multilevel hierarchical K-Map is its computational efficiency. Unlike other statistical anomaly detection methods such as nearest neighbor approach, K-means clustering or probabilistic analysis that employ distance computation in the feature space to identify the outliers, our approach does not involve costly point-to-point computation in organizing the data into clusters. Another advantage is the reduced network size. We use the classification capability of the K-Map on selected dimensions of data set in detecting anomalies. Randomly selected subsets that contain both attacks and normal records from the KDD Cup 1999 benchmark data are used to train the hierarchical net. We use a confidence measure to label the clusters. Then we use the test set from the same KDD Cup 1999 benchmark to test the hierarchical net. We show that a hierarchical K-Map in which each layer operates on a small subset of the feature space is superior to a single-layer K-Map operating on the whole feature space in detecting a variety of attacks in terms of detection rate as well as false positive rate.

Journal ArticleDOI
Charu C. Aggarwal1, S. Yu1
01 Apr 2005
TL;DR: New techniques for outlier detection that find the outliers by studying the behavior of projections from the data set are discussed.
Abstract: The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are most important for high-dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms have been proposed for outlier detection that use several concepts of proximity in order to find the outliers based on their relationship to the other points in the data. However, in high-dimensional space, the data are sparse and concepts using the notion of proximity fail to retain their effectiveness. In fact, the sparsity of high-dimensional data can be understood in a different way so as to imply that every point is an equally good outlier from the perspective of distance-based definitions. Consequently, for high-dimensional data, the notion of finding meaningful outliers becomes substantially more complex and nonobvious. In this paper, we discuss new techniques for outlier detection that find the outliers by studying the behavior of projections from the data set.

Journal ArticleDOI
TL;DR: In this article, the authors employ robust estimation and apply Extreme Bounds Analysis (EBA) to deal with the problem of model uncertainty and find that the use of robust estimation affects the list of variables that are significant determinants of economic growth.
Abstract: Two important problems exist in cross-country growth studies: outliers and model uncertainty. Employing Sala-i-Martin’s (1997a,b) data set, we first use robust estimation and analyze to what extent outliers influence OLS regressions. We then use both OLS and robust estimation techniques in applying the Extreme Bounds Analysis (EBA) to deal with the problem of model uncertainty. We find that the use of robust estimation affects the list of variables that are significant determinants of economic growth. Also the magnitude of the impact of these variables differs sometimes under the various approaches.

Journal ArticleDOI
TL;DR: In this article, an improved F-approximation of the MCD shape with respect to the Rousseeuw's minimum covariance determinant (MCD) is presented.
Abstract: Mahalanobis-type distances in which the shape matrix is derived from a consistent, high-breakdown robust multivariate location and scale estimator have an asymptotic chi-squared distribution as is the case with those derived from the ordinary covariance matrix. For example, Rousseeuw's minimum covariance determinant (MCD) is a robust estimator with a high breakdown. However, even in quite large samples, the chi-squared approximation to the distances of the sample data from the MCD center with respect to the MCD shape is poor. We provide an improved F approximation that gives accurate outlier rejection points for various sample sizes.

Journal ArticleDOI
TL;DR: A new method to detect outliers by discovering frequent patterns (or frequent itemsets) from the data set by defining a measure called FPOF (Frequent Pattern Outlier Factor) and proposing the FindFPOF algorithm to discover outliers.
Abstract: An outlier in a dataset is an observation or a point that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of such outliers is important for many applications and has recently attracted much attention in the data mining research community. In this paper, we present a new method to detect outliers by discovering frequent patterns (or frequent itemsets) from the data set. The outliers are defined as the data transactions that contain less frequent patterns in their itemsets. We define a measure called FPOF (Frequent Pattern Outlier Factor) to detect the outlier transactions and propose the FindFPOF algorithm to discover outliers. The experimental results have shown that our approach outperformed the existing methods on identifying interesting outliers.

Journal ArticleDOI
TL;DR: Open set TCM-kNN (transduction confidence machine-k nearest neighbors), suitable for multiclass authentication operational scenarios that have to include a rejection option for classes never enrolled in the gallery, is shown to be suitable for PSEI (pattern specific error inhomogeneities) error analysis in order to identify difficult to recognize faces.
Abstract: This paper motivates and describes a novel realization of transductive inference that can address the open set face recognition task. Open set operates under the assumption that not all the test probes have mates in the gallery. It either detects the presence of some biometric signature within the gallery and finds its identity or rejects it, i.e., it provides for the "none of the above" answer. The main contribution of the paper is open set TCM-kNN (transduction confidence machine-k nearest neighbors), which is suitable for multiclass authentication operational scenarios that have to include a rejection option for classes never enrolled in the gallery. Open set TCM-kNN, driven by the relation between transduction and Kolmogorov complexity, provides a local estimation of the likelihood ratio needed for detection tasks. We provide extensive experimental data to show the feasibility, robustness, and comparative advantages of open set TCM-kNN on open set identification and watch list (surveillance) tasks using challenging FERET data. Last, we analyze the error structure driven by the fact that most of the errors in identification are due to a relatively small number of face patterns. Open set TCM-kNN is shown to be suitable for PSEI (pattern specific error inhomogeneities) error analysis in order to identify difficult to recognize faces. PSEI analysis improves biometric performance by removing a small number of those difficult to recognize faces responsible for much of the original error in performance and/or by using data fusion.

Journal ArticleDOI
Fredrik Athley1
TL;DR: Approximations to the mean square estimation error and probability of outlier are derived that can be used to predict the threshold region performance of the ML estimator with high accuracy and alleviate the need for time-consuming computer simulations when evaluating the thresholds region performance.
Abstract: This paper presents a performance analysis of the maximum likelihood (ML) estimator for finding the directions of arrival (DOAs) with a sensor array. The asymptotic properties of this estimator are well known. In this paper, the performance under conditions of low signal-to-noise ratio (SNR) and a small number of array snapshots is investigated. It is well known that the ML estimator exhibits a threshold effect, i.e., a rapid deterioration of estimation accuracy below a certain SNR or number of snapshots. This effect is caused by outliers and is not captured by standard techniques such as the Crame/spl acute/r-Rao bound and asymptotic analysis. In this paper, approximations to the mean square estimation error and probability of outlier are derived that can be used to predict the threshold region performance of the ML estimator with high accuracy. Both the deterministic ML and stochastic ML estimators are treated for the single-source and multisource estimation problems. These approximations alleviate the need for time-consuming computer simulations when evaluating the threshold region performance. For the special case of a single stochastic source signal and a single snapshot, it is shown that the ML estimator is not statistically efficient as SNR/spl rarr//spl infin/ due to the effect of outliers.

Journal ArticleDOI
TL;DR: This article reviews the most commonly used robust multivariate regression and exploratory methods that have appeared since 1996 in the field of chemometrics and puts special emphasis on the robust versions of chemometric standard tools like PCA and PLS and the corresponding robust estimates of regression, location and scatter on which they are based.
Abstract: Outliers may hamper proper classical multivariate analysis, and lead to incorrect conclusions. To remedy the problem of outliers, robust methods are developed in statistics and chemometrics. Robust methods reduce or remove the effect of outlying data points and allow the ‘good’ data to primarily determine the result. This article reviews the most commonly used robust multivariate regression and exploratory methods that have appeared since 1996 in the field of chemometrics. Special emphasis is put on the robust versions of chemometric standard tools like PCA and PLS and the corresponding robust estimates of regression, location and scatter on which they are based. Copyright © 2006 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
21 Jun 2005
TL;DR: In this article, a mean-shift based clustering procedure is proposed for robust filtering of a noisy set of points sampled from a smooth surface using a kernel density estimation technique for point clustering.
Abstract: In this paper, we develop a method for robust filtering of a noisy set of points sampled from a smooth surface. The main idea of the method consists of using a kernel density estimation technique for point clustering. Specifically, we use a mean-shift based clustering procedure. With every point of the input data we associate a local likelihood measure capturing the probability that a 3D point is located on the sampled surface. The likelihood measure takes into account the normal directions estimated at the scattered points. Our filtering procedure suppresses noise of different amplitudes and allows for an easy detection of outliers, which are then automatically removed by simple thresholding. The remaining set of maximum likelihood points delivers an accurate point-based approximation of the surface. We also show that while some established meshing techniques often fail to reconstruct the surface from original noisy point scattered data, they work well in conjunction with our filtering method.

Patent
27 Apr 2005
TL;DR: In this article, a multi-view matching framework based on a new class of invariant features is presented, where features are located at Harris corners in scale-space and oriented using a blurred local gradient.
Abstract: A system and process for identifying corresponding points among multiple images of a scene is presented. This involves a multi-view matching framework based on a new class of invariant features. Features are located at Harris corners in scale-space and oriented using a blurred local gradient. This defines a similarity invariant frame in which to sample a feature descriptor. The descriptor actually formed is a bias/gain normalized patch of intensity values. Matching is achieved using a fast nearest neighbor procedure that uses indexing on low frequency Haar wavelet coefficients. A simple 6 parameter model for patch matching is employed, and the noise statistics are analyzed for correct and incorrect matches. This leads to a simple match verification procedure based on a per feature outlier distance.

Proceedings ArticleDOI
05 Mar 2005
TL;DR: The Chebyshev outlier detection method does not ascertain the reason for the outlier; it identifies potential outlier data, allowing for domain experts to investigate the cause.
Abstract: During data collection and analysis, it is often necessary to identify and possibly remove outliers that exist. An objective method for identifying outliers to be removed is critical. Many automated outlier detection methods are available. However, many are limited by assumptions of a distribution or require upper and lower predefined boundaries in which the data should exist. If there is a known distribution for the data, then using that distribution can aid in finding outliers. Often, a distribution is not known, or the experimenter does not want to make an assumption about a certain distribution. Also, enough information may not exist about a set of data to be able to determine reliable upper and lower boundaries. For these cases, an outlier detection method, using the empirical data and based upon Chebyshev's inequality, was formed. This method allows for detection of multiple outliers, not just one at a time. This method also assumes that the data are independent measurements and that a relatively small percentage of outliers are contained in the data. Chebyshev's inequality gives a bound of what percentage of the data falls outside of k standard deviations from the mean. This calculation holds no assumptions about the distribution of the data. If the data are known to be unimodal without a known distribution, then the method can be improved by using the unimodal Chebyshev inequality. The Chebyshev outlier detection method uses the Chebyshev inequality to calculate upper and lower outlier detection limits. Data values that are not within the range of the upper and lower limits would be considered data outliers. Outliers could be due to erroneous data or could indicate that the data are correct but highly unusual. This algorithm does not ascertain the reason for the outlier; it identifies potential outlier data, allowing for domain experts to investigate the cause

Book ChapterDOI
19 Jun 2005
TL;DR: An Outlier Removal Clustering (ORC) algorithm that provides outlier detection and data clustering simultaneously and has a lower error on datasets with overlapping clusters than the competing methods is presented.
Abstract: We present an Outlier Removal Clustering (ORC) algorithm that provides outlier detection and data clustering simultaneously. The method employs both clustering and outlier discovery to improve estimation of the centroids of the generative distribution. The proposed algorithm consists of two stages. The first stage consist of purely K-means process, while the second stage iteratively removes the vectors which are far from their cluster centroids. We provide experimental results on three different synthetic datasets and three map images which were corrupted by lossy compression. The results indicate that the proposed method has a lower error on datasets with overlapping clusters than the competing methods.

Book ChapterDOI
Chuanhai Liu1
14 Jul 2005
TL;DR: In this article, the maximum likelihood estimators of the robit model with a known number of degrees of freedom were shown to be robust to outliers, and the authors proposed a Data Augmentation (DA) algorithm for Bayesian inference with the Robit regression model.
Abstract: Logistic and probit regression models are commonly used in practice to analyze binary response data, but the maximum likelihood estimators of these models are not robust to outliers. This paper considers a robit regression model, which replaces the normal distribution in the probit regression model with a t-distribution with a known or unknown number of degrees of freedom. It is shown that (i) the maximum likelihood estimators of the robit model with a known number of degrees of freedom are robust; (ii) the robit link with about seven degrees of freedom provides an excellent approximation to the logistic link; and (iii) the robit link with a large number of degrees of freedom approximates the probit link. The maximum likelihood estimates can be obtained using efficient EM-type algorithms. EM-type algorithms also provide information that can be used to identify outliers, to which the maximum likelihood estimates of the logistic and probit regression coefficient would be sensitive. The EM algorithms for robit regression are easily modified to obtain efficient Data Augmentation (DA) algorithms for Bayesian inference with the robit regression model. The DA algorithms for robit regression model are much simpler to implement than the existing Gibbs sampler for the logistic regression model. A numerical example illustrates the methodology.

Book ChapterDOI
23 Aug 2005
TL;DR: In this paper, the problem of outlier detection in categorical data is defined as an optimization problem from a global viewpoint, and a local search heuristic based algorithm is proposed to find feasible solutions.
Abstract: In this paper, we formally define the problem of outlier detection in categorical data as an optimization problem from a global viewpoint. Moreover, we present a local-search heuristic based algorithm for efficiently finding feasible solutions. Experimental results on real datasets and large synthetic datasets demonstrate the superiority of our model and algorithm.

Journal ArticleDOI
TL;DR: It is found that strong polarization of the magnetic field channels and the distribution of response function errors are two important parameters for noise detection.
Abstract: SUMMARY Magnetotelluric (MT) response function estimates can be severely disturbed by the effects of cultural noise. Methods to isolate and remove these disturbances are typically based on time-series editing, robust statistics, remote reference processing, or some combination of the above. Robust remote reference processing can improve the data quality at a local site, but only if synchronous recordings of at least one additional site are available and if electromagnetic noise between these sites is uncorrelated. If these prerequisites are not met, we suggest an alternative approach for noise removal, based on a combination of frequency domain editing with subsequent single site robust processing. The data pre-selection relies on a thorough visual inspection of a variety of statistical parameters such as spectral power densities, coherences, the distribution of response functions and their errors, etc. Extreme outliers and particularly noisy data segments are excluded from further data processing by setting threshold values for individual parameters. Examples from Namibia and Jordan illustrate that this scheme can improve data quality significantly. However, the examples also suggest that it is not possible to establish generally valid rules for selection as they depend strongly on the local noise conditions. High coherence, for example, can indicate a good signal-to-noise ratio or strongly correlated noise. However, we found that strong polarization of the magnetic field channels and the distribution of response function errors are two important parameters for noise detection.

Journal ArticleDOI
Peiliang Xu1
TL;DR: Sign-constrained robust least squares as discussed by the authors is a robust estimation method that employs a maximum possible number of good data to derive the robust solution, and thus will not be affected by partial near-multi-collinearity among part of the data or if some data are clustered together.
Abstract: The findings of this paper are summarized as follows: (1) We propose a sign-constrained robust estimation method, which can tolerate 50% of data contamination and meanwhile achieve high, least-squares-comparable efficiency. Since the objective function is identical with least squares, the method may also be called sign-constrained robust least squares. An iterative version of the method has been implemented and shown to be capable of resisting against more than 50% of contamination. As a by-product, a robust estimate of scale parameter can also be obtained. Unlike the least median of squares method and repeated medians, which use a least possible number of data to derive the solution, the sign-constrained robust least squares method attempts to employ a maximum possible number of good data to derive the robust solution, and thus will not be affected by partial near multi-collinearity among part of the data or if some of the data are clustered together; (2) although M-estimates have been reported to have a breakdown point of 1/(t+1), we have shown that the weights of observations can readily deteriorate such results and bring the breakdown point of M-estimates of Huber’s type to zero. The same zero breakdown point of the L 1-norm method is also derived, again due to the weights of observations; (3) by assuming a prior distribution for the signs of outliers, we have developed the concept of subjective breakdown point, which may be thought of as an extension of stochastic breakdown by Donoho and Huber but can be important in explaining real-life problems in Earth Sciences and image reconstruction; and finally, (4) We have shown that the least median of squares method can still break down with a single outlier, even if no highly concentrated good data nor highly concentrated outliers exist.

Journal ArticleDOI
TL;DR: Horn's algorithm undoubtedly is an improvement compared with older methods for outlier detection, reliable statistical identification of outliers in reference data remains a challenge.
Abstract: Background: Medical laboratory reference data may be contaminated with outliers that should be eliminated before estimation of the reference interval. A statistical test for outliers has been proposed by Paul S. Horn and coworkers ( Clin Chem 2001;47:2137–45). The algorithm operates in 2 steps: ( a ) mathematically transform the original data to approximate a gaussian distribution; and ( b ) establish detection limits (Tukey fences) based on the central part of the transformed distribution. Methods: We studied the specificity of Horn’s test algorithm (probability of false detection of outliers), using Monte Carlo computer simulations performed on 13 types of probability distributions covering a wide range of positive and negative skewness. Distributions with 3% of the original observations replaced by random outliers were used to also examine the sensitivity of the test (probability of detection of true outliers). Three data transformations were used: the Box and Cox function (used in the original Horn’s test), the Manly exponential function, and the John and Draper modulus function. Results: For many of the probability distributions, the specificity of Horn’s algorithm was rather poor compared with the theoretical expectation. The cause for such poor performance was at least partially related to remaining nongaussian kurtosis (peakedness). The sensitivity showed great variation, dependent on both the type of underlying distribution and the location of the outliers (upper and/or lower tail). Conclusion: Although Horn’s algorithm undoubtedly is an improvement compared with older methods for outlier detection, reliable statistical identification of outliers in reference data remains a challenge.

Journal ArticleDOI
TL;DR: There was wide variability in the point estimates and confidence or probability intervals of risk-adjusted mortality depending on statistical model, but little variability relative to the choice of predictors, and the use of hierarchical models is recommended.