scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A fast algorithm for the minimum covariance determinant estimator

01 Aug 1999-Technometrics (Taylor & Francis Group)-Vol. 41, Iss: 3, pp 212-223
TL;DR: For small datasets, FAST-MCD typically finds the exact MCD, whereas for larger datasets it gives more accurate results than existing algorithms and is faster by orders.
Abstract: The minimum covariance determinant (MCD) method of Rousseeuw is a highly robust estimator of multivariate location and scatter. Its objective is to find h observations (out of n) whose covariance matrix has the lowest determinant. Until now, applications of the MCD were hampered by the computation time of existing algorithms, which were limited to a few hundred objects in a few dimensions. We discuss two important applications of larger size, one about a production process at Philips with n = 677 objects and p = 9 variables, and a dataset from astronomy with n = 137,256 objects and p = 27 variables. To deal with such problems we have developed a new algorithm for the MCD, called FAST-MCD. The basic ideas are an inequality involving order statistics and determinants, and techniques which we call “selective iteration” and “nested extensions.” For small datasets, FAST-MCD typically finds the exact MCD, whereas for larger datasets it gives more accurate results than existing algorithms and is faster by orders...
Citations
More filters
Journal ArticleDOI
TL;DR: Comparisons of DNA methylation in eight diverse plant and animal genomes found that patterns of methylation are very similar in flowering plants with methylated cytosines detected in all sequence contexts, whereas CG methylation predominates in animals.
Abstract: Cytosine DNA methylation is a heritable epigenetic mark present in many eukaryotic organisms. Although DNA methylation likely has a conserved role in gene silencing, the levels and patterns of DNA methylation appear to vary drastically among different organisms. Here we used shotgun genomic bisulfite sequencing (BS-Seq) to compare DNA methylation in eight diverse plant and animal genomes. We found that patterns of methylation are very similar in flowering plants with methylated cytosines detected in all sequence contexts, whereas CG methylation predominates in animals. Vertebrates have methylation throughout the genome except for CpG islands. Gene body methylation is conserved with clear preference for exons in most organisms. Furthermore, genes appear to be the major target of methylation in Ciona and honey bee. Among the eight organisms, the green alga Chlamydomonas has the most unusual pattern of methylation, having non-CG methylation enriched in exons of genes rather than in repeats and transposons. In addition, the Dnmt1 cofactor Uhrf1 has a conserved function in maintaining CG methylation in both transposons and gene bodies in the mouse, Arabidopsis, and zebrafish genomes.

1,111 citations

Journal ArticleDOI
TL;DR: The ROBPCA approach, which combines projection pursuit ideas with robust scatter matrix estimation, yields more accurate estimates at noncontaminated datasets and more robust estimates at contaminated data.
Abstract: We introduce a new method for robust principal component analysis (PCA). Classical PCA is based on the empirical covariance matrix of the data and hence is highly sensitive to outlying observations. Two robust approaches have been developed to date. The first approach is based on the eigenvectors of a robust scatter matrix such as the minimum covariance determinant or an S-estimator and is limited to relatively low-dimensional data. The second approach is based on projection pursuit and can handle high-dimensional data. Here we propose the ROBPCA approach, which combines projection pursuit ideas with robust scatter matrix estimation. ROBPCA yields more accurate estimates at noncontaminated datasets and more robust estimates at contaminated data. ROBPCA can be computed rapidly, and is able to detect exact-fit situations. As a by-product, ROBPCA produces a diagnostic plot that displays and classifies the outliers. We apply the algorithm to several datasets from chemometrics and engineering.

935 citations

Journal ArticleDOI
19 Apr 2016-PLOS ONE
TL;DR: This paper aims to be a new well-funded basis for unsupervised anomaly detection research by publishing the source code and the datasets, and reveals the strengths and weaknesses of the different approaches for the first time.
Abstract: Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

737 citations

Proceedings ArticleDOI
24 Aug 2008
TL;DR: This paper proposes a novel approach named ABOD (Angle-Based Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points and shows ABOD to perform especially well on high-dimensional data.
Abstract: Detecting outliers in a large set of data objects is a major data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in the full-dimensional Euclidean data space In high-dimensional data, these approaches are bound to deteriorate due to the notorious "curse of dimensionality" In this paper, we propose a novel approach named ABOD (Angle-Based Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points This way, the effects of the "curse of dimensionality" are alleviated compared to purely distance-based approaches A main advantage of our new approach is that our method does not rely on any parameter selection influencing the quality of the achieved ranking In a thorough experimental evaluation, we compare ABOD to the well-established distance-based method LOF for various artificial and a real world data set and show ABOD to perform especially well on high-dimensional data

706 citations

Journal ArticleDOI
TL;DR: The proposed TMCD method allows for the accurate, robust, and efficient estimation of partial volume model parameters, which is crucial to a variety of brain MRI data analysis procedures such as the accurate estimation of tissue volumes and the accurate delineation of the cortical surface.

621 citations


Cites methods from "A fast algorithm for the minimum co..."

  • ...A well-known approach by Geman and Geman (1984) to solve the optimization problem (7) globally could also be employed, but since this method is much more time consuming than ICM, we prefer to use the latter....

    [...]

References
More filters
Book
01 Jan 1987
TL;DR: This paper presents the results of a two-year study of the statistical treatment of outliers in the context of one-Dimensional Location and its applications to discrete-time reinforcement learning.
Abstract: 1. Introduction. 2. Simple Regression. 3. Multiple Regression. 4. The Special Case of One-Dimensional Location. 5. Algorithms. 6. Outlier Diagnostics. 7. Related Statistical Techniques. References. Table of Data Sets. Index.

6,955 citations

Book
01 Jan 1986
TL;DR: This paper presents a meta-modelling framework for estimating the values of Covariance Matrices and Multivariate Location using one-Dimensional and Multidimensional Estimators.
Abstract: 1. Introduction and Motivation. 2. One-Dimensional Estimators. 3. One-Dimensional Tests. 4. Multidimensional Estimators. 5. Estimation of Covariance Matrices and Multivariate Location. 6. Linear Models: Robust Estimation. 7. Linear Models: Robust Testing. 8. Complements and Outlook. References. Index.

3,818 citations

Journal ArticleDOI
TL;DR: In this paper, the median of the squared residuals is used to resist the effect of nearly 50% of contamination in the data in the special case of simple least square regression, which corresponds to finding the narrowest strip covering half of the observations.
Abstract: Classical least squares regression consists of minimizing the sum of the squared residuals. Many authors have produced more robust versions of this estimator by replacing the square by something else, such as the absolute value. In this article a different approach is introduced in which the sum is replaced by the median of the squared residuals. The resulting estimator can resist the effect of nearly 50% of contamination in the data. In the special case of simple regression, it corresponds to finding the narrowest strip covering half of the observations. Generalizations are possible to multivariate location, orthogonal regression, and hypothesis testing in linear models.

3,713 citations


"A fast algorithm for the minimum co..." refers methods in this paper

  • ...PERFORMANCE OF FAST-MCD To get an idea of the performance of the overall algorithm, we start by applying FAST-MCD to some small datasets taken from Rousseeuw and Leroy (1987). To be precise, these were all regression datasets, but we ran FASTMCD only on the explanatory variables-that is, not using the response variable....

    [...]

  • ...Positive-breakdown methods such as the MVE and least trimmed squares regression (Rousseeuw 1984) are increasingly being used in practice-for example, in finance, chemistry, electrical engineering, process control, and computer vision (Meer, Mintz, Rosenfeld, and Kim 1991)....

    [...]

  • ...Moreover, S-PLUS automatically provides the diagnostic plot of Rousseeuw and van Zomeren (1990), which plots the robust residuals versus the robust distances....

    [...]

Book
01 Jan 1980
TL;DR: A computer normalizes the one or more sets of historical data points and creates a first visual representation corresponding to the first set of the oneor more sets and the second set of additional points.
Abstract: The problem of outliers is one of the oldest in statistics, and during the last century and a half interest in it has waxed and waned several times. Currently it is once again an active research area after some years of relative neglect, and recent work has solved a number of old problems in outlier theory, and identified new ones. The major results are, however, scattered amongst many journal articles, and for some time there has been a clear need to bring them together in one place. That was the original intention of this monograph: but during execution it became clear that the existing theory of outliers was deficient in several areas, and so the monograph also contains a number of new results and conjectures. In view of the enormous volume ofliterature on the outlier problem and its cousins, no attempt has been made to make the coverage exhaustive. The material is concerned almost entirely with the use of outlier tests that are known (or may reasonably be expected) to be optimal in some way. Such topics as robust estimation are largely ignored, being covered more adequately in other sources. The numerous ad hoc statistics proposed in the early work on the grounds of intuitive appeal or computational simplicity also are not discussed in any detail.

2,180 citations

Journal ArticleDOI
TL;DR: This work proposes to compute distances based on very robust estimates of location and covariance, better suited to expose the outliers in a multivariate point cloud, to avoid the masking effect.
Abstract: Detecting outliers in a multivariate point cloud is not trivial, especially when there are several outliers. The classical identification method does not always find them, because it is based on the sample mean and covariance matrix, which are themselves affected by the outliers. That is how the outliers get masked. To avoid the masking effect, we propose to compute distances based on very robust estimates of location and covariance. These robust distances are better suited to expose the outliers. In the case of regression data, the classical least squares approach masks outliers in a similar way. Also here, the outliers may be unmasked by using a highly robust regression method. Finally, a new display is proposed in which the robust regression residuals are plotted versus the robust distances. This plot classifies the data into regular observations, vertical outliers, good leverage points, and bad leverage points. Several examples are discussed.

1,419 citations