A fast algorithm for the minimum covariance determinant estimator

doi:10.2307/1270566

Home
/
Papers
/
A fast algorithm for the minimum covariance determinant estimator

Journal Article•DOI•

A fast algorithm for the minimum covariance determinant estimator

Peter J. Rousseeuw, Katrien Van Driessen

01 Aug 1999-Technometrics (Taylor & Francis Group)-Vol. 41, Iss: 3, pp 212-223

TL;DR: For small datasets, FAST-MCD typically finds the exact MCD, whereas for larger datasets it gives more accurate results than existing algorithms and is faster by orders.

read less

Abstract: The minimum covariance determinant (MCD) method of Rousseeuw is a highly robust estimator of multivariate location and scatter. Its objective is to find h observations (out of n) whose covariance matrix has the lowest determinant. Until now, applications of the MCD were hampered by the computation time of existing algorithms, which were limited to a few hundred objects in a few dimensions. We discuss two important applications of larger size, one about a production process at Philips with n = 677 objects and p = 9 variables, and a dataset from astronomy with n = 137,256 objects and p = 27 variables. To deal with such problems we have developed a new algorithm for the MCD, called FAST-MCD. The basic ideas are an inequality involving order statistics and determinants, and techniques which we call “selective iteration” and “nested extensions.” For small datasets, FAST-MCD typically finds the exact MCD, whereas for larger datasets it gives more accurate results than existing algorithms and is faster by orders...

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Conservation and divergence of methylation patterning in plants and animals

[...]

Suhua Feng¹, Shawn J. Cokus¹, Xiaoyu Zhang², Pao-Yang Chen¹, Magnolia Bostick¹, Mary G. Goll³, Jonathan Hetzel¹, Jayati Jain³, Steven H. Strauss⁴, Marnie E. Halpern³, Chinweike Ukomadu⁵, Kirsten C. Sadler⁶, Sriharsa Pradhan⁷, Matteo Pellegrini¹, Steven E. Jacobsen¹ - Show less +11 more•Institutions (7)

University of California, Los Angeles¹, University of Georgia², Carnegie Institution for Science³, Oregon State University⁴, Harvard University⁵, Icahn School of Medicine at Mount Sinai⁶, New England Biolabs⁷

11 May 2010-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: Comparisons of DNA methylation in eight diverse plant and animal genomes found that patterns of methylation are very similar in flowering plants with methylated cytosines detected in all sequence contexts, whereas CG methylation predominates in animals.

...read moreread less

Abstract: Cytosine DNA methylation is a heritable epigenetic mark present in many eukaryotic organisms. Although DNA methylation likely has a conserved role in gene silencing, the levels and patterns of DNA methylation appear to vary drastically among different organisms. Here we used shotgun genomic bisulfite sequencing (BS-Seq) to compare DNA methylation in eight diverse plant and animal genomes. We found that patterns of methylation are very similar in flowering plants with methylated cytosines detected in all sequence contexts, whereas CG methylation predominates in animals. Vertebrates have methylation throughout the genome except for CpG islands. Gene body methylation is conserved with clear preference for exons in most organisms. Furthermore, genes appear to be the major target of methylation in Ciona and honey bee. Among the eight organisms, the green alga Chlamydomonas has the most unusual pattern of methylation, having non-CG methylation enriched in exons of genes rather than in repeats and transposons. In addition, the Dnmt1 cofactor Uhrf1 has a conserved function in maintaining CG methylation in both transposons and gene bodies in the mouse, Arabidopsis, and zebrafish genomes.

...read moreread less

1,111 citations

Journal Article•DOI•

ROBPCA: A New Approach to Robust Principal Component Analysis

[...]

Mia Hubert¹, Peter J. Rousseeuw², Karlien Vanden Branden¹•Institutions (2)

Katholieke Universiteit Leuven¹, University of Antwerp²

01 Jan 2005-Technometrics

TL;DR: The ROBPCA approach, which combines projection pursuit ideas with robust scatter matrix estimation, yields more accurate estimates at noncontaminated datasets and more robust estimates at contaminated data.

...read moreread less

Abstract: We introduce a new method for robust principal component analysis (PCA). Classical PCA is based on the empirical covariance matrix of the data and hence is highly sensitive to outlying observations. Two robust approaches have been developed to date. The first approach is based on the eigenvectors of a robust scatter matrix such as the minimum covariance determinant or an S-estimator and is limited to relatively low-dimensional data. The second approach is based on projection pursuit and can handle high-dimensional data. Here we propose the ROBPCA approach, which combines projection pursuit ideas with robust scatter matrix estimation. ROBPCA yields more accurate estimates at noncontaminated datasets and more robust estimates at contaminated data. ROBPCA can be computed rapidly, and is able to detect exact-fit situations. As a by-product, ROBPCA produces a diagnostic plot that displays and classifies the outliers. We apply the algorithm to several datasets from chemometrics and engineering.

...read moreread less

935 citations

Journal Article•DOI•

A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.

[...]

Markus Goldstein¹, Seiichi Uchida¹•Institutions (1)

Kyushu University¹

19 Apr 2016-PLOS ONE

TL;DR: This paper aims to be a new well-funded basis for unsupervised anomaly detection research by publishing the source code and the datasets, and reveals the strengths and weaknesses of the different approaches for the first time.

...read moreread less

Abstract: Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

...read moreread less

737 citations

Proceedings Article•DOI•

Angle-based outlier detection in high-dimensional data

[...]

Hans-Peter Kriegel¹, Matthias Schubert¹, Arthur Zimek¹•Institutions (1)

Ludwig Maximilian University of Munich¹

24 Aug 2008

TL;DR: This paper proposes a novel approach named ABOD (Angle-Based Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points and shows ABOD to perform especially well on high-dimensional data.

...read moreread less

Abstract: Detecting outliers in a large set of data objects is a major data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set All existing approaches, however, are based on an assessment of distances (sometimes indirectly by assuming certain distributions) in the full-dimensional Euclidean data space In high-dimensional data, these approaches are bound to deteriorate due to the notorious "curse of dimensionality" In this paper, we propose a novel approach named ABOD (Angle-Based Outlier Detection) and some variants assessing the variance in the angles between the difference vectors of a point to the other points This way, the effects of the "curse of dimensionality" are alleviated compared to purely distance-based approaches A main advantage of our new approach is that our method does not rely on any parameter selection influencing the quality of the achieved ranking In a thorough experimental evaluation, we compare ABOD to the well-established distance-based method LOF for various artificial and a real world data set and show ABOD to perform especially well on high-dimensional data

...read moreread less

706 citations

Journal Article•DOI•

Fast and robust parameter estimation for statistical partial volume models in brain MRI.

[...]

Jussi Tohka¹, Alex P. Zijdenbos², Alan C. Evans²•Institutions (2)

Tampere University of Technology¹, Montreal Neurological Institute and Hospital²

01 Sep 2004-NeuroImage

TL;DR: The proposed TMCD method allows for the accurate, robust, and efficient estimation of partial volume model parameters, which is crucial to a variety of brain MRI data analysis procedures such as the accurate estimation of tissue volumes and the accurate delineation of the cortical surface.

...read moreread less

621 citations

Cites methods from "A fast algorithm for the minimum co..."

...A well-known approach by Geman and Geman (1984) to solve the optimization problem (7) globally could also be employed, but since this method is much more time consuming than ICM, we prefer to use the latter....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Robust Regression and Outlier Detection

[...]

Peter J. Rousseeuw, Annick M. Leroy

01 Jan 1987

TL;DR: This paper presents the results of a two-year study of the statistical treatment of outliers in the context of one-Dimensional Location and its applications to discrete-time reinforcement learning.

...read moreread less

Abstract: 1. Introduction. 2. Simple Regression. 3. Multiple Regression. 4. The Special Case of One-Dimensional Location. 5. Algorithms. 6. Outlier Diagnostics. 7. Related Statistical Techniques. References. Table of Data Sets. Index.

...read moreread less

6,955 citations

Book•

Robust statistics: the approach based on influence functions

[...]

Frank R. Hampel¹, Elvezio Ronchetti, Peter J. Rousseeuw, Werner A. Stahel•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Jan 1986

TL;DR: This paper presents a meta-modelling framework for estimating the values of Covariance Matrices and Multivariate Location using one-Dimensional and Multidimensional Estimators.

...read moreread less

Abstract: 1. Introduction and Motivation. 2. One-Dimensional Estimators. 3. One-Dimensional Tests. 4. Multidimensional Estimators. 5. Estimation of Covariance Matrices and Multivariate Location. 6. Linear Models: Robust Estimation. 7. Linear Models: Robust Testing. 8. Complements and Outlook. References. Index.

...read moreread less

3,818 citations

Journal Article•DOI•

Least Median of Squares Regression

[...]

Peter J. Rousseeuw¹•Institutions (1)

Delft University of Technology¹

01 Dec 1984-Journal of the American Statistical Association

TL;DR: In this paper, the median of the squared residuals is used to resist the effect of nearly 50% of contamination in the data in the special case of simple least square regression, which corresponds to finding the narrowest strip covering half of the observations.

...read moreread less

Abstract: Classical least squares regression consists of minimizing the sum of the squared residuals. Many authors have produced more robust versions of this estimator by replacing the square by something else, such as the absolute value. In this article a different approach is introduced in which the sum is replaced by the median of the squared residuals. The resulting estimator can resist the effect of nearly 50% of contamination in the data. In the special case of simple regression, it corresponds to finding the narrowest strip covering half of the observations. Generalizations are possible to multivariate location, orthogonal regression, and hypothesis testing in linear models.

...read moreread less

3,713 citations

"A fast algorithm for the minimum co..." refers methods in this paper

...PERFORMANCE OF FAST-MCD To get an idea of the performance of the overall algorithm, we start by applying FAST-MCD to some small datasets taken from Rousseeuw and Leroy (1987). To be precise, these were all regression datasets, but we ran FASTMCD only on the explanatory variables-that is, not using the response variable....
[...]
...Positive-breakdown methods such as the MVE and least trimmed squares regression (Rousseeuw 1984) are increasingly being used in practice-for example, in finance, chemistry, electrical engineering, process control, and computer vision (Meer, Mintz, Rosenfeld, and Kim 1991)....
[...]
...Moreover, S-PLUS automatically provides the diagnostic plot of Rousseeuw and van Zomeren (1990), which plots the robust residuals versus the robust distances....
[...]

Book•

Identification of outliers

[...]

Douglas M. Hawkins

01 Jan 1980

TL;DR: A computer normalizes the one or more sets of historical data points and creates a first visual representation corresponding to the first set of the oneor more sets and the second set of additional points.

...read moreread less

Abstract: The problem of outliers is one of the oldest in statistics, and during the last century and a half interest in it has waxed and waned several times. Currently it is once again an active research area after some years of relative neglect, and recent work has solved a number of old problems in outlier theory, and identified new ones. The major results are, however, scattered amongst many journal articles, and for some time there has been a clear need to bring them together in one place. That was the original intention of this monograph: but during execution it became clear that the existing theory of outliers was deficient in several areas, and so the monograph also contains a number of new results and conjectures. In view of the enormous volume ofliterature on the outlier problem and its cousins, no attempt has been made to make the coverage exhaustive. The material is concerned almost entirely with the use of outlier tests that are known (or may reasonably be expected) to be optimal in some way. Such topics as robust estimation are largely ignored, being covered more adequately in other sources. The numerous ad hoc statistics proposed in the early work on the grounds of intuitive appeal or computational simplicity also are not discussed in any detail.

...read moreread less

2,180 citations

Journal Article•DOI•

Unmasking Multivariate Outliers and Leverage Points

[...]

Peter J. Rousseeuw, Bert C. van Zomeren¹•Institutions (1)

Delft University of Technology¹

01 Jan 1990-Journal of the American Statistical Association

TL;DR: This work proposes to compute distances based on very robust estimates of location and covariance, better suited to expose the outliers in a multivariate point cloud, to avoid the masking effect.

...read moreread less

Abstract: Detecting outliers in a multivariate point cloud is not trivial, especially when there are several outliers. The classical identification method does not always find them, because it is based on the sample mean and covariance matrix, which are themselves affected by the outliers. That is how the outliers get masked. To avoid the masking effect, we propose to compute distances based on very robust estimates of location and covariance. These robust distances are better suited to expose the outliers. In the case of regression data, the classical least squares approach masks outliers in a similar way. Also here, the outliers may be unmasked by using a highly robust regression method. Finally, a new display is proposed in which the robust regression residuals are plotted versus the robust distances. This plot classifies the data into regular observations, vertical outliers, good leverage points, and bad leverage points. Several examples are discussed.

...read moreread less

1,419 citations