Algorithms for clustering data

Home
/
Papers
/
Algorithms for clustering data

Book•

Algorithms for clustering data

Anil K. Jain¹, Richard C. Dubes¹•Institutions (1)

01 Jan 1988-

About: The article was published on 1988-01-01 and is currently open access. It has received 8586 citations till now. The article focuses on the topics: Cluster analysis & Correlation clustering.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Efficient Graph-Based Image Segmentation

[...]

Pedro F. Felzenszwalb¹, Daniel P. Huttenlocher²•Institutions (2)

Massachusetts Institute of Technology¹, Cornell University²

01 Sep 2004-International Journal of Computer Vision

TL;DR: An efficient segmentation algorithm is developed based on a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image and it is shown that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties.

...read moreread less

Abstract: This paper addresses the problem of segmenting an image into regions. We define a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image. We then develop an efficient segmentation algorithm based on this predicate, and show that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties. We apply the algorithm to image segmentation using two different kinds of local neighborhoods in constructing the graph, and illustrate the results with both real and synthetic images. The algorithm runs in time nearly linear in the number of graph edges and is also fast in practice. An important characteristic of the method is its ability to preserve detail in low-variability image regions while ignoring detail in high-variability regions.

...read moreread less

5,791 citations

Journal Article•DOI•

Survey of clustering algorithms

[...]

Rui Xu¹, Donald C. Wunsch¹•Institutions (1)

Missouri University of Science and Technology¹

01 May 2005-IEEE Transactions on Neural Networks

TL;DR: Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.

...read moreread less

Abstract: Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

...read moreread less

5,744 citations

Journal Article•DOI•

An efficient k-means clustering algorithm: analysis and implementation

[...]

Tapas Kanungo¹, David M. Mount², Nathan S. Netanyahu³, Christine D. Piatko⁴, Ruth Silverman², Angela Y. Wu⁵ - Show less +2 more•Institutions (5)

IBM¹, University of Maryland, College Park², Bar-Ilan University³, Johns Hopkins University Applied Physics Laboratory⁴, University of Washington⁵

01 Jul 2002-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work presents a simple and efficient implementation of Lloyd's k-means clustering algorithm, which it calls the filtering algorithm, and establishes the practical efficiency of the algorithm's running time.

...read moreread less

Abstract: In k-means clustering, we are given a set of n data points in d-dimensional space R/sup d/ and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

...read moreread less

5,288 citations

Journal Article•DOI•

Top 10 algorithms in data mining

[...]

Xindong Wu¹, Vipin Kumar², J. Ross Quinlan, Joydeep Ghosh³, Qiang Yang⁴, Hiroshi Motoda⁵, Geoffrey J. McLachlan⁶, Angus S. K. Ng⁷, Bing Liu⁸, Philip S. Yu⁹, Zhi-Hua Zhou¹⁰, Michael Steinbach², David J. Hand¹¹, Dan Steinberg¹² - Show less +10 more•Institutions (12)

University of Vermont¹, University of Minnesota², University of Texas at Austin³, Hong Kong University of Science and Technology⁴, Osaka University⁵, University of Queensland⁶, Griffith University⁷, University of Illinois at Chicago⁸, IBM⁹, Nanjing University¹⁰, Imperial College London¹¹, University of Salford¹²

19 Dec 2007-Knowledge and Information Systems

TL;DR: This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART.

...read moreread less

Abstract: This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development.

...read moreread less

4,944 citations

Journal Article•DOI•

From Data Mining to Knowledge Discovery in Databases

[...]

Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth

15 Mar 1996-Ai Magazine

TL;DR: An overview of this emerging field is provided, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases.

...read moreread less

Abstract: ■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field.

...read moreread less

4,782 citations

1
2
3
4
5
6
…
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Clustering Algorithms

[...]

John A. Hartigan

01 Feb 1975

6,068 citations

Book•

Cluster Analysis for Applications

[...]

Michael R. Anderberg

01 Dec 1973

5,169 citations

Journal Article•DOI•

Shortest connection networks and some generalizations

[...]

R. C. Prim

01 Nov 1957-Bell System Technical Journal

TL;DR: In this paper, the basic problem of interconnecting a given set of terminals with a shortest possible network of direct links is considered, and a set of simple and practical procedures are given for solving this problem both graphically and computationally.

...read moreread less

Abstract: The basic problem considered is that of interconnecting a given set of terminals with a shortest possible network of direct links Simple and practical procedures are given for solving this problem both graphically and computationally It develops that these procedures also provide solutions for a much broader class of problems, containing other examples of practical interest

...read moreread less

4,395 citations

Journal Article•DOI•

An examination of procedures for determining the number of clusters in a data set

[...]

Glenn W. Milligan¹, Martha C. Cooper¹•Institutions (1)

Ohio State University¹

01 Jun 1985-Psychometrika

TL;DR: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters to provide a variety of clustering solutions.

...read moreread less

Abstract: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.

...read moreread less

3,551 citations

Journal Article•DOI•

SLINK: An optimally efficient algorithm for the single-link cluster method

[...]

Robin Sibson¹•Institutions (1)

University of Cambridge¹

01 Jan 1973-The Computer Journal

TL;DR: Sibson gives an O(n 2) algorithm for single-linkage clustering, and proves that this algorithm achieves the theoretically optimal lower time bound for obtaining a single- linkage dendrogram.

...read moreread less

Abstract: Main point Sibson gives an O(n 2) algorithm for single-linkage clustering, and proves that this algorithm achieves the theoretically optimal lower time bound for obtaining a single-linkage dendrogram. This improves upon the naive O(n 3) implementation of single linkage clustering. A single linkage dendrogram is a tree, where each level of the tree corresponds to a different threshold dissimilarity measure h. The nodes of a dataset are grouped into \" equivalence classes \" c(h) at each level of the dendrogram, where two classes C i and C j are merged if there is a pair of \" OTU's \" (vertices) v i ∈ C i and v j ∈ C j such that the dissimilarity measure between v i and v j is less than h, or D(v i , v j) < h. For example, consider a set of 10 vertices v 1 ,. .. , v 10 for which the dissimilarity matrix D is given below, with D ij equal to the dissimilarity between v i and v j. Suppose we take four cutoff dissimilarity measures h 1 , h 2 , h 3 , h 4 and produce the dendrogram according to these thresholds. An example illustrating how the 10 vertices are grouped into equivalence classes at each level is shown in Figure 1. Since no dissimilarity is at or below 1, each vertex or \" OTU \" is its own equivalence class at the level corresponding to h 1 = 1. At the next level, however, we see that some classes have been merged together because several dissimilarity measures are below h 2 = 2. We can see that c(h 2) consists of 6 equivalence classes, c(h 3) has 3 equivalence classes, and c(h 4 = 4) aggregates all the vertices into one equivalence class. In single linkage clustering, the number of levels in the tree is determined by the nearest-neighbor criterion – at each level, at least one new merge is made between two clusters, and the merge is made for clusters C i and C j if the minimal distance between vertices v i ∈ C i and v j ∈ C j is the smallest such distance across all the clusters. In other words, the nearest neighbors between clusters C j and C i are found, and if these neighbors are closer than all the other nearest-neighbor pairs, then C i and C …

...read moreread less

1,208 citations