A density-based algorithm for discovering clusters in large spatial Databases with Noise

Home
/
Papers
/
A density-based algorithm for discovering clusters in large spatial Databases with Noise

Proceedings Article•

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Martin Ester¹, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu¹•Institutions (1)

01 Jan 1996-pp 226-231

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.

read less

Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book•

Data Mining: Practical Machine Learning Tools and Techniques

[...]

Ian H. Witten, Eibe Frank, Mark Hall

25 Oct 1999

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

20,196 citations

Journal Article•DOI•

Anomaly detection: A survey

[...]

Varun Chandola¹, Arindam Banerjee¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

30 Jul 2009-ACM Computing Surveys

TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.

...read moreread less

Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

...read moreread less

9,627 citations

Journal Article•DOI•

Data clustering: 50 years beyond K-means

[...]

Anil K. Jain¹•Institutions (1)

Michigan State University¹

01 Jun 2010

TL;DR: A brief overview of clustering is provided, well known clustering methods are summarized, the major challenges and key issues in designing clustering algorithms are discussed, and some of the emerging and useful research directions are pointed out.

...read moreread less

Abstract: Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

...read moreread less

6,601 citations

Journal Article•DOI•

Survey of clustering algorithms

[...]

Rui Xu¹, Donald C. Wunsch¹•Institutions (1)

Missouri University of Science and Technology¹

01 May 2005-IEEE Transactions on Neural Networks

TL;DR: Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.

...read moreread less

Abstract: Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

...read moreread less

5,744 citations

Cites background from "A density-based algorithm for disco..."

...3) Many novel algorithms have been developed to cluster large-scale data sets, especially in the context of data mining [44], [45], [85], [135], [213], [248]....
[...]
...DBSCAN requires that the density in a neighborhood for an object should be high enough if it belongs to a cluster....
[...]
...Clustering Algorithms • A. Distance and Similarity Measures (See also Table I) • B. Hierarchical — Agglomerative Single linkage, complete linkage, group average linkage, median linkage, centroid linkage, Ward’s method, balanced iterative reducing and clustering using hierarchies (BIRCH), clustering using representatives (CURE), robust clustering using links (ROCK) — Divisive divisive analysis (DIANA), monothetic analysis (MONA) • C. Squared Error-Based (Vector Quantization) — -means, iterative self-organizing data analysis technique (ISODATA), genetic -means algorithm (GKA), partitioning around medoids (PAM) • D. pdf Estimation via Mixture Densities — Gaussian mixture density decomposition (GMDD), AutoClass • E. Graph Theory-Based — Chameleon, Delaunay triangulation graph (DTG), highly connected subgraphs (HCS), clustering iden- tification via connectivity kernels (CLICK), cluster affinity search technique (CAST) • F. Combinatorial Search Techniques-Based — Genetically guided algorithm (GGA), TS clustering, SA clustering • G. Fuzzy — Fuzzy -means (FCM), mountain method (MM), possibilistic -means clustering algorithm (PCM), fuzzy -shells (FCS) • H. Neural Networks-Based — Learning vector quantization (LVQ), self-organizing feature map (SOFM), ART, simplified ART (SART), hyperellipsoidal clustering network (HEC), self-splitting competitive learning network (SPLL) • I. Kernel-Based — Kernel -means, support vector clustering (SVC) • J. Sequential Data — Sequence Similarity — Indirect sequence clustering — Statistical sequence clustering • K. Large-Scale Data Sets (See also Table II) — CLARA, CURE, CLARANS, BIRCH, DBSCAN, DENCLUE, WaveCluster, FC, ART • L. Data visualization and High-dimensional Data — PCA, ICA, Projection pursuit, Isomap, LLE, CLIQUE, OptiGrid, ORCLUS • M....
[...]
...BIRCH was generalized into a broader framework in [101] with two algorithms realization, named as BUBBLE and BUBBLE-FM. d) Density-based approach, e.g., density based spatial clustering of applications with noise (DBSCAN) [85] and density-based clustering (DENCLUE) [135]....
[...]
...DBSCAN uses a -tree structure for more efficient queries....
[...]

Journal Article•DOI•

Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets

[...]

Evan Z. Macosko¹, Evan Z. Macosko², Anindita Basu², Anindita Basu¹, Rahul Satija³, Rahul Satija¹, James Nemesh¹, James Nemesh², Karthik Shekhar¹, Melissa Goldman¹, Melissa Goldman², Itay Tirosh¹, Allison R. Bialas⁴, Nolan Kamitaki², Nolan Kamitaki¹, Emily M. Martersteck², John J. Trombetta¹, David A. Weitz², Joshua R. Sanes², Alex K. Shalek⁵, Alex K. Shalek¹, Alex K. Shalek⁶, Aviv Regev⁶, Aviv Regev⁷, Aviv Regev¹, Steven A. McCarroll², Steven A. McCarroll¹ - Show less +23 more•Institutions (7)

Broad Institute¹, Harvard University², New York University³, Boston Children's Hospital⁴, Ragon Institute of MGH, MIT and Harvard⁵, Massachusetts Institute of Technology⁶, Howard Hughes Medical Institute⁷

21 May 2015-Cell

TL;DR: Drop-seq will accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution by separating them into nanoliter-sized aqueous droplets, associating a different barcode with each cell's RNAs, and sequencing them all together.

...read moreread less

5,506 citations

Cites background from "A density-based algorithm for disco..."

...Point clouds on the t-SNEmap represent candidate cell types; density clustering (Ester et al., 1996) identified these regions....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Finding Groups in Data: An Introduction to Cluster Analysis

[...]

Leonard Kaufman¹, Peter J. Rousseeuw²•Institutions (2)

Vrije Universiteit Brussel¹, Ukraine International Airlines²

01 Jan 1990

TL;DR: An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count.

...read moreread less

Abstract: 1. Introduction. 2. Partitioning Around Medoids (Program PAM). 3. Clustering large Applications (Program CLARA). 4. Fuzzy Analysis. 5. Agglomerative Nesting (Program AGNES). 6. Divisive Analysis (Program DIANA). 7. Monothetic Analysis (Program MONA). Appendix 1. Implementation and Structure of the Programs. Appendix 2. Running the Programs. Appendix 3. Adapting the Programs to Your Needs. Appendix 4. The Program CLUSPLOT. References. Author Index. Subject Index.

...read moreread less

10,537 citations

Additional excerpts

...For each of the discovered clusterings the silhouette coefficient (Kaufman & Rousseeuw 1990) is calculated, and finally, the clustering with the maximum silhouette coefficient is chosen as the “natural” clustering....
[...]
...Clustering Algorithms There are two basic types of clustering algorithms (Kaufman & Rousseeuw 1990): partitioning and hierarchical algorithms....
[...]
...Kaufman L., and Rousseeuw P.J. 1990....
[...]

Book•

Algorithms for clustering data

[...]

Anil K. Jain¹, Richard C. Dubes¹•Institutions (1)

Michigan State University¹

01 Jan 1988

8,586 citations

Proceedings Article•DOI•

The R*-tree: an efficient and robust access method for points and rectangles

[...]

Norbert Beckmann¹, Hans-Peter Kriegel¹, Ralf Schneider¹, Bernhard Seeger¹•Institutions (1)

University of Bremen¹

01 May 1990

TL;DR: The R*-tree is designed which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory which clearly outperforms the existing R-tree variants.

...read moreread less

Abstract: The R-tree, one of the most popular access methods for rectangles, is based on the heuristic optimization of the area of the enclosing rectangle in each inner node. By running numerous experiments in a standardized testbed under highly varying data, queries and operations, we were able to design the R*-tree which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory. Using our standardized testbed in an exhaustive performance comparison, it turned out that the R*-tree clearly outperforms the existing R-tree variants. Guttman's linear and quadratic R-tree and Greene's variant of the R-tree. This superiority of the R*-tree holds for different types of queries and operations, such as map overlay, for both rectangles and multidimensional points in all experiments. From a practical point of view the R*-tree is very attractive because of the following two reasons 1 it efficiently supports point and spatial data at the same time and 2 its implementation cost is only slightly higher than that of other R-trees.

...read moreread less

4,686 citations

"A density-based algorithm for disco..." refers background or methods in this paper

...Brinkhoff T., Kriegel H.-P., Schneider R., and Seeger B. 1994 Efficient Multi-Step Processing of Spatial Joins, Proc....
[...]
...clusters found by a partitioning algorithm is convex which is moderate values for n, but it is prohibitive for applications on very restrictive. large databases. Ng & Han (1994) explore partitioning algorithms for KDD in spatial databases....
[...]
...Unfortunately, the run time of this approach is prohibitive for large n, because it implies O(n) calls of CLARANS. Jain (1988) explores a density based approach to identify clusters in k-dimensional point sets....
[...]
...CLAIWNS assumes that all objects to be clustered can reside in main memory at the same time which does not hold for large databases. Furthermore, the run time of CLARANS is prohibitive on large databases. Therefore, Ester, Kriegel &Xu (1995) present several focusing techniques which address both of these problems by focusing the clustering process on the relevant parts of the database....
[...]
...clusters found by a partitioning algorithm is convex which is moderate values for n, but it is prohibitive for applications on very restrictive. large databases. Ng & Han (1994) explore partitioning algorithms for KDD in spatial databases. An algorithm called CLARANS (Clustering Large Applications based on RANdomized Search) is introduced which is an improved k-medoid method. Compared to former k-medoid algorithms, CLARANS is more effective and more efficient. An experimental evaluation indicates that CLARANS runs efficiently on databases of thousands of objects. Ng &Han (1994) also discuss methods to determine the “natural” number k,, of clusters in a database....
[...]

Proceedings Article•

Efficient and Effective Clustering Methods for Spatial Data Mining

[...]

Raymond T. Ng¹, Jiawei Han•Institutions (1)

University of British Columbia¹

12 Sep 1994

TL;DR: The analysis and experiments show that with the assistance of CLAHANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms.

...read moreread less

Abstract: Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases In this paper, we explore whether clustering methods have a role to play in spatial data mining To this end, we develop a new clustering method called CLAHANS which is based on randomized search We also develop two spatial data mining algorithms that use CLAHANS Our analysis and experiments show that with the assistance of CLAHANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms Furthermore, experiments conducted to compare the performance of CLAHANS with that of existing clustering methods show that CLAHANS is the most efficient

...read moreread less

1,999 citations

"A density-based algorithm for disco..." refers background in this paper

... Ng &Han (1994) also discuss methods to determine the “natural” number k,, of clusters in a database....
[...]
...Ng & Han (1994) explore partitioning algorithms for...
[...]

Journal Article•DOI•

An introduction to spatial database systems

[...]

Ralf Hartmut Güting¹•Institutions (1)

FernUniversität Hagen¹

01 Oct 1994

TL;DR: This work surveys data modeling, querying, data structures and algorithms, and system architecture for spatial database systems, with the emphasis on describing known technology in a coherent manner, rather than listing open problems.

...read moreread less

Abstract: We propose a definition of a spatial database system as a database system that offers spatial data types in its data model and query language, and supports spatial data types in its implementation, providing at least spatial indexing and spatial join methods. Spatial database systems offer the underlying database technology for geographic information systems and other applications. We survey data modeling, querying, data structures and algorithms, and system architecture for such systems. The emphasis is on describing known technology in a coherent manner, rather than listing open problems.

...read moreread less

744 citations