scispace - formally typeset
Search or ask a question
Proceedings Article

A density-based algorithm for discovering clusters in large spatial Databases with Noise

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

Content maybe subject to copyright    Report

Citations
More filters
Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Journal ArticleDOI
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

9,627 citations

Journal ArticleDOI
01 Jun 2010
TL;DR: A brief overview of clustering is provided, well known clustering methods are summarized, the major challenges and key issues in designing clustering algorithms are discussed, and some of the emerging and useful research directions are pointed out.
Abstract: Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

6,601 citations

Journal ArticleDOI
TL;DR: Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.
Abstract: Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

5,744 citations


Cites background from "A density-based algorithm for disco..."

  • ...3) Many novel algorithms have been developed to cluster large-scale data sets, especially in the context of data mining [44], [45], [85], [135], [213], [248]....

    [...]

  • ...DBSCAN requires that the density in a neighborhood for an object should be high enough if it belongs to a cluster....

    [...]

  • ...Clustering Algorithms • A. Distance and Similarity Measures (See also Table I) • B. Hierarchical — Agglomerative Single linkage, complete linkage, group average linkage, median linkage, centroid linkage, Ward’s method, balanced iterative reducing and clustering using hierarchies (BIRCH), clustering using representatives (CURE), robust clustering using links (ROCK) — Divisive divisive analysis (DIANA), monothetic analysis (MONA) • C. Squared Error-Based (Vector Quantization) — -means, iterative self-organizing data analysis technique (ISODATA), genetic -means algorithm (GKA), partitioning around medoids (PAM) • D. pdf Estimation via Mixture Densities — Gaussian mixture density decomposition (GMDD), AutoClass • E. Graph Theory-Based — Chameleon, Delaunay triangulation graph (DTG), highly connected subgraphs (HCS), clustering iden- tification via connectivity kernels (CLICK), cluster affinity search technique (CAST) • F. Combinatorial Search Techniques-Based — Genetically guided algorithm (GGA), TS clustering, SA clustering • G. Fuzzy — Fuzzy -means (FCM), mountain method (MM), possibilistic -means clustering algorithm (PCM), fuzzy -shells (FCS) • H. Neural Networks-Based — Learning vector quantization (LVQ), self-organizing feature map (SOFM), ART, simplified ART (SART), hyperellipsoidal clustering network (HEC), self-splitting competitive learning network (SPLL) • I. Kernel-Based — Kernel -means, support vector clustering (SVC) • J. Sequential Data — Sequence Similarity — Indirect sequence clustering — Statistical sequence clustering • K. Large-Scale Data Sets (See also Table II) — CLARA, CURE, CLARANS, BIRCH, DBSCAN, DENCLUE, WaveCluster, FC, ART • L. Data visualization and High-dimensional Data — PCA, ICA, Projection pursuit, Isomap, LLE, CLIQUE, OptiGrid, ORCLUS • M....

    [...]

  • ...BIRCH was generalized into a broader framework in [101] with two algorithms realization, named as BUBBLE and BUBBLE-FM. d) Density-based approach, e.g., density based spatial clustering of applications with noise (DBSCAN) [85] and density-based clustering (DENCLUE) [135]....

    [...]

  • ...DBSCAN uses a -tree structure for more efficient queries....

    [...]

Journal ArticleDOI
21 May 2015-Cell
TL;DR: Drop-seq will accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution by separating them into nanoliter-sized aqueous droplets, associating a different barcode with each cell's RNAs, and sequencing them all together.

5,506 citations


Cites background from "A density-based algorithm for disco..."

  • ...Point clouds on the t-SNEmap represent candidate cell types; density clustering (Ester et al., 1996) identified these regions....

    [...]

References
More filters
Book
01 Jan 1990
TL;DR: An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count.
Abstract: 1. Introduction. 2. Partitioning Around Medoids (Program PAM). 3. Clustering large Applications (Program CLARA). 4. Fuzzy Analysis. 5. Agglomerative Nesting (Program AGNES). 6. Divisive Analysis (Program DIANA). 7. Monothetic Analysis (Program MONA). Appendix 1. Implementation and Structure of the Programs. Appendix 2. Running the Programs. Appendix 3. Adapting the Programs to Your Needs. Appendix 4. The Program CLUSPLOT. References. Author Index. Subject Index.

10,537 citations


Additional excerpts

  • ...For each of the discovered clusterings the silhouette coefficient (Kaufman & Rousseeuw 1990) is calculated, and finally, the clustering with the maximum silhouette coefficient is chosen as the “natural” clustering....

    [...]

  • ...Clustering Algorithms There are two basic types of clustering algorithms (Kaufman & Rousseeuw 1990): partitioning and hierarchical algorithms....

    [...]

  • ...Kaufman L., and Rousseeuw P.J. 1990....

    [...]

Book
01 Jan 1988

8,586 citations

Proceedings ArticleDOI
01 May 1990
TL;DR: The R*-tree is designed which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory which clearly outperforms the existing R-tree variants.
Abstract: The R-tree, one of the most popular access methods for rectangles, is based on the heuristic optimization of the area of the enclosing rectangle in each inner node. By running numerous experiments in a standardized testbed under highly varying data, queries and operations, we were able to design the R*-tree which incorporates a combined optimization of area, margin and overlap of each enclosing rectangle in the directory. Using our standardized testbed in an exhaustive performance comparison, it turned out that the R*-tree clearly outperforms the existing R-tree variants. Guttman's linear and quadratic R-tree and Greene's variant of the R-tree. This superiority of the R*-tree holds for different types of queries and operations, such as map overlay, for both rectangles and multidimensional points in all experiments. From a practical point of view the R*-tree is very attractive because of the following two reasons 1 it efficiently supports point and spatial data at the same time and 2 its implementation cost is only slightly higher than that of other R-trees.

4,686 citations


"A density-based algorithm for disco..." refers background or methods in this paper

  • ...Brinkhoff T., Kriegel H.-P., Schneider R., and Seeger B. 1994 Efficient Multi-Step Processing of Spatial Joins, Proc....

    [...]

  • ...clusters found by a partitioning algorithm is convex which is moderate values for n, but it is prohibitive for applications on very restrictive. large databases. Ng & Han (1994) explore partitioning algorithms for KDD in spatial databases....

    [...]

  • ...Unfortunately, the run time of this approach is prohibitive for large n, because it implies O(n) calls of CLARANS. Jain (1988) explores a density based approach to identify clusters in k-dimensional point sets....

    [...]

  • ...CLAIWNS assumes that all objects to be clustered can reside in main memory at the same time which does not hold for large databases. Furthermore, the run time of CLARANS is prohibitive on large databases. Therefore, Ester, Kriegel &Xu (1995) present several focusing techniques which address both of these problems by focusing the clustering process on the relevant parts of the database....

    [...]

  • ...clusters found by a partitioning algorithm is convex which is moderate values for n, but it is prohibitive for applications on very restrictive. large databases. Ng & Han (1994) explore partitioning algorithms for KDD in spatial databases. An algorithm called CLARANS (Clustering Large Applications based on RANdomized Search) is introduced which is an improved k-medoid method. Compared to former k-medoid algorithms, CLARANS is more effective and more efficient. An experimental evaluation indicates that CLARANS runs efficiently on databases of thousands of objects. Ng &Han (1994) also discuss methods to determine the “natural” number k,, of clusters in a database....

    [...]

Proceedings Article
12 Sep 1994
TL;DR: The analysis and experiments show that with the assistance of CLAHANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms.
Abstract: Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases In this paper, we explore whether clustering methods have a role to play in spatial data mining To this end, we develop a new clustering method called CLAHANS which is based on randomized search We also develop two spatial data mining algorithms that use CLAHANS Our analysis and experiments show that with the assistance of CLAHANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms Furthermore, experiments conducted to compare the performance of CLAHANS with that of existing clustering methods show that CLAHANS is the most efficient

1,999 citations


"A density-based algorithm for disco..." refers background in this paper

  • ... Ng &Han (1994) also discuss methods to determine the “natural” number k,, of clusters in a database....

    [...]

  • ...Ng & Han (1994) explore partitioning algorithms for...

    [...]

Journal ArticleDOI
01 Oct 1994
TL;DR: This work surveys data modeling, querying, data structures and algorithms, and system architecture for spatial database systems, with the emphasis on describing known technology in a coherent manner, rather than listing open problems.
Abstract: We propose a definition of a spatial database system as a database system that offers spatial data types in its data model and query language, and supports spatial data types in its implementation, providing at least spatial indexing and spatial join methods. Spatial database systems offer the underlying database technology for geographic information systems and other applications. We survey data modeling, querying, data structures and algorithms, and system architecture for such systems. The emphasis is on describing known technology in a coherent manner, rather than listing open problems.

744 citations


"A density-based algorithm for disco..." refers background in this paper

  • ...Spatial Database Systems (SDBS) (Gueting 1994) are database systems for the management of spatial data....

    [...]