Topic

CURE data clustering algorithm

About: CURE data clustering algorithm is a research topic. Over the lifetime, 13766 publications have been published within this topic receiving 461296 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set

[...]

Malika Charrad, Nadia Ghazzali, Véronique Boiteau, Azam Niknafs

03 Nov 2014-Journal of Statistical Software

TL;DR: The R package NbClust provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user.

...read moreread less

Abstract: Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity. The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform k-means and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the data set of interest.

...read moreread less

1,912 citations

Book Chapter•DOI•

A framework for clustering evolving data streams

[...]

Charu C. Aggarwal, Jiawei Han¹, Jianyong Wang¹, Philip S. Yu•Institutions (1)

University of Illinois at Urbana–Champaign¹

09 Sep 2003

TL;DR: A fundamentally different philosophy for data stream clustering is discussed which is guided by application-centered requirements and uses the concepts of a pyramidal time frame in conjunction with a microclustering approach.

...read moreread less

Abstract: The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream. The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithms is not very useful from an application point of view. For example, a simple one-pass clustering algorithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously perform dynamic clustering over all possible time horizons for a data stream of even moderately large volume. This paper discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. The idea is divide the clustering process into an online component which periodically stores detailed summary statistics and an offine component which uses only this summary statistics. The offine component is utilized by the analyst who can use a wide variety of inputs (such as time horizon or number of clusters) in order to provide a quick understanding of the broad clusters in the data stream. The problems of efficient choice, storage, and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyramidal time frame in conjunction with a microclustering approach. Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach.

...read moreread less

1,836 citations

Journal Article•DOI•

Scatter/Gather: a cluster-based approach to browsing large document collections

[...]

Douglass R. Cutting¹, David R. Karger¹, Jan O. Pedersen¹, John W Tukey¹•Institutions (1)

PARC¹

01 Jun 1992

TL;DR: A document browsing technique that employs docum-ent clustering as its primary operation is presented and a fast (linear time) clustering algorithm is presented that provides a powerful new access paradigm.

...read moreread less

Abstract: Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval.We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm.

...read moreread less

1,596 citations

Journal Article•DOI•

A new approach to clustering

[...]

Enrique H. Ruspini¹•Institutions (1)

University of California, Los Angeles¹

01 Jul 1969-Information & Computation

TL;DR: A new method of representation of the reduced data, based on the idea of “fuzzy sets,” is proposed to avoid some of the problems of current clustering procedures and to provide better insight into the structure of the original data.

...read moreread less

Abstract: A general formulation of data reduction and clustering processes is proposed. These procedures are regarded as mappings or transformations of the original space onto a “representation” or “code” space subjected to some constraints. Current clustering methods, as well as three other data reduction techniques, are specified within the framework of this formulation. A new method of representation of the reduced data, based on the idea of “fuzzy sets,” is proposed to avoid some of the problems of current clustering procedures and to provide better insight into the structure of the original data.

...read moreread less

1,452 citations

Journal Article•DOI•

Subspace clustering for high dimensional data: a review

[...]

Lance Parsons¹, Ehtesham Haque¹, Huan Liu¹•Institutions (1)

Arizona State University¹

01 Jun 2004-Sigkdd Explorations

TL;DR: A survey of the various subspace clustering algorithms along with a hierarchy organizing the algorithms by their defining characteristics is presented, comparing the two main approaches using empirical scalability and accuracy tests and discussing some potential applications where sub space clustering could be particularly useful.

...read moreread less

Abstract: Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant and redundant dimensions by analyzing the entire dataset. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. There are two major branches of subspace clustering based on their search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches find dense regions in low dimensional spaces and combine them to form clusters. This paper presents a survey of the various subspace clustering algorithms along with a hierarchy organizing the algorithms by their defining characteristics. We then compare the two main approaches to subspace clustering using empirical scalability and accuracy tests and discuss some potential applications where subspace clustering could be particularly useful.

...read moreread less

1,419 citations

Collapse

Network Information

Performance

Metrics

14,192

Papers

510,508

Citations

No. of papers in the topic in previous years
Year	Papers
2023	93
2022	323
2021	7
2020	5
2019	19
2018	113

CURE data clustering algorithm

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics