Showing papers on "Cluster analysis published in 2002"

PDF

Open Access

Journal Article•DOI•

An application-specific protocol architecture for wireless microsensor networks

[...]

Wendi Heinzelman¹, Anantha P. Chandrakasan¹, Hari Balakrishnan¹•Institutions (1)

01 Oct 2002-IEEE Transactions on Wireless Communications

TL;DR: This work develops and analyzes low-energy adaptive clustering hierarchy (LEACH), a protocol architecture for microsensor networks that combines the ideas of energy-efficient cluster-based routing and media access together with application-specific data aggregation to achieve good performance in terms of system lifetime, latency, and application-perceived quality.

...read moreread less

Abstract: Networking together hundreds or thousands of cheap microsensor nodes allows users to accurately monitor a remote environment by intelligently combining the data from the individual nodes. These networks require robust wireless communication protocols that are energy efficient and provide low latency. We develop and analyze low-energy adaptive clustering hierarchy (LEACH), a protocol architecture for microsensor networks that combines the ideas of energy-efficient cluster-based routing and media access together with application-specific data aggregation to achieve good performance in terms of system lifetime, latency, and application-perceived quality. LEACH includes a new, distributed cluster formation technique that enables self-organization of large numbers of nodes, algorithms for adapting clusters and rotating cluster head positions to evenly distribute the energy load among all the nodes, and techniques to enable distributed signal processing to save communication resources. Our results show that LEACH can improve system lifetime by an order of magnitude compared with general-purpose multihop approaches.

...read moreread less

10,296 citations

Journal Article•DOI•

An efficient k-means clustering algorithm: analysis and implementation

[...]

Tapas Kanungo¹, David M. Mount², Nathan S. Netanyahu³, Christine D. Piatko⁴, Ruth Silverman², Angela Y. Wu⁵ - Show less +2 more•Institutions (5)

IBM¹, University of Maryland, College Park², Bar-Ilan University³, Johns Hopkins University Applied Physics Laboratory⁴, University of Washington⁵

01 Jul 2002-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work presents a simple and efficient implementation of Lloyd's k-means clustering algorithm, which it calls the filtering algorithm, and establishes the practical efficiency of the algorithm's running time.

...read moreread less

Abstract: In k-means clustering, we are given a set of n data points in d-dimensional space R/sup d/ and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

...read moreread less

5,288 citations

Journal Article•DOI•

Model-Based Clustering, Discriminant Analysis, and Density Estimation

[...]

Chris Fraley¹, Adrian E. Raftery¹•Institutions (1)

University of Washington¹

01 Jun 2002-Journal of the American Statistical Association

TL;DR: This work reviews a general methodology for model-based clustering that provides a principled statistical approach to important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled.

...read moreread less

Abstract: Cluster analysis is the automated search for groups of related observations in a dataset. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures, and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as how many clusters are there, which clustering method should be used, and how should outliers be handled. We review a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, minefield detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology and discuss recent development...

...read moreread less

4,123 citations

Proceedings Article•

Distance Metric Learning with Application to Clustering with Side-Information

[...]

Eric P. Xing¹, Michael I. Jordan¹, Stuart Russell¹, Andrew Y. Ng¹•Institutions (1)

University of California, Berkeley¹

01 Jan 2002

TL;DR: This paper presents an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in �”n, learns a distance metric over ℝn that respects these relationships.

...read moreread less

Abstract: Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many "plausible" ways, and if a clustering algorithm such as K-means initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider "similar." For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in ℝn, learns a distance metric over ℝn that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, local-optima-free algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance.

...read moreread less

3,176 citations

Journal Article•DOI•

Unsupervised learning of finite mixture models

[...]

Mário A. T. Figueiredo, Anil K. Jain¹•Institutions (1)

Michigan State University¹

01 Mar 2002-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The novelty of the approach is that it does not use a model selection criterion to choose one among a set of preestimated candidate models; instead, it seamlessly integrate estimation and model selection in a single algorithm.

...read moreread less

Abstract: This paper proposes an unsupervised algorithm for learning a finite mixture model from multivariate data. The adjective "unsupervised" is justified by two properties of the algorithm: 1) it is capable of selecting the number of components and 2) unlike the standard expectation-maximization (EM) algorithm, it does not require careful initialization. The proposed method also avoids another drawback of EM for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. The novelty of our approach is that we do not use a model selection criterion to choose one among a set of preestimated candidate models; instead, we seamlessly integrate estimation and model selection in a single algorithm. Our technique can be applied to any type of parametric mixture model for which it is possible to write an EM algorithm; in this paper, we illustrate it with experiments involving Gaussian mixtures. These experiments testify for the good performance of our approach.

...read moreread less

2,182 citations

Journal Article•DOI•

Genesis: cluster analysis of microarray data

[...]

Alexander Sturn¹, John Quackenbush, Zlatko Trajanoski¹•Institutions (1)

Graz University of Technology¹

01 Jan 2002-Bioinformatics

TL;DR: Genesis integrates various tools for microarray data analysis such as filters, normalization and visualization tools, distance measures as well as common clustering algorithms including hierarchical clustering, self-organizing maps, k-means, principal component analysis, and support vector machines.

...read moreread less

Abstract: Summary: A versatile, platform independent and easy to use Java suite for large-scale gene expression analysis was developed. Genesis integrates various tools for microarray data analysis such as filters, normalization and visualization tools, distance measures as well as common clustering algorithms including hierarchical clustering, selforganizing maps, k-means, principal component analysis, and support vector machines. The results of the clustering are transparent across all implemented methods and enable the analysis of the outcome of different algorithms and parameters. Additionally, mapping of gene expression data onto chromosomal sequences was implemented to enhance promoter analysis and investigation of transcriptional control mechanisms. Availability: http://genome.tugraz.at Contact: zlatko.trajanoski@tugraz.at

...read moreread less

1,768 citations

Journal Article•DOI•

Stable clustering, the halo model and nonlinear cosmological power spectra

[...]

R.E. Smith¹, John A. Peacock, Adrian Jenkins², Simon D. M. White³, Carlos S. Frenk², Frazer R. Pearce¹, Peter A. Thomas⁴, George Efstathiou, H.M.P. Couchmann⁵ - Show less +5 more•Institutions (5)

University of Nottingham¹, Durham University², Maine Principals' Association³, Sussex County Community College⁴, McMaster-Carr⁵

31 Jul 2002-arXiv: Astrophysics

TL;DR: In this article, the authors present the results of a large library of cosmological N-body simulations, using power-law initial spectra, showing that, when transformed under the self-similarity scaling, the scale-free spectra define a nonlinear locus that is clearly shallower than would be required under stable clustering.

...read moreread less

Abstract: We present the results of a large library of cosmological N-body simulations, using power-law initial spectra. The nonlinear evolution of the matter power spectra is compared with the predictions of existing analytic scaling formulae based on the work of Hamilton et al. The scaling approach has assumed that highly nonlinear structures obey `stable clustering' and are frozen in proper coordinates. Our results show that, when transformed under the self-similarity scaling, the scale-free spectra define a nonlinear locus that is clearly shallower than would be required under stable clustering. Furthermore, the small-scale nonlinear power increases as both the power-spectrum index n and the density parameter Omega decrease, and this evolution is not well accounted for by the previous scaling formulae. This breakdown of stable clustering can be understood as resulting from the modification of dark-matter haloes by continuing mergers. These effects are naturally included in the analytic `halo model' for nonlinear structure; using this approach we are able to fit both our scale-free results and also our previous CDM data. This approach is more accurate than the commonly-used Peacock--Dodds formula and should be applicable to more general power spectra. Code to evaluate nonlinear power spectra using this method is available from this http URL Following publication, we will make the power-law simulation data available through the Virgo website this http URL

...read moreread less

1,693 citations

Journal Article•DOI•

Open Source Clustering Software

[...]

Michiel J. L. de Hoon¹, Seiya Imoto¹, Satoru Miyano¹•Institutions (1)

University of Tokyo¹

01 Jan 2002-Genome Informatics

TL;DR: An improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix is created, and a Python and a Perl interface to the C Clustering Library is generated, thereby combining the flexibility of a scripting language with the speed of C.

...read moreread less

Abstract: SUMMARY We have implemented k-means clustering, hierarchical clustering and self-organizing maps in a single multipurpose open-source library of C routines, callable from other C and C++ programs. Using this library, we have created an improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix. In addition, we generated a Python and a Perl interface to the C Clustering Library, thereby combining the flexibility of a scripting language with the speed of C. AVAILABILITY The C Clustering Library and the corresponding Python C extension module Pycluster were released under the Python License, while the Perl module Algorithm::Cluster was released under the Artistic License. The GUI code Cluster 3.0 for Windows, Macintosh and Linux/Unix, as well as the corresponding command-line program, were released under the same license as the original Cluster code. The complete source code is available at http://bonsai.ims.u-tokyo.ac.jp/mdehoon/software/cluster. Alternatively, Algorithm::Cluster can be downloaded from CPAN, while Pycluster is also available as part of the Biopython distribution.

...read moreread less

1,493 citations

Journal Article•DOI•

The Elusive Concept of Localization Economies: Towards a Knowledge-Based Theory of Spatial Clustering

[...]

Anders Malmberg¹, Peter Maskell²•Institutions (2)

Uppsala University¹, Copenhagen Business School²

01 Mar 2002-Environment and Planning A

TL;DR: In this article, the authors identify two shortcomings of existing research on the clustering phenomenon and argue for the need to establish a specific theory of the cluster where learning occupies centre stage.

...read moreread less

Abstract: A number of possible advantages of industry agglomeration—or spatial clustering—have been identified in the research literature, notably those related to shared costs for infrastructure, the build-up of a skilled labour force, transaction efficiency, and knowledge spillovers leading to firm learning and innovation. We identify two shortcomings of existing research on the clustering phenomenon. First, the abundance of theoretical concepts and explanations stands in sharp contrast with the general lack of work aimed at validating these mechanisms empirically and the contradictory evidence found in recent empirical work in the field. Second, there is still a lack of a unified theoretical framework for analyzing spatial clustering. In an attempt to remedy the latter shortcoming, this paper investigates the nature of the cluster from a knowledge-creation or learning perspective. We argue for the need to establish a specific theory of the cluster where learning occupies centre stage. The basic requirements for ...

...read moreread less

1,454 citations

Journal Article•DOI•

WCA: A Weighted Clustering Algorithm for Mobile Ad Hoc Networks

[...]

Mainak Chatterjee¹, Sajal K. Das¹, Damla Turgut¹•Institutions (1)

University of Texas at Arlington¹

01 Apr 2002-Cluster Computing

TL;DR: An on-demand distributed clustering algorithm for multi-hop packet radio networks that takes into consideration the ideal degree, transmission power, mobility, and battery power of mobile nodes, and is aimed to reduce the computation and communication costs.

...read moreread less

Abstract: In this paper, we propose an on-demand distributed clustering algorithm for multi-hop packet radio networks. These types of networks, also known as i>ad hoc networks, are dynamic in nature due to the mobility of nodes. The association and dissociation of nodes to and from i>clusters perturb the stability of the network topology, and hence a reconfiguration of the system is often unavoidable. However, it is vital to keep the topology stable as long as possible. The i>clusterheads, form a i>dominant set in the network, determine the topology and its stability. The proposed weight-based distributed clustering algorithm takes into consideration the ideal degree, transmission power, mobility, and battery power of mobile nodes. The time required to identify the clusterheads depends on the diameter of the underlying graph. We try to keep the number of nodes in a cluster around a pre-defined threshold to facilitate the optimal operation of the medium access control (MAC) protocol. The non-periodic procedure for clusterhead election is invoked on-demand, and is aimed to reduce the computation and communication costs. The clusterheads, operating in “dual" power mode, connects the clusters which help in routing messages from a node to any other node. We observe a trade-off between the uniformity of the load handled by the clusterheads and the connectivity of the network. Simulation experiments are conducted to evaluate the performance of our algorithm in terms of the number of clusterheads, i>reaffiliation frequency, and dominant set updates. Results show that our algorithm performs better than existing ones and is also tunable to different kinds of network conditions.

...read moreread less

1,419 citations

Journal Article•DOI•

Support vector clustering

[...]

Asa Ben-Hur, David Horn¹, Hava T. Siegelmann², Vladimir Vapnik³•Institutions (3)

Tel Aviv University¹, Massachusetts Institute of Technology², AT&T Labs³

01 Mar 2002-Journal of Machine Learning Research

TL;DR: In this paper, a Gaussian kernel based clustering method using support vector machines (SVM) is proposed to find the minimal enclosing sphere, which can separate into several components, each enclosing a separate cluster of points.

...read moreread less

Abstract: We present a novel clustering method using the approach of support vector machines. Data points are mapped by means of a Gaussian kernel to a high dimensional feature space, where we search for the minimal enclosing sphere. This sphere, when mapped back to data space, can separate into several components, each enclosing a separate cluster of points. We present a simple algorithm for identifying these clusters. The width of the Gaussian kernel controls the scale at which the data is probed while the soft margin constant helps coping with outliers and overlapping clusters. The structure of a dataset is explored by varying the two parameters, maintaining a minimal number of support vectors to assure smooth cluster boundaries. We demonstrate the performance of our algorithm on several datasets.

...read moreread less

Journal Article•DOI•

Performance evaluation of some clustering algorithms and validity indices

[...]

Ujjwal Maulik, Sanghamitra Bandyopadhyay¹•Institutions (1)

Indian Statistical Institute¹

01 Dec 2002-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This article evaluates the performance of three clustering algorithms, hard K-Means, single linkage, and a simulated annealing (SA) based technique, in conjunction with four cluster validity indices, namely Davies-Bouldin index, Dunn's index, Calinski-Harabasz index, andA recently developed index I.

...read moreread less

Abstract: In this article, we evaluate the performance of three clustering algorithms, hard K-Means, single linkage, and a simulated annealing (SA) based technique, in conjunction with four cluster validity indices, namely Davies-Bouldin index, Dunn's index, Calinski-Harabasz index, and a recently developed index I. Based on a relation between the index I and the Dunn's index, a lower bound of the value of the former is theoretically estimated in order to get unique hard K-partition when the data set has distinct substructures. The effectiveness of the different validity indices and clustering methods in automatically evolving the appropriate number of clusters is demonstrated experimentally for both artificial and real-life data sets with the number of clusters varying from two to ten. Once the appropriate number of clusters is determined, the SA-based clustering technique is used for proper partitioning of the data into the said number of clusters.

...read moreread less

Journal Article•DOI•

CLARANS: a method for clustering objects for spatial data mining

[...]

Raymond T. Ng¹, Jiawei Han²•Institutions (2)

University of British Columbia¹, Simon Fraser University²

01 Sep 2002-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A new clustering method is proposed, called CLARANS, whose aim is to identify spatial structures that may be present in the data, and two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes are developed.

...read moreread less

Abstract: Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, the paper investigates how CLARANS can handle not only point objects, but also polygon objects efficiently. One of the methods considered, called the IR-approximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, the paper develops two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms.

...read moreread less

Proceedings Article•DOI•

Low energy adaptive clustering hierarchy with deterministic cluster-head selection

[...]

Matthias Handy¹, M. Haase¹, Dirk Timmermann¹•Institutions (1)

University of Rostock¹

10 Dec 2002

TL;DR: A communication protocol named LEACH (low-energy adaptive clustering hierarchy) is modified and its stochastic cluster-head selection algorithm is extended by a deterministic component to reduce the power consumption of wireless microsensor networks.

...read moreread less

Abstract: This paper focuses on reducing the power consumption of wireless microsensor networks. Therefore, a communication protocol named LEACH (low-energy adaptive clustering hierarchy) is modified. We extend LEACH's stochastic cluster-head selection algorithm by a deterministic component. Depending on the network configuration an increase of network lifetime by about 30% can be accomplished. Furthermore, we present a new approach to define lifetime of microsensor networks using three new metrics FND (First Node Dies), HNA (Half of the Nodes Alive), and LND (Last Node Dies).

...read moreread less

Journal Article•DOI•

Growing scale-free networks with tunable clustering.

[...]

Petter Holme¹, Beom Jun Kim¹•Institutions (1)

Umeå University¹

11 Jan 2002-Physical Review E

TL;DR: The standard scale-free network model is extended to include a "triad formation step" and the clustering coefficient is shown to be tunable simply by changing a control parameter---the average number of triad formation trials per time step.

...read moreread less

Abstract: We extend the standard scale-free network model to include a "triad formation step." We analyze the geometric properties of networks generated by this algorithm both analytically and by numerical calculations, and find that our model possesses the same characteristics as the standard scale-free networks such as the power-law degree distribution and the small average geodesic length, but with the high clustering at the same time. In our model, the clustering coefficient is also shown to be tunable simply by changing a control parameter---the average number of triad formation trials per time step.

...read moreread less

Book•

Netlab: Algorithms for Pattern Recognition

[...]

Ian T. Nabney

01 Jan 2002

TL;DR: This chapter discusses Parameter Optimisation Algorithms, Density Modelling and Clustering, Single-Layer Networks, and Radial Basis Functions.

...read moreread less

Abstract: Introduction.- Parameter Optimisation Algorithms.- Density Modelling and Clustering.- Single-Layer Networks.- The Multi-Layer Perceptron.- Radial Basis Functions.- Visualisation and Latent Variable Models.- Sampling.- Bayesian Techniques.- Gaussian Processes.- Linear Algebra and Matrices.- Algorithm Error Analysis.- Function Index.- Subject Index.

...read moreread less

Journal Article•DOI•

Data mining with an ant colony optimization algorithm

[...]

Rafael Stubs Parpinelli, Heitor Silvério Lopes, Alex A. Freitas

01 Aug 2002-IEEE Transactions on Evolutionary Computation

TL;DR: This paper compares the performance of Ant-Miner with CN2, a well-known data mining algorithm for classification, in six public domain data sets and provides evidence that Ant- Miner is competitive with CN1 with respect to predictive accuracy and the rule lists discovered are considerably simpler than those discovered by CN2.

...read moreread less

Abstract: The paper proposes an algorithm for data mining called Ant-Miner (ant-colony-based data miner). The goal of Ant-Miner is to extract classification rules from data. The algorithm is inspired by both research on the behavior of real ant colonies and some data mining concepts as well as principles. We compare the performance of Ant-Miner with CN2, a well-known data mining algorithm for classification, in six public domain data sets. The results provide evidence that: 1) Ant-Miner is competitive with CN2 with respect to predictive accuracy, and 2) the rule lists discovered by Ant-Miner are considerably simpler (smaller) than those discovered by CN2.

...read moreread less

Journal Article•DOI•

Evaluation of graphical and multivariate statistical methods for classification of water chemistry data

[...]

Cüneyt Güler¹, Geoffrey D. Thyne¹, John E. McCray¹, Keith Turner¹•Institutions (1)

Colorado School of Mines¹

09 May 2002-Hydrogeology Journal

TL;DR: In this paper, the performance of the many available graphical and statistical methodologies used to classify water samples including: Collins bar diagram, pie diagram, Stiff pattern diagram, Schoeller plot, Piper diagram, Q-mode hierarchical cluster analysis, K-means clustering, principal components analysis, and fuzzy k-mean clustering are compared.

...read moreread less

Abstract: A robust classification scheme for partitioning water chemistry samples into homogeneous groups is an important tool for the characterization of hydrologic systems. In this paper we test the performance of the many available graphical and statistical methodologies used to classify water samples including: Collins bar diagram, pie diagram, Stiff pattern diagram, Schoeller plot, Piper diagram, Q-mode hierarchical cluster analysis, K-means clustering, principal components analysis, and fuzzy k-means clustering. All the methods are discussed and compared as to their ability to cluster, ease of use, and ease of interpretation. In addition, several issues related to data preparation, database editing, data-gap filling, data screening, and data quality assurance are discussed and a database construction methodology is presented.

...read moreread less

Journal Article•DOI•

Mercer kernel-based clustering in feature space

[...]

Mark Girolami¹•Institutions (1)

Helsinki University of Technology¹

01 May 2002-IEEE Transactions on Neural Networks

TL;DR: It is shown that the eigenvectors of a kernel matrix which defines the implicit mapping provides a means to estimate the number of clusters inherent within the data and a computationally simple iterative procedure is presented for the subsequent feature space partitioning of the data.

...read moreread less

Abstract: The article presents a method for both the unsupervised partitioning of a sample of data and the estimation of the possible number of inherent clusters which generate the data. This work exploits the notion that performing a nonlinear data transformation into some high dimensional feature space increases the probability of the linear separability of the patterns within the transformed space and therefore simplifies the associated data structure. It is shown that the eigenvectors of a kernel matrix which defines the implicit mapping provides a means to estimate the number of clusters inherent within the data and a computationally simple iterative procedure is presented for the subsequent feature space partitioning of the data.

...read moreread less

Journal Article•DOI•

Revealing modular organization in the yeast transcriptional network

[...]

Jan Ihmels¹, Gilgi Friedlander¹, Sven Bergmann¹, Ofer Sarig¹, Yaniv Ziv¹, Naama Barkai¹ - Show less +2 more•Institutions (1)

Weizmann Institute of Science¹

22 Jul 2002-Nature Genetics

TL;DR: The approach assigns genes to context-dependent and potentially overlapping 'transcription modules', thus overcoming the main limitations of traditional clustering methods, and uses the method to elucidate regulatory properties of cellular pathways and to characterize cis-regulatory elements.

...read moreread less

Abstract: Standard clustering methods can classify genes successfully when applied to relatively small data sets, but have limited use in the analysis of large-scale expression data, mainly owing to their assignment of a gene to a single cluster. Here we propose an alternative method for the global analysis of genome-wide expression data. Our approach assigns genes to context-dependent and potentially overlapping ‘transcription modules’, thus overcoming the main limitations of traditional clustering methods. We use our method to elucidate regulatory properties of cellular pathways and to characterize cis-regulatory elements. By applying our algorithm systematically to all of the available expression data on Saccharomyces cerevisiae, we identify a comprehensive set of overlapping transcriptional modules. Our results provide functional predictions for numerous genes, identify relations between modules and present a global view on the transcriptional network. article

...read moreread less

Journal Article•DOI•

A prediction-based resampling method for estimating the number of clusters in a dataset

[...]

Sandrine Dudoit¹, Jane Fridlyand²•Institutions (2)

University of California, Berkeley¹, University of California, San Francisco²

25 Jun 2002-Genome Biology

TL;DR: A new prediction-based resampling method, Clest, is developed, to estimate the number of clusters in a dataset, and was generally found to be more accurate and robust than the six existing methods considered in the study.

...read moreread less

Abstract: Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems, such as the classification of tumors. An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. Two essential aspects of this clustering problem are: to estimate the number of clusters, if any, in a dataset; and to allocate tumor samples to these clusters, and assess the confidence of cluster assignments for individual samples. Here we address the first of these problems. We have developed a new prediction-based resampling method, Clest, to estimate the number of clusters in a dataset. The performance of the new and existing methods were compared using simulated data and gene-expression data from four recently published cancer microarray studies. Clest was generally found to be more accurate and robust than the six existing methods considered in the study. Focusing on prediction accuracy in conjunction with resampling produces accurate and robust estimates of the number of clusters.

...read moreread less

Latent class models for clustering : a comparison with K-means

[...]

J. Magidson, Jeroen K. Vermunt

01 Jan 2002

TL;DR: The authors compare these two approaches using data simulated from a setting where true group membership is known to indicate that LC substantially outperforms the K-means technique.

...read moreread less

Abstract: Recent developments in latent class (LC) analysis and associated software to include continuous variables offer a model-based alternative to more traditional clustering approaches such as K-means. In this paper, the authors compare these two approaches using data simulated from a setting where true group membership is known. The authors choose a setting favourable to K-means by simulating data according to the assumptions made in both discriminant analysis (DISC) and K-means clustering. Since the information on true group membership is used in DISC but not in clustering approaches in general, the authors use the results obtained from DISC as a gold standard in determining an upper bound on the best possible outcome that might be expected from a clustering technique. The results indicate that LC substantially outperforms the K-means technique. A truly surprising result is that the LC performance is so good that it is virtually indistinguishable from the performance of DISC.

...read moreread less

Journal Article•DOI•

Large-scale topological and dynamical properties of the Internet

[...]

Alexei Vazquez¹, Romualdo Pastor-Satorras², Alessandro Vespignani³•Institutions (3)

International School for Advanced Studies¹, Polytechnic University of Catalonia², International Centre for Theoretical Physics³

28 Jun 2002-Physical Review E

TL;DR: It is found that the connectivity structure of the Internet presents statistical distributions settled in a well-defined stationary state and the large-scale properties are characterized by a scale-free topology consistent with previous observations.

...read moreread less

Abstract: We study the large-scale topological and dynamical properties of real Internet maps at the autonomous system level, collected in a 3-yr time interval. We find that the connectivity structure of the Internet presents statistical distributions settled in a well-defined stationary state. The large-scale properties are characterized by a scale-free topology consistent with previous observations. Correlation functions and clustering coefficients exhibit a remarkable structure due to the underlying hierarchical organization of the Internet. The study of the Internet time evolution shows a growth dynamics with aging features typical of recently proposed growing network models. We compare the properties of growing network models with the present real Internet data analysis.

...read moreread less

Journal Article•DOI•

Factor analysis applied to regional geochemical data: problems and possibilities

[...]

Clemens Reimann, Peter Filzmoser¹, Robert G. Garrett²•Institutions (2)

Vienna University of Technology¹, Geological Survey of Canada²

01 Mar 2002-Applied Geochemistry

TL;DR: The use of cluster analysis as an exploratory data analysis tool requires a powerful program system to test different data preparation, processing and clustering methods, including the ability to present the results in a number of easy to grasp graphics.

...read moreread less

Segmenting Time Series: A Survey and Novel Approach

[...]

Eamonn Keogh, Selina Chu, David M. Hart, Michael J. Pazzani¹•Institutions (1)

University of California, Irvine¹

01 Jan 2002

TL;DR: This paper undertake the first extensive review and empirical comparison of all proposed techniques for mining time series data and introduces a novel algorithm that is empirically show to be superior to all others in the literature.

...read moreread less

Abstract: In recent years, there has been an explosion of interest in mining time series databases. As with most computer science problems, representation of the data is the key to efficient and effective solutions. One of the most commonly used representations is piecewise linear approximation. This representation has been used by various researchers to support clustering, classification, indexing and association rule mining of time series data. A variety of algorithms have been proposed to obtain this representation, with several algorithms having been independently rediscovered several times. In this paper, we undertake the first extensive review and empirical comparison of all proposed techniques. We show that all these algorithms have fatal flaws from a data mining perspective. We introduce a novel algorithm that we empirically show to be superior to all others in the literature.

...read moreread less

Proceedings Article•DOI•

Evaluation of hierarchical clustering algorithms for document datasets

[...]

Ying Zhao¹, George Karypis¹•Institutions (1)

University of Minnesota¹

04 Nov 2002

TL;DR: It is suggested that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance.

...read moreread less

Abstract: Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.

...read moreread less

Proceedings Article•DOI•

An artificial immune network for multimodal function optimization

[...]

L.N. de Castro¹, Jon Timmis¹•Institutions (1)

University of Kent¹

12 May 2002

TL;DR: The main features of the adaptation of an immune network model include: automatic determination of the population size, combination of local with global search, defined convergence criterion, and capability of locating and maintaining stable local optima solutions.

...read moreread less

Abstract: This paper presents the adaptation of an immune network model, originally proposed to perform information compression and data clustering, to solve multimodal function optimization problems. The algorithm is described theoretically and empirically compared with similar approaches from the literature. The main features of the algorithm include: automatic determination of the population size, combination of local with global search (exploitation plus exploration of the fitness landscape), defined convergence criterion, and capability of locating and maintaining stable local optima solutions.

...read moreread less

Proceedings Article•DOI•

A local search approximation algorithm for k-means clustering

[...]

Tapas Kanungo¹, David M. Mount², Nathan S. Netanyahu², Christine D. Piatko³, Ruth Silverman², Angela Y. Wu⁴ - Show less +2 more•Institutions (4)

IBM¹, University of Maryland, College Park², Johns Hopkins University Applied Physics Laboratory³, American University⁴

05 Jun 2002

TL;DR: This work considers the question of whether there exists a simple and practical approximation algorithm for k-means clustering, and presents a local improvement heuristic based on swapping centers in and out that yields a (9+ε)-approximation algorithm.

...read moreread less

Abstract: In k-means clustering we are given a set of n data points in d-dimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the extremely high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9+e)-approximation algorithm. We show that the approximation factor is almost tight, by giving an example for which the algorithm achieves an approximation factor of (9-e). To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.

...read moreread less

Journal Article•DOI•

Image segmentation by data-driven Markov chain Monte Carlo

[...]

Zhuowen Tu¹, Song-Chun Zhu¹•Institutions (1)

Ohio State University¹

01 May 2002-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The DDMCMC paradigm provides a unifying framework in which the role of many existing segmentation algorithms are revealed as either realizing Markov chain dynamics or computing importance proposal probabilities and generalizes these segmentation methods in a principled way.

...read moreread less

Abstract: This paper presents a computational paradigm called Data-Driven Markov Chain Monte Carlo (DDMCMC) for image segmentation in the Bayesian statistical framework. The paper contributes to image segmentation in four aspects. First, it designs efficient and well-balanced Markov Chain dynamics to explore the complex solution space and, thus, achieves a nearly global optimal solution independent of initial segmentations. Second, it presents a mathematical principle and a K-adventurers algorithm for computing multiple distinct solutions from the Markov chain sequence and, thus, it incorporates intrinsic ambiguities in image segmentation. Third, it utilizes data-driven (bottom-up) techniques, such as clustering and edge detection, to compute importance proposal probabilities, which drive the Markov chain dynamics and achieve tremendous speedup in comparison to the traditional jump-diffusion methods. Fourth, the DDMCMC paradigm provides a unifying framework in which the role of many existing segmentation algorithms, such as, edge detection, clustering, region growing, split-merge, snake/balloon, and region competition, are revealed as either realizing Markov chain dynamics or computing importance proposal probabilities. Thus, the DDMCMC paradigm combines and generalizes these segmentation methods in a principled way. The DDMCMC paradigm adopts seven parametric and nonparametric image models for intensity and color at various regions. We test the DDMCMC paradigm extensively on both color and gray-level images and some results are reported in this paper.

...read moreread less

Proceedings Article•DOI•

Streaming-data algorithms for high-quality clustering

[...]

Liadan O'Callaghan¹, Nina Mishra², Adam Meyerson¹, Sudipto Guha³, Rajeev Motwani¹ - Show less +1 more•Institutions (3)

Stanford University¹, Hewlett-Packard², University of Pennsylvania³

07 Aug 2002

TL;DR: This work describes a streaming algorithm that effectively clusters large data streams and provides empirical evidence of the algorithm's performance on synthetic and real data streams.

...read moreread less

Abstract: Streaming data analysis has recently attracted attention in numerous applications including telephone records, Web documents and click streams. For such analysis, single-pass algorithms that consume a small amount of memory are critical. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm's performance on synthetic and real data streams.

...read moreread less

Collapse