Showing papers on "Cluster analysis published in 2004"

PDF

Open Access

Journal Article•DOI•

HEED: a hybrid, energy-efficient, distributed clustering approach for ad hoc sensor networks

[...]

O. Younis¹, Sonia Fahmy¹•Institutions (1)

01 Oct 2004-IEEE Transactions on Mobile Computing

TL;DR: It is proved that, with appropriate bounds on node density and intracluster and intercluster transmission ranges, HEED can asymptotically almost surely guarantee connectivity of clustered networks.

...read moreread less

Abstract: Topology control in a sensor network balances load on sensor nodes and increases network scalability and lifetime. Clustering sensor nodes is an effective topology control approach. We propose a novel distributed clustering approach for long-lived ad hoc sensor networks. Our proposed approach does not make any assumptions about the presence of infrastructure or about node capabilities, other than the availability of multiple power levels in sensor nodes. We present a protocol, HEED (Hybrid Energy-Efficient Distributed clustering), that periodically selects cluster heads according to a hybrid of the node residual energy and a secondary parameter, such as node proximity to its neighbors or node degree. HEED terminates in O(1) iterations, incurs low message overhead, and achieves fairly uniform cluster head distribution across the network. We prove that, with appropriate bounds on node density and intracluster and intercluster transmission ranges, HEED can asymptotically almost surely guarantee connectivity of clustered networks. Simulation results demonstrate that our proposed approach is effective in prolonging the network lifetime and supporting scalable data aggregation.

...read moreread less

4,889 citations

Journal Article•DOI•

Survey over image thresholding techniques and quantitative performance evaluation

[...]

Mehmet Sezgin¹, Bulent Sankur•Institutions (1)

TÜBİTAK Marmara Research Center¹

01 Jan 2004-Journal of Electronic Imaging

TL;DR: 40 selected thresholding methods from various categories are compared in the context of nondestructive testing applications as well as for document images, and the thresholding algorithms that perform uniformly better over nonde- structive testing and document image applications are identified.

...read moreread less

Abstract: We conduct an exhaustive survey of image thresholding methods, categorize them, express their formulas under a uniform notation, and finally carry their performance comparison. The thresholding methods are categorized according to the information they are exploiting, such as histogram shape, measurement space clustering, entropy, object attributes, spatial correlation, and local gray-level surface. 40 selected thresholding methods from various categories are compared in the context of nondestructive testing applications as well as for document images. The comparison is based on the combined performance measures. We identify the thresholding algorithms that perform uniformly better over nonde- structive testing and document image applications. © 2004 SPIE and IS&T. (DOI: 10.1117/1.1631316)

...read moreread less

4,543 citations

Journal Article•DOI•

Open source clustering software

[...]

M J L De Hoon¹, Seiya Imoto¹, Jerry P. Nolan², Satoru Miyano¹•Institutions (2)

University of Tokyo¹, University of California, Santa Cruz²

12 Jun 2004-Bioinformatics

TL;DR: This work has implemented k-means clustering, hierarchical clustering and self-organizing maps in a single multipurpose open-source library of C routines, callable from other C and C++ programs.

...read moreread less

Abstract: Summary: We have implemented k-means clustering, hierarchical clustering and self-organizing maps in a single multipurpose open-source library of C routines, callable from other C and C++ programs. Using this library, we have created an improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix. In addition, we generated a Python and a Perl interface to the C Clustering Library, thereby combining the flexibility of a scripting language with the speed of C. Availability: The C Clustering Library and the corresponding Python C extension module Pycluster were released under the Python License, while the Perl module Algorithm::Cluster was released under the Artistic License. The GUI code Cluster 3.0 for Windows, Macintosh and Linux/Unix, as well as the corresponding command-line program, were released under the same license as the original Cluster code. The complete source code is available at http://bonsai.ims.u-tokyo.ac.jp/mdehoon/software/cluster. Alternatively, Algorithm::Cluster can be downloaded from CPAN, while Pycluster is also available as part of the Biopython distribution.

...read moreread less

2,815 citations

Proceedings Article•

Self-Tuning Spectral Clustering

[...]

Lihi Zelnik-Manor¹, Pietro Perona¹•Institutions (1)

California Institute of Technology¹

01 Dec 2004

TL;DR: This work proposes that a 'local' scale should be used to compute the affinity between each pair of points and suggests exploiting the structure of the eigenvectors to infer automatically the number of groups.

...read moreread less

Abstract: We study a number of open issues in spectral clustering: (i) Selecting the appropriate scale of analysis, (ii) Handling multi-scale data, (iii) Clustering with irregular background clutter, and, (iv) Finding automatically the number of groups. We first propose that a 'local' scale should be used to compute the affinity between each pair of points. This local scaling leads to better clustering especially when the data includes multiple scales and when the clusters are placed within a cluttered background. We further suggest exploiting the structure of the eigenvectors to infer automatically the number of groups. This leads to a new algorithm in which the final randomly initialized k-means stage is eliminated.

...read moreread less

2,206 citations

Journal Article•DOI•

Biclustering Algorithms for Biological Data Analysis: A Survey

[...]

Sara C. Madeira, Arlindo L. Oliveira¹•Institutions (1)

Instituto Superior Técnico¹

01 Jan 2004-IEEE/ACM Transactions on Computational Biology and Bioinformatics

TL;DR: In this comprehensive survey, a large number of existing approaches to biclustering are analyzed, and they are classified in accordance with the type of biclusters they can find, the patterns of bIClusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications.

...read moreread less

Abstract: A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications.

...read moreread less

2,123 citations

Journal Article•DOI•

Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering

[...]

R. Quian Quiroga¹, Zoltan Nadasdy¹, Yoram Ben-Shaul²•Institutions (2)

California Institute of Technology¹, Hebrew University of Jerusalem²

01 Aug 2004-Neural Computation

TL;DR: A new method for detecting and sorting spikes from multiunit recordings that combines the wave let transform with super paramagnetic clustering, which allows automatic classification of the data without assumptions such as low variance or gaussian distributions is introduced.

...read moreread less

Abstract: This study introduces a new method for detecting and sorting spikes from multiunit recordings The method combines the wavelet transform, which localizes distinctive spike features, with superparamagnetic clustering, which allows automatic classification of the data without assumptions such as low variance or gaussian distributions Moreover, an improved method for setting amplitude thresholds for spike detection is proposed We describe several criteria for implementation that render the algorithm unsupervised and fast The algorithm is compared to other conventional methods using several simulated data sets whose characteristics closely resemble those of in vivo recordings For these data sets, we found that the proposed algorithm outperformed conventional methods

...read moreread less

2,050 citations

Journal Article•DOI•

Metagenes and molecular pattern discovery using matrix factorization.

[...]

Jean-Philippe Brunet¹, Pablo Tamayo, Todd R. Golub, Jill P. Mesirov•Institutions (1)

Massachusetts Institute of Technology¹

23 Mar 2004-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: Nonnegative matrix factorization is described, an algorithm based on decomposition by parts that can reduce the dimension of expression data from thousands of genes to a handful of metagenes, and found less sensitive to a priori selection of genes or initial conditions and able to detect alternative or context-dependent patterns of gene expression in complex biological systems.

...read moreread less

Abstract: We describe here the use of nonnegative matrix factorization (NMF), an algorithm based on decomposition by parts that can reduce the dimension of expression data from thousands of genes to a handful of metagenes. Coupled with a model selection mechanism, adapted to work for any stochastic clustering algorithm, NMF is an efficient method for identification of distinct molecular patterns and provides a powerful method for class discovery. We demonstrate the ability of NMF to recover meaningful biological information from cancer-related microarray data. NMF appears to have advantages over other methods such as hierarchical clustering or self-organizing maps. We found it less sensitive to a priori selection of genes or initial conditions and able to detect alternative or context-dependent patterns of gene expression in complex biological systems. This ability, similar to semantic polysemy in text, provides a general method for robust molecular pattern discovery.

...read moreread less

1,818 citations

Proceedings Article•DOI•

K-means clustering via principal component analysis

[...]

Chris Ding¹, Xiaofeng He¹•Institutions (1)

Lawrence Berkeley National Laboratory¹

04 Jul 2004

TL;DR: It is proved that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering, which indicates that unsupervised dimension reduction is closely related to unsuper supervised learning.

...read moreread less

Abstract: Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering. New lower bounds for K-means objective function are derived, which is the total variance minus the eigenvalues of the data covariance matrix. These results indicate that unsupervised dimension reduction is closely related to unsupervised learning. Several implications are discussed. On dimension reduction, the result provides new insights to the observed effectiveness of PCA-based data reductions, beyond the conventional noise-reduction explanation that PCA, via singular value decomposition, provides the best low-dimensional linear approximation of the data. On learning, the result suggests effective techniques for K-means data clustering. DNA gene expression and Internet newsgroups are analyzed to illustrate our results. Experiments indicate that the new bounds are within 0.5-1.5% of the optimal values.

...read moreread less

1,431 citations

Journal Article•DOI•

Spectral grouping using the Nystrom method

[...]

Charless C. Fowlkes¹, Serge Belongie², Fan Chung², Jitendra Malik¹•Institutions (2)

University of California, Berkeley¹, University of California, San Diego²

01 Jan 2004-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The contribution of this paper is a method that substantially reduces the computational requirements of grouping algorithms based on spectral partitioning making it feasible to apply them to very large grouping problems.

...read moreread less

Abstract: Spectral graph theoretic methods have recently shown great promise for the problem of image segmentation. However, due to the computational demands of these approaches, applications to large problems such as spatiotemporal data and high resolution imagery have been slow to appear. The contribution of this paper is a method that substantially reduces the computational requirements of grouping algorithms based on spectral partitioning making it feasible to apply them to very large grouping problems. Our approach is based on a technique for the numerical solution of eigenfunction problems known as the Nystrom method. This method allows one to extrapolate the complete grouping solution using only a small number of samples. In doing so, we leverage the fact that there are far fewer coherent groups in a scene than pixels.

...read moreread less

1,420 citations

Journal Article•DOI•

Subspace clustering for high dimensional data: a review

[...]

Lance Parsons¹, Ehtesham Haque¹, Huan Liu¹•Institutions (1)

Arizona State University¹

01 Jun 2004-Sigkdd Explorations

TL;DR: A survey of the various subspace clustering algorithms along with a hierarchy organizing the algorithms by their defining characteristics is presented, comparing the two main approaches using empirical scalability and accuracy tests and discussing some potential applications where sub space clustering could be particularly useful.

...read moreread less

Abstract: Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant and redundant dimensions by analyzing the entire dataset. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. There are two major branches of subspace clustering based on their search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches find dense regions in low dimensional spaces and combine them to form clusters. This paper presents a survey of the various subspace clustering algorithms along with a hierarchy organizing the algorithms by their defining characteristics. We then compare the two main approaches to subspace clustering using empirical scalability and accuracy tests and discuss some potential applications where subspace clustering could be particularly useful.

...read moreread less

1,419 citations

Proceedings Article•DOI•

Distributed clustering in ad-hoc sensor networks: a hybrid, energy-efficient approach

[...]

O. Younis¹, Sonia Fahmy¹•Institutions (1)

Purdue University¹

07 Mar 2004

TL;DR: A protocol is presented, HEED (hybrid energy-efficient distributed clustering), that periodically selects cluster heads according to a hybrid of their residual energy and a secondary parameter, such as node proximity to its neighbors or node degree, which outperforms weight-based clustering protocols in terms of several cluster characteristics.

...read moreread less

Abstract: Prolonged network lifetime, scalability, and load balancing are important requirements for many ad-hoc sensor network applications. Clustering sensor nodes is an effective technique for achieving these goals. In this work, we propose a new energy-efficient approach for clustering nodes in ad-hoc sensor networks. Based on this approach, we present a protocol, HEED (hybrid energy-efficient distributed clustering), that periodically selects cluster heads according to a hybrid of their residual energy and a secondary parameter, such as node proximity to its neighbors or node degree. HEED does not make any assumptions about the distribution or density of nodes, or about node capabilities, e.g., location-awareness. The clustering process terminates in O(1) iterations, and does not depend on the network topology or size. The protocol incurs low overhead in terms of processing cycles and messages exchanged. It also achieves fairly uniform cluster head distribution across the network. A careful selection of the secondary clustering parameter can balance load among cluster heads. Our simulation results demonstrate that HEED outperforms weight-based clustering protocols in terms of several cluster characteristics. We also apply our approach to a simple application to demonstrate its effectiveness in prolonging the network lifetime and supporting data aggregation.

...read moreread less

Journal Article•DOI•

Cluster analysis for gene expression data: a survey

[...]

Daxin Jiang¹, Chun Tang¹, Aidong Zhang¹•Institutions (1)

State University of New York System¹

01 Nov 2004-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper divides cluster analysis for gene expression data into three categories, presents specific challenges pertinent to each clustering category and introduces several representative approaches, and suggests the promising trends in this field.

...read moreread less

Abstract: DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field.

...read moreread less

Proceedings Article•DOI•

Kernel k-means: spectral clustering and normalized cuts

[...]

Inderjit S. Dhillon¹, Yuqiang Guan¹, Brian Kulis¹•Institutions (1)

University of Texas at Austin¹

22 Aug 2004

TL;DR: The generality of the weighted kernel k-means objective function is shown, and the spectral clustering objective of normalized cut is derived as a special case, leading to a novel weightedkernel k-Means algorithm that monotonically decreases the normalized cut.

...read moreread less

Abstract: Kernel k-means and spectral clustering have both been used to identify clusters that are non-linearly separable in input space Despite significant research, these methods have remained only loosely related In this paper, we give an explicit theoretical connection between them We show the generality of the weighted kernel k-means objective function, and derive the spectral clustering objective of normalized cut as a special case Given a positive definite similarity matrix, our results lead to a novel weighted kernel k-means algorithm that monotonically decreases the normalized cut This has important implications: a) eigenvector-based algorithms, which can be computationally prohibitive, are not essential for minimizing normalized cuts, b) various techniques, such as local search and acceleration schemes, may be used to improve the quality as well as speed of kernel k-means Finally, we present results on several interesting data sets, including diametrical clustering of large gene-expression matrices and a handwriting recognition data set

...read moreread less

Journal Article•DOI•

Validating the independent components of neuroimaging time series via clustering and visualization.

[...]

Johan Himberg¹, Aapo Hyvärinen², Fabrizio Esposito³•Institutions (3)

Helsinki University of Technology¹, Helsinki Institute for Information Technology², University of Naples Federico II³

01 Jul 2004-NeuroImage

TL;DR: In experiments with magnetoencephalographic and functional magnetic resonance imaging data, the method was able to show that expected components are reliable; furthermore, it pointed out components whose interpretation was not obvious but whose reliability should incite the experimenter to investigate the underlying technical or physical phenomena.

...read moreread less

Journal Article•DOI•

Robust image segmentation using FCM with spatial constraints based on new kernel-induced distance measure

[...]

Songcan Chen¹, Daoqiang Zhang²•Institutions (2)

Nanjing University¹, Nanjing University of Aeronautics and Astronautics²

01 Aug 2004

TL;DR: Two variants of fuzzy c-means clustering with spatial constraints, using the kernel methods, are proposed, inducing a class of robust non-Euclidean distance measures for the original data space to derive new objective functions and thus clustering theNon-E Euclidean structures in data.

...read moreread less

Abstract: Fuzzy c-means clustering (FCM) with spatial constraints (FCM/spl I.bar/S) is an effective algorithm suitable for image segmentation. Its effectiveness contributes not only to the introduction of fuzziness for belongingness of each pixel but also to exploitation of spatial contextual information. Although the contextual information can raise its insensitivity to noise to some extent, FCM/spl I.bar/S still lacks enough robustness to noise and outliers and is not suitable for revealing non-Euclidean structure of the input data due to the use of Euclidean distance (L/sub 2/ norm). In this paper, to overcome the above problems, we first propose two variants, FCM/spl I.bar/S/sub 1/ and FCM/spl I.bar/S/sub 2/, of FCM/spl I.bar/S to aim at simplifying its computation and then extend them, including FCM/spl I.bar/S, to corresponding robust kernelized versions KFCM/spl I.bar/S, KFCM/spl I.bar/S/sub 1/ and KFCM/spl I.bar/S/sub 2/ by the kernel methods. Our main motives of using the kernel methods consist in: inducing a class of robust non-Euclidean distance measures for the original data space to derive new objective functions and thus clustering the non-Euclidean structures in data; enhancing robustness of the original clustering algorithms to noise and outliers, and still retaining computational simplicity. The experiments on the artificial and real-world datasets show that our proposed algorithms, especially with spatial constraints, are more effective.

...read moreread less

Journal Article•DOI•

Correlation Clustering

[...]

Nikhil Bansal¹, Avrim Blum¹, Shuchi Chawla¹•Institutions (1)

Carnegie Mellon University¹

25 Jun 2004

TL;DR: This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of “agnostic learning” problem.

...read moreread less

Abstract: We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, v) is labeled either + or − depending on whether u and v have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of − edges between clusters (equivalently, minimizes the number of disagreements: the number of − edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possibles it can also be viewed as a kind of “agnostic learning” problem. An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS, building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Veg (1996). We also show how to extend some of these results to graphs with edge labels in [−1, +1], and give some results for the case of random noise.

...read moreread less

Journal Article•DOI•

Feature Selection for Unsupervised Learning

[...]

Jennifer G. Dy, Carla E. Brodley

01 Dec 2004-Journal of Machine Learning Research

TL;DR: This paper explores the feature selection problem and issues through FSSEM (Feature Subset Selection using Expectation-Maximization (EM) clustering) and through two different performance criteria for evaluating candidate feature subsets: scatter separability and maximum likelihood.

...read moreread less

Abstract: In this paper, we identify two issues involved in developing an automated feature subset selection algorithm for unlabeled data: the need for finding the number of clusters in conjunction with feature selection, and the need for normalizing the bias of feature selection criteria with respect to dimension. We explore the feature selection problem and these issues through FSSEM (Feature Subset Selection using Expectation-Maximization (EM) clustering) and through two different performance criteria for evaluating candidate feature subsets: scatter separability and maximum likelihood. We present proofs on the dimensionality biases of these feature criteria, and present a cross-projection normalization scheme that can be applied to any criterion to ameliorate these biases. Our experiments show the need for feature selection, the need for addressing these two issues, and the effectiveness of our proposed solutions.

...read moreread less

Proceedings Article•DOI•

Integrating constraints and metric learning in semi-supervised clustering

[...]

Mikhail Bilenko¹, Sugato Basu¹, Raymond J. Mooney¹•Institutions (1)

University of Texas at Austin¹

04 Jul 2004

TL;DR: Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semi-supervised clustering algorithms.

...read moreread less

Abstract: Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distance-function learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semi-supervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semi-supervised clustering algorithms.

...read moreread less

Journal Article•DOI•

Data mining in bioinformatics using Weka

[...]

Eibe Frank¹, Mark Hall¹, Len Trigg, Geoffrey Holmes¹, Ian H. Witten¹ - Show less +1 more•Institutions (1)

University of Waikato¹

12 Oct 2004-Bioinformatics

TL;DR: The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research.

...read moreread less

Abstract: Summary: The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection---common data mining problems in bioinformatics research It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem Weka can process data given in the form of a single relational table Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it Availability: http://wwwcswaikatoacnz/ml/weka

...read moreread less

Proceedings Article•DOI•

A probabilistic framework for semi-supervised clustering

[...]

Sugato Basu¹, Mikhail Bilenko¹, Raymond J. Mooney¹•Institutions (1)

University of Texas at Austin¹

22 Aug 2004

TL;DR: A probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering and experimental results demonstrate the advantages of the proposed framework.

...read moreread less

Abstract: Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.

...read moreread less

Journal Article•DOI•

On clusterings: Good, bad and spectral

[...]

Ravi Kannan¹, Santosh Vempala², Adrian Vetta²•Institutions (2)

Yale University¹, Massachusetts Institute of Technology²

01 May 2004-Journal of the ACM

TL;DR: A natural bicriteria measure for assessing the quality of a clustering that avoids the drawbacks of existing measures is motivated and a simple recursive heuristic is shown to have poly-logarithmic worst-case guarantees under the new measure.

...read moreread less

Abstract: We motivate and develop a natural bicriteria measure for assessing the quality of a clustering that avoids the drawbacks of existing measures. A simple recursive heuristic is shown to have poly-logarithmic worst-case guarantees under the new measure. The main result of the article is the analysis of a popular spectral algorithm. One variant of spectral clustering turns out to have effective worst-case guarantees; another finds a "good" clustering, if one exists.

...read moreread less

Proceedings Article•DOI•

Multi-view clustering

[...]

Steffen Bickel¹, Tobias Scheffer¹•Institutions (1)

Humboldt State University¹

01 Nov 2004

TL;DR: It is found empirically that the multi-view versions of k-means and EM greatly improve on their single-view counterparts, and negative results for agglomerative hierarchicalmulti-view clustering are obtained.

...read moreread less

Abstract: We consider clustering problems in which the available attributes can be split into two independent subsets, such that either subset suffices for learning. Example applications of this multi-view setting include clustering of Web pages which have an intrinsic view (the pages themselves) and an extrinsic view (e.g., anchor texts of inbound hyperlinks); multi-view learning has so far been studied in the context of classification. We develop and study partitioning and agglomerative, hierarchical multi-view clustering algorithms for text data. We find empirically that the multi-view versions of k-means and EM greatly improve on their single-view counterparts. By contrast, we obtain negative results for agglomerative hierarchical multi-view clustering. Our analysis explains this surprising phenomenon.

...read moreread less

Proceedings Article•DOI•

Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms

[...]

Stan Salvador¹, Philip K. Chan¹•Institutions (1)

Florida Institute of Technology¹

15 Nov 2004

TL;DR: This work proposes an efficient algorithm, the L method, that finds the "knee" in a '# of clusters vs. clustering evaluation metric' graph, using the knee is well-known, but is not a particularly well-understood method to determine the number of clusters.

...read moreread less

Abstract: Many clustering and segmentation algorithms both suffer from the limitation that the number of clusters/segments is specified by a human user. It is often impractical to expect a human with sufficient domain knowledge to be available to select the number of clusters/segments to return. We investigate techniques to determine the number of clusters or segments to return from hierarchical clustering and segmentation algorithms. We propose an efficient algorithm, the L method that finds the "knee" in a '# of clusters vs. clustering evaluation metric' graph. Using the knee is well-known, but is not a particularly well-understood method to determine the number of clusters. We explore the feasibility of this method, and attempt to determine in which situations it will and will not work. We also compare the L method to existing methods based on the accuracy of the number of clusters that are determined and efficiency. Our results show favorable performance for these criteria compared to the existing methods that were evaluated.

...read moreread less

Journal Article•DOI•

Validity index for crisp and fuzzy clusters

[...]

Malay K. Pakhira¹, Sanghamitra Bandyopadhyay², Ujjwal Maulik¹•Institutions (2)

Kalyani Government Engineering College¹, Indian Statistical Institute²

01 Mar 2004-Pattern Recognition

TL;DR: A cluster validity index and its fuzzification is described, which can provide a measure of goodness of clustering on different partitions of a data set, and results demonstrating the superiority of the PBM-index in appropriately determining the number of clusters are provided.

...read moreread less

Proceedings Article•DOI•

Learning to cluster web search results

[...]

Hua-Jun Zeng¹, Qi-Cai He², Zheng Chen¹, Wei-Ying Ma¹, Jinwen Ma² - Show less +1 more•Institutions (2)

Microsoft¹, Peking University²

25 Jul 2004

TL;DR: This paper reformalizes the clustering problem as a salient phrase ranking problem, and first extracts and ranks salient phrases as candidate cluster names, based on a regression model learned from human labeled training data.

...read moreread less

Abstract: Organizing Web search results into clusters facilitates users' quick browsing through search results. Traditional clustering techniques are inadequate since they don't generate clusters with highly readable names. In this paper, we reformalize the clustering problem as a salient phrase ranking problem. Given a query and the ranked list of documents (typically a list of titles and snippets) returned by a certain Web search engine, our method first extracts and ranks salient phrases as candidate cluster names, based on a regression model learned from human labeled training data. The documents are assigned to relevant salient phrases to form candidate clusters, and the final clusters are generated by merging these candidate clusters. Experimental results verify our method's feasibility and effectiveness.

...read moreread less

Journal Article•DOI•

Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

[...]

Ying Zhao¹, George Karypis¹•Institutions (1)

University of Minnesota¹

01 Jun 2004-Machine Learning

TL;DR: This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets, and shows that there are a set of criterion functions that consistently outperform the rest.

...read moreread less

Abstract: This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.

...read moreread less

Journal Article•DOI•

FlexMix: A general framework for finite mixture models and latent class regression in R

[...]

Friedrich Leisch¹•Institutions (1)

Vienna University of Technology¹

18 Oct 2004-Journal of Statistical Software

TL;DR: FlexMix implements a general framework for fitting discrete mixtures of regression models in the R statistical computing environment and provides the E-step and all data handling, while the M-step can be supplied by the user to easily define new models.

...read moreread less

Abstract: FlexMix implements a general framework for fitting discrete mixtures of regression models in the R statistical computing environment: three variants of the EM algorithm can be used for parameter estimation, regressors and responses may be multivariate with arbitrary dimension, data may be grouped, e.g., to account for multiple observations per individual, the usual formula interface of the S language is used for convenient model specification, and a modular concept of driver functions allows to interface many dierent types of regression models. Existing drivers implement mixtures of standard linear models, generalized linear models and model-based clustering. FlexMix provides the E-step and all data handling, while the M-step can be supplied by the user to easily define new models.

...read moreread less

Book•

Analyzing Microarray Gene Expression Data

[...]

Geoffrey J. McLachlan, Kim Anh Do, Christophe Ambroise

04 Aug 2004

TL;DR: In this article, the authors proposed a supervised classification of Tissue Samples and linked the supervised classification with survival analysis, and showed that the classification of tissue samples is more accurate than that of microarray data.

...read moreread less

Abstract: Preface. 1. Microarrays in Gene Expression Studies. 2. Cleaning and Normalization. 3. Some Cluster Analysis Methods. 4. Clustering of Tissue Samples. 5. Screening and Clustering of Genes. 6. Discriminant Analysis. 7. Supervised Classification of Tissue Samples. 8. Linking Microarray Data with Survival Analysis. References. Author Index. Subject Index.

...read moreread less

Journal Article•DOI•

Simultaneous feature selection and clustering using mixture models

[...]

M.H.C. Law¹, Mário A. T. Figueiredo, Anil K. Jain•Institutions (1)

Michigan State University¹

01 Sep 2004-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper proposes the concept of feature saliency and introduces an expectation-maximization algorithm to estimate it, in the context of mixture-based clustering, and extends the criterion and algorithm to simultaneously estimate the feature saliencies and the number of clusters.

...read moreread less

Abstract: Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we propose the concept of feature saliency and introduce an expectation-maximization (EM) algorithm to estimate it, in the context of mixture-based clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to simultaneously estimate the feature saliencies and the number of clusters.

...read moreread less

Journal Article•DOI•

Protein complex prediction via cost-based clustering

[...]

Andrew D. King¹, Nataša Pržulj¹, Igor Jurisica¹•Institutions (1)

University of Toronto¹

22 Nov 2004-Bioinformatics

TL;DR: The RNSC algorithm is developed to efficiently partition networks into clusters using a cost function and provides an accurate and scalable method of detecting and predicting protein complexes within a PPI network.

...read moreread less

Abstract: Motivation: Understanding principles of cellular organization and function can be enhanced if we detect known and predict still undiscovered protein complexes within the cell's protein--protein interaction (PPI) network. Such predictions may be used as an inexpensive tool to direct biological experiments. The increasing amount of available PPI data necessitates an accurate and scalable approach to protein complex identification. Results: We have developed the Restricted Neighborhood Search Clustering Algorithm (RNSC) to efficiently partition networks into clusters using a cost function. We applied this cost-based clustering algorithm to PPI networks of Saccharomyces cerevisiae, Drosophila melanogaster and Caenorhabditis elegans to identify and predict protein complexes. We have determined functional and graph-theoretic properties of true protein complexes from the MIPS database. Based on these properties, we defined filters to distinguish between identified network clusters and true protein complexes. Conclusions: Our application of the cost-based clustering algorithm provides an accurate and scalable method of detecting and predicting protein complexes within a PPI network. Availability: The RNSC algorithm and data processing code are available upon request from the authors. Supplementary Information: Supplementary data are available at http://www.cs.utoronto.ca/~juris/data/ppi04/

...read moreread less

Collapse