scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2000"


Proceedings ArticleDOI
04 Jan 2000
TL;DR: The Low-Energy Adaptive Clustering Hierarchy (LEACH) as mentioned in this paper is a clustering-based protocol that utilizes randomized rotation of local cluster based station (cluster-heads) to evenly distribute the energy load among the sensors in the network.
Abstract: Wireless distributed microsensor systems will enable the reliable monitoring of a variety of environments for both civil and military applications. In this paper, we look at communication protocols, which can have significant impact on the overall energy dissipation of these networks. Based on our findings that the conventional protocols of direct transmission, minimum-transmission-energy, multi-hop routing, and static clustering may not be optimal for sensor networks, we propose LEACH (Low-Energy Adaptive Clustering Hierarchy), a clustering-based protocol that utilizes randomized rotation of local cluster based station (cluster-heads) to evenly distribute the energy load among the sensors in the network. LEACH uses localized coordination to enable scalability and robustness for dynamic networks, and incorporates data fusion into the routing protocol to reduce the amount of information that must be transmitted to the base station. Simulations show the LEACH can achieve as much as a factor of 8 reduction in energy dissipation compared with conventional outing protocols. In addition, LEACH is able to distribute energy dissipation evenly throughout the sensors, doubling the useful system lifetime for the networks we simulated.

12,497 citations


01 Jan 2000
TL;DR: LEACH (Low-Energy Adaptive Clustering Hierarchy), a clustering-based protocol that utilizes randomized rotation of local cluster based station (cluster-heads) to evenly distribute the energy load among the sensors in the network, is proposed.
Abstract: Wireless distributed microsensor systems will enable the reliable monitoring of a variety of environments for both civil and military applications. In this paper, we look at communication protocols, which can have signicant impact on the overall energy dissipation of these networks. Based on our ndings that the conventional protocols of direct transmission, minimum-transmission-energy, multihop routing, and static clustering may not be optimal for sensor networks, we propose LEACH (Low-Energy Adaptive Clustering Hierarchy), a clustering-based protocol that utilizes randomized rotation of local cluster base stations (cluster-heads) to evenly distribute the energy load among the sensors in the network. LEACH uses localized coordination to enable scalability and robustness for dynamic networks, and incorporates data fusion into the routing protocol to reduce the amount of information that must be transmitted to the base station. Simulations show that LEACH can achieve as much as a factor of 8 reduction in energy dissipation compared with conventional routing protocols. In addition, LEACH is able to distribute energy dissipation evenly throughout the sensors, doubling the useful system lifetime for the networks we simulated.

11,412 citations


Journal ArticleDOI
TL;DR: The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.
Abstract: The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have been receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.

6,527 citations


Journal ArticleDOI
TL;DR: In this paper, the primary goal of pattern recognition is supervised or unsupervised classification, and the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been used.
Abstract: The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical ap...

4,307 citations


23 May 2000
TL;DR: This paper compares the two main approaches to document clustering, agglomerative hierarchical clustering and K-means, and indicates that the bisecting K-MEans technique is better than the standard K-Means approach and as good or better as the hierarchical approaches that were tested for a variety of cluster evaluation metrics.
Abstract: This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a “standard” K-means algorithm and a variant of K-means, “bisecting” K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to “get the best of both worlds.” However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document

2,899 citations


Journal ArticleDOI
TL;DR: The two-stage procedure--first using SOM to produce the prototypes that are then clustered in the second stage--is found to perform well when compared with direct clustering of the data and to reduce the computation time.
Abstract: The self-organizing map (SOM) is an excellent tool in exploratory phase of data mining. It projects input space on prototypes of a low-dimensional regular grid that can be effectively utilized to visualize and explore properties of the data. When the number of SOM units is large, to facilitate quantitative analysis of the map and the data, similar units need to be grouped, i.e., clustered. In this paper, different approaches to clustering of the SOM are considered. In particular, the use of hierarchical agglomerative clustering and partitive clustering using K-means are investigated. The two-stage procedure-first using SOM to produce the prototypes that are then clustered in the second stage-is found to perform well when compared with direct clustering of the data and to reduce the computation time.

2,387 citations


Proceedings Article
19 Aug 2000
TL;DR: An efficient node-deletion algorithm is introduced to find submatrices in expression data that have low mean squared residue scores and it is shown to perform well in finding co-regulation patterns in yeast and human.
Abstract: An efficient node-deletion algorithm is introduced to find submatrices in expression data that have low mean squared residue scores and it is shown to perform well in finding co-regulation patterns in yeast and human. This introduces "biclustering’, or simultaneous clustering of both genes and conditions, to knowledge discovery from expression data. This approach overcomes some problems associated with traditional clustering methods, by allowing automatic discovery of similarity based on a subset of attributes, simultaneous clustering of genes and conditions, and overlapped grouping that provides a better representation for genes with multiple functions or regulated by many factors.

2,213 citations


Journal ArticleDOI
TL;DR: An assessing method of mixture model in a cluster analysis setting with integrated completed likelihood appears to be more robust to violation of some of the mixture model assumptions and it can select a number of dusters leading to a sensible partitioning of the data.
Abstract: We propose an assessing method of mixture model in a cluster analysis setting with integrated completed likelihood. For this purpose, the observed data are assigned to unknown clusters using a maximum a posteriori operator. Then, the integrated completed likelihood (ICL) is approximated using the Bayesian information criterion (BIC). Numerical experiments on simulated and real data of the resulting ICL criterion show that it performs well both for choosing a mixture model and a relevant number of clusters. In particular, ICL appears to be more robust than BIC to violation of some of the mixture model assumptions and it can select a number of dusters leading to a sensible partitioning of the data.

1,418 citations


Journal ArticleDOI
TL;DR: This paper develops a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters, and indicates that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.

1,383 citations


Journal ArticleDOI
TL;DR: The superiority of the GA-clustering algorithm over the commonly used K-means algorithm is extensively demonstrated for four artificial and three real-life data sets.

1,337 citations


Proceedings ArticleDOI
01 Aug 2000
TL;DR: This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.
Abstract: important problems involve clustering large datasets. Although naive implementations of clustering are computa- tionally expensive, there are established ecient techniques for clustering when the dataset has either (1) a limited num- ber of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of eciently clustering datasets that are large in all three ways at once|for example, having millions of data points that exist in many thousands of di- mensions representing many thousands of clusters. We present a new technique for clustering these large, high- dimensional datasets. The key idea involves using a cheap, approximate distance measure to eciently divide the data into overlapping subsets we call canopies .T hen cluster- ing is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under reasonable assumptions about the cheap distance metric, this reduction in computational cost comes without any loss in clustering accuracy. Canopies can be applied to many domains and used with a variety of cluster- ing approaches, including Greedy Agglomerative Clustering, K-means and Expectation-Maximization. We present ex- perimental results on grouping bibliographic citations from the reference sections of research papers. Here the canopy approach reduces computation time over a traditional clus- tering approach by more than an order of magnitude and decreases error in comparison to a previously used algorithm by 25%.

Journal ArticleDOI
TL;DR: It is concluded that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting and bioengineering.
Abstract: Advances in molecular biological, analytical, and computational technologies are enabling us to systematically investigate the complex molecular processes underlying biological systems. In particular, using high-throughput gene expression assays, we are able to measure the output of the gene regulatory network. We aim here to review datamining and modeling approaches for conceptualizing and unraveling the functional relationships implicit in these datasets. Clustering of co-expression profiles allows us to infer shared regulatory inputs and functional pathways. We discuss various aspects of clustering, ranging from distance measures to clustering algorithms and multiple-duster memberships. More advanced analysis aims to infer causal connections between genes directly, i.e., who is regulating whom and how. We discuss several approaches to the problem of reverse engineering of genetic networks, from discrete Boolean networks, to continuous linear and non-linear models. We conclude that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting, and bioengineering.

Journal ArticleDOI
TL;DR: The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.
Abstract: Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.

DOI
01 Jan 2000
TL;DR: “¦e—žŒ4—&ŒŽ“2¦2“n™n¤2– ‹UŒ ¥ —S‹e¦§¯4e– ̈©“SšS–œ‹™– ‘¬Œ’¦ e-S«—S«
Abstract: ŠG‹UŒŽe‘’”“n‹e‘Y• “n‘Ž– —|‘Ž– ’”“n`‘Y‘4– ˜ ™’”Œ›šœŒ4™– —žŒ ’Ÿ‹2— ‹™– Œd “S4¡&– ‹U¢£’”“n‹e¤2– ‹UŒ ¥ —S‹e¦§Œ4e– 4– ̈©“S–œ‹™– – ¦WŒŽ“[a:–œ•e4“n¤2•`Œ4«”š[¦e– ŒŽ– ˜¬ŒŽ– ¦+—S‹e¦\¦e– —S«”Œ8 8’ŸŒ4;­8® – ̄’°‹A± ŒŽe‘4’”“n‹Œdš£• – ‘ ¥™“S ̈L e’Ÿ˜2¦™– ŒŽ– ˜¬Œ4’Ÿ“n‹ ‘ŽšA‘ŽŒŽ– ¤3‘ ¤3—)š[‹™“SŒ – ¢S– ‹ a – ́— —ž4–S¥`—ž4– Œ4™–§¤2“n‘ŽŒ¦e’”μ"˜ e«ŸŒJŒŽ“ ¦™– ŒŽ– ˜¬Œ ­·¶ ™44– ‹UŒ3‘4’” ̧n‹`—žŒ4™4– a`—S‘4– ¦1¤2– Œ4™“A¦e‘3—S‹e¦ «”– —ž‹e’Ÿ‹e ̧œ—S«” ̧S“S’ŸŒ4e¤3‘ 8e’Ÿ˜ "– «”š3“n‹ «Ÿ—ža – «”– ¦[¦e—žŒ4—&ŒŽ“2ŒŽ—S’Ÿ‹@¥U ̧S– ‹™–  —S«Ÿ«”š"˜ —S‹ ‹™“SŒ2¦™– ŒŽ– ˜¬Œ2Œ4™– ‘Ž– ‹™– o’Ÿ‹UŒŽe‘4’Ÿ“n‹e‘ ­(»+–2•e4– ‘Ž– ‹UŒ2—\ ̈1⁄4—S¤2– “S4¡ ̈©“S3—S™ŒŽ“ž± ¤3—žŒ4’Ÿ˜ —S«Ÿ«”š ¦™– ŒŽ– ˜¬Œ4’°‹™ ̧[’°‹mŒŽ`‘4’”“n‹e‘ ¥P‹™– 1⁄2“S ́“SŒ4™– 4 ’Ÿ‘Ž–S¥`– ¢S– ‹(’” ̈]Œ4™– š\—ž4–œšS– Œ e‹™¡A‹™“) 8‹3⁄4ŒŽ“ Œ4e–W‘ŽšA‘ŽŒŽ– ¤ ­^ŠG‹3⁄4“n™2‘Žš£‘4ŒŽ– ¤ ¥8‹™“%¤3—S‹£e—S«Ÿ«”š%“S[“SŒ4™– 4 8’Ÿ‘4– ˜ «Ÿ—S‘4‘4’”¿`– ¦ ¦e—žŒ4—N’Ÿ‘&‹™– ˜¬– ‘4‘4—ž4š ̈1⁄4“S2ŒŽ—S’Ÿ‹e’Ÿ‹e ̧™­ À ́™2¤2– Œ4™“A¦Á’Ÿ‘&—ža`«”–2ŒŽ“N¦e–¬± ŒŽ– ˜¬ŒÂ¤3—S‹Uš§¦e’”Ã:– 4– ‹UŒ8Œ›šU• – ‘ “S ̈/’Ÿ‹UŒŽe‘4’Ÿ“n‹e‘ ¥™ 8e’°«”–$¤3—S’Ÿ‹UŒ4—S’Ÿ‹e’°‹™ ̧3—«Ÿ“) C ̈1⁄4—S«Ÿ‘4– •:“n‘’”Œ4’”¢S– ́—žŒŽ–S­

Proceedings ArticleDOI
12 Nov 2000
TL;DR: Two results regarding the quality of the clustering found by a popular spectral algorithm are presented, one proffers worst case guarantees whilst the other shows that if there exists a "good" clustering then the spectral algorithm will find one close to it.
Abstract: We propose a new measure for assessing the quality of a clustering. A simple heuristic is shown to give worst-case guarantees under the new measure. Then we present two results regarding the quality of the clustering found by a popular spectral algorithm. One proffers worst case guarantees whilst the other shows that if there exists a "good" clustering then the spectral algorithm will find one close to it.

Journal ArticleDOI
TL;DR: An algorithm, based on iterative clustering, that performs an algorithm to identify subsets of the genes and samples, such that when one of these is used to cluster the other, stable and significant partitions emerge.
Abstract: We present a coupled two-way clustering approach to gene microarray data analysis. The main idea is to identify subsets of the genes and samples, such that when one of these is used to cluster the other, stable and significant partitions emerge. The search for such subsets is a computationally complex task. We present an algorithm, based on iterative clustering, that performs such a search. This analysis is especially suitable for gene microarray data, where the contributions of a variety of biological mechanisms to the gene expression levels are entangled in a large body of experimental data. The method was applied to two gene microarray data sets, on colon cancer and leukemia. By identifying relevant subsets of the data and focusing on them we were able to discover partitions and correlations that were masked and hidden when the full dataset was used in the analysis. Some of these partitions have clear biological interpretation; others can serve to identify possible directions for future research.

Proceedings ArticleDOI
12 Nov 2000
TL;DR: This work gives constant-factor approximation algorithms for the k-median problem in the data stream model of computation in a single pass, and shows negative results implying that these algorithms cannot be improved in a certain sense.
Abstract: We study clustering under the data stream model of computation where: given a sequence of points, the objective is to maintain a consistently good clustering of the sequence observed so far, using a small amount of memory and time. The data stream model is relevant to new classes of applications involving massive data sets, such as Web click stream analysis and multimedia data analysis. We give constant-factor approximation algorithms for the k-median problem in the data stream model of computation in a single pass. We also show negative results implying that our algorithms cannot be improved in a certain sense.

01 Jan 2000
TL;DR: Comparing four popular similarity measures in conjunction with several clustering techniques, cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest.
Abstract: Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively different metrics. We observe that in domains such as YAHOO that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hyper-graph partitioning, generalized kmeans, weighted graph partitioning), on high dimensionai sparse data representing web documents. Performance is measured against a human-imposed classification into news categories and industry categories. We conduct a number of experiments and use t-tests to assure statistical significance of results. Cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest. Also, weighted graph partitioning approaches are clearly superior to all others.

Book ChapterDOI
26 Jun 2000
TL;DR: A method to learn object class models from unlabeled and unsegmented cluttered cluttered scenes for the purpose of visual object recognition that achieves very good classification results on human faces and rear views of cars.
Abstract: We present a method to learn object class models from unlabeled and unsegmented cluttered scenes for the purpose of visual object recognition. We focus on a particular type of model where objects are represented as flexible constellations of rigid parts (features). The variability within a class is represented by a joint probability density function (pdf) on the shape of the constellation and the output of part detectors. In a first stage, the method automatically identifies distinctive parts in the training set by applying a clustering algorithm to patterns selected by an interest operator. It then learns the statistical shape model using expectation maximization. The method achieves very good classification results on human faces and rear views of cars.

Proceedings Article
29 Apr 2000
TL;DR: This paper proposed a method for linear text segmentation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998).
Abstract: This paper describes a method for linear text segmentation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering.

Proceedings Article
Sid Ray1, Rose H Turi1
01 Jan 2000
TL;DR: This paper presents a simple validity measure based on the intra-clusters and inter-cluster distance measures which allows the number of clusters to be determined automatically and is tested for synthetic images for which theNumber of clusters in known, and is also implemented for natural images.
Abstract: The main disadvantage of the k-means algorithm is that the number of clusters, K, must be supplied as a parameter. In this paper we present a simple validity measure based on the intra-cluster and inter-cluster distance measures which allows the number of clusters to be determined automatically. The basic procedure involves producing all the segmented images for 2 clusters up to Kmax clusters, where Kmax represents an upper limit on the number of clusters. Then our validity measure is calculated to determine which is the best clustering by finding the minimum value for our measure. The validity measure is tested for synthetic images for which the number of clusters in known, and is also implemented for natural images.

Book
03 Feb 2000
TL;DR: This work focuses on Symbolic Data Analysis and the SODAS Project: Purpose, History, Perspective, and Symbolic Objects, where H.H. Bock and E. Diday focused on the former and the latter dealt with the latter.
Abstract: E. Diday: Symbolic Data Analysis and the SODAS Project: Purpose, History, Perspective.- H.H. Bock: The Classical Data Situation.- H.H. Bock: Symbolic Data.- H.H. Bock, E. Diday: Symbolic Objects.- V. Stephan, G. Hebrail, Y. Lechevallier: Generation of Symbolic Objects from Relational Databases.- P. Bertrand, F. Goupil: Descriptive Statistics for Symbolic Data.- M. Noirhomme-Fraiture, M. Rouard: Visualizing and Editing Symbolic Objects.- Similarity and Dissimilarity: F. Esposito, D. Malerba, V. Tamma, H.H. Bock: Classical Resemblance Measures.- H.H. Bock: Dissimilarity Measures for Probability Distributions.- F. Esposito, D. Malerba, V. Tamma: Dissimilarity Measures for Symbolic Objects.- F. Esposito, D. Malerba, F. Lisi: Matching Symbolic Objects.- Symbolic Factor Analysis: H.H.Bock: Classical Principal Component Analysis.- A. Chouakria, P. Cazes, E. Diday: Symbolic Principal Component Analysis.- N.C. Lauro, F. Palumbo, R. Verde: Factorial Discriminant Analysis on Symbolic Objects.- Discrimination: Assigning Symbolic Objects to Classes: J. Rasson, S. Lissoir: Classical Methods of Discrimination.- J. Rasson, S. Lissoir: Symbolic Kernel Discriminant Analysis.- E. Perinel, Y. Lechevalier: Symbolic Discrimination Rules.- M. Bravo Llatas, J. Garcia-Santesmases: Segmentation Trees for Stratified Data.- Clustering Methods for Symbolic Objects: M. Chavent, H.H. Bock: Clustering Problem, Clustering Methods for Classical Data.- M. Chavent: Criterion-Based Divisive Clustering for Symbolic Data.- P. Brito: Hierarchical and Pyramidal Clustering with Complete Symbolic Objects.- G. Polaillon: Pyramidal Classification for Interval Data Using Galois Lattice Reduction.- M. Gettler-Summa, C. Pardoux: Symbolic Approaches for Three-way Data.-Illustrative Benchmark Analysis: R. Bisdorff: Introduction.- R. Bisdorff: Professional Careers of Retired Working Persons.- A. Iztueta, P. Calvo: Labour Force Survey.- F. Goupil, M. Touati, E. Diday, R. Moult: Census Data from the Office for National Statistics.- A. Morineau: The SODAS Software Package.

Journal ArticleDOI
TL;DR: This paper investigates and develops a methodology that serves to automatically identify a subset of aROIs (algorithmically detected ROIs) using different image processing algorithms (IPAs), and appropriate clustering procedures, and compares hROIs with hROI as a criterion for evaluating and selecting bottom-up, context-free algorithms.
Abstract: Many machine vision applications, such as compression, pictorial database querying, and image understanding, often need to analyze in detail only a representative subset of the image, which may be arranged into sequences of loci called regions-of-interest (ROIs). We have investigated and developed a methodology that serves to automatically identify such a subset of aROIs (algorithmically detected ROIs) using different image processing algorithms (IPAs), and appropriate clustering procedures. In human perception, an internal representation directs top-down, context-dependent sequences of eye movements to fixate on similar sequences of hROIs (human identified ROIs). In the paper, we introduce our methodology and we compare aROIs with hROIs as a criterion for evaluating and selecting bottom-up, context-free algorithms. An application is finally discussed.

Proceedings Article
30 Jul 2000
TL;DR: In this paper, two types of instance-level clustering constraints, must-link and cannot-link constraints, are proposed to aid the search of possible organizations of a data set.
Abstract: Clustering algorithms conduct a search through the space of possible organizations of a data set. In this paper, we propose two types of instance-level clustering constraints { must-link and cannot-link constraints { and show how they can be incorporated into a clustering algorithm to aid that search. For three of the four data sets tested, our results indicate that the incorporation of surprisingly few such constraints can increase clustering accuracy while decreasing runtime. We also investigate the relative eects of each type of constraint and nd that the type that contributes most to accuracy improvements depends on the behavior of the clustering algorithm without constraints.

Journal ArticleDOI
TL;DR: An integrated method for clustering of QRS complexes is presented which includes basis function representation and self-organizing neural networks (NN's) and outperforms both a published supervised learning method as well as a conventional template cross-correlation clustering method.
Abstract: An integrated method for clustering of QRS complexes is presented which includes basis function representation and self-organizing neural networks (NN's). Each QRS complex is decomposed into Hermite basis functions and the resulting coefficients and width parameter are used to represent the complex. By means of this representation, unsupervised self-organizing NNs are employed to cluster the data into 25 groups. Using the MIT-BIH arrhythmia database, the resulting clusters are found to exhibit a very low degree of misclassification (1.5%). The integrated method outperforms, on the MIT-BIH database, both a published supervised learning method as well as a conventional template cross-correlation clustering method.

Journal ArticleDOI
Charu C. Aggarwal1, Philip S. Yu1
16 May 2000
TL;DR: Very general techniques for projected clustering are discussed which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality, which is substantially more general and realistic than currently available techniques.
Abstract: High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projected clustering which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to redefine clustering for high dimensional applications by searching for hidden subspaces with clusters which are created by inter-attribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable, and are likely ta tradeoff with better accuracy.

Proceedings ArticleDOI
01 Jul 2000
TL;DR: A novel implementation of the recently introduced information bottleneck method for unsupervised document clustering that first finds word-clusters that capture most of the mutual information about to set of documents, and then finds document clusters that preserve the information about the word clusters.
Abstract: We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x, y), we first cluster the words, Y, so that the obtained word clusters, Ytilde;, maximally preserve the information on the documents. The resulting joint distribution. p(X, Ytilde;), contains most of the original information about the documents, I(X; Ytilde;) a I(X; Y), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X, so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about to set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.

Journal ArticleDOI
TL;DR: The growing self-organizing map (GSOM) is presented in detail and the effect of a spread factor, which can be used to measure and control the spread of the GSOM, is investigated.
Abstract: The growing self-organizing map (GSOM) algorithm is presented in detail and the effect of a spread factor, which can be used to measure and control the spread of the GSOM, is investigated. The spread factor is independent of the dimensionality of the data and as such can be used as a controlling measure for generating maps with different dimensionality, which can then be compared and analyzed with better accuracy. The spread factor is also presented as a method of achieving hierarchical clustering of a data set with the GSOM. Such hierarchical clustering allows the data analyst to identify significant and interesting clusters at a higher level of the hierarchy, and continue with finer clustering of the interesting clusters only. Therefore, only a small map is created in the beginning with a low spread factor, which can be generated for even a very large data set. Further analysis is conducted on selected sections of the data and of smaller volume. Therefore, this method facilitates the analysis of even very large data sets.

Proceedings Article
29 Jun 2000
TL;DR: This paper proposes two types of instance-level clustering constraints { must-link and cannot-link constraints} and shows how they can be incorporated into a clustering algorithm to aid that search.
Abstract: Clustering algorithms conduct a search through the space of possible organizations of a data set. In this paper, we propose two types of instance-level clustering constraints { must-link and cannot-link constraints { and show how they can be incorporated into a clustering algorithm to aid that search. For three of the four data sets tested, our results indicate that the incorporation of surprisingly few such constraints can increase clustering accuracy while decreasing runtime. We also investigate the relative eects of each type of constraint and nd that the type that contributes most to accuracy improvements depends on the behavior of the clustering algorithm without constraints.

Journal ArticleDOI
01 Feb 2000
TL;DR: This work describes a novel approach for clustering collections of sets, and its application to the analysis and mining of categorical data, based on an iterative method for assigning and propagating weights on the categorical values in a table.
Abstract: We describe a novel approach for clustering collections of sets, and its application to the analysis and mining of categorical data. By “categorical data,” we mean tables with fields that cannot be naturally ordered by a metric – e.g., the names of producers of automobiles, or the names of products offered by a manufacturer. Our approach is based on an iterative method for assigning and propagating weights on the categorical values in a table; this facilitates a type of similarity measure arising from the co-occurrence of values in the dataset. Our techniques can be studied analytically in terms of certain types of non-linear dynamical systems.