Showing papers on "Cluster analysis published in 2000"

PDF

Open Access

Proceedings Article•DOI•

Energy-efficient communication protocol for wireless microsensor networks

[...]

Wendi Rabiner Heinzelman¹, Anantha P. Chandrakasan¹, Hari Balakrishnan¹•Institutions (1)

04 Jan 2000

TL;DR: The Low-Energy Adaptive Clustering Hierarchy (LEACH) as mentioned in this paper is a clustering-based protocol that utilizes randomized rotation of local cluster based station (cluster-heads) to evenly distribute the energy load among the sensors in the network.

...read moreread less

Abstract: Wireless distributed microsensor systems will enable the reliable monitoring of a variety of environments for both civil and military applications. In this paper, we look at communication protocols, which can have significant impact on the overall energy dissipation of these networks. Based on our findings that the conventional protocols of direct transmission, minimum-transmission-energy, multi-hop routing, and static clustering may not be optimal for sensor networks, we propose LEACH (Low-Energy Adaptive Clustering Hierarchy), a clustering-based protocol that utilizes randomized rotation of local cluster based station (cluster-heads) to evenly distribute the energy load among the sensors in the network. LEACH uses localized coordination to enable scalability and robustness for dynamic networks, and incorporates data fusion into the routing protocol to reduce the amount of information that must be transmitted to the base station. Simulations show the LEACH can achieve as much as a factor of 8 reduction in energy dissipation compared with conventional outing protocols. In addition, LEACH is able to distribute energy dissipation evenly throughout the sensors, doubling the useful system lifetime for the networks we simulated.

...read moreread less

12,497 citations

Energy-efficient communication protocols for wireless microsensor networks

[...]

Wendi Rabiner Heinzelman, Anantha P. Chandrakasan, Hari Balakrishnan

01 Jan 2000

TL;DR: LEACH (Low-Energy Adaptive Clustering Hierarchy), a clustering-based protocol that utilizes randomized rotation of local cluster based station (cluster-heads) to evenly distribute the energy load among the sensors in the network, is proposed.

...read moreread less

Abstract: Wireless distributed microsensor systems will enable the reliable monitoring of a variety of environments for both civil and military applications. In this paper, we look at communication protocols, which can have signicant impact on the overall energy dissipation of these networks. Based on our ndings that the conventional protocols of direct transmission, minimum-transmission-energy, multihop routing, and static clustering may not be optimal for sensor networks, we propose LEACH (Low-Energy Adaptive Clustering Hierarchy), a clustering-based protocol that utilizes randomized rotation of local cluster base stations (cluster-heads) to evenly distribute the energy load among the sensors in the network. LEACH uses localized coordination to enable scalability and robustness for dynamic networks, and incorporates data fusion into the routing protocol to reduce the amount of information that must be transmitted to the base station. Simulations show that LEACH can achieve as much as a factor of 8 reduction in energy dissipation compared with conventional routing protocols. In addition, LEACH is able to distribute energy dissipation evenly throughout the sensors, doubling the useful system lifetime for the networks we simulated.

...read moreread less

11,412 citations

Journal Article•DOI•

Statistical pattern recognition: a review

[...]

Anil K. Jain¹, Robert P. W. Duin², Jianchang Mao³•Institutions (3)

Michigan State University¹, Delft University of Technology², IBM³

01 Jan 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.

...read moreread less

Abstract: The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have been receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field.

...read moreread less

6,527 citations

Journal Article•DOI•

Statistical Pattern Recognition

[...]

K JainAnil, P W DuinRobert, MaoJianchang

01 Jan 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: In this paper, the primary goal of pattern recognition is supervised or unsupervised classification, and the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been used.

...read moreread less

4,307 citations

A Comparison of Document Clustering Techniques

[...]

Michael Steinbach¹, George Karypis¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

23 May 2000

TL;DR: This paper compares the two main approaches to document clustering, agglomerative hierarchical clustering and K-means, and indicates that the bisecting K-MEans technique is better than the standard K-Means approach and as good or better as the hierarchical approaches that were tested for a variety of cluster evaluation metrics.

...read moreread less

Abstract: This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a “standard” K-means algorithm and a variant of K-means, “bisecting” K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to “get the best of both worlds.” However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document

...read moreread less

2,899 citations

Journal Article•DOI•

Clustering of the self-organizing map

[...]

Juha Vesanto¹, Esa Alhoniemi¹•Institutions (1)

Helsinki University of Technology¹

01 May 2000-IEEE Transactions on Neural Networks

TL;DR: The two-stage procedure--first using SOM to produce the prototypes that are then clustered in the second stage--is found to perform well when compared with direct clustering of the data and to reduce the computation time.

...read moreread less

Abstract: The self-organizing map (SOM) is an excellent tool in exploratory phase of data mining. It projects input space on prototypes of a low-dimensional regular grid that can be effectively utilized to visualize and explore properties of the data. When the number of SOM units is large, to facilitate quantitative analysis of the map and the data, similar units need to be grouped, i.e., clustered. In this paper, different approaches to clustering of the SOM are considered. In particular, the use of hierarchical agglomerative clustering and partitive clustering using K-means are investigated. The two-stage procedure-first using SOM to produce the prototypes that are then clustered in the second stage-is found to perform well when compared with direct clustering of the data and to reduce the computation time.

...read moreread less

2,387 citations

Proceedings Article•

Biclustering of Expression Data

[...]

Yizong Cheng¹, George M. Church•Institutions (1)

Harvard University¹

19 Aug 2000

TL;DR: An efficient node-deletion algorithm is introduced to find submatrices in expression data that have low mean squared residue scores and it is shown to perform well in finding co-regulation patterns in yeast and human.

...read moreread less

Abstract: An efficient node-deletion algorithm is introduced to find submatrices in expression data that have low mean squared residue scores and it is shown to perform well in finding co-regulation patterns in yeast and human. This introduces "biclustering’, or simultaneous clustering of both genes and conditions, to knowledge discovery from expression data. This approach overcomes some problems associated with traditional clustering methods, by allowing automatic discovery of similarity based on a subset of attributes, simultaneous clustering of genes and conditions, and overlapped grouping that provides a better representation for genes with multiple functions or regulated by many factors.

...read moreread less

2,213 citations

Journal Article•DOI•

Assessing a mixture model for clustering with the integrated completed likelihood

[...]

Christophe Biernacki, Gilles Celeux¹, Gérard Govaert²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, University of Technology of Compiègne²

01 Jul 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: An assessing method of mixture model in a cluster analysis setting with integrated completed likelihood appears to be more robust to violation of some of the mixture model assumptions and it can select a number of dusters leading to a sensible partitioning of the data.

...read moreread less

Abstract: We propose an assessing method of mixture model in a cluster analysis setting with integrated completed likelihood. For this purpose, the observed data are assigned to unknown clusters using a maximum a posteriori operator. Then, the integrated completed likelihood (ICL) is approximated using the Bayesian information criterion (BIC). Numerical experiments on simulated and real data of the resulting ICL criterion show that it performs well both for choosing a mixture model and a relevant number of clusters. In particular, ICL appears to be more robust than BIC to violation of some of the mixture model assumptions and it can select a number of dusters leading to a sensible partitioning of the data.

...read moreread less

1,418 citations

Journal Article•DOI•

ROCK: a robust clustering algorithm for categorical attributes

[...]

Sudipto Guha¹, Rajeev Rastogi², Kyuseok Shim³•Institutions (3)

Stanford University¹, Bell Labs², KAIST³

01 Jul 2000-Information Systems

TL;DR: This paper develops a robust hierarchical clustering algorithm ROCK that employs links and not distances when merging clusters, and indicates that ROCK not only generates better quality clusters than traditional algorithms, but it also exhibits good scalability properties.

...read moreread less

1,383 citations

Journal Article•DOI•

Genetic algorithm-based clustering technique

[...]

Ujjwal Maulik¹, Sanghamitra Bandyopadhyay²•Institutions (2)

Government Engineering College, Sreekrishnapuram¹, Indian Statistical Institute²

01 Sep 2000-Pattern Recognition

TL;DR: The superiority of the GA-clustering algorithm over the commonly used K-means algorithm is extensively demonstrated for four artificial and three real-life data sets.

...read moreread less

1,337 citations

Proceedings Article•DOI•

Efficient clustering of high-dimensional data sets with application to reference matching

[...]

Andrew McCallum¹, Kamal Nigam¹, Lyle H. Ungar²•Institutions (2)

Carnegie Mellon University¹, University of Pennsylvania²

01 Aug 2000

TL;DR: This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.

...read moreread less

Abstract: important problems involve clustering large datasets. Although naive implementations of clustering are computa- tionally expensive, there are established ecient techniques for clustering when the dataset has either (1) a limited num- ber of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of eciently clustering datasets that are large in all three ways at once|for example, having millions of data points that exist in many thousands of di- mensions representing many thousands of clusters. We present a new technique for clustering these large, high- dimensional datasets. The key idea involves using a cheap, approximate distance measure to eciently divide the data into overlapping subsets we call canopies .T hen cluster- ing is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under reasonable assumptions about the cheap distance metric, this reduction in computational cost comes without any loss in clustering accuracy. Canopies can be applied to many domains and used with a variety of cluster- ing approaches, including Greedy Agglomerative Clustering, K-means and Expectation-Maximization. We present ex- perimental results on grouping bibliographic citations from the reference sections of research papers. Here the canopy approach reduces computation time over a traditional clus- tering approach by more than an order of magnitude and decreases error in comparison to a previously used algorithm by 25%.

...read moreread less

Journal Article•DOI•

Genetic network inference: from co-expression clustering to reverse engineering.

[...]

Patrik D'haeseleer¹, Shoudan Liang, Roland Somogyi²•Institutions (2)

University of New Mexico¹, Incyte²

01 Aug 2000-Bioinformatics

TL;DR: It is concluded that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting and bioengineering.

...read moreread less

Abstract: Advances in molecular biological, analytical, and computational technologies are enabling us to systematically investigate the complex molecular processes underlying biological systems. In particular, using high-throughput gene expression assays, we are able to measure the output of the gene regulatory network. We aim here to review datamining and modeling approaches for conceptualizing and unraveling the functional relationships implicit in these datasets. Clustering of co-expression profiles allows us to infer shared regulatory inputs and functional pathways. We discuss various aspects of clustering, ranging from distance measures to clustering algorithms and multiple-duster memberships. More advanced analysis aims to infer causal connections between genes directly, i.e., who is regulating whom and how. We discuss several approaches to the problem of reverse engineering of genetic networks, from discrete Boolean networks, to continuous linear and non-linear models. We conclude that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting, and bioengineering.

...read moreread less

Journal Article•DOI•

Robust mixture modelling using the t distribution

[...]

David Peel¹, Geoffrey J. McLachlan¹•Institutions (1)

University of Queensland¹

01 Oct 2000-Statistics and Computing

TL;DR: The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.

...read moreread less

Abstract: Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.

...read moreread less

DOI•

Intrusion detection with unlabeled data using clustering

[...]

Leonid Portnoy

01 Jan 2000

TL;DR: “¦e4&2¦2nn¤2 U ¥ Se¦§¯4e ̈©SS ‘¬’¦ e-S«S«

...read moreread less

Abstract: GUeneY n | n`Y4 4 2 d S4¡& U¢£ne¤2 U ¥ Se¦§4e 4 ̈©S ¦W[a:e4n¤2`4«[¦e ¬ ¦+Se¦\¦e S«8 84;8® ̄°A± e4nd£ ¥S ̈L e2¦ ¬4n A ¤3 ¤3)[S ¢S a ́ 4S¥`4 4§¤2n¦eμ" e«J ¦ ¬ ·¶ 44 U34 ̧n`44 a`S4 ¦1¤2 4A¦e3Se¦ « ee ̧S« ̧SS4e¤3 8e " «3n «a « ¦[¦e4&2S@¥U ̧S S««" S S2¦ ¬24 oUe4ne (»+2e4 U2\ ̈1⁄4S¤2 S4¡ ̈©S3S± ¤34 S«« ¦ ¬4° ̧[°m`4ne ¥P 1⁄2S ́S4 4 S¥` ¢S ( ̈]4 \4S e¡A) 83⁄4 4eWA ¤ ^G3⁄4n2£4 ¤ ¥8%¤3S£eS««%S[S4 4 84 «S44¿` ¦ ¦e4N& ¬ 444 ̈1⁄4S2See ̧ À ́2¤2 4A¦Á&a`«2N¦e¬± ¬Â¤3SU§¦eÃ: 4 U8U S ̈/Ue4ne ¥ 8e°«$¤3SU4Se° ̧3«) C ̈1⁄4S«4 :n4¢S ́S

...read moreread less

Proceedings Article•DOI•

On clusterings-good, bad and spectral

[...]

Ravi Kannan¹, Santosh Vempala¹, A. Veta¹•Institutions (1)

Yale University¹

12 Nov 2000

TL;DR: Two results regarding the quality of the clustering found by a popular spectral algorithm are presented, one proffers worst case guarantees whilst the other shows that if there exists a "good" clustering then the spectral algorithm will find one close to it.

...read moreread less

Abstract: We propose a new measure for assessing the quality of a clustering. A simple heuristic is shown to give worst-case guarantees under the new measure. Then we present two results regarding the quality of the clustering found by a popular spectral algorithm. One proffers worst case guarantees whilst the other shows that if there exists a "good" clustering then the spectral algorithm will find one close to it.

...read moreread less

Journal Article•DOI•

Coupled two-way clustering analysis of gene microarray data

[...]

Gad Getz¹, Erel Levine, Eytan Domany•Institutions (1)

Weizmann Institute of Science¹

24 Oct 2000-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: An algorithm, based on iterative clustering, that performs an algorithm to identify subsets of the genes and samples, such that when one of these is used to cluster the other, stable and significant partitions emerge.

...read moreread less

Abstract: We present a coupled two-way clustering approach to gene microarray data analysis. The main idea is to identify subsets of the genes and samples, such that when one of these is used to cluster the other, stable and significant partitions emerge. The search for such subsets is a computationally complex task. We present an algorithm, based on iterative clustering, that performs such a search. This analysis is especially suitable for gene microarray data, where the contributions of a variety of biological mechanisms to the gene expression levels are entangled in a large body of experimental data. The method was applied to two gene microarray data sets, on colon cancer and leukemia. By identifying relevant subsets of the data and focusing on them we were able to discover partitions and correlations that were masked and hidden when the full dataset was used in the analysis. Some of these partitions have clear biological interpretation; others can serve to identify possible directions for future research.

...read moreread less

Proceedings Article•DOI•

Clustering data streams

[...]

Sudipto Guha¹, Nina Mishra¹, Rajeev Motwani¹, Liadan O'Callaghan¹•Institutions (1)

Stanford University¹

12 Nov 2000

TL;DR: This work gives constant-factor approximation algorithms for the k-median problem in the data stream model of computation in a single pass, and shows negative results implying that these algorithms cannot be improved in a certain sense.

...read moreread less

Abstract: We study clustering under the data stream model of computation where: given a sequence of points, the objective is to maintain a consistently good clustering of the sequence observed so far, using a small amount of memory and time. The data stream model is relevant to new classes of applications involving massive data sets, such as Web click stream analysis and multimedia data analysis. We give constant-factor approximation algorithms for the k-median problem in the data stream model of computation in a single pass. We also show negative results implying that our algorithms cannot be improved in a certain sense.

...read moreread less

[...]

Alexander Strehl, Joydeep Ghosh, Raymond J. Mooney

01 Jan 2000

TL;DR: Comparing four popular similarity measures in conjunction with several clustering techniques, cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest.

...read moreread less

Abstract: Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively different metrics. We observe that in domains such as YAHOO that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hyper-graph partitioning, generalized kmeans, weighted graph partitioning), on high dimensionai sparse data representing web documents. Performance is measured against a human-imposed classification into news categories and industry categories. We conduct a number of experiments and use t-tests to assure statistical significance of results. Cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest. Also, weighted graph partitioning approaches are clearly superior to all others.

...read moreread less

Book Chapter•DOI•

Unsupervised Learning of Models for Recognition

[...]

M. Weber¹, Max Welling¹, Pietro Perona¹, Pietro Perona²•Institutions (2)

California Institute of Technology¹, University of Padua²

26 Jun 2000

TL;DR: A method to learn object class models from unlabeled and unsegmented cluttered cluttered scenes for the purpose of visual object recognition that achieves very good classification results on human faces and rear views of cars.

...read moreread less

Abstract: We present a method to learn object class models from unlabeled and unsegmented cluttered scenes for the purpose of visual object recognition. We focus on a particular type of model where objects are represented as flexible constellations of rigid parts (features). The variability within a class is represented by a joint probability density function (pdf) on the shape of the constellation and the output of part detectors. In a first stage, the method automatically identifies distinctive parts in the training set by applying a clustering algorithm to patterns selected by an interest operator. It then learns the statistical shape model using expectation maximization. The method achieves very good classification results on human faces and rear views of cars.

...read moreread less

Proceedings Article•

Advances in domain independent linear text segmentation

[...]

Freddy Y. Y. Choi¹•Institutions (1)

University of Manchester¹

29 Apr 2000

TL;DR: This paper proposed a method for linear text segmentation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998).

...read moreread less

Abstract: This paper describes a method for linear text segmentation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering.

...read moreread less

Proceedings Article•

Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation

[...]

Sid Ray¹, Rose H Turi¹•Institutions (1)

Monash University¹

01 Jan 2000

TL;DR: This paper presents a simple validity measure based on the intra-clusters and inter-cluster distance measures which allows the number of clusters to be determined automatically and is tested for synthetic images for which theNumber of clusters in known, and is also implemented for natural images.

...read moreread less

Abstract: The main disadvantage of the k-means algorithm is that the number of clusters, K, must be supplied as a parameter. In this paper we present a simple validity measure based on the intra-cluster and inter-cluster distance measures which allows the number of clusters to be determined automatically. The basic procedure involves producing all the segmented images for 2 clusters up to Kmax clusters, where Kmax represents an upper limit on the number of clusters. Then our validity measure is calculated to determine which is the best clustering by finding the minimum value for our measure. The validity measure is tested for synthetic images for which the number of clusters in known, and is also implemented for natural images.

...read moreread less

Book•

Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data

[...]

Hans-Hermann Bock, Edwin Diday

03 Feb 2000

TL;DR: This work focuses on Symbolic Data Analysis and the SODAS Project: Purpose, History, Perspective, and Symbolic Objects, where H.H. Bock and E. Diday focused on the former and the latter dealt with the latter.

...read moreread less

Abstract: E. Diday: Symbolic Data Analysis and the SODAS Project: Purpose, History, Perspective.- H.H. Bock: The Classical Data Situation.- H.H. Bock: Symbolic Data.- H.H. Bock, E. Diday: Symbolic Objects.- V. Stephan, G. Hebrail, Y. Lechevallier: Generation of Symbolic Objects from Relational Databases.- P. Bertrand, F. Goupil: Descriptive Statistics for Symbolic Data.- M. Noirhomme-Fraiture, M. Rouard: Visualizing and Editing Symbolic Objects.- Similarity and Dissimilarity: F. Esposito, D. Malerba, V. Tamma, H.H. Bock: Classical Resemblance Measures.- H.H. Bock: Dissimilarity Measures for Probability Distributions.- F. Esposito, D. Malerba, V. Tamma: Dissimilarity Measures for Symbolic Objects.- F. Esposito, D. Malerba, F. Lisi: Matching Symbolic Objects.- Symbolic Factor Analysis: H.H.Bock: Classical Principal Component Analysis.- A. Chouakria, P. Cazes, E. Diday: Symbolic Principal Component Analysis.- N.C. Lauro, F. Palumbo, R. Verde: Factorial Discriminant Analysis on Symbolic Objects.- Discrimination: Assigning Symbolic Objects to Classes: J. Rasson, S. Lissoir: Classical Methods of Discrimination.- J. Rasson, S. Lissoir: Symbolic Kernel Discriminant Analysis.- E. Perinel, Y. Lechevalier: Symbolic Discrimination Rules.- M. Bravo Llatas, J. Garcia-Santesmases: Segmentation Trees for Stratified Data.- Clustering Methods for Symbolic Objects: M. Chavent, H.H. Bock: Clustering Problem, Clustering Methods for Classical Data.- M. Chavent: Criterion-Based Divisive Clustering for Symbolic Data.- P. Brito: Hierarchical and Pyramidal Clustering with Complete Symbolic Objects.- G. Polaillon: Pyramidal Classification for Interval Data Using Galois Lattice Reduction.- M. Gettler-Summa, C. Pardoux: Symbolic Approaches for Three-way Data.-Illustrative Benchmark Analysis: R. Bisdorff: Introduction.- R. Bisdorff: Professional Careers of Retired Working Persons.- A. Iztueta, P. Calvo: Labour Force Survey.- F. Goupil, M. Touati, E. Diday, R. Moult: Census Data from the Office for National Statistics.- A. Morineau: The SODAS Software Package.

...read moreread less

Journal Article•DOI•

Algorithms for defining visual regions-of-interest: comparison with eye fixations

[...]

C.M. Privitera¹, L.W. Stark¹•Institutions (1)

University of California, Berkeley¹

01 Sep 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper investigates and develops a methodology that serves to automatically identify a subset of aROIs (algorithmically detected ROIs) using different image processing algorithms (IPAs), and appropriate clustering procedures, and compares hROIs with hROI as a criterion for evaluating and selecting bottom-up, context-free algorithms.

...read moreread less

Abstract: Many machine vision applications, such as compression, pictorial database querying, and image understanding, often need to analyze in detail only a representative subset of the image, which may be arranged into sequences of loci called regions-of-interest (ROIs). We have investigated and developed a methodology that serves to automatically identify such a subset of aROIs (algorithmically detected ROIs) using different image processing algorithms (IPAs), and appropriate clustering procedures. In human perception, an internal representation directs top-down, context-dependent sequences of eye movements to fixate on similar sequences of hROIs (human identified ROIs). In the paper, we introduce our methodology and we compare aROIs with hROIs as a criterion for evaluating and selecting bottom-up, context-free algorithms. An application is finally discussed.

...read moreread less

Proceedings Article•

Clustering with Instance-Level Constraints

[...]

Kiri L. Wagstaff¹, Claire Cardie¹•Institutions (1)

Cornell University¹

30 Jul 2000

TL;DR: In this paper, two types of instance-level clustering constraints, must-link and cannot-link constraints, are proposed to aid the search of possible organizations of a data set.

...read moreread less

Abstract: Clustering algorithms conduct a search through the space of possible organizations of a data set. In this paper, we propose two types of instance-level clustering constraints { must-link and cannot-link constraints { and show how they can be incorporated into a clustering algorithm to aid that search. For three of the four data sets tested, our results indicate that the incorporation of surprisingly few such constraints can increase clustering accuracy while decreasing runtime. We also investigate the relative eects of each type of constraint and nd that the type that contributes most to accuracy improvements depends on the behavior of the clustering algorithm without constraints.

...read moreread less

Journal Article•DOI•

Clustering ECG complexes using Hermite functions and self-organizing maps

[...]

Martin Lagerholm¹, Carsten Peterson¹, G. Braccini¹, Lars Edenbrandt¹, Leif Sörnmo¹ - Show less +1 more•Institutions (1)

Lund University¹

01 Jul 2000-IEEE Transactions on Biomedical Engineering

TL;DR: An integrated method for clustering of QRS complexes is presented which includes basis function representation and self-organizing neural networks (NN's) and outperforms both a published supervised learning method as well as a conventional template cross-correlation clustering method.

...read moreread less

Abstract: An integrated method for clustering of QRS complexes is presented which includes basis function representation and self-organizing neural networks (NN's). Each QRS complex is decomposed into Hermite basis functions and the resulting coefficients and width parameter are used to represent the complex. By means of this representation, unsupervised self-organizing NNs are employed to cluster the data into 25 groups. Using the MIT-BIH arrhythmia database, the resulting clusters are found to exhibit a very low degree of misclassification (1.5%). The integrated method outperforms, on the MIT-BIH database, both a published supervised learning method as well as a conventional template cross-correlation clustering method.

...read moreread less

Journal Article•DOI•

Finding generalized projected clusters in high dimensional spaces

[...]

Charu C. Aggarwal¹, Philip S. Yu¹•Institutions (1)

IBM¹

16 May 2000

TL;DR: Very general techniques for projected clustering are discussed which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality, which is substantially more general and realistic than currently available techniques.

...read moreread less

Abstract: High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projected clustering which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to redefine clustering for high dimensional applications by searching for hidden subspaces with clusters which are created by inter-attribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable, and are likely ta tradeoff with better accuracy.

...read moreread less

Proceedings Article•DOI•

Document clustering using word clusters via the information bottleneck method

[...]

Noam Slonim¹, Naftali Tishby¹•Institutions (1)

Interdisciplinary Center for Neural Computation¹

01 Jul 2000

TL;DR: A novel implementation of the recently introduced information bottleneck method for unsupervised document clustering that first finds word-clusters that capture most of the mutual information about to set of documents, and then finds document clusters that preserve the information about the word clusters.

...read moreread less

Abstract: We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x, y), we first cluster the words, Y, so that the obtained word clusters, Ytilde;, maximally preserve the information on the documents. The resulting joint distribution. p(X, Ytilde;), contains most of the original information about the documents, I(X; Ytilde;) a I(X; Y), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X, so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about to set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.

...read moreread less

Journal Article•DOI•

Dynamic self-organizing maps with controlled growth for knowledge discovery

[...]

Damminda Alahakoon¹, Saman K. Halgamuge², Bala Srinivasan³•Institutions (3)

Monash University, Clayton campus¹, University of Melbourne², Monash University³

01 May 2000-IEEE Transactions on Neural Networks

TL;DR: The growing self-organizing map (GSOM) is presented in detail and the effect of a spread factor, which can be used to measure and control the spread of the GSOM, is investigated.

...read moreread less

Abstract: The growing self-organizing map (GSOM) algorithm is presented in detail and the effect of a spread factor, which can be used to measure and control the spread of the GSOM, is investigated. The spread factor is independent of the dimensionality of the data and as such can be used as a controlling measure for generating maps with different dimensionality, which can then be compared and analyzed with better accuracy. The spread factor is also presented as a method of achieving hierarchical clustering of a data set with the GSOM. Such hierarchical clustering allows the data analyst to identify significant and interesting clusters at a higher level of the hierarchy, and continue with finer clustering of the interesting clusters only. Therefore, only a small map is created in the beginning with a low spread factor, which can be generated for even a very large data set. Further analysis is conducted on selected sections of the data and of smaller volume. Therefore, this method facilitates the analysis of even very large data sets.

...read moreread less

Proceedings Article•

Clustering with Instance-level Constraints

[...]

Kiri L. Wagstaff¹, Claire Cardie¹•Institutions (1)

Cornell University¹

29 Jun 2000

TL;DR: This paper proposes two types of instance-level clustering constraints { must-link and cannot-link constraints} and shows how they can be incorporated into a clustering algorithm to aid that search.

...read moreread less

Journal Article•DOI•

Clustering Categorical Data: An Approach Based on Dynamical Systems

[...]

David Gibson¹, Jon Kleinberg², Prabhakar Raghavan³•Institutions (3)

University of California, Berkeley¹, Cornell University², IBM³

01 Feb 2000

TL;DR: This work describes a novel approach for clustering collections of sets, and its application to the analysis and mining of categorical data, based on an iterative method for assigning and propagating weights on the categorical values in a table.

...read moreread less

Abstract: We describe a novel approach for clustering collections of sets, and its application to the analysis and mining of categorical data. By “categorical data,” we mean tables with fields that cannot be naturally ordered by a metric – e.g., the names of producers of automobiles, or the names of products offered by a manufacturer. Our approach is based on an iterative method for assigning and propagating weights on the categorical values in a table; this facilitates a type of similarity measure arising from the co-occurrence of values in the dataset. Our techniques can be studied analytically in terms of certain types of non-linear dynamical systems.

...read moreread less

Collapse