scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2005"


Journal ArticleDOI
TL;DR: Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.
Abstract: Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

5,744 citations


Journal ArticleDOI
TL;DR: This work presents tools for hierarchical clustering of imaged objects according to the shapes of their boundaries, learning of probability models for clusters of shapes, and testing of newly observed shapes under competing probability models.
Abstract: Using a differential-geometric treatment of planar shapes, we present tools for: 1) hierarchical clustering of imaged objects according to the shapes of their boundaries, 2) learning of probability models for clusters of shapes, and 3) testing of newly observed shapes under competing probability models. Clustering at any level of hierarchy is performed using a minimum variance type criterion and a Markov process. Statistical means of clusters provide shapes to be clustered at the next higher level, thus building a hierarchy of shapes. Using finite-dimensional approximations of spaces tangent to the shape space at sample means, we (implicitly) impose probability models on the shape space, and results are illustrated via random sampling and classification (hypothesis testing). Together, hierarchical clustering and hypothesis testing provide an efficient framework for shape retrieval. Examples are presented using shapes and images from ETH, Surrey, and AMCOM databases.

2,858 citations


Journal ArticleDOI
TL;DR: With the categorizing framework, the efforts toward-building an integrated system for intelligent feature selection are continued, and an illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms.
Abstract: This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward-building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.

2,605 citations


Journal ArticleDOI
TL;DR: This paper surveys and summarizes previous works that investigated the clustering of time series data in various application domains, including general-purpose clustering algorithms commonly used in time series clustering studies.

2,336 citations


Proceedings ArticleDOI
01 Dec 2005
TL;DR: This paper proposes and analyzes parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences, and shows that there is a bijection between regular exponential families and a largeclass of BRegman diverGences, that is called regular Breg man divergence.
Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans , the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information.

1,723 citations


Journal ArticleDOI
TL;DR: A simple model for semantic growth is described, in which each new word or concept is connected to an existing network by differentiating the connectivity pattern of an existing node, which generates appropriate small-world statistics and power-law connectivity distributions.

1,224 citations


Journal ArticleDOI
TL;DR: An algebro-geometric solution to the problem of segmenting an unknown number of subspaces of unknown and varying dimensions from sample data points and applications of GPCA to computer vision problems such as face clustering, temporal video segmentation, and 3D motion segmentation from point correspondences in multiple affine views are presented.
Abstract: This paper presents an algebro-geometric solution to the problem of segmenting an unknown number of subspaces of unknown and varying dimensions from sample data points. We represent the subspaces with a set of homogeneous polynomials whose degree is the number of subspaces and whose derivatives at a data point give normal vectors to the subspace passing through the point. When the number of subspaces is known, we show that these polynomials can be estimated linearly from data; hence, subspace segmentation is reduced to classifying one point per subspace. We select these points optimally from the data set by minimizing certain distance function, thus dealing automatically with moderate noise in the data. A basis for the complement of each subspace is then recovered by applying standard PCA to the collection of derivatives (normal vectors). Extensions of GPCA that deal with data in a high-dimensional space and with an unknown number of subspaces are also presented. Our experiments on low-dimensional data show that GPCA outperforms existing algebraic algorithms based on polynomial factorization and provides a good initialization to iterative techniques such as k-subspaces and expectation maximization. We also present applications of GPCA to computer vision problems such as face clustering, temporal video segmentation, and 3D motion segmentation from point correspondences in multiple affine views.

1,162 citations


Journal ArticleDOI
TL;DR: A theoretical framework for the analysis of the proposed clustering combination strategy and its evaluation is developed, based on the concept of mutual information between data partitions, for extracting a consistent clustering, given the various partitions in a clustering ensemble.
Abstract: We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble?a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of producing data partitions are: 1) applying different clustering algorithms and 2) applying the same clustering algorithm with different values of parameters or initializations. Further, combinations of different data representations (feature spaces) and clustering algorithms can also provide a multitude of significantly different data partitionings. We propose a simple framework for extracting a consistent clustering, given the various partitions in a clustering ensemble. According to the EAC concept, each partition is viewed as an independent evidence of data organization, individual data partitions being combined, based on a voting mechanism, to generate a new n \times n similarity matrix between the n patterns. The final data partition of the n patterns is obtained by applying a hierarchical agglomerative clustering algorithm on this matrix. We have developed a theoretical framework for the analysis of the proposed clustering combination strategy and its evaluation, based on the concept of mutual information between data partitions. Stability of the results is evaluated using bootstrapping techniques. A detailed discussion of an evidence accumulation-based clustering algorithm, using a split and merge strategy based on the K-means clustering algorithm, is presented. Experimental results of the proposed method on several synthetic and real data sets are compared with other combination strategies, and with individual clustering results produced by well-known clustering algorithms.

1,131 citations


Journal ArticleDOI
TL;DR: A centralized routing protocol called base-station controlled dynamic clustering protocol (BCDCP), which distributes the energy dissipation evenly among all sensor nodes to improve network lifetime and average energy savings and is compared to clustering-based schemes.
Abstract: Wireless sensor networks consist of small battery powered devices with limited energy resources. Once deployed, the small sensor nodes are usually inaccessible to the user, and thus replacement of the energy source is not feasible. Hence, energy efficiency is a key design issue that needs to be enhanced in order to improve the life span of the network. Several network layer protocols have been proposed to improve the effective lifetime of a network with a limited energy supply. In this article we propose a centralized routing protocol called base-station controlled dynamic clustering protocol (BCDCP), which distributes the energy dissipation evenly among all sensor nodes to improve network lifetime and average energy savings. The performance of BCDCP is then compared to clustering-based schemes such as low-energy adaptive clustering hierarchy (LEACH), LEACH-centralized (LEACH-C), and power-efficient gathering in sensor information systems (PEGASIS). Simulation results show that BCDCP reduces overall energy consumption and improves network lifetime over its comparatives.

922 citations


Journal ArticleDOI
TL;DR: This article presents a comprehensive survey of recently proposed clustering algorithms, which are classified based on their objectives and descriptions of the mechanisms, evaluations of their performance and cost, and discussions of advantages and disadvantages of each clustering scheme.
Abstract: Clustering is an important research topic for mobile ad hoc networks (MANETs) because clustering makes it possible to guarantee basic levels of system performance, such as throughput and delay, in the presence of both mobility and a large number of mobile terminals. A large variety of approaches for ad hoc clustering have been presented, whereby different approaches typically focus on different performance metrics. This article presents a comprehensive survey of recently proposed clustering algorithms, which we classify based on their objectives. This survey provides descriptions of the mechanisms, evaluations of their performance and cost, and discussions of advantages and disadvantages of each clustering scheme. With this article, readers can have a more thorough and delicate understanding of ad hoc clustering and the research trends in this area.

914 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a review of clustering validation techniques for post-genomic data analysis, with a particular focus on their application to postgenomic analysis of biological data.
Abstract: Motivation: The discovery of novel biological knowledge from the ab initio analysis of post-genomic data relies upon the use of unsupervised processing methods, in particular clustering techniques. Much recent research in bioinformatics has therefore been focused on the transfer of clustering methods introduced in other scientific fields and on the development of novel algorithms specifically designed to tackle the challenges posed by post-genomic data. The partitions returned by a clustering algorithm are commonly validated using visual inspection and concordance with prior biological knowledge---whether the clusters actually correspond to the real structure in the data is somewhat less frequently considered. Suitable computational cluster validation techniques are available in the general data-mining literature, but have been given only a fraction of the same attention in bioinformatics. Results: This review paper aims to familiarize the reader with the battery of techniques available for the validation of clustering results, with a particular focus on their application to post-genomic data analysis. Synthetic and real biological datasets are used to demonstrate the benefits, and also some of the perils, of analytical clustervalidation. Availability: The software used in the experiments is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/ Contact: J.Handl@postgrad.manchester.ac.uk Supplementary information: Enlarged colour plots are provided in the Supplementary Material, which is available at http://dbkweb.ch.umist.ac.uk/handl/clustervalidation/

Journal ArticleDOI
TL;DR: A generative mixture-model approach to clustering directional data based on the von Mises-Fisher distribution, which arises naturally for data distributed on the unit hypersphere, and derives and analyzes two variants of the Expectation Maximization framework for estimating the mean and concentration parameters of this mixture.
Abstract: Several large scale data mining applications, such as text categorization and gene expression analysis, involve high-dimensional data that is also inherently directional in nature. Often such data is L2 normalized so that it lies on the surface of a unit hypersphere. Popular models such as (mixtures of) multi-variate Gaussians are inadequate for characterizing such data. This paper proposes a generative mixture-model approach to clustering directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. In particular, we derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the mean and concentration parameters of this mixture. Numerical estimation of the concentration parameters is non-trivial in high dimensions since it involves functional inversion of ratios of Bessel functions. We also formulate two clustering algorithms corresponding to the variants of EM that we derive. Our approach provides a theoretical basis for the use of cosine similarity that has been widely employed by the information retrieval community, and obtains the spherical kmeans algorithm (kmeans with cosine similarity) as a special case of both variants. Empirical results on clustering of high-dimensional text and gene-expression data based on a mixture of vMF distributions show that the ability to estimate the concentration parameter for each vMF component, which is not present in existing approaches, yields superior results, especially for difficult clustering tasks in high-dimensional spaces.

Proceedings ArticleDOI
07 Apr 2005
TL;DR: This paper proposes a novel clustering schema EECS for wireless sensor networks, which better suits the periodical data gathering applications and elects cluster heads with more residual energy through local radio communication while achieving well cluster head distribution.
Abstract: Data gathering is a common but critical operation in many applications of wireless sensor networks. Innovative techniques that improve energy efficiency to prolong the network lifetime are highly required. Clustering is an effective topology control approach in wireless sensor networks, which can increase network scalability and lifetime. In this paper, we propose a novel clustering schema EECS for wireless sensor networks, which better suits the periodical data gathering applications. Our approach elects cluster heads with more residual energy through local radio communication while achieving well cluster head distribution; further more it introduces a novel method to balance the load among the cluster heads. Simulation results show that EECS outperforms LEACH significantly with prolonging the network lifetime over 35%.

Journal Article
TL;DR: The main analysis tasks preprocessing, classification, clustering, information extraction and visualization are described and a number of successful applications of text mining are discussed.
Abstract: The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific (pre-)processing methods and algorithms are required in order to extract useful patterns. Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. In this article, we discuss text mining as a young and interdisciplinary field in the intersection of the related areas information retrieval, machine learning, statistics, computational linguistics and especially data mining. We describe the main analysis tasks preprocessing, classification, clustering, information extraction and visualization. In addition, we briefly discuss a number of successful applications of text mining.

Proceedings ArticleDOI
17 Oct 2005
TL;DR: It is shown that dense representations outperform equivalent keypoint based ones on these tasks and that SVM or mutual information based feature selection starting from a dense codebook further improves the performance.
Abstract: Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as k-means to cluster the descriptor vectors of patches sampled either densely ('textons') or sparsely ('bags of features' based on key-points or salience measures) from a set of training images. This works well for texture analysis in homogeneous images, but the images that arise in natural object recognition tasks have far less uniform statistics. We show that for dense sampling, k-means over-adapts to this, clustering centres almost exclusively around the densest few regions in descriptor space and thus failing to code other informative regions. This gives suboptimal codes that are no better than using randomly selected centres. We describe a scalable acceptance-radius based clusterer that generates better codebooks and study its performance on several image classification tasks. We also show that dense representations outperform equivalent keypoint based ones on these tasks and that SVM or mutual information based feature selection starting from a dense codebook further improves the performance.

Journal ArticleDOI
TL;DR: This work defines both a measure of local community structure and an algorithm that infers the hierarchy of communities that enclose a given vertex by exploring the graph one vertex at a time, and uses this algorithm to extract meaningful local clustering information in the large recommender network of an online retailer.
Abstract: Although the inference of global community structure in networks has recently become a topic of great interest in the physics community, all such algorithms require that the graph be completely known. Here, we define both a measure of local community structure and an algorithm that infers the hierarchy of communities that enclose a given vertex by exploring the graph one vertex at a time. This algorithm runs in time O(k2d) for general graphs when d is the mean degree and k is the number of vertices to be explored. For graphs where exploring a new vertex is time consuming, the running time is linear, O(k). We show that on computer-generated graphs the average behavior of this technique approximates that of algorithms that require global knowledge. As an application, we use this algorithm to extract meaningful local clustering information in the large recommender network of an online retailer.

Journal ArticleDOI
TL;DR: It is shown that the empirically derived solution clearly dominates the traditional P-A-D-R typology of Miles and Snow, and implications and directions for future research are provided.
Abstract: The Miles and Snow strategic type framework is re-examined with respect to interrelationships with several theoretically relevant batteries of variables, including SBU strategic capabilities, environmental uncertainty, and performance. A newly developed constrained, multi-objective, classification methodology is modified to empirically derive an alternative quantitative typology using survey data obtained from 709 firms in three countries (China, Japan, United States). We compare the Miles and Snow typology to the classification empirically derived utilizing this combinatorial optimization clustering procedure. With respect to both variable battery associations and objective statistical criteria, we show that the empirically derived solution clearly dominates the traditional P-A-D-R typology of Miles and Snow. Implications and directions for future research are provided. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A new step is introduced to the k-means clustering process to iteratively update variable weights based on the current partition of data and a formula for weight calculation is proposed, and the convergency theorem of the new clustered process is given.
Abstract: This paper proposes a k-means type clustering algorithm that can automatically calculate variable weights. A new step is introduced to the k-means clustering process to iteratively update variable weights based on the current partition of data and a formula for weight calculation is proposed. The convergency theorem of the new clustering process is given. The variable weights produced by the algorithm measure the importance of variables in clustering and can be used in variable selection in data mining applications where large and complex real data are often involved. Experimental results on both synthetic and real data have shown that the new algorithm outperformed the standard k-means type algorithms in recovering clusters in data.

Journal ArticleDOI
TL;DR: A novel document clustering method which aims to cluster the documents into different semantic classes by using locality preserving indexing (LPI), an unsupervised approximation of the supervised linear discriminant analysis (LDA) method, which gives the intuitive motivation of the method.
Abstract: We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using locality preserving indexing (LPI), the documents can be projected into a lower-dimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on latent semantic indexing (LSI) or nonnegative matrix factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised linear discriminant analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters-21578 and TDT2 data sets.

Proceedings ArticleDOI
15 Aug 2005
TL;DR: In this paper, clusters generated from the training data provide the basis for data smoothing and neighborhood selection and show that the new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.
Abstract: Memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. In the past, the memory-based approach has been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. Alternatively, the model-based approach has been proposed to alleviate these problems, but this approach tends to limit the range of users. In this paper, we present a novel approach that combines the advantages of these two approaches by introducing a smoothing-based method. In our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. As a result, we provide higher accuracy as well as increased efficiency in recommendations. Empirical studies on two datasets (EachMovie and MovieLens) show that our new proposed approach consistently outperforms other state-of-art collaborative filtering algorithms.

Journal ArticleDOI
TL;DR: A unified representation for multiple clusterings is introduced and a probabilistic model of consensus is proposed using a finite mixture of multinomial distributions in a space of clusterings in order to define a new consensus function related to the classical intraclass variance criterion.
Abstract: Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial, or statistical perspectives. This study extends previous research on clustering ensembles in several respects. First, we introduce a unified representation for multiple clusterings and formulate the corresponding categorical clustering problem. Second, we propose a probabilistic model of consensus using a finite mixture of multinomial distributions in a space of clusterings. A combined partition is found as a solution to the corresponding maximum-likelihood problem using the EM algorithm. Third, we define a new consensus function that is related to the classical intraclass variance criterion using the generalized mutual information definition. Finally, we demonstrate the efficacy of combining partitions generated by weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combinations of such weak clustering components. Combination accuracy is analyzed as a function of several parameters that control the power and resolution of component partitions as well as the number of partitions. We also analyze clustering ensembles with incomplete information and the effect of missing cluster labels on the quality of overall consensus. Experimental results demonstrate the effectiveness of the proposed methods on several real-world data sets.

Proceedings ArticleDOI
12 Dec 2005
TL;DR: An energy-efficient unequal clustering mechanism for periodical data gathering in wireless sensor networks that partitions the nodes into clusters of unequal size, and clusters closer to the base station can preserve some energy for the inter-cluster data forwarding.
Abstract: Clustering provides an effective way for prolonging the lifetime of a wireless sensor network. Current clustering algorithms usually utilize two techniques, selecting cluster heads with more residual energy and rotating cluster heads periodically, to distribute the energy consumption among nodes in each cluster and extend the network lifetime. However, they rarely consider the hot spots problem in multihop wireless sensor networks. When cluster heads cooperate with each other to forward their data to the base station, the cluster heads closer to the base station are burdened with heavy relay traffic and tend to die early, leaving areas of the network uncovered and causing network partition. To address the problem, we propose an energy-efficient unequal clustering (EEUC) mechanism for periodical data gathering in wireless sensor networks. It partitions the nodes into clusters of unequal size, and clusters closer to the base station have smaller sizes than those farther away from the base station. Thus cluster heads closer to the base station can preserve some energy for the inter-cluster data forwarding. We also propose an energy-aware multihop routing protocol for the inter-cluster communication. Simulation results show that our unequal clustering mechanism balances the energy consumption well among all sensor nodes and achieves an obvious improvement on the network lifetime

Journal ArticleDOI
TL;DR: The experimental evaluation shows that, contrary to the common belief, partitional algorithms always lead to better solutions than agglomerative algorithms; making them ideal for clustering large document collections due to not only their relatively low computational requirements, but also higher clustering quality.
Abstract: Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, clustering algorithms that build meaningful hierarchies out of large document collections are ideal tools for their interactive visualization and exploration as they provide data-views that are consistent, predictable, and at different levels of granularity. This paper focuses on document clustering algorithms that build such hierarchical solutions and (i) presents a comprehensive study of partitional and agglomerative algorithms that use different criterion functions and merging schemes, and (ii) presents a new class of clustering algorithms called constrained agglomerative algorithms, which combine features from both partitional and agglomerative approaches that allows them to reduce the early-stage errors made by agglomerative methods and hence improve the quality of clustering solutions. The experimental evaluation shows that, contrary to the common belief, partitional algorithms always lead to better solutions than agglomerative algorithms; making them ideal for clustering large document collections due to not only their relatively low computational requirements, but also higher clustering quality. Furthermore, the constrained agglomerative methods consistently lead to better solutions than agglomerative methods alone and for many cases they outperform partitional methods, as well.

Proceedings Article
01 Jan 2005
TL;DR: This paper shows how optimizing the Q function can be reformulated as a spectral relaxation problem and proposes two new spectral clustering algorithms that seek to maximize Q and indicates that the new algorithms are efficient and effective at finding both good clusterings and the appropriate number of clusters across a variety of real-world graph data sets.
Abstract: Clustering nodes in a graph is a useful general technique in data mining of large network data sets In this context, Newman and Girvan [9] recently proposed an objective function for graph clustering called the Q function which allows automatic selection of the number of clusters Empirically, higher values of the Q function have been shown to correlate well with good graph clusterings In this paper we show how optimizing the Q function can be reformulated as a spectral relaxation problem and propose two new spectral clustering algorithms that seek to maximize Q Experimental results indicate that the new algorithms are efficient and effective at finding both good clusterings and the appropriate number of clusters across a variety of real-world graph data sets In addition, the spectral algorithms are much faster for large sparse graphs, scaling roughly linearly with the number of nodes n in the graph, compared to O(n) for previous clustering algorithms using the Q function

Journal ArticleDOI
TL;DR: The key idea is to view clustering as a supervised classification problem, in which the “true” class labels are estimated, and the resulting “prediction strength” measure assesses how many groups can be predicted from the data, and how well.
Abstract: This article proposes a new quantity for assessing the number of groups or clusters in a dataset. The key idea is to view clustering as a supervised classification problem, in which we must also estimate the “true” class labels. The resulting “prediction strength” measure assesses how many groups can be predicted from the data, and how well. In the process, we develop novel notions of bias and variance for unlabeled data. Prediction strength performs well in simulation studies, and we apply it to clusters of breast cancer samples from a DNA microarray study. Finally, some consistency properties of the method are established.

Journal ArticleDOI
TL;DR: A novel approach to the automatic acquisition of taxonomies or concept hierarchies from a text corpus based on Formal Concept Analysis, which model the context of a certain term as a vector representing syntactic dependencies which are automatically acquired from the text corpus with a linguistic parser.
Abstract: We present a novel approach to the automatic acquisition of taxonomies or concept hierarchies from a text corpus. The approach is based on Formal Concept Analysis (FCA), a method mainly used for the analysis of data, i.e. for investigating and processing explicitly given information. We follow Harris' distributional hypothesis and model the context of a certain term as a vector representing syntactic dependencies which are automatically acquired from the text corpus with a linguistic parser. On the basis of this context information, FCA produces a lattice that we convert into a special kind of partial order constituting a concept hierarchy. The approach is evaluated by comparing the resulting concept hierarchies with hand-crafted taxonomies for two domains: tourism and finance. We also directly compare our approach with hierarchical agglomerative clustering as well as with Bi-Section-KMeans as an instance of a divisive clustering algorithm. Furthermore, we investigate the impact of using different measures weighting the contribution of each attribute as well as of applying a particular smoothing technique to cope with data sparseness.

Journal ArticleDOI
TL;DR: A hierarchical clustering method that minimizes a joint between-within measure of distance between clusters, by defining a cluster distance and objective function in terms of Euclidean distance, or any power of Euclidesan distance in the interval (0,2).
Abstract: We propose a hierarchical clustering method that minimizes a joint between-within measure of distance between clusters. This method extends Ward's minimum variance method, by defining a cluster distance and objective function in terms of Euclidean distance, or any power of Euclidean distance in the interval (0,2]. Ward's method is obtained as the special case when the power is 2. The ability of the proposed extension to identify clusters with nearly equal centers is an important advantage over geometric or cluster center methods. The between-within distance statistic determines a clustering method that is ultrametric and space-dilating; and for powers strictly less than 2, determines a consistent test of homogeneity and a consistent clustering procedure. The clustering procedure is applied to three problems: classification of tumors by microarray gene expression data, classification of dermatology diseases by clinical and histopathological attributes, and classification of simulated multivariate normal data.

Journal ArticleDOI
Juan Merlo1, Basile Chaix, Min Yang, John Lynch, Lennart Råstam 
TL;DR: The statistical idea of clustering emerges as appropriate for quantifying “contextual phenomena” that is of central relevance in social epidemiology.
Abstract: Study objective: This didactical essay is directed to readers disposed to approach multilevel regression analysis (MLRA) in a more conceptual than mathematical way. However, it specifically develops an epidemiological vision on multilevel analysis with particular emphasis on measures of health variation (for example, intraclass correlation). Such measures have been underused in the literature as compared with more traditional measures of association (for example, regression coefficients) in the investigation of contextual determinants of health. A link is provided, which will be comprehensible to epidemiologists, between MLRA and social epidemiological concepts, particularly between the statistical idea of clustering and the concept of contextual phenomenon. Design and participants: The study uses an example based on hypothetical data on systolic blood pressure (SBP) from 25 000 people living in 39 neighbourhoods. As the focus is on the empty MLRA model, the study does not use any independent variable but focuses mainly on SBP variance between people and between neighbourhoods. Results: The intraclass correlation (ICC = 0.08) informed of an appreciable clustering of individual SBP within the neighbourhoods, showing that 8% of the total individual differences in SBP occurred at the neighbourhood level and might be attributable to contextual neighbourhood factors or to the different composition of neighbourhoods. Conclusions: The statistical idea of clustering emerges as appropriate for quantifying ''contextual phenomena'' that is of central relevance in social epidemiology. Both concepts convey that people from the same neighbourhood are more similar to each other than to people from different neighbourhoods with respect to the health outcome variable.

Journal Article
TL;DR: This work presents the Relevant Component Analysis algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric, and shows that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis.
Abstract: Many learning algorithms use a metric defined over the input space as a principal tool, and their performance critically depends on the quality of this metric. We address the problem of learning metrics using side-information in the form of equivalence constraints. Unlike labels, we demonstrate that this type of side-information can sometimes be automatically obtained without the need of human intervention. We show how such side-information can be used to modify the representation of the data, leading to improved clustering and classification.Specifically, we present the Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a Mahalanobis metric. We show that RCA is the solution of an interesting optimization problem, founded on an information theoretic basis. If dimensionality reduction is allowed within RCA, we show that it is optimally accomplished by a version of Fisher's linear discriminant that uses constraints. Moreover, under certain Gaussian assumptions, RCA can be viewed as a Maximum Likelihood estimation of the within class covariance matrix. We conclude with extensive empirical evaluations of RCA, showing its advantage over alternative methods.

Proceedings ArticleDOI
04 Apr 2005
TL;DR: This work proposes an unequal clustering size (UCS) model for network organization, which can lead to more uniform energy dissipation among the cluster head nodes, thus increasing network lifetime and expands this approach to homogeneous sensor networks.
Abstract: Organizing wireless sensor networks into clusters enables the efficient utilization of the limited energy resources of the deployed sensor nodes However, the problem of unbalanced energy consumption exists, and it is tightly bound to the role and to the location of a particular node in the network If the network is organized into heterogeneous clusters, where some more powerful nodes take on the cluster head role to control network operation, it is important to ensure that energy dissipation of these cluster head nodes is balanced Oftentimes the network is organized into clusters of equal size, but such equal clustering results in an unequal load on the cluster head nodes Instead, we propose an unequal clustering size (UCS) model for network organization, which can lead to more uniform energy dissipation among the cluster head nodes, thus increasing network lifetime Also, we expand this approach to homogeneous sensor networks and show that UCS can lead to more uniform energy dissipation in a homogeneous network as well