scispace - formally typeset
Search or ask a question
Book

Algorithms for clustering data

01 Jan 1988-
About: The article was published on 1988-01-01 and is currently open access. It has received 8586 citations till now. The article focuses on the topics: Cluster analysis & Correlation clustering.
Citations
More filters
Book ChapterDOI
27 Sep 2010
TL;DR: A methodology based on the integration of expression data and signalling pathways as a needed phase for the pattern recognition within signaling CC pathways is proposed and results provide a top-down interpretation approach where biologists interact with the recognized patterns inside signalling pathways.
Abstract: Cervical Cancer (CC) is the result of the infection of high risk Human Papilloma Viruses. mRNA microarray expression data provides biologists with evidences of cellular compensatory gene expression mechanisms in the CC progression. Pattern recognition of signalling pathways through expression data can reveal interesting insights for the understanding of CC. Consequently, gene expression data should be submitted to different pre-processing tasks. In this paper we propose a methodology based on the integration of expression data and signalling pathways as a needed phase for the pattern recognition within signaling CC pathways. Our results provide a top-down interpretation approach where biologists interact with the recognized patterns inside signalling pathways.
Proceedings ArticleDOI
29 Jul 2010
TL;DR: The Fuzzy C means algorithm is used as the deterministic algorithm for ant optimization and the partitions obtained from the ant based algorithm were better optimized than those from randomly initialized hard C Means.
Abstract: Ant-based techniques are designed to take biological inspirations on the behavior of these social insects. Data clustering techniques are classification algorithms that have a wide range of applications, from Biology to Image processing and Data presentation. Since real life ants do perform clustering and sorting of objects among their many activities, we expect that a study of ant colonies can provide new insights for clustering techniques. The aim of clustering is to separate a set of data points into self-similar groups such that the points that belong to the same group are more similar than the points belonging to different groups. Each group is called a cluster. Data may be clustered using an iterative version of the Fuzzy C means (FCM) algorithm, but the draw back of FCM algorithm is that it is very sensitive to cluster center initialization because the search is based on the hill climbing heuristic. The ant based algorithm provides a relevant partition of data without any knowledge of the initial cluster centers. In the past researchers have used ant based algorithms which are based on stochastic principles coupled with the k-means algorithm. The proposal in this work use the Fuzzy C means algorithm as the deterministic algorithm for ant optimization. The proposed model is used after reformulation and the partitions obtained from the ant based algorithm were better optimized than those from randomly initialized hard C Means. The proposed technique executes the ant fuzzy in parallel for multiple clusters. This would enhance the speed and accuracy of cluster formation for the required system problem.
Posted Content
TL;DR: The observation that network nodes can be grouped, based on their utility values, into clusters that portray different delivery capabilities is exploited to transform the basic forwarding strategy, and actually forward it through clusters of increasing delivery capability.
Abstract: Dynamic replication is a wide-spread multi-copy routing approach for efficiently coping with the intermittent connectivity in mobile opportunistic networks. According to it, a node forwards a message replica to an encountered node based on a utility value that captures the latter's fitness for delivering the message to the destination. The popularity of the approach stems from its flexibility to effectively operate in networks with diverse characteristics without requiring special customization. Nonetheless, its drawback is the tendency to produce a high number of replicas that consume limited resources such as energy and storage. To tackle the problem we make the observation that network nodes can be grouped, based on their utility values, into clusters that portray different delivery capabilities. We exploit this finding to transform the basic forwarding strategy, which is to move a packet using nodes of increasing utility, and actually forward it through clusters of increasing delivery capability. The new strategy works in synergy with the basic dynamic replication algorithms and is fully configurable, in the sense that it can be used with virtually any utility function. We also extend our approach to work with two utility functions at the same time, a feature that is especially efficient in mobile networks that exhibit social characteristics. By conducting experiments in a wide set of real-life networks, we empirically show that our method is robust in reducing the overall number of replicas in networks with diverse connectivity characteristics without at the same time hindering delivery efficiency.
DissertationDOI
01 Jan 2009
TL;DR: An adjusted k-means approach and a misclassification error score criterion are proposed and applied to the real data of phylogeny of the owlet-nightjars to show that the phylogeny tree constructed by Dumbacher et al. (2003) can reach minimum misclassifying error score compared with the other several methods.
Abstract: The reconstruction of phylogenetic trees is one of the most important and interesting problems of the evolutionary study. There are many methods proposed in the literature for constructing phylogenetic trees. Each method has its own criterion and bases on a selected evolutionary model. However, the topologies for the trees constructed from different methods may be quite different. The topology error may due to the unsuitable criterion or evolutionary model. Since there are many trees built from different methods, we are interested in selecting a valid tree. In this study, we propose an adjusted k-means approach and a misclassification error score criterion to solve the problem. This approach evaluates the trees by looking at the feature of the data from a statistical point view. It can provide an object criterion to select a valid tree from the statistics perspective. We apply the approach to the real data of phylogeny of the owlet-nightjars. It shows that the phylogeny tree constructed by Dumbacher et al. (2003) can reach minimum misclassification error score compared with the other several methods.
01 Jan 2006
TL;DR: The variance of the spatial topological relationships and how to restore these relationships by the geometric relationships between points are researched and a kind of approaching algorithm of outline curves based on messy data is presented.
Abstract: The variance of the spatial topological relationships and how to restore these relationships by the geometric relationships between points are researched. A kind of approaching algorithm of outline curves based on messy data is presented. The test results in virtual navigating environment are also given.
References
More filters
Book
01 Feb 1975

6,068 citations

Book
01 Dec 1973

5,169 citations

Journal ArticleDOI
TL;DR: In this paper, the basic problem of interconnecting a given set of terminals with a shortest possible network of direct links is considered, and a set of simple and practical procedures are given for solving this problem both graphically and computationally.
Abstract: The basic problem considered is that of interconnecting a given set of terminals with a shortest possible network of direct links Simple and practical procedures are given for solving this problem both graphically and computationally It develops that these procedures also provide solutions for a much broader class of problems, containing other examples of practical interest

4,395 citations

Journal ArticleDOI
TL;DR: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters to provide a variety of clustering solutions.
Abstract: A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.

3,551 citations

Journal ArticleDOI
TL;DR: Sibson gives an O(n 2) algorithm for single-linkage clustering, and proves that this algorithm achieves the theoretically optimal lower time bound for obtaining a single- linkage dendrogram.
Abstract: Main point Sibson gives an O(n 2) algorithm for single-linkage clustering, and proves that this algorithm achieves the theoretically optimal lower time bound for obtaining a single-linkage dendrogram. This improves upon the naive O(n 3) implementation of single linkage clustering. A single linkage dendrogram is a tree, where each level of the tree corresponds to a different threshold dissimilarity measure h. The nodes of a dataset are grouped into \" equivalence classes \" c(h) at each level of the dendrogram, where two classes C i and C j are merged if there is a pair of \" OTU's \" (vertices) v i ∈ C i and v j ∈ C j such that the dissimilarity measure between v i and v j is less than h, or D(v i , v j) < h. For example, consider a set of 10 vertices v 1 ,. .. , v 10 for which the dissimilarity matrix D is given below, with D ij equal to the dissimilarity between v i and v j. Suppose we take four cutoff dissimilarity measures h 1 , h 2 , h 3 , h 4 and produce the dendrogram according to these thresholds. An example illustrating how the 10 vertices are grouped into equivalence classes at each level is shown in Figure 1. Since no dissimilarity is at or below 1, each vertex or \" OTU \" is its own equivalence class at the level corresponding to h 1 = 1. At the next level, however, we see that some classes have been merged together because several dissimilarity measures are below h 2 = 2. We can see that c(h 2) consists of 6 equivalence classes, c(h 3) has 3 equivalence classes, and c(h 4 = 4) aggregates all the vertices into one equivalence class. In single linkage clustering, the number of levels in the tree is determined by the nearest-neighbor criterion – at each level, at least one new merge is made between two clusters, and the merge is made for clusters C i and C j if the minimal distance between vertices v i ∈ C i and v j ∈ C j is the smallest such distance across all the clusters. In other words, the nearest neighbors between clusters C j and C i are found, and if these neighbors are closer than all the other nearest-neighbor pairs, then C i and C …

1,208 citations