scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2009"


Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations


Journal ArticleDOI
TL;DR: A thorough exposition of community structure, or clustering, is attempted, from the definition of the main elements of the problem, to the presentation of most methods developed, with a special focus on techniques designed by statistical physicists.
Abstract: The modern science of networks has brought significant advances to our understanding of complex systems. One of the most relevant features of graphs representing real systems is community structure, or clustering, i. e. the organization of vertices in clusters, with many edges joining vertices of the same cluster and comparatively few edges joining vertices of different clusters. Such clusters, or communities, can be considered as fairly independent compartments of a graph, playing a similar role like, e. g., the tissues or the organs in the human body. Detecting communities is of great importance in sociology, biology and computer science, disciplines where systems are often represented as graphs. This problem is very hard and not yet satisfactorily solved, despite the huge effort of a large interdisciplinary community of scientists working on it over the past few years. We will attempt a thorough exposition of the topic, from the definition of the main elements of the problem, to the presentation of most methods developed, with a special focus on techniques designed by statistical physicists, from the discussion of crucial issues like the significance of clustering and how methods should be tested and compared against each other, to the description of applications to real networks.

9,057 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the new models developed for the structure program allow structure to be detected at lower levels of divergence, or with less data, than the original structure models or principal components methods, and that they are not biased towards detecting structure when it is not present.
Abstract: Genetic clustering algorithms require a certain amount of data to produce informative results. In the common situation that individuals are sampled at several locations, we show how sample group information can be used to achieve better results when the amount of data is limited. New models are developed for the structure program, both for the cases of admixture and no admixture. These models work by modifying the prior distribution for each individual's population assignment. The new prior distributions allow the proportion of individuals assigned to a particular cluster to vary by location. The models are tested on simulated data, and illustrated using microsatellite data from the CEPH Human Genome Diversity Panel. We demonstrate that the new models allow structure to be detected at lower levels of divergence, or with less data, than the original structure models or principal components methods, and that they are not biased towards detecting structure when it is not present. These models are implemented in a new version of structure which is freely available online at http://pritch.bsd.uchicago.edu/structure.html.

3,105 citations


Journal ArticleDOI
TL;DR: dlib-ml contains an extensible linear algebra toolkit with built in BLAS support, and implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classification, regression, clustering, anomaly detection, and feature ranking.
Abstract: There are many excellent toolkits which provide support for developing machine learning software in Python, R, Matlab, and similar environments. Dlib-ml is an open source library, targeted at both engineers and research scientists, which aims to provide a similarly rich environment for developing machine learning software in the C++ language. Towards this end, dlib-ml contains an extensible linear algebra toolkit with built in BLAS support. It also houses implementations of algorithms for performing inference in Bayesian networks and kernel-based methods for classification, regression, clustering, anomaly detection, and feature ranking. To enable easy use of these tools, the entire library has been developed with contract programming, which provides complete and precise documentation as well as powerful debugging tools.

2,701 citations


Book
12 Oct 2009
TL;DR: This book provides a broad survey of models and efficient algorithms for Nonnegative Matrix Factorization (NMF), including NMFs various extensions and modifications, especially Nonnegative Tensor Factorizations (NTF) and Nonnegative Tucker Decompositions (NTD).
Abstract: This book provides a broad survey of models and efficient algorithms for Nonnegative Matrix Factorization (NMF) This includes NMFs various extensions and modifications, especially Nonnegative Tensor Factorizations (NTF) and Nonnegative Tucker Decompositions (NTD) NMF/NTF and their extensions are increasingly used as tools in signal and image processing, and data analysis, having garnered interest due to their capability to provide new insights and relevant information about the complex latent relationships in experimental data sets It is suggested that NMF can provide meaningful components with physical interpretations; for example, in bioinformatics, NMF and its extensions have been successfully applied to gene expression, sequence analysis, the functional characterization of genes, clustering and text mining As such, the authors focus on the algorithms that are most useful in practice, looking at the fastest, most robust, and suitable for large-scale models Key features: Acts as a single source reference guide to NMF, collating information that is widely dispersed in current literature, including the authors own recently developed techniques in the subject area Uses generalized cost functions such as Bregman, Alpha and Beta divergences, to present practical implementations of several types of robust algorithms, in particular Multiplicative, Alternating Least Squares, Projected Gradient and Quasi Newton algorithms Provides a comparative analysis of the different methods in order to identify approximation error and complexity Includes pseudo codes and optimized MATLAB source codes for almost all algorithms presented in the book The increasing interest in nonnegative matrix and tensor factorizations, as well as decompositions and sparse representation of data, will ensure that this book is essential reading for engineers, scientists, researchers, industry practitioners and graduate students across signal and image processing; neuroscience; data mining and data analysis; computer science; bioinformatics; speech processing; biomedical engineering; and multimedia

2,136 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids.
Abstract: This paper proposes a new algorithm for K-medoids clustering which runs like the K-means algorithm and tests several methods for selecting initial medoids. The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every iterative step. To evaluate the proposed algorithm, we use some real and artificial data sets and compare with the results of other algorithms in terms of the adjusted Rand index. Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids.

1,629 citations


Journal ArticleDOI
TL;DR: The authors showed that the revised ALE‐algorithm overcomes conceptual problems of former meta‐analyses and increases the specificity of the ensuing results without loosing the sensitivity of the original approach, and may provide a methodologically improved tool for coordinate‐based meta-analyses on functional imaging data.
Abstract: A widely used technique for coordinate-based meta-analyses of neuroimaging data is activation likelihood estimation (ALE). ALE assesses the overlap between foci based on modeling them as probability distributions centered at the respective coordinates. In this Human Brain Project/Neuroinformatics research, the authors present a revised ALE algorithm addressing drawbacks associated with former implementations. The first change pertains to the size of the probability distributions, which had to be specified by the used. To provide a more principled solution, the authors analyzed fMRI data of 21 subjects, each normalized into MNI space using nine different approaches. This analysis provided quantitative estimates of between-subject and between-template variability for 16 functionally defined regions, which were then used to explicitly model the spatial uncertainty associated with each reported coordinate. Secondly, instead of testing for an above-chance clustering between foci, the revised algorithm assesses above-chance clustering between experiments. The spatial relationship between foci in a given experiment is now assumed to be fixed and ALE results are assessed against a null-distribution of random spatial association between experiments. Critically, this modification entails a change from fixed- to random-effects inference in ALE analysis allowing generalization of the results to the entire population of studies analyzed. By comparative analysis of real and simulated data, the authors showed that the revised ALE-algorithm overcomes conceptual problems of former meta-analyses and increases the specificity of the ensuing results without loosing the sensitivity of the original approach. It may thus provide a methodologically improved tool for coordinate-based meta-analyses on functional imaging data.

1,609 citations


Proceedings ArticleDOI
01 Sep 2009
TL;DR: A system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city on Internet photo sharing sites and is designed to scale gracefully with both the size of the problem and the amount of available computation.
Abstract: We present a system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city (e.g., Rome) on Internet photo sharing sites. Our system uses a collection of novel parallel distributed matching and reconstruction algorithms, designed to maximize parallelism at each stage in the pipeline and minimize serialization bottlenecks. It is designed to scale gracefully with both the size of the problem and the amount of available computation. We have experimented with a variety of alternative algorithms at each stage of the pipeline and report on which ones work best in a parallel computing environment. Our experimental results demonstrate that it is now possible to reconstruct cities consisting of 150K images in less than a day on a cluster with 500 compute cores.

1,454 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This work proposes a method based on sparse representation (SR) to cluster data drawn from multiple low-dimensional linear or affine subspaces embedded in a high-dimensional space and applies this method to the problem of segmenting multiple motions in video.
Abstract: We propose a method based on sparse representation (SR) to cluster data drawn from multiple low-dimensional linear or affine subspaces embedded in a high-dimensional space. Our method is based on the fact that each point in a union of subspaces has a SR with respect to a dictionary formed by all other data points. In general, finding such a SR is NP hard. Our key contribution is to show that, under mild assumptions, the SR can be obtained `exactly' by using l1 optimization. The segmentation of the data is obtained by applying spectral clustering to a similarity matrix built from this SR. Our method can handle noise, outliers as well as missing data. We apply our subspace clustering algorithm to the problem of segmenting multiple motions in video. Experiments on 167 video sequences show that our approach significantly outperforms state-of-the-art methods.

1,411 citations


Journal ArticleDOI
TL;DR: This survey tries to clarify the different problem definitions related to subspace clustering in general; the specific difficulties encountered in this field of research; the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and how several prominent solutions tackle different problems.
Abstract: As a prolific research area in data mining, subspace clustering and related problems induced a vast quantity of proposed solutions. However, many publications compare a new proposition—if at all—with one or two competitors, or even with a so-called “naive” ad hoc solution, but fail to clarify the exact problem definition. As a consequence, even if two solutions are thoroughly compared experimentally, it will often remain unclear whether both solutions tackle the same problem or, if they do, whether they agree in certain tacit assumptions and how such assumptions may influence the outcome of an algorithm. In this survey, we try to clarify: (i) the different problem definitions related to subspace clustering in general; (ii) the specific difficulties encountered in this field of research; (iii) the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and (iv) how several prominent solutions tackle different problems.

1,206 citations


Journal ArticleDOI
TL;DR: In this article, it is shown that it is easy to calculate standard errors that are robust to simultaneous correlation across both firms and time, and that any statistical package with a clustering command can be used to easily calculate these standard errors.
Abstract: When estimating finance panel regressions, it is common practice to adjust standard errors for correlation either across firms or across time. These procedures are valid only if the residuals are correlated either across time or across firms, but not across both. This note shows that it is very easy to calculate standard errors that are robust to simultaneous correlation across both firms and time. The covariance estimator is equal to the estimator that clusters by firm, plus the the estimator that clusters by time, minus the usual heteroskedasticity-robust OLS covariance matrix. Any statistical package with a clustering command can be used to easily calculate these standard errors.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: A new dataset, H3D, is built of annotations of humans in 2D photographs with 3D joint information, inferred using anthropometric constraints, to address the classic problems of detection, segmentation and pose estimation of people in images with a novel definition of a part, a poselet.
Abstract: We address the classic problems of detection, segmentation and pose estimation of people in images with a novel definition of a part, a poselet. We postulate two criteria (1) It should be easy to find a poselet given an input image (2) it should be easy to localize the 3D configuration of the person conditioned on the detection of a poselet. To permit this we have built a new dataset, H3D, of annotations of humans in 2D photographs with 3D joint information, inferred using anthropometric constraints. This enables us to implement a data-driven search procedure for finding poselets that are tightly clustered in both 3D joint configuration space as well as 2D image appearance. The algorithm discovers poselets that correspond to frontal and profile faces, pedestrians, head and shoulder views, among others. Each poselet provides examples for training a linear SVM classifier which can then be run over the image in a multiscale scanning mode. The outputs of these poselet detectors can be thought of as an intermediate layer of nodes, on top of which one can run a second layer of classification or regression. We show how this permits detection and localization of torsos or keypoints such as left shoulder, nose, etc. Experimental results show that we obtain state of the art performance on people detection in the PASCAL VOC 2007 challenge, among other datasets. We are making publicly available both the H3D dataset as well as the poselet parameters for use by other researchers.

Journal ArticleDOI
TL;DR: Examining the overall organization of the brain network using graph analysis shows a strong negative association between the normalized characteristic path length λ of the resting-state brain network and intelligence quotient (IQ), suggesting that human intellectual performance is likely to be related to how efficiently the authors' brain integrates information between multiple brain regions.
Abstract: Our brain is a complex network in which information is continuously processed and transported between spatially distributed but functionally linked regions. Recent studies have shown that the functional connections of the brain network are organized in a highly efficient small-world manner, indicating a high level of local neighborhood clustering, together with the existence of more long-distance connections that ensure a high level of global communication efficiency within the overall network. Such an efficient network architecture of our functional brain raises the question of a possible association between how efficiently the regions of our brain are functionally connected and our level of intelligence. Examining the overall organization of the brain network using graph analysis, we show a strong negative association between the normalized characteristic path length λ of the resting-state brain network and intelligence quotient (IQ). This suggests that human intellectual performance is likely to be related to how efficiently our brain integrates information between multiple brain regions. Most pronounced effects between normalized path length and IQ were found in frontal and parietal regions. Our findings indicate a strong positive association between the global efficiency of functional brain networks and intellectual performance.

Book
17 Feb 2009
TL;DR: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables
Abstract: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables Summary Principal Component Analysis (PCA) Concepts Number of PCA Components Centering and Scaling Outliers and Data Distribution Robust PCA Algorithms for PCA Evaluation and Diagnostics Complementary Methods for Exploratory Data Analysis Examples Summary Calibration Concepts Performance of Regression Models Ordinary Least Squares Regression Robust Regression Variable Selection Principal Component Regression Partial Least Squares Regression Related Methods Examples Summary Classification Concepts Linear Classification Methods Kernel and Prototype Methods Classification Trees Artificial Neural Networks Support Vector Machine Evaluation Examples Summary Cluster Analysis Concepts Distance and Similarity Measures Partitioning Methods Hierarchical Clustering Methods Fuzzy Clustering Model-Based Clustering Cluster Validity and Clustering Tendency Measures Examples Summary Preprocessing Concepts Smoothing and Differentiation Multiplicative Signal Correction Mass Spectral Features Appendix 1: Symbols and Abbreviations Appendix 2: Matrix Algebra Appendix 3: Introduction to R Index References appear at the end of each chapter

Journal ArticleDOI
TL;DR: This paper focuses on a measure originally defined for unweighted networks: the global clustering coefficient, and proposes a generalization of this coefficient that retains the information encoded in the weights of ties.

Proceedings ArticleDOI
29 Sep 2009
TL;DR: Two methods for learning robust distance measures are presented: a logistic discriminant approach which learns the metric from a set of labelled image pairs (LDML) and a nearest neighbour approach which computes the probability for two images to belong to the same class (MkNN).
Abstract: Face identification is the problem of determining whether two face images depict the same person or not. This is difficult due to variations in scale, pose, lighting, background, expression, hairstyle, and glasses. In this paper we present two methods for learning robust distance measures: (a) a logistic discriminant approach which learns the metric from a set of labelled image pairs (LDML) and (b) a nearest neighbour approach which computes the probability for two images to belong to the same class (MkNN). We evaluate our approaches on the Labeled Faces in the Wild data set, a large and very challenging data set of faces from Yahoo! News. The evaluation protocol for this data set defines a restricted setting, where a fixed set of positive and negative image pairs is given, as well as an unrestricted one, where faces are labelled by their identity. We are the first to present results for the unrestricted setting, and show that our methods benefit from this richer training data, much more so than the current state-of-the-art method. Our results of 79.3% and 87.5% correct for the restricted and unrestricted setting respectively, significantly improve over the current state-of-the-art result of 78.5%. Confidence scores obtained for face identification can be used for many applications e.g. clustering or recognition from a single training example. We show that our learned metrics also improve performance for these tasks.

Journal ArticleDOI
01 Aug 2009
TL;DR: This paper proposes a novel graph clustering algorithm, SA-Cluster, based on both structural and attribute similarities through a unified distance measure, which partitions a large graph associated with attributes into k clusters so that each cluster contains a densely connected subgraph with homogeneous attribute values.
Abstract: The goal of graph clustering is to partition vertices in a large graph into different clusters based on various criteria such as vertex connectivity or neighborhood similarity. Graph clustering techniques are very useful for detecting densely connected groups in a large graph. Many existing graph clustering methods mainly focus on the topological structure for clustering, but largely ignore the vertex properties which are often heterogenous. In this paper, we propose a novel graph clustering algorithm, SA-Cluster, based on both structural and attribute similarities through a unified distance measure. Our method partitions a large graph associated with attributes into k clusters so that each cluster contains a densely connected subgraph with homogeneous attribute values. An effective method is proposed to automatically learn the degree of contributions of structural similarity and attribute similarity. Theoretical analysis is provided to show that SA-Cluster is converging. Extensive experimental results demonstrate the effectiveness of SA-Cluster through comparison with the state-of-the-art graph clustering and summarization methods.

Journal ArticleDOI
TL;DR: A novel technique for unsupervised change detection in multitemporal satellite images using principal component analysis (PCA) and k-means clustering and Experimental results confirm the effectiveness of the proposed approach.
Abstract: In this letter, we propose a novel technique for unsupervised change detection in multitemporal satellite images using principal component analysis (PCA) and k-means clustering. The difference image is partitioned into h times h nonoverlapping blocks. S, S les h2, orthonormal eigenvectors are extracted through PCA of h times h nonoverlapping block set to create an eigenvector space. Each pixel in the difference image is represented with an S-dimensional feature vector which is the projection of h times h difference image data onto the generated eigenvector space. The change detection is achieved by partitioning the feature vector space into two clusters using k-means clustering with k = 2 and then assigning each pixel to the one of the two clusters by using the minimum Euclidean distance between the pixel's feature vector and mean feature vector of clusters. Experimental results confirm the effectiveness of the proposed approach.

Proceedings ArticleDOI
14 Jun 2009
TL;DR: Under the assumption that the views are un-correlated given the cluster label, it is shown that the separation conditions required for the algorithm to be successful are significantly weaker than prior results in the literature.
Abstract: Clustering data in high dimensions is believed to be a hard problem in general. A number of efficient clustering algorithms developed in recent years address this problem by projecting the data into a lower-dimensional subspace, e.g. via Principal Components Analysis (PCA) or random projections, before clustering. Here, we consider constructing such projections using multiple views of the data, via Canonical Correlation Analysis (CCA).Under the assumption that the views are un-correlated given the cluster label, we show that the separation conditions required for the algorithm to be successful are significantly weaker than prior results in the literature. We provide results for mixtures of Gaussians and mixtures of log concave distributions. We also provide empirical support from audio-visual speaker clustering (where we desire the clusters to correspond to speaker ID) and from hierarchical Wikipedia document clustering (where one view is the words in the document and the other is the link structure).

Journal ArticleDOI
TL;DR: A recent proof of NP-hardness of Euclidean sum-of-squares clustering, due to Drineas et al. (Mach. 56:9–33, 2004), is not valid and an alternate short proof is provided.
Abstract: A recent proof of NP-hardness of Euclidean sum-of-squares clustering, due to Drineas et al. (Mach. Learn. 56:9---33, 2004), is not valid. An alternate short proof is provided.

Journal ArticleDOI
TL;DR: The QMEAN server provides access to two scoring functions successfully tested at the eighth round of the community-wide blind test experiment CASP, which derives a quality estimate on the basis of the geometrical analysis of single models and a weighted all-against-all comparison of the models from the ensemble provided by the user.
Abstract: Model quality estimation is an essential component of protein structure prediction, since ultimately the accuracy of a model determines its usefulness for specific applications. Usually, in the course of protein structure prediction a set of alternative models is produced, from which subsequently the most accurate model has to be selected. The QMEAN server provides access to two scoring functions successfully tested at the eighth round of the community-wide blind test experiment CASP. The user can choose between the composite scoring function QMEAN, which derives a quality estimate on the basis of the geometrical analysis of single models, and the clustering-based scoring function QMEANclust which calculates a global and local quality estimate based on a weighted all-against-all comparison of the models from the ensemble provided by the user. The web server performs a ranking of the input models and highlights potentially problematic regions for each model. The QMEAN server is available at http://swissmodel.expasy.org/qmean.

Journal ArticleDOI
TL;DR: This article defines a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families, and proposes a modified version of Bcubed that avoids the problems found with other metrics.
Abstract: There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article, we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families. These formal constraints are validated in an experiment involving human assessments, and compared with other constraints proposed in the literature. Our analysis of a wide range of metrics shows that only BCubed satisfies all formal constraints. We also extend the analysis to the problem of overlapping clustering, where items can simultaneously belong to more than one cluster. As Bcubed cannot be directly applied to this task, we propose a modified version of Bcubed that avoids the problems found with other metrics.

Journal ArticleDOI
TL;DR: This paper introduces an energy efficient heterogeneous clustered scheme for wireless sensor networks based on weighted election probabilities of each node to become a cluster head according to the residual energy in each node.

Journal ArticleDOI
TL;DR: A new spectral-spatial classification scheme for hyperspectral images is proposed that improves the classification accuracies and provides classification maps with more homogeneous regions, when compared to pixel wise classification.
Abstract: A new spectral-spatial classification scheme for hyperspectral images is proposed. The method combines the results of a pixel wise support vector machine classification and the segmentation map obtained by partitional clustering using majority voting. The ISODATA algorithm and Gaussian mixture resolving techniques are used for image clustering. Experimental results are presented for two hyperspectral airborne images. The developed classification scheme improves the classification accuracies and provides classification maps with more homogeneous regions, when compared to pixel wise classification. The proposed method performs particularly well for classification of images with large spatial structures and when different classes have dissimilar spectral responses and a comparable number of pixels.

Proceedings Article
01 Jan 2009
TL;DR: Recent researchers have started to explore automated clustering techniques that help to identify samples that exhibit similar behavior, which allows an analyst to discard reports of samples that have been seen before, while focusing on novel, interesting threats.
Abstract: Anti-malware companies receive thousands of malware samples every day. To process this large quantity, a number of automated analysis tools were developed. These tools execute a malicious program in a controlled environment and produce reports that summarize the program’s actions. Of course, the problem of analyzing the reports still remains. Recently, researchers have started to explore automated clustering techniques that help to identify samples that exhibit similar behavior. This allows an analyst to discard reports of samples that have been seen before, while focusing on novel, interesting threats. Unfortunately, previous techniques do not scale well and frequently fail to generalize the observed activity well enough to recognize

Journal ArticleDOI
01 Mar 2009
TL;DR: An up-to-date overview that is fully devoted to evolutionary algorithms for clustering, is not limited to any particular kind of evolutionary approach, and comprises advanced topics like multiobjective and ensemble-based evolutionary clustering.
Abstract: This paper presents a survey of evolutionary algorithms designed for clustering tasks. It tries to reflect the profile of this area by focusing more on those subjects that have been given more importance in the literature. In this context, most of the paper is devoted to partitional algorithms that look for hard clusterings of data, though overlapping (i.e., soft and fuzzy) approaches are also covered in the paper. The paper is original in what concerns two main aspects. First, it provides an up-to-date overview that is fully devoted to evolutionary algorithms for clustering, is not limited to any particular kind of evolutionary approach, and comprises advanced topics like multiobjective and ensemble-based evolutionary clustering. Second, it provides a taxonomy that highlights some very important aspects in the context of evolutionary data clustering, namely, fixed or variable number of clusters, cluster-oriented or nonoriented operators, context-sensitive or context-insensitive operators, guided or unguided operators, binary, integer, or real encodings, centroid-based, medoid-based, label-based, tree-based, or graph-based representations, among others. A number of references are provided that describe applications of evolutionary algorithms for clustering in different domains, such as image processing, computer security, and bioinformatics. The paper ends by addressing some important issues and open questions that can be subject of future research.

Journal ArticleDOI
TL;DR: This paper proposes an algorithm (EAGLE) to detect both the overlapping and hierarchical properties of complex community structure together and deals with the set of maximal cliques and adopts an agglomerative framework.
Abstract: Clustering and community structure is crucial for many network systems and the related dynamic processes. It has been shown that communities are usually overlapping and hierarchical. However, previous methods investigate these two properties of community structure separately. This paper proposes an algorithm (EAGLE) to detect both the overlapping and hierarchical properties of complex community structure together. This algorithm deals with the set of maximal cliques and adopts an agglomerative framework. The quality function of modularity is extended to evaluate the goodness of a cover. The examples of application to real world networks give excellent results.

Book ChapterDOI
22 Nov 2009
TL;DR: This paper proposes a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique and demonstrates that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.
Abstract: Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

Journal ArticleDOI
TL;DR: An exact version of nonnegative matrix factorization is defined and it is established that it is equivalent to a problem in polyhedral combinatorics; it is NP-hard; and that a polynomial-time local search heuristic exists.
Abstract: Nonnegative matrix factorization (NMF) has become a prominent technique for the analysis of image databases, text databases, and other information retrieval and clustering applications. The problem is most naturally posed as continuous optimization. In this report, we define an exact version of NMF. Then we establish several results about exact NMF: (i) that it is equivalent to a problem in polyhedral combinatorics; (ii) that it is NP-hard; and (iii) that a polynomial-time local search heuristic exists.

Journal ArticleDOI
TL;DR: A methodology using Geographical Information Systems (GIS) and Kernel Density Estimation to study the spatial patterns of injury related road accidents in London, UK and a clustering methodology using environmental data and results from the first section in order to create a classification of road accident hotspots are presented.