scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2012"


Journal ArticleDOI
TL;DR: A new superpixel algorithm is introduced, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels and is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.
Abstract: Computer vision applications have come to rely increasingly on superpixels in recent years, but it is not always clear what constitutes a good superpixel algorithm. In an effort to understand the benefits and drawbacks of existing methods, we empirically compare five state-of-the-art superpixel algorithms for their ability to adhere to image boundaries, speed, memory efficiency, and their impact on segmentation performance. We then introduce a new superpixel algorithm, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels. Despite its simplicity, SLIC adheres to boundaries as well as or better than previous methods. At the same time, it is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.

7,849 citations


Journal ArticleDOI

[...]

TL;DR: A new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets to reduce sequence redundancy and improve the performance of other sequence analyses is developed.
Abstract: Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ~24 cores and a quasi-linear speedup for up to ~8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. Availability: http://cd-hit.org. Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

5,959 citations


Book
06 Jun 2012
TL;DR: An up-to-date, self-contained introduction to a state-of-the-art machine learning approach, Ensemble Methods: Foundations and Algorithms shows how these accurate methods are used in real-world tasks and gives the necessary groundwork to carry out further research in this evolving field.
Abstract: An up-to-date, self-contained introduction to a state-of-the-art machine learning approach, Ensemble Methods: Foundations and Algorithms shows how these accurate methods are used in real-world tasks. It gives you the necessary groundwork to carry out further research in this evolving field. After presenting background and terminology, the book covers the main algorithms and theories, including Boosting, Bagging, Random Forest, averaging and voting schemes, the Stacking method, mixture of experts, and diversity measures. It also discusses multiclass extension, noise tolerance, error-ambiguity and bias-variance decompositions, and recent progress in information theoretic diversity. Moving on to more advanced topics, the author explains how to achieve better performance through ensemble pruning and how to generate better clustering results by combining multiple clusterings. In addition, he describes developments of ensemble methods in semi-supervised learning, active learning, cost-sensitive learning, class-imbalance learning, and comprehensibility enhancement.

1,834 citations


Posted Content
TL;DR: This paper proposes and studies an algorithm, called sparse subspace clustering, to cluster data points that lie in a union of low-dimensional subspaces, and demonstrates the effectiveness of the proposed algorithm through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering.
Abstract: In many real-world problems, we are dealing with collections of high-dimensional data, such as images, videos, text and web documents, DNA microarray data, and more. Often, high-dimensional data lie close to low-dimensional structures corresponding to several classes or categories the data belongs to. In this paper, we propose and study an algorithm, called Sparse Subspace Clustering (SSC), to cluster data points that lie in a union of low-dimensional subspaces. The key idea is that, among infinitely many possible representations of a data point in terms of other points, a sparse representation corresponds to selecting a few points from the same subspace. This motivates solving a sparse optimization program whose solution is used in a spectral clustering framework to infer the clustering of data into subspaces. Since solving the sparse optimization program is in general NP-hard, we consider a convex relaxation and show that, under appropriate conditions on the arrangement of subspaces and the distribution of data, the proposed minimization program succeeds in recovering the desired sparse representations. The proposed algorithm can be solved efficiently and can handle data points near the intersections of subspaces. Another key advantage of the proposed algorithm with respect to the state of the art is that it can deal with data nuisances, such as noise, sparse outlying entries, and missing entries, directly by incorporating the model of the data into the sparse optimization program. We demonstrate the effectiveness of the proposed algorithm through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering.

1,521 citations


Journal ArticleDOI
TL;DR: A recently developed very efficient (linear time) hierarchical clustering algorithm is described, which can also be viewed as a hierarchical grid‐based algorithm.
Abstract: We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally, we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm. This review adds to the earlier version, Murtagh F, Contreras P. Algorithms for hierarchical clustering: an overview, Wiley Interdiscip Rev: Data Mining Knowl Discov 2012, 2, 86–97. WIREs Data Mining Knowl Discov 2017, 7:e1219. doi: 10.1002/widm.1219 This article is categorized under: Algorithmic Development > Hierarchies and Trees Technologies > Classification Technologies > Structure Discovery and Clustering

977 citations


Proceedings ArticleDOI
12 Aug 2012
TL;DR: This work shows that by using a combination of four novel ideas the authors can search and mine truly massive time series for the first time, and shows that in large datasets they can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms.
Abstract: Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

969 citations


Journal ArticleDOI
TL;DR: An implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries is presented, and the package flashClust that implements the original algorithm which in practice achieves order approximately n(2), leading to substantial time savings when clustering large data sets.
Abstract: Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA.The hierarchical clustering algorithm implemented in R function hclust is an order n(3) (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n(2), leading to substantial time savings when clustering large data sets.

839 citations


Journal ArticleDOI
TL;DR: It is shown that consensus clustering can be combined with any existing method in a self-consistent way, enhancing considerably both the stability and the accuracy of the resulting partitions.
Abstract: The community structure of complex networks reveals both their organization and hidden relationships among their constituents. Most community detection methods currently available are not deterministic, and their results typically depend on the specific random seeds, initial conditions and tie-break rules adopted for their execution. Consensus clustering is used in data analysis to generate stable results out of a set of partitions delivered by stochastic methods. Here we show that consensus clustering can be combined with any existing method in a self-consistent way, enhancing considerably both the stability and the accuracy of the resulting partitions. This framework is also particularly suitable to monitor the evolution of community structure in temporal networks. An application of consensus clustering to a large citation network of physics papers demonstrates its capability to keep track of the birth, death and diversification of topics.

727 citations


Journal ArticleDOI
TL;DR: The results of the analysis show that for multi-label classification the best performing methods overall are random forests of predictive clustering trees (RF-PCT) and hierarchy of multi- label classifiers (HOMER), followed by binary relevance (BR) and classifier chains (CC).

711 citations


Journal ArticleDOI
01 Jul 2012
TL;DR: The problems motivating subspace clustering are sketched, different definitions and usages of subspaces for clusteringare described, and exemplary algorithmic solutions are discussed.
Abstract: Subspace clustering refers to the task of identifying clusters of similar objects or data records (vectors) where the similarity is defined with respect to a subset of the attributes (i.e., a subspace of the data space). The subspace is not necessarily (and actually is usually not) the same for different clusters within one clustering solution. In this article, the problems motivating subspace clustering are sketched, different definitions and usages of subspaces for clustering are described, and exemplary algorithmic solutions are discussed. Finally, we sketch current research directions. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.

666 citations


Proceedings Article
27 Jun 2012
TL;DR: In this article, the authors present a general mathematical framework for the study of both linear and non-linear autoencoders, including the Boolean autoencoder, which is equivalent to a clustering problem that can be solved in polynomial time when the number of clusters is small and becomes NP complete when the size of the clusters is large.
Abstract: Autoencoders play a fundamental role in unsupervised learning and in deep architectures for transfer learning and other tasks. In spite of their fundamental role, only linear autoencoders over the real numbers have been solved analytically. Here we present a general mathematical framework for the study of both linear and non-linear autoencoders. The framework allows one to derive an analytical treatment for the most non-linear autoencoder, the Boolean autoencoder. Learning in the Boolean autoencoder is equivalent to a clustering problem that can be solved in polynomial time when the number of clusters is small and becomes NP complete when the number of clusters is large. The framework sheds light on the different kinds of autoencoders, their learning complexity, their horizontal and vertical composability in deep architectures, their critical points, and their fundamental connections to clustering, Hebbian learning, and information theory.

Journal ArticleDOI
TL;DR: In this paper, a system uses natural language processing and data mining techniques to extract situation awareness information from Twitter messages generated during various disasters and crises, such as hurricanes, floods, and floods.
Abstract: The described system uses natural language processing and data mining techniques to extract situation awareness information from Twitter messages generated during various disasters and crises.

Journal ArticleDOI
09 Aug 2012-Sensors
TL;DR: A comprehensive and fine grained survey on clustering routing protocols proposed in the literature for WSNs, and a novel taxonomy of WSN clustering routed methods based on complete and detailed clustering attributes are presented.
Abstract: The past few years have witnessed increased interest in the potential use of wireless sensor networks (WSNs) in a wide range of applications and it has become a hot research area. Based on network structure, routing protocols in WSNs can be divided into two categories: flat routing and hierarchical or clustering routing. Owing to a variety of advantages, clustering is becoming an active branch of routing technology in WSNs. In this paper, we present a comprehensive and fine grained survey on clustering routing protocols proposed in the literature for WSNs. We outline the advantages and objectives of clustering for WSNs, and develop a novel taxonomy of WSN clustering routing methods based on complete and detailed clustering attributes. In particular, we systematically analyze a few prominent WSN clustering routing protocols and compare these different approaches according to our taxonomy and several significant metrics. Finally, we summarize and conclude the paper with some future directions.

Book ChapterDOI
01 Jan 2012
TL;DR: This chapter will summarize recent results and technical tricks that are needed to make effective use of K-means clustering for learning large-scale representations of images and connect these results to other well-known algorithms to make clear when K-Means can be most useful.
Abstract: Many algorithms are available to learn deep hierarchies of features from unlabeled data, especially images. In many cases, these algorithms involve multi-layered networks of features (e.g., neural networks) that are sometimes tricky to train and tune and are difficult to scale up to many machines effectively. Recently, it has been found that K-means clustering can be used as a fast alternative training method. The main advantage of this approach is that it is very fast and easily implemented at large scale. On the other hand, employing this method in practice is not completely trivial: K-means has several limitations, and care must be taken to combine the right ingredients to get the system to work well. This chapter will summarize recent results and technical tricks that are needed to make effective use of K-means clustering for learning large-scale representations of images. We will also connect these results to other well-known algorithms to make clear when K-means can be most useful and convey intuitions about its behavior that are useful for debugging and engineering new systems.

Book ChapterDOI
01 Aug 2012
TL;DR: This chapter will study the key challenges of the clustering problem, as it applies to the text domain, and discuss the key methods used for text clustering, and their relative advantages.
Abstract: Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.

Posted Content
TL;DR: In this paper, a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation is discovered. But these patches are not restricted to be any one of the parts, objects, visual phrases, etc.
Abstract: The goal of this paper is to discover a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation. The desired patches need to satisfy two requirements: 1) to be representative, they need to occur frequently enough in the visual world; 2) to be discriminative, they need to be different enough from the rest of the visual world. The patches could correspond to parts, objects, "visual phrases", etc. but are not restricted to be any one of them. We pose this as an unsupervised discriminative clustering problem on a huge dataset of image patches. We use an iterative procedure which alternates between clustering and training discriminative classifiers, while applying careful cross-validation at each step to prevent overfitting. The paper experimentally demonstrates the effectiveness of discriminative patches as an unsupervised mid-level visual representation, suggesting that it could be used in place of visual words for many tasks. Furthermore, discriminative patches can also be used in a supervised regime, such as scene classification, where they demonstrate state-of-the-art performance on the MIT Indoor-67 dataset.

Journal ArticleDOI
TL;DR: A bottom-up aggregation approach to image segmentation that takes into account intensity and texture distributions in a local area around each region and incorporates priors based on the geometry of the regions, providing a complete hierarchical segmentation of the image.
Abstract: We present a bottom-up aggregation approach to image segmentation. Beginning with an image, we execute a sequence of steps in which pixels are gradually merged to produce larger and larger regions. In each step, we consider pairs of adjacent regions and provide a probability measure to assess whether or not they should be included in the same segment. Our probabilistic formulation takes into account intensity and texture distributions in a local area around each region. It further incorporates priors based on the geometry of the regions. Finally, posteriors based on intensity and texture cues are combined using “ a mixture of experts” formulation. This probabilistic approach is integrated into a graph coarsening scheme, providing a complete hierarchical segmentation of the image. The algorithm complexity is linear in the number of the image pixels and it requires almost no user-tuned parameters. In addition, we provide a novel evaluation scheme for image segmentation algorithms, attempting to avoid human semantic considerations that are out of scope for segmentation algorithms. Using this novel evaluation scheme, we test our method and provide a comparison to several existing segmentation algorithms.

Journal ArticleDOI
TL;DR: An unsupervised distribution-free change detection approach for synthetic aperture radar (SAR) images based on an image fusion strategy and a novel fuzzy clustering algorithm that exhibited lower error than its preexistences.
Abstract: This paper presents an unsupervised distribution-free change detection approach for synthetic aperture radar (SAR) images based on an image fusion strategy and a novel fuzzy clustering algorithm. The image fusion technique is introduced to generate a difference image by using complementary information from a mean-ratio image and a log-ratio image. In order to restrain the background information and enhance the information of changed regions in the fused difference image, wavelet fusion rules based on an average operator and minimum local area energy are chosen to fuse the wavelet coefficients for a low-frequency band and a high-frequency band, respectively. A reformulated fuzzy local-information C-means clustering algorithm is proposed for classifying changed and unchanged regions in the fused difference image. It incorporates the information about spatial context in a novel fuzzy way for the purpose of enhancing the changed information and of reducing the effect of speckle noise. Experiments on real SAR images show that the image fusion strategy integrates the advantages of the log-ratio operator and the mean-ratio operator and gains a better performance. The change detection results obtained by the improved fuzzy clustering algorithm exhibited lower error than its preexistences.

Proceedings Article
08 Aug 2012
TL;DR: A new technique to detect randomly generated domains without reversing is presented, finding that most of the DGA-generated domains that a bot queries would result in Non-Existent Domain (NXDomain) responses, and that bots from the same bot-net (with the same DGA algorithm) would generate similar NXDomain traffic.
Abstract: Many botnet detection systems employ a blacklist of known command and control (C&C) domains to detect bots and block their traffic. Similar to signature-based virus detection, such a botnet detection approach is static because the blacklist is updated only after running an external (and often manual) process of domain discovery. As a response, botmasters have begun employing domain generation algorithms (DGAs) to dynamically produce a large number of random domain names and select a small subset for actual C&C use. That is, a C&C domain is randomly generated and used for a very short period of time, thus rendering detection approaches that rely on static domain lists ineffective. Naturally, if we know how a domain generation algorithm works, we can generate the domains ahead of time and still identify and block bot-net C&C traffic. The existing solutions are largely based on reverse engineering of the bot malware executables, which is not always feasible. In this paper we present a new technique to detect randomly generated domains without reversing. Our insight is that most of the DGA-generated (random) domains that a bot queries would result in Non-Existent Domain (NXDomain) responses, and that bots from the same bot-net (with the same DGA algorithm) would generate similar NXDomain traffic. Our approach uses a combination of clustering and classification algorithms. The clustering algorithm clusters domains based on the similarity in the make-ups of domain names as well as the groups of machines that queried these domains. The classification algorithm is used to assign the generated clusters to models of known DGAs. If a cluster cannot be assigned to a known model, then a new model is produced, indicating a new DGA variant or family. We implemented a prototype system and evaluated it on real-world DNS traffic obtained from large ISPs in North America. We report the discovery of twelve DGAs. Half of them are variants of known (botnet) DGAs, and the other half are brand new DGAs that have never been reported before.

Book
01 Jul 2012
TL;DR: This semi-structured heterogeneous network modeling leads to a series of new principles and powerful methodologies for mining interconnected data, including: (1) rank-based clustering and classification; (2) meta-path-based similarity search and mining; (3) relation strength-aware mining, and many other potential developments.
Abstract: Real-world physical and abstract data objects are interconnected, forming gigantic, interconnected networks. By structuring these data objects and interactions between these objects into multiple types, such networks become semi-structured heterogeneous information networks. Most real-world applications that handle big data, including interconnected social media and social networks, scientific, engineering, or medical information systems, online e-commerce systems, and most database systems, can be structured into heterogeneous information networks. Therefore, effective analysis of large-scale heterogeneous information networks poses an interesting but critical challenge. In this book, we investigate the principles and methodologies of mining heterogeneous information networks. Departing from many existing network models that view interconnected data as homogeneous graphs or networks, our semi-structured heterogeneous information network model leverages the rich semantics of typed nodes and links in a network and uncovers surprisingly rich knowledge from the network. This semi-structured heterogeneous network modeling leads to a series of new principles and powerful methodologies for mining interconnected data, including: (1) rank-based clustering and classification; (2) meta-path-based similarity search and mining; (3) relation strength-aware mining, and many other potential developments. This book introduces this new research frontier and points out some promising research directions. Table of Contents: Introduction / Ranking-Based Clustering / Classification of Heterogeneous Information Networks / Meta-Path-Based Similarity Search / Meta-Path-Based Relationship Prediction / Relation Strength-Aware Clustering with Incomplete Attributes / User-Guided Clustering via Meta-Path Selection / Research Frontiers

Journal ArticleDOI
TL;DR: How data mining technologies (in each area of classification, clustering, and association) have been used for a multitude of purposes, including research in the biomedical and healthcare fields are introduced.
Abstract: As a new concept that emerged in the middle of 1990's, data mining can help researchers gain both novel and deep insights and can facilitate unprecedented understanding of large biomedical datasets. Data mining can uncover new biomedical and healthcare knowledge for clinical and administrative decision making as well as generate scientific hypotheses from large experimental data, clinical databases, and/or biomedical literature. This review first introduces data mining in general (e.g., the background, definition, and process of data mining), discusses the major differences between statistics and data mining and then speaks to the uniqueness of data mining in the biomedical and healthcare fields. A brief summarization of various data mining algorithms used for classification, clustering, and association as well as their respective advantages and drawbacks is also presented. Suggested guidelines on how to use data mining algorithms in each area of classification, clustering, and association are offered along with three examples of how data mining has been used in the healthcare industry. Given the successful application of data mining by health related organizations that has helped to predict health insurance fraud and under-diagnosed patients, and identify and classify at-risk people in terms of health with the goal of reducing healthcare cost, we introduce how data mining technologies (in each area of classification, clustering, and association) have been used for a multitude of purposes, including research in the biomedical and healthcare fields. A discussion of the technologies available to enable the prediction of healthcare costs (including length of hospital stay), disease diagnosis and prognosis, and the discovery of hidden biomedical and healthcare patterns from related databases is offered along with a discussion of the use of data mining to discover such relationships as those between health conditions and a disease, relationships among diseases, and relationships among drugs. The article concludes with a discussion of the problems that hamper the clinical use of data mining by health professionals.

Proceedings Article
22 Jul 2012
TL;DR: A new unsupervised learning algorithm, namely Nonnegative Discriminative Feature Selection (NDFS), which exploits the discriminative information and feature correlation simultaneously to select a better feature subset.
Abstract: In this paper, a new unsupervised learning algorithm, namely Nonnegative Discriminative Feature Selection (NDFS), is proposed. To exploit the discriminative information in unsupervised scenarios, we perform spectral clustering to learn the cluster labels of the input samples, during which the feature selection is performed simultaneously. The joint learning of the cluster labels and feature selection matrix enables NDFS to select the most discriminative features. To learn more accurate cluster labels, a nonnegative constraint is explicitly imposed to the class indicators. To reduce the redundant or even noisy features, l2,1-norm minimization constraint is added into the objective function, which guarantees the feature selection matrix sparse in rows. Our algorithm exploits the discriminative information and feature correlation simultaneously to select a better feature subset. A simple yet efficient iterative algorithm is designed to optimize the proposed objective function. Experimental results on different real world datasets demonstrate the encouraging performance of our algorithm over the state-of-the-arts.

Book ChapterDOI
07 Oct 2012
TL;DR: The paper experimentally demonstrates the effectiveness of discriminative patches as an unsupervised mid-level visual representation, suggesting that it could be used in place of visual words for many tasks.
Abstract: The goal of this paper is to discover a set of discriminative patches which can serve as a fully unsupervised mid-level visual representation The desired patches need to satisfy two requirements: 1) to be representative, they need to occur frequently enough in the visual world; 2) to be discriminative, they need to be different enough from the rest of the visual world The patches could correspond to parts, objects, "visual phrases", etc but are not restricted to be any one of them We pose this as an unsupervised discriminative clustering problem on a huge dataset of image patches We use an iterative procedure which alternates between clustering and training discriminative classifiers, while applying careful cross-validation at each step to prevent overfitting The paper experimentally demonstrates the effectiveness of discriminative patches as an unsupervised mid-level visual representation, suggesting that it could be used in place of visual words for many tasks Furthermore, discriminative patches can also be used in a supervised regime, such as scene classification, where they demonstrate state-of-the-art performance on the MIT Indoor-67 dataset

MonographDOI
01 Jul 2012
TL;DR: This chapter discusses clustering, classification and data mining in the context of spatial point processes, and investigates the role of time series analysis in this process.
Abstract: 1. Introduction 2. Probability 3. Statistical inference 4. Probability distribution functions 5. Nonparametric statistics 6. Density estimation or data smoothing 7. Regression 8. Multivariate analysis 9. Clustering, classification and data mining 10. Nondetections: censored and truncated data 11. Time series analysis 12. Spatial point processes Appendices Index.

Journal ArticleDOI
TL;DR: This paper compares the efficacy of three different implementations of techniques aimed to extend fuzzy c-means (FCM) clustering to VL data and concludes by demonstrating the VL algorithms on a dataset with 5 billion objects and presenting a set of recommendations regarding the use of different VL FCM clustering schemes.
Abstract: Very large (VL) data or big data are any data that you cannot load into your computer's working memory. This is not an objective definition, but a definition that is easy to understand and one that is practical, because there is a dataset too big for any computer you might use; hence, this is VL data for you. Clustering is one of the primary tasks used in the pattern recognition and data mining communities to search VL databases (including VL images) in various applications, and so, clustering algorithms that scale well to VL data are important and useful. This paper compares the efficacy of three different implementations of techniques aimed to extend fuzzy c-means (FCM) clustering to VL data. Specifically, we compare methods that are based on 1) sampling followed by noniterative extension; 2) incremental techniques that make one sequential pass through subsets of the data; and 3) kernelized versions of FCM that provide approximations based on sampling, including three proposed algorithms. We use both loadable and VL datasets to conduct the numerical experiments that facilitate comparisons based on time and space complexity, speed, quality of approximations to batch FCM (for loadable data), and assessment of matches between partitions and ground truth. Empirical results show that random sampling plus extension FCM, bit-reduced FCM, and approximate kernel FCM are good choices to approximate FCM for VL data. We conclude by demonstrating the VL algorithms on a dataset with 5 billion objects and presenting a set of recommendations regarding the use of different VL FCM clustering schemes.

Proceedings ArticleDOI
16 Jun 2012
TL;DR: The proposed framework and theoretical foundations are illustrated with examples in video summarization and image classification using representatives and can be extended to detect and reject outliers in datasets, and to efficiently deal with new observations and large datasets.
Abstract: We consider the problem of finding a few representatives for a dataset, i.e., a subset of data points that efficiently describes the entire dataset. We assume that each data point can be expressed as a linear combination of the representatives and formulate the problem of finding the representatives as a sparse multiple measurement vector problem. In our formulation, both the dictionary and the measurements are given by the data matrix, and the unknown sparse codes select the representatives via convex optimization. In general, we do not assume that the data are low-rank or distributed around cluster centers. When the data do come from a collection of low-rank models, we show that our method automatically selects a few representatives from each low-rank model. We also analyze the geometry of the representatives and discuss their relationship to the vertices of the convex hull of the data. We show that our framework can be extended to detect and reject outliers in datasets, and to efficiently deal with new observations and large datasets. The proposed framework and theoretical foundations are illustrated with examples in video summarization and image classification using representatives.

Journal ArticleDOI
01 Jun 2012-Energy
TL;DR: In this paper, an overview of the clustering techniques used to establish suitable customer grouping, included in a general scheme for analysing electrical load pattern data, is provided, illustrated and discussed, providing links to relevant literature references.

Proceedings Article
01 Jan 2012
TL;DR: Symmetric NMF is proposed as a general framework for graph clustering, which inherits the advantages of NMF by enforcing nonnegativity on the clustering assignment matrix, and serves as a potential basis for many extensions.
Abstract: Nonnegative matrix factorization (NMF) provides a lower rank approximation of a nonnegative matrix, and has been successfully used as a clustering method. In this paper, we offer some conceptual understanding for the capabilities and shortcomings of NMF as a clustering method. Then, we propose Symmetric NMF (SymNMF) as a general framework for graph clustering, which inherits the advantages of NMF by enforcing nonnegativity on the clustering assignment matrix. Unlike NMF, however, SymNMF is based on a similarity measure between data points, and factorizes a symmetric matrix containing pairwise similarity values (not necessarily nonnegative). We compare SymNMF with the widely-used spectral clustering methods, and give an intuitive explanation of why SymNMF captures the cluster structure embedded in the graph representation more naturally. In addition, we develop a Newton-like algorithm that exploits second-order information efficiently, so as to show the feasibility of SymNMF as a practical framework for graph clustering. Our experiments on artificial graph data, text data, and image data demonstrate the substantially enhanced clustering quality of SymNMF over spectral clustering and NMF. Therefore, SymNMF is able to achieve better clustering results on both linear and nonlinear manifolds, and serves as a potential basis for many extensions

Journal ArticleDOI
TL;DR: The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments pose tremendous challenges in data analysis, and sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families.
Abstract: The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.

Journal ArticleDOI
TL;DR: A fuzzy-logic-based clustering approach with an extension to the energy predication has been proposed to prolong the lifetime of WSNs by evenly distributing the workload and the simulation results show that the proposed approach is more efficient than other distributed algorithms.
Abstract: In order to collect information more efficiently, wireless sensor networks (WSNs) are partitioned into clusters. Clustering provides an effective way to prolong the lifetime of WSNs. Current clustering approaches often use two methods: selecting cluster heads with more residual energy, and rotating cluster heads periodically, to distribute the energy consumption among nodes in each cluster and extend the network lifetime. However, most of the previous algorithms have not considered the expected residual energy, which is the predicated remaining energy for being selected as a cluster head and running a round. In this paper, a fuzzy-logic-based clustering approach with an extension to the energy predication has been proposed to prolong the lifetime of WSNs by evenly distributing the workload. The simulation results show that the proposed approach is more efficient than other distributed algorithms. It is believed that the technique presented in this paper could be further applied to large-scale wireless sensor networks.