Showing papers on "Cluster analysis published in 2011"

PDF

Open Access

Journal Article•DOI•

REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms

[...]

Fran Supek, Matko Bošnjak, Nives Škunca, Tomislav Šmuc

18 Jul 2011-PLOS ONE

TL;DR: REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures.

...read moreread less

Abstract: Outcomes of high-throughput biological experiments are typically interpreted by statistical testing for enriched gene functional categories defined by the Gene Ontology (GO). The resulting lists of GO terms may be large and highly redundant, and thus difficult to interpret. REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures. Furthermore, REVIGO visualizes this non-redundant GO term set in multiple ways to assist in interpretation: multidimensional scaling and graph-based visualizations accurately render the subdivisions and the semantic relationships in the data, while treemaps and tag clouds are also offered as alternative views. REVIGO is freely available at http://revigo.irb.hr/.

...read moreread less

4,919 citations

Journal Article•DOI•

Robust Inference with Multi-way Clustering

[...]

A. Colin Cameron, Jonah B. Gelbach, Douglas L. Miller¹•Institutions (1)

University of California, Davis¹

01 Apr 2011-Journal of Business & Economic Statistics

TL;DR: The authors proposed a variance estimator for the OLS estimator as well as for nonlinear estimators such as logit, probit, and GMM that enables cluster-robust inference when there is two-way or multiway clustering that is nonnested.

...read moreread less

Abstract: In this article we propose a variance estimator for the OLS estimator as well as for nonlinear estimators such as logit, probit, and GMM. This variance estimator enables cluster-robust inference when there is two-way or multiway clustering that is nonnested. The variance estimator extends the standard cluster-robust variance estimator or sandwich estimator for one-way clustering (e.g., Liang and Zeger 1986; Arellano 1987) and relies on similar relatively weak distributional assumptions. Our method is easily implemented in statistical packages, such as Stata and SAS, that already offer cluster-robust standard errors when there is one-way clustering. The method is demonstrated by a Monte Carlo analysis for a two-way random effects model; a Monte Carlo analysis of a placebo law that extends the state–year effects example of Bertrand, Duflo, and Mullainathan (2004) to two dimensions; and by application to studies in the empirical literature where two-way clustering is present.

...read moreread less

2,542 citations

Proceedings Article•

An analysis of single-layer networks in unsupervised feature learning

[...]

Adam Coates¹, Andrew Y. Ng², Honglak Lee¹•Institutions (2)

Stanford University¹, University of Michigan²

14 Jun 2011

TL;DR: In this paper, the authors show that the number of hidden nodes in the model may be more important to achieving high performance than the learning algorithm or the depth of the model, and they apply several othe-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs, K-means clustering, and Gaussian mixtures) to CIFAR, NORB, and STL datasets using only single-layer networks.

...read moreread less

Abstract: A great deal of research has focused on algorithms for learning features from unlabeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning algorithms and deep models. In this paper, however, we show that several simple factors, such as the number of hidden nodes in the model, may be more important to achieving high performance than the learning algorithm or the depth of the model. Specifically, we will apply several othe-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs, K-means clustering, and Gaussian mixtures) to CIFAR, NORB, and STL datasets using only singlelayer networks. We then present a detailed analysis of the eect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (“stride”) between extracted features, and the eect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are critical to achieving high performance—so critical, in fact, that when these parameters are pushed to their limits, we achieve state-of-the-art performance on both CIFAR-10 and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyperparameters to tune beyond the model structure itself, and is very easy to implement. Despite the simplicity of our system, we achieve accuracy beyond all previously published results on the CIFAR-10 and NORB datasets (79.6% and 97.2% respectively).

...read moreread less

2,091 citations

Proceedings Article•

KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework

[...]

Jesús Alcalá-Fdez¹, Alberto Fernández², Julián Luengo¹, Joaquín Derrac¹, Salvador García³, Luciano Sánchez⁴, Francisco Herrera¹ - Show less +3 more•Institutions (4)

University of Granada¹, Central University of Venezuela², University of Jaén³, University of Oviedo⁴

01 Jan 2011

TL;DR: The aim of this paper is to present three new aspects of KEEL: KEEL-dataset, a data set repository which includes the data set partitions in theKEELformat and some guidelines for including new algorithms in KEEL, helping the researcher to compare the results of many approaches already included within the KEEL software.

...read moreread less

Abstract: (Knowledge Extraction based onEvolutionary Learning) tool, an open source software that supports datamanagement and a designer of experiments. KEEL pays special attentionto the implementation of evolutionary learning and soft computing basedtechniques for Data Mining problems including regression, classiﬁcation,clustering, pattern mining and so on.The aim of this paper is to present three new aspects of KEEL: KEEL-dataset, a data set repository which includes the data set partitions in theKEELformatandshowssomeresultsofalgorithmsinthesedatasets; someguidelines for including new algorithms in KEEL, helping the researcherstomaketheirmethodseasilyaccessibletootherauthorsandtocomparetheresults of many approaches already included within the KEEL software;and a module of statistical procedures developed in order to provide to theresearcher a suitable tool to contrast the results obtained in any experimen-talstudy.Acaseofstudyisgiventoillustrateacompletecaseofapplicationwithin this experimental analysis framework.

...read moreread less

2,057 citations

Book•

Mining of Massive Datasets

[...]

Anand Rajaraman¹, Jeffrey D. Ullman²•Institutions (2)

Walmart Labs¹, Stanford University²

01 Oct 2011

TL;DR: This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets, and explains the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing.

...read moreread less

Abstract: The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.

...read moreread less

1,795 citations

Journal Article•DOI•

Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm

[...]

Fionn Murtagh¹, Pierre Legendre•Institutions (1)

Association for Computing Machinery¹

27 Nov 2011-arXiv: Machine Learning

TL;DR: There are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the aggLomerative criterion as mentioned in this paper.

...read moreread less

Abstract: The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward's hierarchical clustering method.

...read moreread less

1,290 citations

Journal Article•DOI•

Finding Statistically Significant Communities in Networks

[...]

Andrea Lancichinetti¹, Filippo Radicchi², José J. Ramasco³, Santo Fortunato•Institutions (3)

Polytechnic University of Turin¹, Howard Hughes Medical Institute², Spanish National Research Council³

29 Apr 2011-PLOS ONE

TL;DR: OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics, is presented.

...read moreread less

Abstract: Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multi-purpose techniques, able to handle different types of datasets and the subtleties of community structure. In this paper we present OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics. OSLOM can be used alone or as a refinement procedure of partitions/covers delivered by other techniques. We have also implemented sequential algorithms combining OSLOM with other fast techniques, so that the community structure of very large networks can be uncovered. Our method has a comparable performance as the best existing algorithms on artificial benchmark graphs. Several applications on real networks are shown as well. OSLOM is implemented in a freely available software (http://www.oslom.org), and we believe it will be a valuable tool in the analysis of networks.

...read moreread less

1,205 citations

Journal Article•DOI•

A Level Set Method for Image Segmentation in the Presence of Intensity Inhomogeneities With Application to MRI

[...]

Chunming Li¹, Rui Huang², Zhaohua Ding¹, J. C. Gatenby¹, Dimitris N. Metaxas², John C. Gore¹ - Show less +2 more•Institutions (2)

Vanderbilt University¹, Rutgers University²

01 Jul 2011-IEEE Transactions on Image Processing

TL;DR: A novel region-based method for image segmentation, which is able to simultaneously segment the image and estimate the bias field, and the estimated bias field can be used for intensity inhomogeneity correction (or bias correction).

...read moreread less

Abstract: Intensity inhomogeneity often occurs in real-world images, which presents a considerable challenge in image segmentation. The most widely used image segmentation algorithms are region-based and typically rely on the homogeneity of the image intensities in the regions of interest, which often fail to provide accurate segmentation results due to the intensity inhomogeneity. This paper proposes a novel region-based method for image segmentation, which is able to deal with intensity inhomogeneities in the segmentation. First, based on the model of images with intensity inhomogeneities, we derive a local intensity clustering property of the image intensities, and define a local clustering criterion function for the image intensities in a neighborhood of each point. This local clustering criterion function is then integrated with respect to the neighborhood center to give a global criterion of image segmentation. In a level set formulation, this criterion defines an energy in terms of the level set functions that represent a partition of the image domain and a bias field that accounts for the intensity inhomogeneity of the image. Therefore, by minimizing this energy, our method is able to simultaneously segment the image and estimate the bias field, and the estimated bias field can be used for intensity inhomogeneity correction (or bias correction). Our method has been validated on synthetic images and real images of various modalities, with desirable performance in the presence of intensity inhomogeneities. Experiments show that our method is more robust to initialization, faster and more accurate than the well-known piecewise smooth model. As an application, our method has been used for segmentation and bias correction of magnetic resonance (MR) images with promising results.

...read moreread less

1,201 citations

Journal Article•DOI•

A survey on clustering algorithms for wireless sensor networks

[...]

Olutayo Boyinbode, Hanh Le, Makoto Takizawa

25 May 2011-International Journal of Space-Based and Situated Computing

TL;DR: This paper synthesises existing clustering algorithms news's and highlights the challenges in clustering.

...read moreread less

Abstract: A wireless sensor network (WSN) consisting of a large number of tiny sensors can be an effective tool for gathering data in diverse kinds of environments. The data collected by each sensor is communicated to the base station, which forwards the data to the end user. Clustering is introduced to WSNs because it has proven to be an effective approach to provide better data aggregation and scalability for large WSNs. Clustering also conserves the limited energy resources of the sensors. This paper synthesises existing clustering algorithms in WSNs and highlights the challenges in clustering.

...read moreread less

1,097 citations

Proceedings Article•DOI•

Patterns of temporal variation in online media

[...]

Jaewon Yang¹, Jure Leskovec¹•Institutions (1)

Stanford University¹

09 Feb 2011

TL;DR: This work develops the K-Spectral Centroid (K-SC) clustering algorithm that effectively finds cluster centroids with the authors' similarity measure and presents a simple model that reliably predicts the shape of attention by using information about only a small number of participants.

...read moreread less

Abstract: Online content exhibits rich temporal dynamics, and diverse realtime user generated content further intensifies this process. However, temporal patterns by which online content grows and fades over time, and by which different pieces of content compete for attention remain largely unexplored.We study temporal patterns associated with online content and how the content's popularity grows and fades over time. The attention that content receives on the Web varies depending on many factors and occurs on very different time scales and at different resolutions. In order to uncover the temporal dynamics of online content we formulate a time series clustering problem using a similarity metric that is invariant to scaling and shifting. We develop the K-Spectral Centroid (K-SC) clustering algorithm that effectively finds cluster centroids with our similarity measure. By applying an adaptive wavelet-based incremental approach to clustering, we scale K-SC to large data sets.We demonstrate our approach on two massive datasets: a set of 580 million Tweets, and a set of 170 million blog posts and news media articles. We find that K-SC outperforms the K-means clustering algorithm in finding distinct shapes of time series. Our analysis shows that there are six main temporal shapes of attention of online content. We also present a simple model that reliably predicts the shape of attention by using information about only a small number of participants. Our analyses offer insight into common temporal patterns of the content on theWeb and broaden the understanding of the dynamics of human attention.

...read moreread less

1,041 citations

Journal Article•DOI•

A novel clustering approach: Artificial Bee Colony (ABC) algorithm

[...]

Dervis Karaboga¹, Celal Ozturk¹•Institutions (1)

Erciyes University¹

01 Jan 2011

TL;DR: Simulation results indicate that ABC algorithm can efficiently be used for multivariate data clustering and is compared with Particle Swarm Optimization (PSO) algorithm and other nine classification techniques from the literature.

...read moreread less

Abstract: Artificial Bee Colony (ABC) algorithm which is one of the most recently introduced optimization algorithms, simulates the intelligent foraging behavior of a honey bee swarm. Clustering analysis, used in many disciplines and applications, is an important tool and a descriptive task seeking to identify homogeneous groups of objects based on the values of their attributes. In this work, ABC is used for data clustering on benchmark problems and the performance of ABC algorithm is compared with Particle Swarm Optimization (PSO) algorithm and other nine classification techniques from the literature. Thirteen of typical test data sets from the UCI Machine Learning Repository are used to demonstrate the results of the techniques. The simulation results indicate that ABC algorithm can efficiently be used for multivariate data clustering.

...read moreread less

Proceedings Article•

Co-regularized Multi-view Spectral Clustering

[...]

Abhishek Kumar¹, Piyush Rai², Hal Daumé¹•Institutions (2)

University of Maryland, College Park¹, University of Utah²

12 Dec 2011

TL;DR: A spectral clustering framework is proposed that achieves this goal by co-regularizing the clustering hypotheses, and two co- regularization schemes are proposed to accomplish this.

...read moreread less

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.

...read moreread less

Autoencoders, unsupervised learning and deep architectures

[...]

Pierre Baldi¹•Institutions (1)

University of California, Irvine¹

02 Jul 2011

TL;DR: The framework sheds light on the different kinds of autoencoders, their learning complexity, their horizontal and vertical composability in deep architectures, their critical points, and their fundamental connections to clustering, Hebbian learning, and information theory.

...read moreread less

Abstract: Autoencoders play a fundamental role in unsupervised learning and in deep architectures for transfer learning and other tasks. In spite of their fundamental role, only linear autoencoders over the real numbers have been solved analytically. Here we present a general mathematical framework for the study of both linear and non-linear autoencoders. The framework allows one to derive an analytical treatment for the most non-linear autoencoder, the Boolean autoencoder. Learning in the Boolean autoencoder is equivalent to a clustering problem that can be solved in polynomial time when the number of clusters is small and becomes NP complete when the number of clusters is large. The framework sheds light on the different kinds of autoencoders, their learning complexity, their horizontal and vertical composability in deep architectures, their critical points, and their fundamental connections to clustering, Hebbian learning, and information theory.

...read moreread less

Journal Article•DOI•

A global averaging method for dynamic time warping, with applications to clustering

[...]

François Petitjean¹, Alain Ketterlin¹, Pierre Gançarski¹•Institutions (1)

University of Strasbourg¹

01 Mar 2011-Pattern Recognition

TL;DR: A global technique for averaging a set of sequences is developed, which avoids using iterative pairwise averaging and is thus insensitive to ordering effects, and a new strategy to reduce the length of the resulting average sequence is described.

...read moreread less

Journal Article•

Subspace Clustering

[...]

René Vidal

01 Mar 2011-IEEE Signal Processing Magazine

TL;DR: This article presented a review of existing subspace clustering algorithms together with an experimental evaluation on the motion segmentation and face clustering problems in computer vision.

...read moreread less

Abstract: Over the past few decades, significant progress has been made in clustering high-dimensional data sets distributed around a collection of linear and affine subspaces. This article presented a review of such progress, which included a number of existing subspace clustering algorithms together with an experimental evaluation on the motion segmentation and face clustering problems in computer vision.

...read moreread less

Proceedings Article•

A Co-training Approach for Multi-view Spectral Clustering

[...]

Abhishek Kumar¹, Hal Daumé¹•Institutions (1)

University of Maryland, College Park¹

28 Jun 2011

TL;DR: A spectral clustering algorithm for the multi-view setting where the authors have access to multiple views of the data, each of which can be independently used for clustering, which has a flavor of co-training.

...read moreread less

Abstract: We propose a spectral clustering algorithm for the multi-view setting where we have access to multiple views of the data, each of which can be independently used for clustering. Our spectral clustering algorithm has a flavor of co-training, which is already a widely used idea in semi-supervised learning. We work on the assumption that the true underlying clustering would assign a point to the same cluster irrespective of the view. Hence, we constrain our approach to only search for the clusterings that agree across the views. Our algorithm does not have any hyperparameters to set, which is a major advantage in unsupervised learning. We empirically compare with a number of baseline methods on synthetic and real-world datasets to show the efficacy of the proposed algorithm.

...read moreread less

Journal Article•DOI•

Spectral clustering and the high-dimensional stochastic blockmodel

[...]

Karl Rohe¹, Sourav Chatterjee¹, Bin Yu¹•Institutions (1)

University of California, Berkeley¹

01 Aug 2011-Annals of Statistics

TL;DR: In this article, the authors studied spectral clustering under the stochastic block model and showed that the eigenvectors of the normalized graph Laplacian asymptotically converges to the eigens of a population normalized graph.

...read moreread less

Abstract: Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors. Social ne tworks, representing people who communicate with each other, are one example. Communities or clusters of highly connected actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally feasi ble method to discover these communities. The Stochastic Block Model (Holland et al., 1983) is a social network model with well defined communities; each node is a member of one community. For a network generated from the Stochastic Block Model, we bound the number of nodes "misclus- tered" by spectral clustering. The asymptotic results in th is paper are the first clustering results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional. In order to study spectral clustering under the Stochastic Block Model, we first show that under the more general latent space model, the eigenvectors of the normalized graph Laplacian asymptotically converge to the eigenvectors of a "population" normal- ized graph Laplacian. Aside from the implication for spectral clustering, this provides insight into a graph visualization technique. Our method of studying the eigenvectors of random matrices is original. AMS 2000 subject classifications: Primary 62H30, 62H25; secondary 60B20.

...read moreread less

Journal Article•DOI•

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis

[...]

Patrick D. Schloss¹, Sarah L. Westcott¹•Institutions (1)

University of Michigan¹

15 May 2011-Applied and Environmental Microbiology

TL;DR: The ability to quickly and accurately assign sequences to OTUs and then obtain taxonomic information for those OTUs will greatly improve OTU-based analyses and overcome many of the challenges encountered with phylotype-based methods.

...read moreread less

Abstract: In spite of technical advances that have provided increases in orders of magnitude in sequencing coverage, microbial ecologists still grapple with how to interpret the genetic diversity represented by the 16S rRNA gene. Two widely used approaches put sequences into bins based on either their similarity to reference sequences (i.e., phylotyping) or their similarity to other sequences in the community (i.e., operational taxonomic units [OTUs]). In the present study, we investigate three issues related to the interpretation and implementation of OTU-based methods. First, we confirm the conventional wisdom that it is impossible to create an accurate distance-based threshold for defining taxonomic levels and instead advocate for a consensus-based method of classifying OTUs. Second, using a taxonomic-independent approach, we show that the average neighbor clustering algorithm produces more robust OTUs than other hierarchical and heuristic clustering algorithms. Third, we demonstrate several steps to reduce the computational burden of forming OTUs without sacrificing the robustness of the OTU assignment. Finally, by blending these solutions, we propose a new heuristic that has a minimal effect on the robustness of OTUs and significantly reduces the necessary time and memory requirements. The ability to quickly and accurately assign sequences to OTUs and then obtain taxonomic information for those OTUs will greatly improve OTU-based analyses and overcome many of the challenges encountered with phylotype-based methods.

...read moreread less

Journal Article•DOI•

Particle Swarm Optimization in Wireless-Sensor Networks: A Brief Survey

[...]

Raghavendra V. Kulkarni¹, Ganesh K. Venayagamoorthy¹•Institutions (1)

Missouri University of Science and Technology¹

01 Mar 2011

TL;DR: Issues in WSNs are outlined,PSO is introduced, and its suitability for WSN applications is discussed, and a brief survey of how PSO is tailored to address these issues is presented.

...read moreread less

Abstract: Wireless-sensor networks (WSNs) are networks of autonomous nodes used for monitoring an environment. Developers of WSNs face challenges that arise from communication link failures, memory and computational constraints, and limited energy. Many issues in WSNs are formulated as multidimensional optimization problems, and approached through bioinspired techniques. Particle swarm optimization (PSO) is a simple, effective, and computationally efficient optimization algorithm. It has been applied to address WSN issues such as optimal deployment, node localization, clustering, and data aggregation. This paper outlines issues in WSNs, introduces PSO, and discusses its suitability for WSN applications. It also presents a brief survey of how PSO is tailored to address these issues.

...read moreread less

Proceedings Article•DOI•

Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks

[...]

Paolo Boldi¹, Marco Rosa¹, Massimo Santini¹, Sebastiano Vigna¹•Institutions (1)

University of Milan¹

28 Mar 2011

TL;DR: In this paper, a layered label propagation (LBP) algorithm is proposed to reorder very large graphs (billions of nodes) using task decomposition to perform aggressively on multi-core architecture, making it possible to compress graphs of more than 600 millions nodes in a few hours.

...read moreread less

Abstract: We continue the line of research on graph compression started with WebGraph, but we move our focus to the compression of social networks in a proper sense (e.g., LiveJournal): the approaches that have been used for a long time to compress web graphs rely on a specific ordering of the nodes (lexicographical URL ordering) whose extension to general social networks is not trivial. In this paper, we propose a solution that mixes clusterings and orders, and devise a new algorithm, called Layered Label Propagation, that builds on previous work on scalable clustering and can be used to reorder very large graphs (billions of nodes). Our implementation uses task decomposition to perform aggressively on multi-core architecture, making it possible to reorder graphs of more than 600 millions nodes in a few hours.Experiments performed on a wide array of web graphs and social networks show that combining the order produced by the proposed algorithm with the WebGraph compression framework provides a major increase in compression with respect to all currently known techniques, both on web graphs and on social networks. These improvements make it possible to analyse in main memory significantly larger graphs.

...read moreread less

Journal Article•DOI•

Parallel Spectral Clustering in Distributed Systems

[...]

Wen-Yen Chen¹, Yangqiu Song², Hongjie Bai³, Chih-Jen Lin⁴, Edward Y. Chang³ - Show less +1 more•Institutions (4)

Yahoo!¹, Microsoft², Google³, National Taiwan University⁴

01 Mar 2011-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work investigates two representative ways of approximating the dense similarity matrix and picks the strategy of sparsifying the matrix via retaining nearest neighbors and investigates its parallelization, which can effectively handle large problems.

...read moreread less

Abstract: Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nystrom method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

...read moreread less

Journal Article•DOI•

Density-based clustering

[...]

Hans-Peter Kriegel¹, Peer Kröger¹, Jörg Sander², Arthur Zimek¹•Institutions (2)

Ludwig Maximilian University of Munich¹, University of Alberta²

01 May 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: In this article, a density-based clustering is defined as the task of identifying groups or clusters in a data set, a cluster is a set of data objects spread in the data space over a contiguous region of high density of objects.

...read moreread less

Abstract: Clustering refers to the task of identifying groups or clusters in a data set. In density-based clustering, a cluster is a set of data objects spread in the data space over a contiguous region of high density of objects. Density-based clusters are separated from each other by contiguous regions of low density of objects. Data objects located in low-density regions are typically considered noise or outliers. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 231–240 DOI: 10.1002/widm.30 This article is categorized under: Technologies > Structure Discovery and Clustering

...read moreread less

Journal Article•DOI•

A survey of clustering ensemble algorithms

[...]

Sandro Vega-Pons, José Ruiz-Shulcloper

21 Nov 2011-International Journal of Pattern Recognition and Artificial Intelligence

TL;DR: An overview of clustering ensemble methods that can be very useful for the community of clusters practitioners is presented and a taxonomy of these techniques is presented to illustrate some important applications.

...read moreread less

Abstract: Cluster ensemble has proved to be a good alternative when facing cluster analysis problems. It consists of generating a set of clusterings from the same dataset and combining them into a final clustering. The goal of this combination process is to improve the quality of individual data clusterings. Due to the increasing appearance of new methods, their promising results and the great number of applications, we consider that it is necessary to make a critical analysis of the existing techniques and future projections. This paper presents an overview of clustering ensemble methods that can be very useful for the community of clustering practitioners. The characteristics of several methods are discussed, which may help in the selection of the most appropriate one to solve a problem at hand. We also present a taxonomy of these techniques and illustrate some important applications.

...read moreread less

Journal Article•DOI•

Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction

[...]

Jyoti Soni, Ujma Ansari, Dipesh Sharma, Sunita Soni

31 Mar 2011-International Journal of Computer Applications

TL;DR: A survey of current techniques of knowledge discovery in databases using data mining techniques that are in use in today’s medical research particularly in Heart Disease Prediction reveals that Decision Tree outperforms and some time Bayesian classification is having similar accuracy as of decision tree but other predictive methods are not performing well.

...read moreread less

Abstract: The successful application of data mining in highly visible fields like e-business, marketing and retail has led to its application in other industries and sectors. Among these sectors just discovering is healthcare. The healthcare environment is still „information rich‟ but „knowledge poor‟. There is a wealth of data available within the healthcare systems. However, there is a lack of effective analysis tools to discover hidden relationships and trends in data. This research paper intends to provide a survey of current techniques of knowledge discovery in databases using data mining techniques that are in use in today‟s medical research particularly in Heart Disease Prediction. Number of experiment has been conducted to compare the performance of predictive data mining technique on the same dataset and the outcome reveals that Decision Tree outperforms and some time Bayesian classification is having similar accuracy as of decision tree but other predictive methods like KNN, Neural Networks, Classification based on clustering are not performing well. The second conclusion is that the accuracy of the Decision Tree and Bayesian Classification further improves after applying genetic algorithm to reduce the actual data size to get the optimal subset of attribute sufficient for heart disease prediction.

...read moreread less

Journal Article•DOI•

Weighted dynamic time warping for time series classification

[...]

Young-Seon Jeong¹, Myong K. Jeong¹, Olufemi A. Omitaomu²•Institutions (2)

Rutgers University¹, Oak Ridge National Laboratory²

01 Sep 2011-Pattern Recognition

TL;DR: A novel distance measure, called a weighted DTW (WDTW), which is a penalty-based DTW that penalizes points with higher phase difference between a reference point and a testing point in order to prevent minimum distance distortion caused by outliers is proposed.

...read moreread less

Journal Article•DOI•

Robust statistics for outlier detection

[...]

Peter J. Rousseeuw¹, Mia Hubert¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: An overview of several robust methods and outlier detection tools for univariate, low‐dimensional, and high‐dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification are presented.

...read moreread less

Abstract: When analyzing data, outlying observations cause problems because they may strongly influence the result. Robust statistics aims at detecting the outliers by searching for the model fitted by the majority of the data. We present an overview of several robust methods and outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 73-79 DOI: 10.1002/widm.2 This article is categorized under: Algorithmic Development > Biological Data Mining Algorithmic Development > Spatial and Temporal Data Mining Application Areas > Health Care Technologies > Structure Discovery and Clustering

...read moreread less

Proceedings Article•DOI•

Sparsity-based image denoising via dictionary learning and structural clustering

[...]

Weisheng Dong¹, Xin Li², Lei Zhang, Guangming Shi¹•Institutions (2)

Xidian University¹, West Virginia University²

20 Jun 2011

TL;DR: A double-header l1-optimization problem where the regularization involves both dictionary learning and structural structuring is formulated and a new denoising algorithm built upon clustering-based sparse representation (CSR) is proposed.

...read moreread less

Abstract: Where does the sparsity in image signals come from? Local and nonlocal image models have supplied complementary views toward the regularity in natural images — the former attempts to construct or learn a dictionary of basis functions that promotes the sparsity; while the latter connects the sparsity with the self-similarity of the image source by clustering. In this paper, we present a variational framework for unifying the above two views and propose a new denoising algorithm built upon clustering-based sparse representation (CSR). Inspired by the success of l 1 -optimization, we have formulated a double-header l 1 -optimization problem where the regularization involves both dictionary learning and structural structuring. A surrogate-function based iterative shrinkage solution has been developed to solve the double-header l 1 -optimization problem and a probabilistic interpretation of CSR model is also included. Our experimental results have shown convincing improvements over state-of-the-art denoising technique BM3D on the class of regular texture images. The PSNR performance of CSR denoising is at least comparable and often superior to other competing schemes including BM3D on a collection of 12 generic natural images.

...read moreread less

Journal Article•DOI•

clusterMaker: a multi-algorithm clustering plugin for Cytoscape

[...]

John H. Morris¹, Leonard Apeltsin¹, Aaron M. Newman², Jan Baumbach³, Tobias Wittkop⁴, Gang Su⁵, Gary D. Bader⁶, Thomas E. Ferrin¹ - Show less +4 more•Institutions (6)

University of California, San Francisco¹, Stanford University², Max Planck Society³, Buck Institute for Research on Aging⁴, University of Michigan⁵, University of Toronto⁶

09 Nov 2011-BMC Bioinformatics

TL;DR: The Cytoscape plugin clusterMaker as mentioned in this paper provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function.

...read moreread less

Abstract: In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present clusterMaker, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. clusterMaker is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL. Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast Saccharomyces cerevisiae; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section. The Cytoscape plugin clusterMaker provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the clusterMaker plugin. clusterMaker is available via the Cytoscape plugin manager.

...read moreread less

Journal Article•DOI•

Discovering Activities to Recognize and Track in a Smart Environment

[...]

Parisa Rashidi¹, Diane J. Cook¹, Lawrence B. Holder¹, Maureen Schmitter-Edgecombe¹•Institutions (1)

Washington State University¹

01 Apr 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper introduces an automated approach to activity tracking that identifies frequent activities that naturally occur in an individual's routine and can then track the occurrence of regular activities to monitor functional health and to detect changes in anindividual's patterns and lifestyle.

...read moreread less

Abstract: The machine learning and pervasive sensing technologies found in smart homes offer unprecedented opportunities for providing health monitoring and assistance to individuals experiencing difficulties living independently at home. In order to monitor the functional health of smart home residents, we need to design technologies that recognize and track activities that people normally perform as part of their daily routines. Although approaches do exist for recognizing activities, the approaches are applied to activities that have been preselected and for which labeled training data are available. In contrast, we introduce an automated approach to activity tracking that identifies frequent activities that naturally occur in an individual's routine. With this capability, we can then track the occurrence of regular activities to monitor functional health and to detect changes in an individual's patterns and lifestyle. In this paper, we describe our activity mining and tracking approach, and validate our algorithms on data collected in physical smart environments.

...read moreread less

Journal Article•DOI•

The MOPED framework: Object recognition and pose estimation for manipulation

[...]

Alvaro Collet¹, Manuel Martinez¹, Siddhartha S. Srinivasa²•Institutions (2)

Carnegie Mellon University¹, Intel²

01 Sep 2011-The International Journal of Robotics Research

TL;DR: MOPED, a framework for Multiple Object Pose Estimation and Detection that seamlessly integrates single-image and multi-image object recognition and pose estimation in one optimized, robust, and scalable framework is presented.

...read moreread less

Abstract: We present MOPED, a framework for Multiple Object Pose Estimation and Detection that seamlessly integrates single-image and multi-image object recognition and pose estimation in one optimized, robust, and scalable framework. We address two main challenges in computer vision for robotics: robust performance in complex scenes, and low latency for real-time operation. We achieve robust performance with Iterative Clustering Estimation (ICE), a novel algorithm that iteratively combines feature clustering with robust pose estimation. Feature clustering quickly partitions the scene and produces object hypotheses. The hypotheses are used to further refine the feature clusters, and the two steps iterate until convergence. ICE is easy to parallelize, and easily integrates single- and multi-camera object recognition and pose estimation. We also introduce a novel object hypothesis scoring function based on M-estimator theory, and a novel pose clustering algorithm that robustly handles recognition outliers. We achieve scalability and low latency with an improved feature matching algorithm for large databases, a GPU/CPU hybrid architecture that exploits parallelism at all levels, and an optimized resource scheduler. We provide extensive experimental results demonstrating state-of-the-art performance in terms of recognition, scalability, and latency in real-world robotic applications.

...read moreread less

Collapse