scispace - formally typeset
Search or ask a question

Showing papers on "Cluster analysis published in 2011"


Journal ArticleDOI
18 Jul 2011-PLOS ONE
TL;DR: REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures.
Abstract: Outcomes of high-throughput biological experiments are typically interpreted by statistical testing for enriched gene functional categories defined by the Gene Ontology (GO). The resulting lists of GO terms may be large and highly redundant, and thus difficult to interpret. REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures. Furthermore, REVIGO visualizes this non-redundant GO term set in multiple ways to assist in interpretation: multidimensional scaling and graph-based visualizations accurately render the subdivisions and the semantic relationships in the data, while treemaps and tag clouds are also offered as alternative views. REVIGO is freely available at http://revigo.irb.hr/.

4,919 citations


Journal ArticleDOI
TL;DR: The authors proposed a variance estimator for the OLS estimator as well as for nonlinear estimators such as logit, probit, and GMM that enables cluster-robust inference when there is two-way or multiway clustering that is nonnested.
Abstract: In this article we propose a variance estimator for the OLS estimator as well as for nonlinear estimators such as logit, probit, and GMM. This variance estimator enables cluster-robust inference when there is two-way or multiway clustering that is nonnested. The variance estimator extends the standard cluster-robust variance estimator or sandwich estimator for one-way clustering (e.g., Liang and Zeger 1986; Arellano 1987) and relies on similar relatively weak distributional assumptions. Our method is easily implemented in statistical packages, such as Stata and SAS, that already offer cluster-robust standard errors when there is one-way clustering. The method is demonstrated by a Monte Carlo analysis for a two-way random effects model; a Monte Carlo analysis of a placebo law that extends the state–year effects example of Bertrand, Duflo, and Mullainathan (2004) to two dimensions; and by application to studies in the empirical literature where two-way clustering is present.

2,542 citations


Proceedings Article
14 Jun 2011
TL;DR: In this paper, the authors show that the number of hidden nodes in the model may be more important to achieving high performance than the learning algorithm or the depth of the model, and they apply several othe-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs, K-means clustering, and Gaussian mixtures) to CIFAR, NORB, and STL datasets using only single-layer networks.
Abstract: A great deal of research has focused on algorithms for learning features from unlabeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning algorithms and deep models. In this paper, however, we show that several simple factors, such as the number of hidden nodes in the model, may be more important to achieving high performance than the learning algorithm or the depth of the model. Specifically, we will apply several othe-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs, K-means clustering, and Gaussian mixtures) to CIFAR, NORB, and STL datasets using only singlelayer networks. We then present a detailed analysis of the eect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (“stride”) between extracted features, and the eect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are critical to achieving high performance—so critical, in fact, that when these parameters are pushed to their limits, we achieve state-of-the-art performance on both CIFAR-10 and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyperparameters to tune beyond the model structure itself, and is very easy to implement. Despite the simplicity of our system, we achieve accuracy beyond all previously published results on the CIFAR-10 and NORB datasets (79.6% and 97.2% respectively).

2,091 citations


Proceedings Article
01 Jan 2011
TL;DR: The aim of this paper is to present three new aspects of KEEL: KEEL-dataset, a data set repository which includes the data set partitions in theKEELformat and some guidelines for including new algorithms in KEEL, helping the researcher to compare the results of many approaches already included within the KEEL software.
Abstract: (Knowledge Extraction based onEvolutionary Learning) tool, an open source software that supports datamanagement and a designer of experiments. KEEL pays special attentionto the implementation of evolutionary learning and soft computing basedtechniques for Data Mining problems including regression, classification,clustering, pattern mining and so on.The aim of this paper is to present three new aspects of KEEL: KEEL-dataset, a data set repository which includes the data set partitions in theKEELformatandshowssomeresultsofalgorithmsinthesedatasets; someguidelines for including new algorithms in KEEL, helping the researcherstomaketheirmethodseasilyaccessibletootherauthorsandtocomparetheresults of many approaches already included within the KEEL software;and a module of statistical procedures developed in order to provide to theresearcher a suitable tool to contrast the results obtained in any experimen-talstudy.Acaseofstudyisgiventoillustrateacompletecaseofapplicationwithin this experimental analysis framework.

2,057 citations


Book
01 Oct 2011
TL;DR: This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets, and explains the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing.
Abstract: The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.

1,795 citations


Journal ArticleDOI
TL;DR: There are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the aggLomerative criterion as mentioned in this paper.
Abstract: The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward's hierarchical clustering method.

1,290 citations


Journal ArticleDOI
29 Apr 2011-PLOS ONE
TL;DR: OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics, is presented.
Abstract: Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multi-purpose techniques, able to handle different types of datasets and the subtleties of community structure. In this paper we present OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics. OSLOM can be used alone or as a refinement procedure of partitions/covers delivered by other techniques. We have also implemented sequential algorithms combining OSLOM with other fast techniques, so that the community structure of very large networks can be uncovered. Our method has a comparable performance as the best existing algorithms on artificial benchmark graphs. Several applications on real networks are shown as well. OSLOM is implemented in a freely available software (http://www.oslom.org), and we believe it will be a valuable tool in the analysis of networks.

1,205 citations


Journal ArticleDOI
TL;DR: A novel region-based method for image segmentation, which is able to simultaneously segment the image and estimate the bias field, and the estimated bias field can be used for intensity inhomogeneity correction (or bias correction).
Abstract: Intensity inhomogeneity often occurs in real-world images, which presents a considerable challenge in image segmentation. The most widely used image segmentation algorithms are region-based and typically rely on the homogeneity of the image intensities in the regions of interest, which often fail to provide accurate segmentation results due to the intensity inhomogeneity. This paper proposes a novel region-based method for image segmentation, which is able to deal with intensity inhomogeneities in the segmentation. First, based on the model of images with intensity inhomogeneities, we derive a local intensity clustering property of the image intensities, and define a local clustering criterion function for the image intensities in a neighborhood of each point. This local clustering criterion function is then integrated with respect to the neighborhood center to give a global criterion of image segmentation. In a level set formulation, this criterion defines an energy in terms of the level set functions that represent a partition of the image domain and a bias field that accounts for the intensity inhomogeneity of the image. Therefore, by minimizing this energy, our method is able to simultaneously segment the image and estimate the bias field, and the estimated bias field can be used for intensity inhomogeneity correction (or bias correction). Our method has been validated on synthetic images and real images of various modalities, with desirable performance in the presence of intensity inhomogeneities. Experiments show that our method is more robust to initialization, faster and more accurate than the well-known piecewise smooth model. As an application, our method has been used for segmentation and bias correction of magnetic resonance (MR) images with promising results.

1,201 citations


Journal ArticleDOI
TL;DR: This paper synthesises existing clustering algorithms news's and highlights the challenges in clustering.
Abstract: A wireless sensor network (WSN) consisting of a large number of tiny sensors can be an effective tool for gathering data in diverse kinds of environments. The data collected by each sensor is communicated to the base station, which forwards the data to the end user. Clustering is introduced to WSNs because it has proven to be an effective approach to provide better data aggregation and scalability for large WSNs. Clustering also conserves the limited energy resources of the sensors. This paper synthesises existing clustering algorithms in WSNs and highlights the challenges in clustering.

1,097 citations


Proceedings ArticleDOI
09 Feb 2011
TL;DR: This work develops the K-Spectral Centroid (K-SC) clustering algorithm that effectively finds cluster centroids with the authors' similarity measure and presents a simple model that reliably predicts the shape of attention by using information about only a small number of participants.
Abstract: Online content exhibits rich temporal dynamics, and diverse realtime user generated content further intensifies this process. However, temporal patterns by which online content grows and fades over time, and by which different pieces of content compete for attention remain largely unexplored.We study temporal patterns associated with online content and how the content's popularity grows and fades over time. The attention that content receives on the Web varies depending on many factors and occurs on very different time scales and at different resolutions. In order to uncover the temporal dynamics of online content we formulate a time series clustering problem using a similarity metric that is invariant to scaling and shifting. We develop the K-Spectral Centroid (K-SC) clustering algorithm that effectively finds cluster centroids with our similarity measure. By applying an adaptive wavelet-based incremental approach to clustering, we scale K-SC to large data sets.We demonstrate our approach on two massive datasets: a set of 580 million Tweets, and a set of 170 million blog posts and news media articles. We find that K-SC outperforms the K-means clustering algorithm in finding distinct shapes of time series. Our analysis shows that there are six main temporal shapes of attention of online content. We also present a simple model that reliably predicts the shape of attention by using information about only a small number of participants. Our analyses offer insight into common temporal patterns of the content on theWeb and broaden the understanding of the dynamics of human attention.

1,041 citations


Journal ArticleDOI
01 Jan 2011
TL;DR: Simulation results indicate that ABC algorithm can efficiently be used for multivariate data clustering and is compared with Particle Swarm Optimization (PSO) algorithm and other nine classification techniques from the literature.
Abstract: Artificial Bee Colony (ABC) algorithm which is one of the most recently introduced optimization algorithms, simulates the intelligent foraging behavior of a honey bee swarm. Clustering analysis, used in many disciplines and applications, is an important tool and a descriptive task seeking to identify homogeneous groups of objects based on the values of their attributes. In this work, ABC is used for data clustering on benchmark problems and the performance of ABC algorithm is compared with Particle Swarm Optimization (PSO) algorithm and other nine classification techniques from the literature. Thirteen of typical test data sets from the UCI Machine Learning Repository are used to demonstrate the results of the techniques. The simulation results indicate that ABC algorithm can efficiently be used for multivariate data clustering.

Proceedings Article
12 Dec 2011
TL;DR: A spectral clustering framework is proposed that achieves this goal by co-regularizing the clustering hypotheses, and two co- regularization schemes are proposed to accomplish this.
Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.

02 Jul 2011
TL;DR: The framework sheds light on the different kinds of autoencoders, their learning complexity, their horizontal and vertical composability in deep architectures, their critical points, and their fundamental connections to clustering, Hebbian learning, and information theory.
Abstract: Autoencoders play a fundamental role in unsupervised learning and in deep architectures for transfer learning and other tasks. In spite of their fundamental role, only linear autoencoders over the real numbers have been solved analytically. Here we present a general mathematical framework for the study of both linear and non-linear autoencoders. The framework allows one to derive an analytical treatment for the most non-linear autoencoder, the Boolean autoencoder. Learning in the Boolean autoencoder is equivalent to a clustering problem that can be solved in polynomial time when the number of clusters is small and becomes NP complete when the number of clusters is large. The framework sheds light on the different kinds of autoencoders, their learning complexity, their horizontal and vertical composability in deep architectures, their critical points, and their fundamental connections to clustering, Hebbian learning, and information theory.

Journal ArticleDOI
TL;DR: A global technique for averaging a set of sequences is developed, which avoids using iterative pairwise averaging and is thus insensitive to ordering effects, and a new strategy to reduce the length of the resulting average sequence is described.

Journal Article
TL;DR: This article presented a review of existing subspace clustering algorithms together with an experimental evaluation on the motion segmentation and face clustering problems in computer vision.
Abstract: Over the past few decades, significant progress has been made in clustering high-dimensional data sets distributed around a collection of linear and affine subspaces. This article presented a review of such progress, which included a number of existing subspace clustering algorithms together with an experimental evaluation on the motion segmentation and face clustering problems in computer vision.

Proceedings Article
28 Jun 2011
TL;DR: A spectral clustering algorithm for the multi-view setting where the authors have access to multiple views of the data, each of which can be independently used for clustering, which has a flavor of co-training.
Abstract: We propose a spectral clustering algorithm for the multi-view setting where we have access to multiple views of the data, each of which can be independently used for clustering. Our spectral clustering algorithm has a flavor of co-training, which is already a widely used idea in semi-supervised learning. We work on the assumption that the true underlying clustering would assign a point to the same cluster irrespective of the view. Hence, we constrain our approach to only search for the clusterings that agree across the views. Our algorithm does not have any hyperparameters to set, which is a major advantage in unsupervised learning. We empirically compare with a number of baseline methods on synthetic and real-world datasets to show the efficacy of the proposed algorithm.

Journal ArticleDOI
TL;DR: In this article, the authors studied spectral clustering under the stochastic block model and showed that the eigenvectors of the normalized graph Laplacian asymptotically converges to the eigens of a population normalized graph.
Abstract: Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors. Social ne tworks, representing people who communicate with each other, are one example. Communities or clusters of highly connected actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally feasi ble method to discover these communities. The Stochastic Block Model (Holland et al., 1983) is a social network model with well defined communities; each node is a member of one community. For a network generated from the Stochastic Block Model, we bound the number of nodes "misclus- tered" by spectral clustering. The asymptotic results in th is paper are the first clustering results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional. In order to study spectral clustering under the Stochastic Block Model, we first show that under the more general latent space model, the eigenvectors of the normalized graph Laplacian asymptotically converge to the eigenvectors of a "population" normal- ized graph Laplacian. Aside from the implication for spectral clustering, this provides insight into a graph visualization technique. Our method of studying the eigenvectors of random matrices is original. AMS 2000 subject classifications: Primary 62H30, 62H25; secondary 60B20.

Journal ArticleDOI
TL;DR: The ability to quickly and accurately assign sequences to OTUs and then obtain taxonomic information for those OTUs will greatly improve OTU-based analyses and overcome many of the challenges encountered with phylotype-based methods.
Abstract: In spite of technical advances that have provided increases in orders of magnitude in sequencing coverage, microbial ecologists still grapple with how to interpret the genetic diversity represented by the 16S rRNA gene. Two widely used approaches put sequences into bins based on either their similarity to reference sequences (i.e., phylotyping) or their similarity to other sequences in the community (i.e., operational taxonomic units [OTUs]). In the present study, we investigate three issues related to the interpretation and implementation of OTU-based methods. First, we confirm the conventional wisdom that it is impossible to create an accurate distance-based threshold for defining taxonomic levels and instead advocate for a consensus-based method of classifying OTUs. Second, using a taxonomic-independent approach, we show that the average neighbor clustering algorithm produces more robust OTUs than other hierarchical and heuristic clustering algorithms. Third, we demonstrate several steps to reduce the computational burden of forming OTUs without sacrificing the robustness of the OTU assignment. Finally, by blending these solutions, we propose a new heuristic that has a minimal effect on the robustness of OTUs and significantly reduces the necessary time and memory requirements. The ability to quickly and accurately assign sequences to OTUs and then obtain taxonomic information for those OTUs will greatly improve OTU-based analyses and overcome many of the challenges encountered with phylotype-based methods.

Journal ArticleDOI
01 Mar 2011
TL;DR: Issues in WSNs are outlined,PSO is introduced, and its suitability for WSN applications is discussed, and a brief survey of how PSO is tailored to address these issues is presented.
Abstract: Wireless-sensor networks (WSNs) are networks of autonomous nodes used for monitoring an environment. Developers of WSNs face challenges that arise from communication link failures, memory and computational constraints, and limited energy. Many issues in WSNs are formulated as multidimensional optimization problems, and approached through bioinspired techniques. Particle swarm optimization (PSO) is a simple, effective, and computationally efficient optimization algorithm. It has been applied to address WSN issues such as optimal deployment, node localization, clustering, and data aggregation. This paper outlines issues in WSNs, introduces PSO, and discusses its suitability for WSN applications. It also presents a brief survey of how PSO is tailored to address these issues.

Proceedings ArticleDOI
28 Mar 2011
TL;DR: In this paper, a layered label propagation (LBP) algorithm is proposed to reorder very large graphs (billions of nodes) using task decomposition to perform aggressively on multi-core architecture, making it possible to compress graphs of more than 600 millions nodes in a few hours.
Abstract: We continue the line of research on graph compression started with WebGraph, but we move our focus to the compression of social networks in a proper sense (e.g., LiveJournal): the approaches that have been used for a long time to compress web graphs rely on a specific ordering of the nodes (lexicographical URL ordering) whose extension to general social networks is not trivial. In this paper, we propose a solution that mixes clusterings and orders, and devise a new algorithm, called Layered Label Propagation, that builds on previous work on scalable clustering and can be used to reorder very large graphs (billions of nodes). Our implementation uses task decomposition to perform aggressively on multi-core architecture, making it possible to reorder graphs of more than 600 millions nodes in a few hours.Experiments performed on a wide array of web graphs and social networks show that combining the order produced by the proposed algorithm with the WebGraph compression framework provides a major increase in compression with respect to all currently known techniques, both on web graphs and on social networks. These improvements make it possible to analyse in main memory significantly larger graphs.

Journal ArticleDOI
TL;DR: This work investigates two representative ways of approximating the dense similarity matrix and picks the strategy of sparsifying the matrix via retaining nearest neighbors and investigates its parallelization, which can effectively handle large problems.
Abstract: Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nystrom method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

Journal ArticleDOI
TL;DR: In this article, a density-based clustering is defined as the task of identifying groups or clusters in a data set, a cluster is a set of data objects spread in the data space over a contiguous region of high density of objects.
Abstract: Clustering refers to the task of identifying groups or clusters in a data set. In density-based clustering, a cluster is a set of data objects spread in the data space over a contiguous region of high density of objects. Density-based clusters are separated from each other by contiguous regions of low density of objects. Data objects located in low-density regions are typically considered noise or outliers. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 231–240 DOI: 10.1002/widm.30 This article is categorized under: Technologies > Structure Discovery and Clustering

Journal ArticleDOI
TL;DR: An overview of clustering ensemble methods that can be very useful for the community of clusters practitioners is presented and a taxonomy of these techniques is presented to illustrate some important applications.
Abstract: Cluster ensemble has proved to be a good alternative when facing cluster analysis problems. It consists of generating a set of clusterings from the same dataset and combining them into a final clustering. The goal of this combination process is to improve the quality of individual data clusterings. Due to the increasing appearance of new methods, their promising results and the great number of applications, we consider that it is necessary to make a critical analysis of the existing techniques and future projections. This paper presents an overview of clustering ensemble methods that can be very useful for the community of clustering practitioners. The characteristics of several methods are discussed, which may help in the selection of the most appropriate one to solve a problem at hand. We also present a taxonomy of these techniques and illustrate some important applications.

Journal ArticleDOI
TL;DR: A survey of current techniques of knowledge discovery in databases using data mining techniques that are in use in today’s medical research particularly in Heart Disease Prediction reveals that Decision Tree outperforms and some time Bayesian classification is having similar accuracy as of decision tree but other predictive methods are not performing well.
Abstract: The successful application of data mining in highly visible fields like e-business, marketing and retail has led to its application in other industries and sectors. Among these sectors just discovering is healthcare. The healthcare environment is still „information rich‟ but „knowledge poor‟. There is a wealth of data available within the healthcare systems. However, there is a lack of effective analysis tools to discover hidden relationships and trends in data. This research paper intends to provide a survey of current techniques of knowledge discovery in databases using data mining techniques that are in use in today‟s medical research particularly in Heart Disease Prediction. Number of experiment has been conducted to compare the performance of predictive data mining technique on the same dataset and the outcome reveals that Decision Tree outperforms and some time Bayesian classification is having similar accuracy as of decision tree but other predictive methods like KNN, Neural Networks, Classification based on clustering are not performing well. The second conclusion is that the accuracy of the Decision Tree and Bayesian Classification further improves after applying genetic algorithm to reduce the actual data size to get the optimal subset of attribute sufficient for heart disease prediction.

Journal ArticleDOI
TL;DR: A novel distance measure, called a weighted DTW (WDTW), which is a penalty-based DTW that penalizes points with higher phase difference between a reference point and a testing point in order to prevent minimum distance distortion caused by outliers is proposed.

Journal ArticleDOI
TL;DR: An overview of several robust methods and outlier detection tools for univariate, low‐dimensional, and high‐dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification are presented.
Abstract: When analyzing data, outlying observations cause problems because they may strongly influence the result. Robust statistics aims at detecting the outliers by searching for the model fitted by the majority of the data. We present an overview of several robust methods and outlier detection tools. We discuss robust procedures for univariate, low-dimensional, and high-dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 73-79 DOI: 10.1002/widm.2 This article is categorized under: Algorithmic Development > Biological Data Mining Algorithmic Development > Spatial and Temporal Data Mining Application Areas > Health Care Technologies > Structure Discovery and Clustering

Proceedings ArticleDOI
20 Jun 2011
TL;DR: A double-header l1-optimization problem where the regularization involves both dictionary learning and structural structuring is formulated and a new denoising algorithm built upon clustering-based sparse representation (CSR) is proposed.
Abstract: Where does the sparsity in image signals come from? Local and nonlocal image models have supplied complementary views toward the regularity in natural images — the former attempts to construct or learn a dictionary of basis functions that promotes the sparsity; while the latter connects the sparsity with the self-similarity of the image source by clustering. In this paper, we present a variational framework for unifying the above two views and propose a new denoising algorithm built upon clustering-based sparse representation (CSR). Inspired by the success of l 1 -optimization, we have formulated a double-header l 1 -optimization problem where the regularization involves both dictionary learning and structural structuring. A surrogate-function based iterative shrinkage solution has been developed to solve the double-header l 1 -optimization problem and a probabilistic interpretation of CSR model is also included. Our experimental results have shown convincing improvements over state-of-the-art denoising technique BM3D on the class of regular texture images. The PSNR performance of CSR denoising is at least comparable and often superior to other competing schemes including BM3D on a collection of 12 generic natural images.

Journal ArticleDOI
TL;DR: The Cytoscape plugin clusterMaker as mentioned in this paper provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function.
Abstract: In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present clusterMaker, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. clusterMaker is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL. Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast Saccharomyces cerevisiae; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section. The Cytoscape plugin clusterMaker provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the clusterMaker plugin. clusterMaker is available via the Cytoscape plugin manager.

Journal ArticleDOI
TL;DR: This paper introduces an automated approach to activity tracking that identifies frequent activities that naturally occur in an individual's routine and can then track the occurrence of regular activities to monitor functional health and to detect changes in anindividual's patterns and lifestyle.
Abstract: The machine learning and pervasive sensing technologies found in smart homes offer unprecedented opportunities for providing health monitoring and assistance to individuals experiencing difficulties living independently at home. In order to monitor the functional health of smart home residents, we need to design technologies that recognize and track activities that people normally perform as part of their daily routines. Although approaches do exist for recognizing activities, the approaches are applied to activities that have been preselected and for which labeled training data are available. In contrast, we introduce an automated approach to activity tracking that identifies frequent activities that naturally occur in an individual's routine. With this capability, we can then track the occurrence of regular activities to monitor functional health and to detect changes in an individual's patterns and lifestyle. In this paper, we describe our activity mining and tracking approach, and validate our algorithms on data collected in physical smart environments.

Journal ArticleDOI
TL;DR: MOPED, a framework for Multiple Object Pose Estimation and Detection that seamlessly integrates single-image and multi-image object recognition and pose estimation in one optimized, robust, and scalable framework is presented.
Abstract: We present MOPED, a framework for Multiple Object Pose Estimation and Detection that seamlessly integrates single-image and multi-image object recognition and pose estimation in one optimized, robust, and scalable framework. We address two main challenges in computer vision for robotics: robust performance in complex scenes, and low latency for real-time operation. We achieve robust performance with Iterative Clustering Estimation (ICE), a novel algorithm that iteratively combines feature clustering with robust pose estimation. Feature clustering quickly partitions the scene and produces object hypotheses. The hypotheses are used to further refine the feature clusters, and the two steps iterate until convergence. ICE is easy to parallelize, and easily integrates single- and multi-camera object recognition and pose estimation. We also introduce a novel object hypothesis scoring function based on M-estimator theory, and a novel pose clustering algorithm that robustly handles recognition outliers. We achieve scalability and low latency with an improved feature matching algorithm for large databases, a GPU/CPU hybrid architecture that exploits parallelism at all levels, and an optimized resource scheduler. We provide extensive experimental results demonstrating state-of-the-art performance in terms of recognition, scalability, and latency in real-world robotic applications.