Showing papers on "Cluster analysis published in 2015"

PDF

Open Access

Proceedings Article•DOI•

FaceNet: A unified embedding for face recognition and clustering

[...]

Florian Schroff¹, Dmitry Kalenichenko¹, James Philbin¹•Institutions (1)

07 Jun 2015

TL;DR: A system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure offace similarity, and achieves state-of-the-art face recognition performance using only 128-bytes perface.

...read moreread less

Abstract: Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

...read moreread less

8,289 citations

Proceedings Article•DOI•

FaceNet: A Unified Embedding for Face Recognition and Clustering

[...]

Florian Schroff¹, Dmitry Kalenichenko¹, James Philbin¹•Institutions (1)

Google¹

12 Mar 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: FaceNet as discussed by the authors uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches, and achieves state-of-the-art face recognition performance using only 128 bytes per face.

...read moreread less

Abstract: Despite significant recent advances in the field of face recognition, implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors. Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face. On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result by 30% on both datasets. We also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

...read moreread less

4,560 citations

Journal Article•DOI•

A Practitioner’s Guide to Cluster-Robust Inference

[...]

A. Colin Cameron, Douglas L. Miller

31 Mar 2015-Journal of Human Resources

TL;DR: This work considers statistical inference for regression when data are grouped into clusters, with regression model errors independent across clusters but correlated within clusters, when the number of clusters is large and default standard errors can greatly overstate estimator precision.

...read moreread less

Abstract: We consider statistical inference for regression when data are grouped into clus- ters, with regression model errors independent across clusters but correlated within clusters. Examples include data on individuals with clustering on village or region or other category such as industry, and state-year dierences-in-dierences studies with clustering on state. In such settings default standard errors can greatly overstate es- timator precision. Instead, if the number of clusters is large, statistical inference after OLS should be based on cluster-robust standard errors. We outline the basic method as well as many complications that can arise in practice. These include cluster-specic �xed eects, few clusters, multi-way clustering, and estimators other than OLS.

...read moreread less

3,236 citations

Journal Article•DOI•

ClustVis: a web tool for visualizing clustering of multivariate data using Principal Component Analysis and heatmap

[...]

Tauno Metsalu¹, Jaak Vilo¹•Institutions (1)

University of Tartu¹

01 Jul 2015-Nucleic Acids Research

TL;DR: A web tool called ClustVis that aims to have an intuitive user interface for the Principal Component Analysis and heatmap plots and is freely available at http://biit.cs.ut.ee/clustvis/.

...read moreread less

Abstract: The Principal Component Analysis (PCA) is a widely used method of reducing the dimensionality of high-dimensional data, often followed by visualizing two of the components on the scatterplot. Although widely used, the method is lacking an easy-to-use web interface that scientists with little programming skills could use to make plots of their own data. The same applies to creating heatmaps: it is possible to add conditional formatting for Excel cells to show colored heatmaps, but for more advanced features such as clustering and experimental annotations, more sophisticated analysis tools have to be used. We present a web tool called ClustVis that aims to have an intuitive user interface. Users can upload data from a simple delimited text file that can be created in a spreadsheet program. It is possible to modify data processing methods and the final appearance of the PCA and heatmap plots by using drop-down menus, text boxes, sliders etc. Appropriate defaults are given to reduce the time needed by the user to specify input parameters. As an output, users can download PCA plot and heatmap in one of the preferred file formats. This web server is freely available at http://biit.cs.ut.ee/clustvis/.

...read moreread less

2,293 citations

Journal Article•DOI•

Clumpak: a program for identifying clustering modes and packaging population structure inferences across K

[...]

Naama M. Kopelman¹, Jonathan Mayzel¹, Mattias Jakobsson², Noah A. Rosenberg³, Itay Mayrose¹ - Show less +1 more•Institutions (3)

Tel Aviv University¹, Uppsala University², Stanford University³

01 Sep 2015-Molecular Ecology Resources

TL;DR: Clumpak, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology by automating the postprocessing of results of model‐based population structure analyses.

...read moreread less

Abstract: The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population-genetic data analysis. Application of model-based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present CLUMPAK (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model-based population structure analyses. For analysing multiple independent runs at a single K value, CLUMPAK identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software CLUMPP. Next, CLUMPAK identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in CLUMPP and simplifying the comparison of clustering results across different K values. CLUMPAK incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. CLUMPAK, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology.

...read moreread less

2,252 citations

Proceedings Article•DOI•

GraRep: Learning Graph Representations with Global Structural Information

[...]

Cao Shaosheng¹, Wei Lu², Qiongkai Xu³•Institutions (3)

Xidian University¹, Singapore University of Technology and Design², IBM³

17 Oct 2015

TL;DR: A novel model for learning vertex representations of weighted graphs that integrates global structural information of the graph into the learning process and significantly outperforms other state-of-the-art methods in such tasks.

...read moreread less

Abstract: In this paper, we present {GraRep}, a novel model for learning vertex representations of weighted graphs. This model learns low dimensional vectors to represent vertices appearing in a graph and, unlike existing work, integrates global structural information of the graph into the learning process. We also formally analyze the connections between our work and several previous research efforts, including the DeepWalk model of Perozzi et al. as well as the skip-gram model with negative sampling of Mikolov et al. We conduct experiments on a language network, a social network as well as a citation network and show that our learned global representations can be effectively used as features in tasks such as clustering, classification and visualization. Empirical results demonstrate that our representation significantly outperforms other state-of-the-art methods in such tasks.

...read moreread less

1,565 citations

Journal Article•DOI•

Time-series clustering - A decade review

[...]

Saeed Aghabozorgi¹, Ali Seyed Shirkhorshidi¹, Teh Ying Wah¹•Institutions (1)

Information Technology University¹

01 Oct 2015-Information Systems

TL;DR: This review will expose four main components of time-series clustering and is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time- series approaches during the last decade and enlighten new paths for future works.

...read moreread less

1,235 citations

Journal Article•DOI•

A Comprehensive Survey of Clustering Algorithms

[...]

Dongkuan Xu¹, Yingjie Tian¹•Institutions (1)

Chinese Academy of Sciences¹

12 Aug 2015-Annals of Data Science

TL;DR: This review paper begins at the definition of clustering, takes the basic elements involved in the clustering process, such as the distance or similarity measurement and evaluation indicators, into consideration, and analyzes the clustered algorithms from two perspectives, the traditional ones and the modern ones.

...read moreread less

Abstract: Data analysis is used as a common method in modern science research, which is across communication science, computer science and biology science. Clustering, as the basic composition of data analysis, plays a significant role. On one hand, many tools for cluster analysis have been created, along with the information increase and subject intersection. On the other hand, each clustering algorithm has its own strengths and weaknesses, due to the complexity of information. In this review paper, we begin at the definition of clustering, take the basic elements involved in the clustering process, such as the distance or similarity measurement and evaluation indicators, into consideration, and analyze the clustering algorithms from two perspectives, the traditional ones and the modern ones. All the discussed clustering algorithms will be compared in detail and comprehensively shown in Appendix Table 22.

...read moreread less

1,234 citations

Journal Article•DOI•

K-Profiles: A Nonlinear Clustering Method for Pattern Detection in High Dimensional Data.

[...]

Kai Wang¹, Qing Zhao², Jianwei Lu², Tianwei Yu¹•Institutions (2)

Emory University¹, Tongji University²

03 Aug 2015-BioMed Research International

TL;DR: The nonlinear K-profiles clustering method is designed, which can be seen as the nonlinear counterpart of the K-means clustering algorithm, and has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles.

...read moreread less

Abstract: With modern technologies such as microarray, deep sequencing, and liquid chromatography-mass spectrometry (LC-MS), it is possible to measure the expression levels of thousands of genes/proteins simultaneously to unravel important biological processes. A very first step towards elucidating hidden patterns and understanding the massive data is the application of clustering techniques. Nonlinear relations, which were mostly unutilized in contrast to linear correlations, are prevalent in high-throughput data. In many cases, nonlinear relations can model the biological relationship more precisely and reflect critical patterns in the biological systems. Using the general dependency measure, Distance Based on Conditional Ordered List (DCOL) that we introduced before, we designed the nonlinear K-profiles clustering method, which can be seen as the nonlinear counterpart of the K-means clustering algorithm. The method has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles. Results from extensive simulation studies showed that K-profiles clustering not only outperformed traditional linear K-means algorithm, but also presented significantly better performance over our previous General Dependency Hierarchical Clustering (GDHC) algorithm. We further analyzed a gene expression dataset, on which K-profile clustering generated biologically meaningful results.

...read moreread less

1,005 citations

Journal Article•DOI•

PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models.

[...]

Martin K. Scherer¹, Benjamin Trendelkamp-Schroer¹, Fabian Paul¹, Guillermo Pérez-Hernández¹, Moritz Hoffmann¹, Nuria Plattner¹, Christoph Wehmeyer¹, Jan-Hendrik Prinz¹, Frank Noé¹ - Show less +5 more•Institutions (1)

Free University of Berlin¹

14 Oct 2015-Journal of Chemical Theory and Computation

TL;DR: The open-source Python package PyEMMA is presented, derived a systematic and accurate way to coarse-grain MSMs to few states and to illustrate the structures of the metastable states of the system.

...read moreread less

Abstract: Markov (state) models (MSMs) and related models of molecular kinetics have recently received a surge of interest as they can systematically reconcile simulation data from either a few long or many short simulations and allow us to analyze the essential metastable structures, thermodynamics, and kinetics of the molecular system under investigation. However, the estimation, validation, and analysis of such models is far from trivial and involves sophisticated and often numerically sensitive methods. In this work we present the open-source Python package PyEMMA (http://pyemma.org) that provides accurate and efficient algorithms for kinetic model construction. PyEMMA can read all common molecular dynamics data formats, helps in the selection of input features, provides easy access to dimension reduction algorithms such as principal component analysis (PCA) and time-lagged independent component analysis (TICA) and clustering algorithms such as k-means, and contains estimators for MSMs, hidden Markov models, an...

...read moreread less

809 citations

Book•

Data Mining: The Textbook

[...]

Charu C. Aggarwal

27 Apr 2015

TL;DR: This textbook explores the different aspects of data mining from the fundamentals to the complex data types and their applications, capturing the wide diversity of problem domains for data mining issues.

...read moreread less

Abstract: This textbook explores the different aspects of data mining from the fundamentals to the complex data types and their applications, capturing the wide diversity of problem domains for data mining issues. It goes beyond the traditional focus on data mining problems to introduce advanced data types such as text, time series, discrete sequences, spatial data, graph data, and social networks. Until now, no single book has addressed all these topics in a comprehensive and integrated way. The chapters of this book fall into one of three categories: Fundamental chapters: Data mining has four main problems, which correspond to clustering, classification, association pattern mining, and outlier analysis. These chapters comprehensively discuss a wide variety of methods for these problems. Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. Application chapters: These chapters study important applications such as stream mining, Web mining, ranking, recommendations, social networks, and privacy preservation. The domain chapters also have an applied flavor. Appropriate for both introductory and advanced data mining courses, Data Mining: The Textbook balances mathematical details and intuition. It contains the necessary mathematical details for professors and researchers, but it is presented in a simple and intuitive style to improve accessibility for students and industrial practitioners (including those with a limited mathematical background). Numerous illustrations, examples, and exercises are included, with an emphasis on semantically interpretable examples. Praise for Data Mining: The Textbook - As I read through this book, I have already decided to use it in my classes. This is a book written by an outstanding researcher who has made fundamental contributions to data mining, in a way that is both accessible and up to date. The book is complete with theory and practical use cases. Its a must-have for students and professors alike!" -- Qiang Yang, Chair of Computer Science and Engineering at Hong Kong University of Science and Technology"This is the most amazing and comprehensive text book on data mining. It covers not only the fundamental problems, such as clustering, classification, outliers and frequent patterns, and different data types, including text, time series, sequences, spatial data and graphs, but also various applications, such as recommenders, Web, social network and privacy. It is a great book for graduate students and researchers as well as practitioners." -- Philip S. Yu, UIC Distinguished Professor and Wexler Chair in Information Technology at University of Illinois at Chicago

...read moreread less

Journal Article•DOI•

Image Segmentation Using K -means Clustering Algorithm and Subtractive Clustering Algorithm

[...]

Nameirakpam Dhanachandra¹, Khumanthem Manglem¹, Yambem Jina Chanu¹•Institutions (1)

National Institute of Technology, Manipur¹

01 Jan 2015-Procedia Computer Science

TL;DR: This paper presents k-means clustering algorithm, an unsupervised algorithm used to segment the interest area from the background, and subtractive cluster, a data clustering method, which generates the centroid based on the potential value of the data points.

...read moreread less

Proceedings Article•DOI•

A review of feature selection methods with applications

[...]

Alan Jovic¹, Karla Brkić¹, Nikola Bogunović¹•Institutions (1)

University of Zagreb¹

25 May 2015

TL;DR: This review considers most of the commonly used FS techniques, including standard filter, wrapper, and embedded methods, and provides insight into FS for recent hybrid approaches and other advanced topics.

...read moreread less

Abstract: Feature selection (FS) methods can be used in data pre-processing to achieve efficient data reduction. This is useful for finding accurate data models. Since exhaustive search for optimal feature subset is infeasible in most cases, many search strategies have been proposed in literature. The usual applications of FS are in classification, clustering, and regression tasks. This review considers most of the commonly used FS techniques. Particular emphasis is on the application aspects. In addition to standard filter, wrapper, and embedded methods, we also provide insight into FS for recent hybrid approaches and other advanced topics.

...read moreread less

Posted Content•

Deep clustering: Discriminative embeddings for segmentation and separation

[...]

John R. Hershey¹, Zhuo Chen², Jonathan Le Roux¹, Shinji Watanabe¹•Institutions (2)

Mitsubishi Electric Research Laboratories¹, Columbia University²

18 Aug 2015-arXiv: Neural and Evolutionary Computing

TL;DR: Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.

...read moreread less

Abstract: We address the problem of acoustic source separation in a deep learning framework we call "deep clustering." Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be "decoded" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.

...read moreread less

Journal Article•DOI•

Diffusion maps for high-dimensional single-cell analysis of differentiation data.

[...]

Laleh Haghverdi¹, Florian Buettner¹, Fabian J. Theis¹•Institutions (1)

Technische Universität München¹

15 Sep 2015-Bioinformatics

TL;DR: In this paper, the authors proposed the use of diffusion maps to deal with the problem of defining differentiation trajectories, which enables the establishment of a pseudotemporal ordering of single cells in a high-dimensional gene expression space.

...read moreread less

Abstract: Motivation: Single-cell technologies have recently gained popularity in cellular differentiation studies regarding their ability to resolve potential heterogeneities in cell populations. Analyzing such high-dimensional single-cell data has its own statistical and computational challenges. Popular multivariate approaches are based on data normalization, followed by dimension reduction and clustering to identify subgroups. However, in the case of cellular differentiation, we would not expect clear clusters to be present but instead expect the cells to follow continuous branching lineages. Results: Here, we propose the use of diffusion maps to deal with the problem of defining differentiation trajectories. We adapt this method to single-cell data by adequate choice of kernel width and inclusion of uncertainties or missing measurement values, which enables the establishment of a pseudotemporal ordering of single cells in a high-dimensional gene expression space. We expect this output to reflect cell differentiation trajectories, where the data originates from intrinsic diffusion-like dynamics. Starting from a pluripotent stage, cells move smoothly within the transcriptional landscape towards more differentiated states with some stochasticity along their path. We demonstrate the robustness of our method with respect to extrinsic noise (e.g. measurement noise) and sampling density heterogeneities on simulated toy data as well as two single-cell quantitative polymerase chain reaction datasets (i.e. mouse haematopoietic stem cells and mouse embryonic stem cells) and an RNA-Seq data of human pre-implantation embryos. We show that diffusion maps perform considerably better than Principal Component Analysis and are advantageous over other techniques for non-linear dimension reduction such as t-distributed Stochastic Neighbour Embedding for preserving the global structures and pseudotemporal ordering of cells. Availability and implementation: The Matlab implementation of diffusion maps for single-cell data is available at https://www.helmholtz-muenchen.de/icb/single-cell-diffusion-map. Contact: fbuettner.phys@gmail.com, fabian.theis@helmholtz-muenchen.de Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

Proceedings Article•DOI•

Diversity-induced Multi-view Subspace Clustering

[...]

Xiaochun Cao¹, Changqing Zhang¹, Huazhu Fu², Si Liu³, Hua Zhang¹ - Show less +1 more•Institutions (3)

Tianjin University¹, Nanyang Technological University², Chinese Academy of Sciences³

07 Jun 2015

TL;DR: A multi-view clustering framework, called Diversity-induced Multi-view Subspace Clustering (DiMSC), is proposed for this task, which extends the existing subspace clustering into the multi- view domain, and utilizes the Hilbert Schmidt Independence Criterion (HSIC) as a diversity term to explore the complementarity of multi-View representations.

...read moreread less

Abstract: In this paper, we focus on how to boost the multi-view clustering by exploring the complementary information among multi-view features. A multi-view clustering framework, called Diversity-induced Multi-view Subspace Clustering (DiMSC), is proposed for this task. In our method, we extend the existing subspace clustering into the multi-view domain, and utilize the Hilbert Schmidt Independence Criterion (HSIC) as a diversity term to explore the complementarity of multi-view representations, which could be solved efficiently by using the alternating minimizing optimization. Compared to other multi-view clustering methods, the enhanced complementarity reduces the redundancy between the multi-view representations, and improves the accuracy of the clustering results. Experiments on both image and video face clustering well demonstrate that the proposed method outperforms the state-of-the-art methods.

...read moreread less

Journal Article•DOI•

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

[...]

Ricardo J. G. B. Campello¹, Davoud Moulavi², Arthur Zimek³, Jörg Sander²•Institutions (3)

University of São Paulo¹, University of Alberta², Ludwig Maximilian University of Munich³

22 Jul 2015-ACM Transactions on Knowledge Discovery From Data

TL;DR: An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced, consisting of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan’s classic model of density-contour clusters and trees.

...read moreread less

Abstract: An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan’s classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of “outlierness” can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a “flat” (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.

...read moreread less

Journal Article•DOI•

Identification of cell types from single-cell transcriptomes using a novel clustering method

[...]

Chen Xu¹, Zhengchang Su¹•Institutions (1)

University of North Carolina at Charlotte¹

15 Jun 2015-Bioinformatics

TL;DR: A novel algorithm named SNN-Cliq is described that clusters single-cell transcriptomes using the concept of shared nearest neighbor that shows advantages in handling high-dimensional data.

...read moreread less

Abstract: Motivation The recent advance of single-cell technologies has brought new insights into complex biological phenomena. In particular, genome-wide single-cell measurements such as transcriptome sequencing enable the characterization of cellular composition as well as functional variation in homogenic cell populations. An important step in the single-cell transcriptome analysis is to group cells that belong to the same cell types based on gene expression patterns. The corresponding computational problem is to cluster a noisy high dimensional dataset with substantially fewer objects (cells) than the number of variables (genes). Results In this article, we describe a novel algorithm named shared nearest neighbor (SNN)-Cliq that clusters single-cell transcriptomes. SNN-Cliq utilizes the concept of shared nearest neighbor that shows advantages in handling high-dimensional data. When evaluated on a variety of synthetic and real experimental datasets, SNN-Cliq outperformed the state-of-the-art methods tested. More importantly, the clustering results of SNN-Cliq reflect the cell types or origins with high accuracy. Availability and implementation The algorithm is implemented in MATLAB and Python. The source code can be downloaded at http://bioinfo.uncc.edu/SNNCliq.

...read moreread less

Journal Article•DOI•

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.

[...]

Azadeh Nikfarjam¹, Abeed Sarker¹, Karen O'Connor¹, Rachel Ginn¹, Graciela Gonzalez¹ - Show less +1 more•Institutions (1)

Arizona State University¹

01 May 2015-Journal of the American Medical Informatics Association

TL;DR: A machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

...read moreread less

Journal Article•DOI•

Data mining for the Internet of Things: literature review and challenges

[...]

Feng Chen¹, Pan Deng¹, Jiafu Wan², Daqiang Zhang³, Athanasios V. Vasilakos⁴, Xiaohui Rong - Show less +2 more•Institutions (4)

Chinese Academy of Sciences¹, South China University of Technology², Tongji University³, Luleå University of Technology⁴

01 Jan 2015-International Journal of Distributed Sensor Networks

TL;DR: A systematic way to review data mining in knowledge view, technique view, and application view, including classification, clustering, association analysis, time series analysis and outlier analysis is given.

...read moreread less

Abstract: The massive data generated by the Internet of Things (IoT) are considered of high business value, and data mining algorithms can be applied to IoT to extract hidden information from data. In this paper, we give a systematic way to review data mining in knowledge view, technique view, and application view, including classification, clustering, association analysis, time series analysis and outlier analysis. And the latest application cases are also surveyed. As more and more devices connected to IoT, large volume of data should be analyzed, the latest algorithms should be modified to apply to big data. We reviewed these algorithms and discussed challenges and open research issues. At last a suggested big data mining system is proposed.

...read moreread less

Neural Codes for Image Retrieval

[...]

David Stutz

01 Jan 2015

TL;DR: A thorough discussion of several state-of-the-art techniques in image retrieval by considering the associated subproblems: image description, descriptor compression, nearest-neighbor search and query expansion, and the combined use of deep architectures and hand-crafted image representations for accurate and efficient image retrieval.

...read moreread less

Abstract: This seminar report focuses on using convolutional neural networks for image retrieval. Firstly, we give a thorough discussion of several state-of-the-art techniques in image retrieval by considering the associated subproblems: image description, descriptor compression, nearest-neighbor search and query expansion. We discuss both the aggregation of local descriptors using clustering and metric learning techniques as well as global descriptors. Subsequently, we briefly introduce the basic concepts of deep convolutional neural networks, focusing on the architecture proposed by Krizhevsky et al. [KSH12]. We discuss different types of layers commonly used in recent architectures, for example convolutional layers, non-linearity and rectification layers, pooling layers as well as local contrast normalization layers. Finally, we shortly review supervised training techniques based on stochastic gradient descent and regularization techniques such as dropout and weight decay. Finally, following Babenko et al. [BSCL14], we discuss the use of feature activations in intermediate layers as image representation for image retrieval. After presenting experiments and comparing convolutional neural networks for image retrieval with other state-of-the-art techniques, we conclude by motivating the combined use of deep architectures and hand-crafted image representations for accurate and efficient image retrieval.

...read moreread less

Journal Article•DOI•

Swarm v2: highly-scalable and high-resolution amplicon clustering

[...]

Frédéric Mahé¹, Torbjørn Rognes², Torbjørn Rognes³, Christopher Quince⁴, Colomban de Vargas⁵, Colomban de Vargas⁶, Micah Dunthorn¹ - Show less +3 more•Institutions (6)

Kaiserslautern University of Technology¹, University of Oslo², Oslo University Hospital³, University of Warwick⁴, University of Paris⁵, Centre national de la recherche scientifique⁶

10 Dec 2015-PeerJ

TL;DR: Swarm v2 has two important novel features: a new algorithm for d = 1 that allows the computation time of the program to scale linearly with increasing amounts of data; and the new fastidious option that reduces under-grouping by grafting low abundant OTUs onto larger ones.

...read moreread less

Abstract: Previously we presented Swarm v1, a novel and open source amplicon clustering program that produced fine-scale molecular operational taxonomic units (OTUs), free of arbitrary global clustering thresholds and input-order dependency. Swarm v1 worked with an initial phase that used iterative single-linkage with a local clustering threshold (d), followed by a phase that used the internal abundance structures of clusters to break chained OTUs. Here we present Swarm v2, which has two important novel features: (1) a new algorithm for d = 1 that allows the computation time of the program to scale linearly with increasing amounts of data; and (2) the new fastidious option that reduces under-grouping by grafting low abundant OTUs (e.g., singletons and doubletons) onto larger ones. Swarm v2 also directly integrates the clustering and breaking phases, dereplicates sequencing reads with d = 0, outputs OTU representatives in fasta format, and plots individual OTUs as two-dimensional networks.

...read moreread less

Proceedings Article•DOI•

k-Shape: Efficient and Accurate Clustering of Time Series

[...]

John Paparrizos¹, Luis Gravano¹•Institutions (1)

Columbia University¹

27 May 2015

TL;DR: K-Shape as discussed by the authors uses a normalized version of the cross-correlation measure in order to consider the shapes of time series while comparing them, and develops a method to compute cluster centroids, which are used in every iteration to update the assignment of the time series to clusters.

...read moreread less

Abstract: The proliferation and ubiquity of temporal data across many disciplines has generated substantial interest in the analysis and mining of time series. Clustering is one of the most popular data mining methods, not only due to its exploratory power, but also as a preprocessing step or subroutine for other techniques. In this paper, we present k-Shape, a novel algorithm for time-series clustering. k-Shape relies on a scalable iterative refinement procedure, which creates homogeneous and well-separated clusters. As its distance measure, k-Shape uses a normalized version of the cross-correlation measure in order to consider the shapes of time series while comparing them. Based on the properties of that distance measure, we develop a method to compute cluster centroids, which are used in every iteration to update the assignment of time series to clusters. To demonstrate the robustness of k-Shape, we perform an extensive experimental evaluation of our approach against partitional, hierarchical, and spectral clustering methods, with combinations of the most competitive distance measures. k-Shape outperforms all scalable approaches in terms of accuracy. Furthermore, k-Shape also outperforms all non-scalable (and hence impractical) combinations, with one exception that achieves similar accuracy results. However, unlike k-Shape, this combination requires tuning of its distance measure and is two orders of magnitude slower than k-Shape. Overall, k-Shape emerges as a domain-independent, highly accurate, and highly efficient clustering approach for time series with broad applications.

...read moreread less

Proceedings Article•DOI•

Multi-view Subspace Clustering

[...]

Hongchang Gao¹, Feiping Nie¹, Xuelong Li, Heng Huang¹•Institutions (1)

University of Texas at Arlington¹

07 Dec 2015

TL;DR: A novel multi-view subspace clustering method that performs clustering on the subspace representation of each view simultaneously and proposes to use a common cluster structure to guarantee the consistence among different views.

...read moreread less

Abstract: For many computer vision applications, the data sets distribute on certain low-dimensional subspaces. Subspace clustering is to find such underlying subspaces and cluster the data points correctly. In this paper, we propose a novel multi-view subspace clustering method. The proposed method performs clustering on the subspace representation of each view simultaneously. Meanwhile, we propose to use a common cluster structure to guarantee the consistence among different views. In addition, an efficient algorithm is proposed to solve the problem. Experiments on four benchmark data sets have been performed to validate our proposed method. The promising results demonstrate the effectiveness of our method.

...read moreread less

Proceedings Article•

Large-scale multi-view spectral clustering via bipartite graph

[...]

Yeqing Li¹, Feiping Nie¹, Heng Huang¹, Junzhou Huang¹•Institutions (1)

University of Texas at Arlington¹

25 Jan 2015

TL;DR: A novel large-scale multi-view spectral clustering approach based on the bipartite graph that uses local manifold fusion to integrate heterogeneous features and can be easily extended to handle the out-of-sample problem.

...read moreread less

Abstract: In this paper, we address the problem of large-scale multi-view spectral clustering. In many real-world applications, data can be represented in various heterogeneous features or views. Different views often provide different aspects of information that are complementary to each other. Several previous methods of clustering have demonstrated that better accuracy can be achieved using integrated information of all the views than just using each view individually. One important class of such methods is multi-view spectral clustering, which is based on graph Laplacian. However, existing methods are not applicable to large-scale problem for their high computational complexity. To this end, we propose a novel large-scale multi-view spectral clustering approach based on the bipartite graph. Our method uses local manifold fusion to integrate heterogeneous features. To improve efficiency, we approximate the similarity graphs using bipartite graphs. Furthermore, we show that our method can be easily extended to handle the out-of-sample problem. Extensive experimental results on five benchmark datasets demonstrate the effectiveness and efficiency of the proposed method, where our method runs up to nearly 3000 times faster than the state-of-the-art methods.

...read moreread less

Journal Article•DOI•

A clustering approach to domestic electricity load profile characterisation using smart metering data

[...]

Fintan McLoughlin¹, Aidan Duffy¹, Michael Conlon¹•Institutions (1)

Dublin Institute of Technology¹

01 Mar 2015-Applied Energy

TL;DR: The availability of increasing amounts of data to electricity utilities through the implementation of domestic smart metering campaigns has meant that traditional ways of analysing meter reading information such as descriptive statistics has become increasingly difficult.

...read moreread less

Proceedings Article•DOI•

Low-Rank Tensor Constrained Multiview Subspace Clustering

[...]

Changqing Zhang¹, Huazhu Fu², Si Liu³, Guangcan Liu⁴, Xiaochun Cao⁵ - Show less +1 more•Institutions (5)

Tianjin University¹, Nanyang Technological University², National University of Singapore³, Nanjing University of Information Science and Technology⁴, Chinese Academy of Sciences⁵

07 Dec 2015

TL;DR: A low-rank tensor constraint is introduced to explore the complementary information from multiple views and, accordingly, a novel method called Low-rank Tensor constrained Multiview Subspace Clustering (LT-MSC) is established.

...read moreread less

Abstract: In this paper, we explore the problem of multiview subspace clustering. We introduce a low-rank tensor constraint to explore the complementary information from multiple views and, accordingly, establish a novel method called Low-rank Tensor constrained Multiview Subspace Clustering (LT-MSC). Our method regards the subspace representation matrices of different views as a tensor, which captures dexterously the high order correlations underlying multiview data. Then the tensor is equipped with a low-rank constraint, which models elegantly the cross information among different views, reduces effectually the redundancy of the learned subspace representations, and improves the accuracy of clustering as well. The inference process of the affinity matrix for clustering is formulated as a tensor nuclear norm minimization problem, constrained with an additional L2,1-norm regularizer and some linear equalities. The minimization problem is convex and thus can be solved efficiently by an Augmented Lagrangian Alternating Direction Minimization (AL-ADM) method. Extensive experimental results on four benchmark datasets show the effectiveness of our proposed LT-MSC method.

...read moreread less

Proceedings Article•DOI•

Superpixel segmentation using Linear Spectral Clustering

[...]

Zhengqin Li¹, Jiansheng Chen¹•Institutions (1)

Tsinghua University¹

07 Jun 2015

TL;DR: A superpixel segmentation algorithm called Linear Spectral Clustering (LSC), which produces compact and uniform superpixels with low computational costs and is able to preserve global properties of images.

...read moreread less

Abstract: We present in this paper a superpixel segmentation algorithm called Linear Spectral Clustering (LSC), which produces compact and uniform superpixels with low computational costs. Basically, a normalized cuts formulation of the superpixel segmentation is adopted based on a similarity metric that measures the color similarity and space proximity between image pixels. However, instead of using the traditional eigen-based algorithm, we approximate the similarity metric using a kernel function leading to an explicitly mapping of pixel values and coordinates into a high dimensional feature space. We revisit the conclusion that by appropriately weighting each point in this feature space, the objective functions of weighted K-means and normalized cuts share the same optimum point. As such, it is possible to optimize the cost function of normalized cuts by iteratively applying simple K-means clustering in the proposed feature space. LSC is of linear computational complexity and high memory efficiency and is able to preserve global properties of images. Experimental results show that LSC performs equally well or better than state of the art superpixel segmentation algorithms in terms of several commonly used evaluation metrics in image segmentation.

...read moreread less

Journal Article•DOI•

Lift : Multi-Label Learning with Label-Specific Features

[...]

Min-Ling Zhang¹, Lei Wu¹•Institutions (1)

Southeast University¹

01 Jan 2015-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Lift is proposed, an intuitive yet effective algorithm that constructs features specific to each label by conducting clustering analysis on its positive and negative instances, and then performs training and testing by querying the clustering results.

...read moreread less

Abstract: Multi-label learning deals with the problem where each example is represented by a single instance (feature vector) while associated with a set of class labels. Existing approaches learn from multi-label data by manipulating with identical feature set, i.e. the very instance representation of each example is employed in the discrimination processes of all class labels. However, this popular strategy might be suboptimal as each label is supposed to possess specific characteristics of its own. In this paper, another strategy to learn from multi-label data is studied, where label-specific features are exploited to benefit the discrimination of different class labels. Accordingly, an intuitive yet effective algorithm named Lift , i.e. multi-label learning with Label specIfic FeaTures , is proposed. Lift firstly constructs features specific to each label by conducting clustering analysis on its positive and negative instances, and then performs training and testing by querying the clustering results. Comprehensive experiments on a total of 17 benchmark data sets clearly validate the superiority of Lift against other well-established multi-label learning algorithms as well as the effectiveness oflabel-specific features.

...read moreread less

Journal Article•DOI•

TSclust: An R Package for Time Series Clustering

[...]

Pablo Montero, José A. Vilar

01 Jan 2015-Journal of Statistical Software

TL;DR: The R package TSclust is aimed to implement a large set of well-established peer-reviewed time series dissimilarity measures, including measures based on raw data, extracted features, underlying parametric models, complexity levels, and forecast behaviors.

...read moreread less

Abstract: Time series clustering is an active research area with applications in a wide range of fields. One key component in cluster analysis is determining a proper dissimilarity measure between two data objects, and many criteria have been proposed in the literature to assess dissimilarity between two time series. The R package TSclust is aimed to implement a large set of well-established peer-reviewed time series dissimilarity measures, including measures based on raw data, extracted features, underlying parametric models, complexity levels, and forecast behaviors. Computation of these measures allows the user to perform clustering by using conventional clustering algorithms. TSclust also includes a clustering procedure based on p values from checking the equality of generating models, and some utilities to evaluate cluster solutions. The implemented dissimilarity functions are accessible individually for an easier extension and possible use out of the clustering context. The main features of TSclust are described and examples of its use are presented.

...read moreread less

Collapse