scispace - formally typeset
Proceedings ArticleDOI

Data weaving: scaling up the state-of-the-art in data clustering

Ron Bekkerman, +1 more
- pp 1083-1092
Reads0
Chats0
TLDR
This paper proposes data weaving - a novel method for parallelizing sequential clustering algorithms and uses data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm.
Abstract
The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.

read more

Citations
More filters
Journal ArticleDOI

Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD)

TL;DR: In this paper, two strategies within SGD, i.e., mini-batch and adaptive sampling, are proposed to effectively reduce the number of updates (i.e. projections onto the PSD cone) in SGD.
Proceedings ArticleDOI

Sparse Latent Semantic Analysis.

TL;DR: A new model called Sparse LSA is proposed, which produces a sparse projection matrix via the `1 regularization and achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.
Posted Content

Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD)

TL;DR: This work develops hybrid approaches that combine the strength of adaptive sampling with that of mini-batch online learning techniques to further improve the computational efficiency of SGD for DML and proves the theoretical guarantees for both adaptive sampling and mini- batch based approaches for D ML.
Patent

Method and system for characterizing web content

TL;DR: In this article, the authors present a method of processing Web activity data, which includes obtaining a database of Website organizational data and generating a data structure from the database of website organizational data comprising an item identifier and a Website category corresponding to the item identifier.
Proceedings ArticleDOI

Improving clustering stability with combinatorial MRFs

TL;DR: This work aims at improving clustering stability by attempting to diminish the influence of algorithmic inconsistencies and enhance the signal that comes from the data by proposing a mechanism that takes m clusterings as input and outputs m clusters of comparable quality, which are in higher agreement with each other.
References
More filters
Journal ArticleDOI

Two-mode multi-partitioning

TL;DR: By reanalyzing the double k-means, that identifies a unique partition for each mode of the data, a relevant extension is discussed which allows to specify more partitions of one mode, conditionally to the partition of the other one.
Proceedings ArticleDOI

Robust information-theoretic clustering

TL;DR: The proposed framework, Robust Information-theoretic Clustering (RIC), is orthogonal to any known clustering algorithm: given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape of the clusters.
Book ChapterDOI

Parallel density-based clustering of complex objects

TL;DR: This paper shows how simple lower-bounding distance functions can be used to parallelize the density-based clustering algorithm DBSCAN and shows that the different result sets computed by the various slaves can effectively and efficiently be merged to a global result by means of cluster connectivity graphs.
Proceedings Article

Constrained Co-clustering of Gene Expression Data

TL;DR: An iterative co-clustering algorithm which exploits user-defined constraints while minimizing the sum-squared residues, i.e., an objective function introduced for gene expression data clustering by Cho et al (2004).
Journal Article

Parallel density-based clustering of complex objects

TL;DR: In this paper, a simple lower-bounding distance function is used to parallelize the density-based clustering algorithm DBSCAN, and the result sets computed by the various slaves can be merged to a global result by means of cluster connectivity graphs.