Data weaving: scaling up the state-of-the-art in data clustering

doi:10.1145/1458082.1458226

Proceedings ArticleDOI

Data weaving: scaling up the state-of-the-art in data clustering

Ron Bekkerman, +1 more

- pp 1083-1092

Chats0

TLDR

This paper proposes data weaving - a novel method for parallelizing sequential clustering algorithms and uses data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm.

Abstract:

The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD)

Qi Qian, +4 more

- 01 Jun 2015 -

Machine Learning

TL;DR: In this paper, two strategies within SGD, i.e., mini-batch and adaptive sampling, are proposed to effectively reduce the number of updates (i.e. projections onto the PSD cone) in SGD.

...read moreread less

Proceedings ArticleDOI

Sparse Latent Semantic Analysis.

Xi Chen, +4 more

TL;DR: A new model called Sparse LSA is proposed, which produces a sparse projection matrix via the `1 regularization and achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.

...read moreread less

Posted Content

Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD)

Qi Qian, +4 more

- 03 Apr 2013 -

arXiv: Learning

TL;DR: This work develops hybrid approaches that combine the strength of adaptive sampling with that of mini-batch online learning techniques to further improve the computational efficiency of SGD for DML and proves the theoretical guarantees for both adaptive sampling and mini- batch based approaches for D ML.

...read moreread less

Patent

Method and system for characterizing web content

Martin B. Scholz, +2 more

TL;DR: In this article, the authors present a method of processing Web activity data, which includes obtaining a database of Website organizational data and generating a data structure from the database of website organizational data comprising an item identifier and a Website category corresponding to the item identifier.

...read moreread less

Proceedings ArticleDOI

Improving clustering stability with combinatorial MRFs

Ron Bekkerman, +2 more

TL;DR: This work aims at improving clustering stability by attempting to diminish the influence of algorithmic inconsistencies and enhance the signal that comes from the data by proposing a mechanism that takes m clusterings as input and outputs m clusters of comparable quality, which are in higher agreement with each other.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Two-mode multi-partitioning

Roberto Rocci, +1 more

- 01 Jan 2008 -

Computational Statistics & Data Analysis

TL;DR: By reanalyzing the double k-means, that identifies a unique partition for each mode of the data, a relevant extension is discussed which allows to specify more partitions of one mode, conditionally to the partition of the other one.

...read moreread less

Proceedings ArticleDOI

Robust information-theoretic clustering

Christian Bohm, +3 more

TL;DR: The proposed framework, Robust Information-theoretic Clustering (RIC), is orthogonal to any known clustering algorithm: given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape of the clusters.

...read moreread less

Book ChapterDOI

Parallel density-based clustering of complex objects

Stefan Brecheisen, +2 more

TL;DR: This paper shows how simple lower-bounding distance functions can be used to parallelize the density-based clustering algorithm DBSCAN and shows that the different result sets computed by the various slaves can effectively and efficiently be merged to a global result by means of cluster connectivity graphs.

...read moreread less

Proceedings Article

Constrained Co-clustering of Gene Expression Data

Ruggero G. Pensa, +1 more

TL;DR: An iterative co-clustering algorithm which exploits user-defined constraints while minimizing the sum-squared residues, i.e., an objective function introduced for gene expression data clustering by Cho et al (2004).

...read moreread less

Journal Article

Parallel density-based clustering of complex objects

Stefan Brecheisen, +2 more

- 01 Jan 2006 -

Lecture Notes in Computer Science

TL;DR: In this paper, a simple lower-bounding distance function is used to parallelize the density-based clustering algorithm DBSCAN, and the result sets computed by the various slaves can be merged to a global result by means of cluster connectivity graphs.

...read moreread less

Collapse

Data weaving: scaling up the state-of-the-art in data clustering

Citations

Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD)

Sparse Latent Semantic Analysis.

Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD)

Method and system for characterizing web content

Improving clustering stability with combinatorial MRFs

References

Two-mode multi-partitioning

Robust information-theoretic clustering

Parallel density-based clustering of complex objects

Constrained Co-clustering of Gene Expression Data

Parallel density-based clustering of complex objects

Related Papers (5)

A comprehensive study on clustering approaches for big data mining

An evaluation of Hadoop cluster efficiency in document clustering using parallel K-means

An efficient density based clustering algorithm for large databases

Classification and analysis of clustering algorithms for large datasets

A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability