Proceedings ArticleDOI
Data weaving: scaling up the state-of-the-art in data clustering
Ron Bekkerman,Martin B. Scholz +1 more
- pp 1083-1092
Reads0
Chats0
TLDR
This paper proposes data weaving - a novel method for parallelizing sequential clustering algorithms and uses data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm.Abstract:
The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.read more
Citations
More filters
Journal ArticleDOI
Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD)
TL;DR: In this paper, two strategies within SGD, i.e., mini-batch and adaptive sampling, are proposed to effectively reduce the number of updates (i.e. projections onto the PSD cone) in SGD.
Proceedings ArticleDOI
Sparse Latent Semantic Analysis.
TL;DR: A new model called Sparse LSA is proposed, which produces a sparse projection matrix via the `1 regularization and achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.
Posted Content
Efficient Distance Metric Learning by Adaptive Sampling and Mini-Batch Stochastic Gradient Descent (SGD)
TL;DR: This work develops hybrid approaches that combine the strength of adaptive sampling with that of mini-batch online learning techniques to further improve the computational efficiency of SGD for DML and proves the theoretical guarantees for both adaptive sampling and mini- batch based approaches for D ML.
Patent
Method and system for characterizing web content
TL;DR: In this article, the authors present a method of processing Web activity data, which includes obtaining a database of Website organizational data and generating a data structure from the database of website organizational data comprising an item identifier and a Website category corresponding to the item identifier.
Proceedings ArticleDOI
Improving clustering stability with combinatorial MRFs
TL;DR: This work aims at improving clustering stability by attempting to diminish the influence of algorithmic inconsistencies and enhance the signal that comes from the data by proposing a mechanism that takes m clusterings as input and outputs m clusters of comparable quality, which are in higher agreement with each other.
References
More filters
Journal ArticleDOI
Two-mode multi-partitioning
Roberto Rocci,Maurizio Vichi +1 more
TL;DR: By reanalyzing the double k-means, that identifies a unique partition for each mode of the data, a relevant extension is discussed which allows to specify more partitions of one mode, conditionally to the partition of the other one.
Proceedings ArticleDOI
Robust information-theoretic clustering
TL;DR: The proposed framework, Robust Information-theoretic Clustering (RIC), is orthogonal to any known clustering algorithm: given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape of the clusters.
Book ChapterDOI
Parallel density-based clustering of complex objects
TL;DR: This paper shows how simple lower-bounding distance functions can be used to parallelize the density-based clustering algorithm DBSCAN and shows that the different result sets computed by the various slaves can effectively and efficiently be merged to a global result by means of cluster connectivity graphs.
Proceedings Article
Constrained Co-clustering of Gene Expression Data
TL;DR: An iterative co-clustering algorithm which exploits user-defined constraints while minimizing the sum-squared residues, i.e., an objective function introduced for gene expression data clustering by Cho et al (2004).
Journal Article
Parallel density-based clustering of complex objects
TL;DR: In this paper, a simple lower-bounding distance function is used to parallelize the density-based clustering algorithm DBSCAN, and the result sets computed by the various slaves can be merged to a global result by means of cluster connectivity graphs.
Related Papers (5)
A comprehensive study on clustering approaches for big data mining
Divya Pandove,Shivani Goel +1 more