scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Data weaving: scaling up the state-of-the-art in data clustering

26 Oct 2008-pp 1083-1092
TL;DR: This paper proposes data weaving - a novel method for parallelizing sequential clustering algorithms and uses data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm.
Abstract: The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.
Citations
More filters
Journal ArticleDOI
Qi Qian1, Rong Jin1, Jinfeng Yi1, Lijun Zhang1, Shenghuo Zhu 
TL;DR: In this paper, two strategies within SGD, i.e., mini-batch and adaptive sampling, are proposed to effectively reduce the number of updates (i.e. projections onto the PSD cone) in SGD.
Abstract: Distance metric learning (DML) is an important task that has found applications in many domains. The high computational cost of DML arises from the large number of variables to be determined and the constraint that a distance metric has to be a positive semi-definite (PSD) matrix. Although stochastic gradient descent (SGD) has been successfully applied to improve the efficiency of DML, it can still be computationally expensive in order to ensure that the solution is a PSD matrix. It has to, at every iteration, project the updated distance metric onto the PSD cone, an expensive operation. We address this challenge by developing two strategies within SGD, i.e. mini-batch and adaptive sampling, to effectively reduce the number of updates (i.e. projections onto the PSD cone) in SGD. We also develop hybrid approaches that combine the strength of adaptive sampling with that of mini-batch online learning techniques to further improve the computational efficiency of SGD for DML. We prove the theoretical guarantees for both adaptive sampling and mini-batch based approaches for DML. We also conduct an extensive empirical study to verify the effectiveness of the proposed algorithms for DML.

86 citations

Proceedings ArticleDOI
Xi Chen1, Yanjun Qi1, Bing Bai1, Qihang Lin, Jaime G. Carbonell 
01 Jan 2011
TL;DR: A new model called Sparse LSA is proposed, which produces a sparse projection matrix via the `1 regularization and achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.
Abstract: Latent semantic analysis (LSA), as one of the most popular unsupervised dimension reduction tools, has a wide range of applications in text mining and information retrieval. The key idea of LSA is to learn a projection matrix that maps the high dimensional vector space representations of documents to a lower dimensional latent space, i.e. so called latent topic space. In this paper, we propose a new model called Sparse LSA, which produces a sparse projection matrix via the `1 regularization. Compared to the traditional LSA, Sparse LSA selects only a small number of relevant words for each topic and hence provides a compact representation of topic-word relationships. Moreover, Sparse LSA is computationally very efficient with much less memory usage for storing the projection matrix. Furthermore, we propose two important extensions of Sparse LSA: group structured Sparse LSA and non-negative Sparse LSA. We conduct experiments on several benchmark datasets and compare Sparse LSA and its extensions with several widely used methods, e.g. LSA, Sparse Coding and LDA. Empirical results suggest that Sparse LSA achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.

40 citations


Cites methods from "Data weaving: scaling up the state-..."

  • ...For RCV1, we remove the words appearing fewer than 10 times and standard stopwords; pre-process the data according to [2] (3); and convert it into a 53 classes classification task....

    [...]

Posted Content
Qi Qian1, Rong Jin1, Jinfeng Yi1, Lijun Zhang1, Shenghuo Zhu 
TL;DR: This work develops hybrid approaches that combine the strength of adaptive sampling with that of mini-batch online learning techniques to further improve the computational efficiency of SGD for DML and proves the theoretical guarantees for both adaptive sampling and mini- batch based approaches for D ML.
Abstract: Distance metric learning (DML) is an important task that has found applications in many domains. The high computational cost of DML arises from the large number of variables to be determined and the constraint that a distance metric has to be a positive semi-definite (PSD) matrix. Although stochastic gradient descent (SGD) has been successfully applied to improve the efficiency of DML, it can still be computationally expensive because in order to ensure that the solution is a PSD matrix, it has to, at every iteration, project the updated distance metric onto the PSD cone, an expensive operation. We address this challenge by developing two strategies within SGD, i.e. mini-batch and adaptive sampling, to effectively reduce the number of updates (i.e., projections onto the PSD cone) in SGD. We also develop hybrid approaches that combine the strength of adaptive sampling with that of mini-batch online learning techniques to further improve the computational efficiency of SGD for DML. We prove the theoretical guarantees for both adaptive sampling and mini-batch based approaches for DML. We also conduct an extensive empirical study to verify the effectiveness of the proposed algorithms for DML.

19 citations


Cites methods from "Data weaving: scaling up the state-..."

  • ...2009) comprised of the documents from the 30 most popular categories and rcv20 is the subset of a large rcv1 dataset (Bekkerman and Scholz 2008) consisted of documents from the 20 most popular categories. Following Chechik et al. (2010), we reduce the dimensionality of these document datasets to 200 by principle components analysis (PCA)....

    [...]

  • ...2009) comprised of the documents from the 30 most popular categories and rcv20 is the subset of a large rcv1 dataset (Bekkerman and Scholz 2008) consisted of documents from the 20 most popular categories....

    [...]

Patent
10 Aug 2009
TL;DR: In this article, the authors present a method of processing Web activity data, which includes obtaining a database of Website organizational data and generating a data structure from the database of website organizational data comprising an item identifier and a Website category corresponding to the item identifier.
Abstract: An exemplary embodiment of the present invention provides a method of processing Web activity data. The method includes obtaining a database of Website organizational data. The method also includes generating a data structure from the database of Website organizational data comprising an Item identifier and a Website category corresponding to the item identifier. The method also includes generating a reduced-rank classification structure from the data structure, the reduced-rank classification structure including a category grouping corresponding to one or more of the Website categories.

15 citations

Proceedings ArticleDOI
28 Jun 2009
TL;DR: This work aims at improving clustering stability by attempting to diminish the influence of algorithmic inconsistencies and enhance the signal that comes from the data by proposing a mechanism that takes m clusterings as input and outputs m clusters of comparable quality, which are in higher agreement with each other.
Abstract: As clustering methods are often sensitive to parameter tuning, obtaining stability in clustering results is an important task. In this work, we aim at improving clustering stability by attempting to diminish the influence of algorithmic inconsistencies and enhance the signal that comes from the data. We propose a mechanism that takes m clusterings as input and outputs m clusterings of comparable quality, which are in higher agreement with each other. We call our method the Clustering Agreement Process (CAP). To preserve the clustering quality, CAP uses the same optimization procedure as used in clustering. In particular, we study the stability problem of randomized clustering methods (which usually produce different results at each run). We focus on methods that are based on inference in a combinatorial Markov Random Field (or Comraf, for short) of a simple topology. We instantiate CAP as inference within a more complex, bipartite Comraf. We test the resulting system on four datasets, three of which are medium-sized text collections, while the fourth is a large-scale user/movie dataset. First, in all the four cases, our system significantly improves the clustering stability measured in terms of the macro-averaged Jaccard index. Second, in all the four cases our system managed to significantly improve clustering quality as well, achieving the state-of-the-art results. Third, our system significantly improves stability of consensus clustering built on top of the randomized clustering solutions.

10 citations


Cites background or methods from "Data weaving: scaling up the state-..."

  • ...when the user u was assigned into a cluster ũ, and the movie m was assigned into a cluster m̃ (for a discussion, see [5])....

    [...]

  • ...Improving Clustering Stability with Combinatorial MRFs Ron Bekkerman HP Labs Martin Scholz HP Labs Krishnamurthy ViswanathanHP Labs 1501 Page Mill Rd 1501 Page Mill Rd 1501 Page Mill Rd Palo Alto, CA 94304 Palo Alto, CA 94304 Palo Alto, CA 94304 ron.bekkerman@hp.com scholz@hp.com ABSTRACT As clustering methods are often sensitive to parameter tun­ing, obtaining stability in clustering results is an important task....

    [...]

  • ...[5] R. Bekkerman and M. Scholz....

    [...]

  • ...Instead, we apply its parallelized version, called DataLoom [5]....

    [...]

  • ...For evaluating our collaborative filtering results based on the constructed clusterings of users and movies, we follow Bekkerman and Scholz [5] who compute the Area Under the ROC Curve (or AUC, in short) for a constructed ranking of the user/movie pairs (see Section 5....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations


"Data weaving: scaling up the state-..." refers background in this paper

  • ...In LDA, each document is represented as a distribution of topics, and pa­rameters of those distributions are learned from the data....

    [...]

  • ...against the standard uni-modal k-means, as well as against Latent Dirichlet Allocation (LDA) [5]—a popular generative model for representing document collections....

    [...]

  • ...Dataset k-means LDA IT-CC SCC 2way DataLoom (deterministic) 2way DataLoom (stochastic) 3way DataLoom (stochastic) acheyer 24.7 44.3 ± 0.4 39.0±0.6 46.1±0.3 43.7±0.5 42.4±0.5 46.7±0.3 mgondek 37.0 68.0 ± 0.8 61.3±1.5 63.4±1.1 63.3±1.8 64.6±1.2 73.8±1.7 sanders-r 45.5 63.8 ± 0.4 56.1±0.7 60.2±0.4 59.8±0.9 61.3±0.8 66.5±0.2 20NG 16.1 56.7 ± 0.6 54.2±0.7 57.7±0.2 55.1±0.7 55.6±0.7 N/A against the standard uni-modal k-means, as well as against Latent Dirichlet Allocation (LDA) [5] a popular generative model for representing document collections....

    [...]

  • ...We used Xuerui Wang s LDA implementation [25] that applies Gibbs sampling with 10000 sampling iterations....

    [...]

Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations


"Data weaving: scaling up the state-..." refers background in this paper

  • ...It has recently been discussed that the same kind of parallelization works very well in combination with the popular MapReduce paradigm [10]....

    [...]

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Journal ArticleDOI
U.M. Feyyad1
TL;DR: Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.
Abstract: Current computing and storage technology is rapidly outstripping society's ability to make meaningful use of the torrent of available data. Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.

4,806 citations