Distributed Algorithms for Topic Models

Open AccessJournal Article

Distributed Algorithms for Topic Models

David Newman, +3 more

- 01 Dec 2009 -

Journal of Machine Learning Research

- Vol. 10, Iss: 62, pp 1801-1828

Chats0

TLDR

This work describes distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model and the Hierarchical Dirichet Process (HDP) model, and proposes a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data.

Abstract:

We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer.

Distributed Algorithms for Topic Models

Citations

Stochastic variational inference

topicmodels: An R Package for Fitting Topic Models

Distributed large-scale natural graph factorization

An architecture for parallel topic models

Gaia: geo-distributed machine learning approaching LAN speeds

References

Latent dirichlet allocation

Latent Dirichlet Allocation

Handbook of mathematical functions : with formulas, graphs, and mathematical tables

UCI Machine Learning Repository

Finding scientific topics

Related Papers (5)

Latent dirichlet allocation

Finding scientific topics

Hierarchical Dirichlet Processes

Evaluation methods for topic models

Online Learning for Latent Dirichlet Allocation