Learning Topic Models -- Going beyond SVD
Sanjeev Arora,Rong Ge,Ankur Moitra +2 more
- pp 1-10
TLDR
This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative, and gives the first polynomial-time algorithm for learning topic models without the above two limitations.Abstract:
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas, the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the {\em span} of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. Perhaps the most attractive feature of our algorithm is that it generalizes to yet more realistic models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD -- just as NMF has come to replace SVD in many applications.read more
Citations
More filters
Journal ArticleDOI
Tensor decompositions for learning latent variable models
TL;DR: In this article, the authors consider a wide class of latent variable models, including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation, which exploit a certain tensor structure in their low-order observable moments (typically, of second and third-order).
Journal ArticleDOI
A model of text for experimentation in the social sciences
TL;DR: A hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates is posit, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a generally applicable framework.
Proceedings Article
A Practical Algorithm for Topic Modeling with Provable Guarantees
Sanjeev Arora,Rong Ge,Yoni Halpern,David Mimno,Ankur Moitra,David Sontag,Yichen Wu,Michael Zhu +7 more
TL;DR: This article presented an algorithm for learning topic models that is both provable and practical, which produces results comparable to the best MCMC implementations while running orders of magnitude faster than MCMC.
Proceedings Article
Provable Bounds for Learning Some Deep Representations
TL;DR: This work gives algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others, based upon a novel idea of observing correlations among features and using these to infer the underlying edge structure via a global graph recovery procedure.
Journal ArticleDOI
Fast and Robust Recursive Algorithmsfor Separable Nonnegative Matrix Factorization
TL;DR: This paper presents a family of fast recursive algorithms that are equivalent to the hyperspectral unmixing problem under the linear mixing model and the pure-pixel assumption and proves they are robust under any small perturbations of the input data matrix.
References
More filters
Journal ArticleDOI
Maximum likelihood from incomplete data via the EM algorithm
Journal ArticleDOI
Latent dirichlet allocation
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Journal ArticleDOI
Indexing by Latent Semantic Analysis
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Journal ArticleDOI
Learning the parts of objects by non-negative matrix factorization
TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.
Book
Independent Component Analysis
TL;DR: Independent component analysis as mentioned in this paper is a statistical generative model based on sparse coding, which is basically a proper probabilistic formulation of the ideas underpinning sparse coding and can be interpreted as providing a Bayesian prior.