Learning Topic Models -- Going beyond SVD
Sanjeev Arora,Rong Ge,Ankur Moitra +2 more
- pp 1-10
TLDR
This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative, and gives the first polynomial-time algorithm for learning topic models without the above two limitations.Abstract:
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas, the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the {\em span} of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. Perhaps the most attractive feature of our algorithm is that it generalizes to yet more realistic models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD -- just as NMF has come to replace SVD in many applications.read more
Citations
More filters
Posted Content
Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model
TL;DR: In this paper, the authors consider a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel.
Proceedings ArticleDOI
Scalable Collapsed Inference for High-Dimensional Topic Models
Rashidul Islam,James R. Foulds +1 more
TL;DR: This paper develops an online inference algorithm for topic models which leverages stochasticity to scale well in the number of documents, sparsity toscale well inThe number of topics, and which operates in the collapsed representation of the topic model for improved accuracy and run-time performance.
Convex Analysis for Processing Hyperspectral Images and Data from Hadamard Spaces
TL;DR: In this article, the authors combine convex analysis and hyperspectral image processing, and propose an algorithm to solve the missing-pixel defect problem by minimizing the Kullback-Leibler divergence in the case when no prior knowledge of pure spectra is given.
Dissertation
Learning mixed membership models with a separable latent structure: Theory, provably efficient algorithms, and applications
TL;DR: In a wide spectrum of problems in science and engineering that includes hyperspectral imaging, gene expression analysis, and machine learning tasks such as topic modeling, the observed data is high-dimensional and can be modeled as arising from a dataspecific probabilistic mixture of a small collection of latent factors.
Dissertation
Constrained Matrix and Tensor Factorization: Theory, Algorithms, and Applications
TL;DR: This dissertation presents a meta-modelling system that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of systematically cataloging and cataloging individual components of a system.
References
More filters
Journal ArticleDOI
Maximum likelihood from incomplete data via the EM algorithm
Journal ArticleDOI
Latent dirichlet allocation
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Journal ArticleDOI
Indexing by Latent Semantic Analysis
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Journal ArticleDOI
Learning the parts of objects by non-negative matrix factorization
TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.
Book
Independent Component Analysis
TL;DR: Independent component analysis as mentioned in this paper is a statistical generative model based on sparse coding, which is basically a proper probabilistic formulation of the ideas underpinning sparse coding and can be interpreted as providing a Bayesian prior.