scispace - formally typeset
Open AccessProceedings ArticleDOI

Learning Topic Models -- Going beyond SVD

TLDR
This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative, and gives the first polynomial-time algorithm for learning topic models without the above two limitations.
Abstract
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas, the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular. Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the {\em span} of the topic vectors instead of the topic vectors themselves. This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. Perhaps the most attractive feature of our algorithm is that it generalizes to yet more realistic models that incorporate topic-topic correlations, such as the Correlated Topic Model (CTM) and the Pachinko Allocation Model (PAM). We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD -- just as NMF has come to replace SVD in many applications.

read more

Citations
More filters
Posted Content

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

TL;DR: In this paper, the authors consider a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel.
Proceedings ArticleDOI

Scalable Collapsed Inference for High-Dimensional Topic Models

TL;DR: This paper develops an online inference algorithm for topic models which leverages stochasticity to scale well in the number of documents, sparsity toscale well inThe number of topics, and which operates in the collapsed representation of the topic model for improved accuracy and run-time performance.

Convex Analysis for Processing Hyperspectral Images and Data from Hadamard Spaces

TL;DR: In this article, the authors combine convex analysis and hyperspectral image processing, and propose an algorithm to solve the missing-pixel defect problem by minimizing the Kullback-Leibler divergence in the case when no prior knowledge of pure spectra is given.
Dissertation

Learning mixed membership models with a separable latent structure: Theory, provably efficient algorithms, and applications

Weicong Ding
TL;DR: In a wide spectrum of problems in science and engineering that includes hyperspectral imaging, gene expression analysis, and machine learning tasks such as topic modeling, the observed data is high-dimensional and can be modeled as arising from a dataspecific probabilistic mixture of a small collection of latent factors.
Dissertation

Constrained Matrix and Tensor Factorization: Theory, Algorithms, and Applications

Kejun Huang
TL;DR: This dissertation presents a meta-modelling system that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of systematically cataloging and cataloging individual components of a system.
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Journal ArticleDOI

Learning the parts of objects by non-negative matrix factorization

TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.
Book

Independent Component Analysis

TL;DR: Independent component analysis as mentioned in this paper is a statistical generative model based on sparse coding, which is basically a proper probabilistic formulation of the ideas underpinning sparse coding and can be interpreted as providing a Bayesian prior.
Related Papers (5)