Open AccessProceedings Article
Sparse Additive Generative Models of Text
Jacob Eisenstein,Amr Ahmed,Eric P. Xing +2 more
- pp 1041-1048
TLDR
This approach has two key advantages: it can enforce sparsity to prevent overfitting, and it can combine generative facets through simple addition in log space, avoiding the need for latent switching variables.Abstract:
Generative models of text typically associate a multinomial with every class label or topic. Even in simple models this requires the estimation of thousands of parameters; in multi-faceted latent variable models, standard approaches require additional latent "switching" variables for every token, complicating inference. In this paper, we propose an alternative generative model for text. The central idea is that each class label or latent topic is endowed with a model of the deviation in log-frequency from a constant background distribution. This approach has two key advantages: we can enforce sparsity to prevent overfitting, and we can combine generative facets through simple addition in log space, avoiding the need for latent switching variables. We demonstrate the applicability of this idea to a range of scenarios: classification, topic modeling, and more complex multifaceted generative models.read more
Citations
More filters
Journal ArticleDOI
Structural topic models for open ended survey responses
Margaret E. Roberts,Brandon M. Stewart,Dustin Tingley,Chris Lucas,Jetson Leder-Luis,Shana Kushner Gadarian,Bethany Albertson,David G. Rand +7 more
TL;DR: The structural topic model makes analyzing open-ended responses easier, more revealing, and capable of being used to estimate treatment effects, and is illustrated with analysis of text from surveys and experiments.
Journal ArticleDOI
stm: An R Package for Structural Topic Models
TL;DR: This paper demonstrates how to use the R package stm for structural topic modeling, which allows researchers to flexibly estimate a topic model that includes document-level metadata.
Journal ArticleDOI
A model of text for experimentation in the social sciences
TL;DR: A hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates is posit, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a generally applicable framework.
Proceedings ArticleDOI
Discovering geographical topics in the twitter stream
TL;DR: An algorithm is presented by modeling diversity in tweets based on topical diversity, geographical diversity, and an interest distribution of the user by exploiting sparse factorial coding of the attributes, thus allowing it to deal with a large and diverse set of covariates efficiently.
Proceedings ArticleDOI
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
TL;DR: It is found that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts, and empirically assess several controllable generation methods find that while data- or compute-intensive methods are more effective at steering away from toxicity than simpler solutions, no current method is failsafe against neural toxic degeneration.
References
More filters
Journal ArticleDOI
Regression Shrinkage and Selection via the Lasso
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Journal ArticleDOI
Latent dirichlet allocation
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article
Latent Dirichlet Allocation
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Journal ArticleDOI
Sparse bayesian learning and the relevance vector machine
TL;DR: It is demonstrated that by exploiting a probabilistic Bayesian learning framework, the 'relevance vector machine' (RVM) can derive accurate prediction models which typically utilise dramatically fewer basis functions than a comparable SVM while offering a number of additional advantages.
Posted Content
Supervised Topic Models
David M. Blei,Jon McAuliffe +1 more
TL;DR: This article proposed supervised latent Dirichlet allocation (sLDA), a statistical model of labeled documents, which accommodates a variety of response types and derived an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations.