Proceedings ArticleDOI
Evaluating topic models for digital libraries
David Newman,Youn Noh,Edmund M. Talley,Sarvnaz Karimi,Timothy Baldwin +4 more
- pp 215-224
Reads0
Chats0
TLDR
This large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains and shows how scoring model -- based on pointwise mutual information of word-pair using Wikipedia, Google and MEDLINE as external data sources - performs well at predicting human scores.Abstract:
Topic models could have a huge impact on improving the ways users find and discover content in digital libraries and search interfaces through their ability to automatically learn and apply subject tags to each and every item in a collection, and their ability to dynamically create virtual collections on the fly. However, much remains to be done to tap this potential, and empirically evaluate the true value of a given topic model to humans. In this work, we sketch out some sub-tasks that we suggest pave the way towards this goal, and present methods for assessing the coherence and interpretability of topics learned by topic models. Our large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains. We show how scoring model -- based on pointwise mutual information of word-pair using Wikipedia, Google and MEDLINE as external data sources - performs well at predicting human scores. This automated scoring of topics is an important first step to integrating topic modeling into digital librariesread more
Citations
More filters
Proceedings ArticleDOI
LDAvis: A method for visualizing and interpreting topics
TL;DR: LDAvis, a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation that is built using a combination of R and D3, and a novel method for choosing which terms to present to a user to aid in the task of topic interpretation is proposed.
Proceedings Article
Automatic Evaluation of Topic Coherence
TL;DR: A simple co-occurrence measure based on pointwise mutual information over Wikipedia data is able to achieve results for the task at or nearing the level of inter-annotator correlation, and that other Wikipedia-based lexical relatedness methods also achieve strong results.
Proceedings Article
Exploring Topic Coherence over Many Models and Many Topics
TL;DR: Two new automated semantic evaluations to three distinct latent topic models are applied, revealing that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation of documents and words in a corpus.
Proceedings ArticleDOI
Termite: visualization techniques for assessing textual topic models
TL;DR: A novel saliency measure for selecting relevant terms and a seriation algorithm that both reveals clustering structure and promotes the legibility of related terms are contributed to Termite, a visual analysis tool for assessing topic model quality.
Journal ArticleDOI
Topic Modeling in Management Research: Rendering New Theory from Textual Data
Timothy R. Hannigan,Richard F. J. Haans,Keyvan Vakili,Hovig Tchalian,Vern Glaser,Milo Shaoqing Wang,Sarah Kaplan,P. Devereaux Jennings +7 more
TL;DR: For example, this article used topic modeling to reveal phenomenon-based constructs and grounded conceptual relationships in textual documents. But, they did not consider the relationship between concepts and concepts in the documents.
References
More filters
Journal ArticleDOI
Latent dirichlet allocation
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article
Latent Dirichlet Allocation
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Journal ArticleDOI
Finding scientific topics
TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Journal ArticleDOI
Hierarchical Dirichlet Processes
TL;DR: This work considers problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups, and considers a hierarchical model, specifically one in which the base measure for the childDirichlet processes is itself distributed according to a Dirichlet process.
Proceedings ArticleDOI
Dynamic topic models
David M. Blei,John Lafferty +1 more
TL;DR: A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections, and dynamic topic models provide a qualitative window into the contents of a large document collection.