scispace - formally typeset
Proceedings ArticleDOI

Evaluating topic models for digital libraries

Reads0
Chats0
TLDR
This large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains and shows how scoring model -- based on pointwise mutual information of word-pair using Wikipedia, Google and MEDLINE as external data sources - performs well at predicting human scores.
Abstract
Topic models could have a huge impact on improving the ways users find and discover content in digital libraries and search interfaces through their ability to automatically learn and apply subject tags to each and every item in a collection, and their ability to dynamically create virtual collections on the fly. However, much remains to be done to tap this potential, and empirically evaluate the true value of a given topic model to humans. In this work, we sketch out some sub-tasks that we suggest pave the way towards this goal, and present methods for assessing the coherence and interpretability of topics learned by topic models. Our large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains. We show how scoring model -- based on pointwise mutual information of word-pair using Wikipedia, Google and MEDLINE as external data sources - performs well at predicting human scores. This automated scoring of topics is an important first step to integrating topic modeling into digital libraries

read more

Citations
More filters
Proceedings ArticleDOI

LDAvis: A method for visualizing and interpreting topics

TL;DR: LDAvis, a web-based interactive visualization of topics estimated using Latent Dirichlet Allocation that is built using a combination of R and D3, and a novel method for choosing which terms to present to a user to aid in the task of topic interpretation is proposed.
Proceedings Article

Automatic Evaluation of Topic Coherence

TL;DR: A simple co-occurrence measure based on pointwise mutual information over Wikipedia data is able to achieve results for the task at or nearing the level of inter-annotator correlation, and that other Wikipedia-based lexical relatedness methods also achieve strong results.
Proceedings Article

Exploring Topic Coherence over Many Models and Many Topics

TL;DR: Two new automated semantic evaluations to three distinct latent topic models are applied, revealing that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation of documents and words in a corpus.
Proceedings ArticleDOI

Termite: visualization techniques for assessing textual topic models

TL;DR: A novel saliency measure for selecting relevant terms and a seriation algorithm that both reveals clustering structure and promotes the legibility of related terms are contributed to Termite, a visual analysis tool for assessing topic model quality.
Journal ArticleDOI

Topic Modeling in Management Research: Rendering New Theory from Textual Data

TL;DR: For example, this article used topic modeling to reveal phenomenon-based constructs and grounded conceptual relationships in textual documents. But, they did not consider the relationship between concepts and concepts in the documents.
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Journal ArticleDOI

Finding scientific topics

TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Journal ArticleDOI

Hierarchical Dirichlet Processes

TL;DR: This work considers problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups, and considers a hierarchical model, specifically one in which the base measure for the childDirichlet processes is itself distributed according to a Dirichlet process.
Proceedings ArticleDOI

Dynamic topic models

TL;DR: A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections, and dynamic topic models provide a qualitative window into the contents of a large document collection.
Related Papers (5)