Reading Tea Leaves: How Humans Interpret Topic Models

Home
/
Papers
/
Reading Tea Leaves: How Humans Interpret Topic Models

Proceedings Article•

Reading Tea Leaves: How Humans Interpret Topic Models

Jonathan Chang¹, Sean Gerrish², Chong Wang², Jordan Boyd-Graber³, David M. Blei² - Show less +1 more•Institutions (3)

Facebook¹, Princeton University², University of Maryland, College Park³

07 Dec 2009-Vol. 22, pp 288-296

TL;DR: New quantitative methods for measuring semantic meaning in inferred topics are presented, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood.

read less

Abstract: Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book•

Machine Learning : A Probabilistic Perspective

[...]

Kevin P. Murphy

24 Aug 2012

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

...read moreread less

Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

...read moreread less

8,059 citations

Software Framework for Topic Modelling with Large Corpora

[...]

Radim Řehůřek¹, Petr Sojka¹•Institutions (1)

Masaryk University¹

22 May 2010

TL;DR: This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.

...read moreread less

Abstract: Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). We identify gap in existing VSM implementations, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. In this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.

...read moreread less

3,965 citations

Cites background from "Reading Tea Leaves: How Humans Inte..."

...…on topical models has recently picked up pace, especially in the field of generative topic models such as Latent Dirichlet Allocation (Blei et al., 2003), their hierarchical extensions (Teh et al., 2006), topic quality assessment and visualisation (Chang et al., 2009; Blei and Lafferty, 2009)....
[...]

Posted Content•

Towards A Rigorous Science of Interpretable Machine Learning

[...]

Finale Doshi-Velez, Been Kim

28 Feb 2017-arXiv: Machine Learning

TL;DR: This position paper defines interpretability and describes when interpretability is needed (and when it is not), and suggests a taxonomy for rigorous evaluation and exposes open questions towards a more rigorous science of interpretable machine learning.

...read moreread less

Abstract: As machine learning systems become ubiquitous, there has been a surge of interest in interpretable machine learning: systems that provide explanation for their outputs. These explanations are often used to qualitatively assess other criteria such as safety or non-discrimination. However, despite the interest in interpretability, there is very little consensus on what interpretable machine learning is and how it should be measured. In this position paper, we first define interpretability and describe when interpretability is needed (and when it is not). Next, we suggest a taxonomy for rigorous evaluation and expose open questions towards a more rigorous science of interpretable machine learning.

...read moreread less

2,589 citations

Additional excerpts

...The common intrusion-detection test [Chang et al., 2009] in topic models is a form of the forward simulation/prediction task: we ask the human to find the difference between the model’s true output and some corrupted output as a way to determine whether the human has correctly understood what the…...
[...]
...The common intrusion-detection test [Chang et al., 2009] in topic models is a form of the forward simulation/prediction task: we ask the human to find the difference between the model’s true output and some corrupted output as a way to determine whether the human has correctly understood what the model’s true output is....
[...]

Journal Article•DOI•

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

[...]

Justin Grimmer¹, Brandon M. Stewart², Brandon M. Stewart³•Institutions (3)

Stanford University¹, Princeton University², Center for Information Technology³

01 Jul 2013-Political Analysis

TL;DR: A survey of automated text analysis for political science can be found in this article, where the authors provide guidance on how to validate the output of the models and clarify misconceptions and errors in the literature.

...read moreread less

Abstract: Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have hindered their use in political science research. Here lies the promise of automated text analysis: it substantially reduces the costs of analyzing large collections of text. We provide a guide to this exciting new area of research and show how, in many instances, the methods have already obtained part of their promise. But there are pitfalls to using automated methods—they are no substitute for careful thought and close reading and require extensive and problem-specific validation. We survey a wide range of new methods, provide guidance on how to validate the output of the models, and clarify misconceptions and errors in the literature. To conclude, we argue that for automated text methods to become a standard tool for political scientists, methodologists must contribute new methods and new methods of validation. Language is the medium for politics and political conflict. Candidates debate and state policy positions during a campaign. Once elected, representatives write and debate legislation. After laws are passed, bureaucrats solicit comments before they issue regulations. Nations regularly negotiate and then sign peace treaties, with language that signals the motivations and relative power of the countries involved. News reports document the day-to-day affairs of international relations that provide a detailed picture of conflict and cooperation. Individual candidates and political parties articulate their views through party platforms and manifestos. Terrorist groups even reveal their preferences and goals through recruiting materials, magazines, and public statements. These examples, and many others throughout political science, show that to understand what politics is about we need to know what political actors are saying and writing. Recognizing that language is central to the study of politics is not new. To the contrary, scholars of politics have long recognized that much of politics is expressed in words. But scholars have struggled when using texts to make inferences about politics. The primary problem is volume: there are simply too many political texts. Rarely are scholars able to manually read all the texts in even moderately sized corpora. And hiring coders to manually read all documents is still very expensive. The result is that

...read moreread less

2,044 citations

Proceedings Article•DOI•

Collaborative topic modeling for recommending scientific articles

[...]

Chong Wang¹, David M. Blei¹•Institutions (1)

Princeton University¹

21 Aug 2011

TL;DR: An algorithm to recommend scientific articles to users of an online community that combines the merits of traditional collaborative filtering and probabilistic topic modeling and can form recommendations about both existing and newly published articles is developed.

...read moreread less

Abstract: Researchers have access to large online archives of scientific articles As a consequence, finding relevant papers has become more difficult Newly formed online communities of researchers sharing citations provides a new way to solve this problem In this paper, we develop an algorithm to recommend scientific articles to users of an online community Our approach combines the merits of traditional collaborative filtering and probabilistic topic modeling It provides an interpretable latent structure for users and items, and can form recommendations about both existing and newly published articles We study a large subset of data from CiteULike, a bibliography sharing service, and show that our algorithm provides a more effective recommender system than traditional collaborative filtering

...read moreread less

1,600 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

ImageNet: A large-scale hierarchical image database

[...]

Jia Deng¹, Wei Dong¹, Richard Socher¹, Li-Jia Li¹, Kai Li¹, Li Fei-Fei¹ - Show less +2 more•Institutions (1)

Princeton University¹

20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

...read moreread less

49,639 citations

"Reading Tea Leaves: How Humans Inte..." refers background in this paper

...Amazon Mechanical Turk has been successfully used in the past to develop gold-standard data for natural language processing [22] and to label images [23]....
[...]

Journal Article•DOI•

Maximum likelihood from incomplete data via the EM algorithm

[...]

Arthur P. Dempster¹, Nan M. Laird¹, Donald B. Rubin¹•Institutions (1)

Harvard University¹

01 Sep 1977-Journal of the royal statistical society series b-methodological

49,597 citations

Journal Article•DOI•

Latent dirichlet allocation

[...]

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

30,570 citations

"Reading Tea Leaves: How Humans Inte..." refers background or methods or result in this paper

...The performance of pLSI degrades with larger numbers of topics, suggesting that overfitting [ 4 ] might affect interpretability as well as predictive power....
[...]
...Because the direct computation of the posterior is intractable, we employ variational inference [ 4 ] and set the symmetric Dirichlet prior parameter, , to 1. CTM In LDA, the components of d are nearly independent (i.e., d is statistically neutral)....
[...]
...Latent Dirichlet allocation (LDA) [ 4 ] and the correlated topic model (CTM) [5] treat each document’s topic assignment as a multinomial random variable drawn from a symmetric Dirichlet and logistic normal prior, respectively....
[...]
...Models either use measures based on held-out likelihood [ 4 , 5] or an external task that is independent of the topic space such as sentiment detection [10] or information retrieval [11]....
[...]
...In this work we study three topic models: probabilistic latent semantic indexing (pLSI) [3], latent Dirichlet allocation (LDA) [ 4 ], and the correlated topic model (CTM) [5], which are all mixed membership models [17]....
[...]

Proceedings Article•

Latent Dirichlet Allocation

[...]

David M. Blei¹, Andrew Y. Ng¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

03 Jan 2001

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

...read moreread less

25,546 citations

Journal Article•DOI•

Indexing by Latent Semantic Analysis

[...]

Scott Deerwester¹, Susan T. Dumais², George W. Furnas², Thomas K. Landauer², Richard A. Harshman³ - Show less +1 more•Institutions (3)

University of Chicago¹, Telcordia Technologies², University of Western Ontario³

01 Sep 1990-Journal of the Association for Information Science and Technology

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

...read moreread less

12,443 citations