scispace - formally typeset
Search or ask a question

Showing papers by "Ning Zheng published in 2008"


Ning Zheng1
01 Jan 2008
TL;DR: This dissertation argues that the research has evolved from focusing on fast search/document retrieval to creating interpretable models of entire corpora, i.e., databases, and proposes new measures including the “KL percentage” that provide absolute evaluations of the accuracy or “informativeness” of all topics in the model.
Abstract: Massive databases with free-style text fields are a common feature of virtually all types of organizations from hospitals to aviation companies to governmental agencies. Perhaps the most promising approaches for intelligent, automatic text analysis are called “topic models”. Yet, it is likely also true that all topic models generate at least some topics that do not correspond to anything human analysts understand and can act upon. In this dissertation, we begin by synthesizing the literature on text modeling and information retrieval. We argue that the research has evolved from focusing on fast search/document retrieval to creating interpretable models of entire corpora, i.e., databases. We also argue that the topic model literature has largely failed to address statistical issues relating to data limitations, rare topics, and the associated effects on topic model accuracy. Next, we clarify the limitations of the standard measure of topic model accuracy, perplexity, for cases in which topic interpretability and accuracy are important. Then, we propose new measures including the “KL percentage” that provide absolute evaluations of the accuracy or “informativeness” of all topics in the model. Computational experiments show that the proposed measures are more sensitive and give different data requirement estimates than perplexity.

1 citations