Detecting topic evolution in scientific literature: how can citations help?

doi:10.1145/1645953.1646076

Home
/
Papers
/
Detecting topic evolution in scientific literature: how can citations help?

Proceedings Article•DOI•

Detecting topic evolution in scientific literature: how can citations help?

Qi He¹, Bi Chen¹, Jian Pei², Baojun Qiu¹, Prasenjit Mitra¹, C. Lee Giles¹ - Show less +2 more•Institutions (2)

Pennsylvania State University¹, Simon Fraser University²

02 Nov 2009-pp 957-966

TL;DR: An iterative topic evolution learning framework is proposed by adapting the Latent Dirichlet Allocation model to the citation network and develop a novel inheritance topic model, which clearly shows that citations can help to understand topic evolution better.

read less

Abstract: Understanding how topics in scientific literature evolve is an interesting and important problem. Previous work simply models each paper as a bag of words and also considers the impact of authors. However, the impact of one document on another as captured by citations, one important inherent element in scientific literature, has not been considered. In this paper, we address the problem of understanding topic evolution by leveraging citations, and develop citation-aware approaches. We propose an iterative topic evolution learning framework by adapting the Latent Dirichlet Allocation model to the citation network and develop a novel inheritance topic model. We evaluate the effectiveness and efficiency of our approaches and compare with the state of the art approaches on a large collection of more than 650,000 research papers in the last 16 years and the citation network enabled by CiteSeerX. The results clearly show that citations can help to understand topic evolution better.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Survey of Topic Modeling in Text Mining

[...]

Rubayyi Alghamdi, Khalid Alfalqi

01 Jan 2015-International Journal of Advanced Computer Science and Applications

TL;DR: Different models, such as topic over time (TOT), dynamic topic models (DTM), multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc. are discussed.

...read moreread less

Abstract: Topic models provide a convenient way to analyze large of unclassified text. A topic contains a cluster of words that frequently occur together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings. This paper provides two categories that can be under the field of topic modeling. First one discusses the area of methods of topic modeling, which has four methods that can be considerable under this category. These methods are Latent semantic analysis (LSA), Probabilistic latent semantic analysis (PLSA), Latent Dirichlet allocation (LDA), and Correlated topic model (CTM). The second category is called topic evolution models, which model topics by considering an important factor time. In the second category, different models are discussed, such as topic over time (TOT), dynamic topic models (DTM), multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc.

...read moreread less

243 citations

Journal Article•DOI•

Topic modelling for qualitative studies

[...]

Sergey I. Nikolenko¹, Sergei Koltcov¹, Olessia Koltsova¹•Institutions (1)

National Research University – Higher School of Economics¹

01 Feb 2017-Journal of Information Science

TL;DR: This work identifies two important problems along the way to using topic models in qualitative studies: lack of a good quality metric that closely matches human judgement in understanding topics and the need to indicate specific subtopics that a specific qualitative study may be most interested in mining.

...read moreread less

Abstract: Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation LDA. However, examples of qualitative studies that employ topic modelling as a tool are currently few and far between. In this work, we identify two important problems along the way to using topic models in qualitative studies: lack of a good quality metric that closely matches human judgement in understanding topics and the need to indicate specific subtopics that a specific qualitative study may be most interested in mining. For the first problem, we propose a new quality metric, tf-idf coherence, that reflects human judgement more accurately than regular coherence, and conduct an experiment to verify this claim. For the second problem, we propose an interval semi-supervised approach ISLDA where certain predefined sets of keywords that define the topics researchers are interested in are restricted to specific intervals of topic assignments. Our experiments show that ISLDA is better for topic extraction than LDA in terms of tf-idf coherence, number of topics identified to predefined keywords and topic stability. We also present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.

...read moreread less

204 citations

Book•

Applications of Topic Models

[...]

Jordan Boyd-Graber¹, Yuening Hu², David Mimno³•Institutions (3)

University of Maryland, College Park¹, Google², Cornell University³

13 Jul 2017

TL;DR: Applications of Topic Models describes the recent academic and industrial applications of topic models and reviews their successful use by researchers to help understand fiction, non-fiction, scientific publications, and political texts.

...read moreread less

Abstract: How can a single person understand what’s going on in a collection of millions of documents? This is an increasingly widespread problem: sifting through an organization’s e-mails, understanding a decade worth of newspapers, or characterizing a scientific field’s research. This monograph explores the ways that humans and computers make sense of document collections through tools called topic models. Topic models are a statistical framework that help users understand large document collections; not just to find individual documents but to understand the general themes present in the collection. Applications of Topic Models describes the recent academic and industrial applications of topic models. In addition to topic models’ effective application to traditional problems like information retrieval, visualization, statistical inference, multilingual modeling, and linguistic understanding, Applications of Topic Models also reviews topic models’ ability to unlock large text collections for qualitative analysis. It reviews their successful use by researchers to help understand fiction, non-fiction, scientific publications, and political texts. Applications of Topic Models is aimed at the reader with some knowledge of document processing, basic understanding of some probability, and interested in many application domains. It discusses the information needs of each application area, and how those specific needs affect models, curation procedures, and interpretations. By the end of the monograph, it is hoped that readers will be excited enough to attempt to embark on building their own topic models. It should also be of interest to topic model experts as the coverage of diverse applications may expose models and approaches they had not seen before.

...read moreread less

177 citations

Proceedings Article•DOI•

The web of topics: discovering the topology of topic evolution in a corpus

[...]

Yookyung Jo¹, John E. Hopcroft¹, Carl Lagoze¹•Institutions (1)

Cornell University¹

28 Mar 2011

TL;DR: The topic evolution graphs obtained from the ACM corpus provide an effective and concrete summary of the corpus with remarkably rich topology that are congruent to the authors' background knowledge.

...read moreread less

Abstract: In this paper we study how to discover the evolution of topics over time in a time-stamped document collection. Our approach is uniquely designed to capture the rich topology of topic evolution inherent in the corpus. Instead of characterizing the evolving topics at fixed time points, we conceptually define a topic as a quantized unit of evolutionary change in content and discover topics with the time of their appearance in the corpus. Discovered topics are then connected to form a topic evolution graph using a measure derived from the underlying document network. Our approach allows inhomogeneous distribution of topics over time and does not impose any topological restriction in topic evolution graphs. We evaluate our algorithm on the ACM corpus.The topic evolution graphs obtained from the ACM corpus provide an effective and concrete summary of the corpus with remarkably rich topology that are congruent to our background knowledge. In a finer resolution, the graphs reveal concrete information about the corpus that were previously unknown to us, suggesting the utility of our approach as a navigational tool for the corpus.

...read moreread less

76 citations

Journal Article•DOI•

Understanding the topic evolution in a scientific domain: An exploratory study for the field of information retrieval

[...]

Baitong Chen¹, Baitong Chen², Satoshi Tsutsui³, Ying Ding¹, Ying Ding⁴, Ying Ding³, Feicheng Ma¹ - Show less +3 more•Institutions (4)

Wuhan University¹, Shanghai University², Indiana University³, Tongji University⁴

01 Nov 2017-Journal of Informetrics

TL;DR: Examination of how research topics evolve by analyzing the topic trends, evolving dynamics, and semantic word shifts in the IR domain shows that the evolution of a major topic usually follows a pattern from adjusting status to mature status, and sometimes with re-adjusting status in between the evolving process.

...read moreread less

74 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Latent dirichlet allocation

[...]

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

30,570 citations

"Detecting topic evolution in scient..." refers background or methods in this paper

...Here, we use one of the most popular models in machine learning and information retrieval, the Latent Dirichlet Allocation (LDA) [3] framework, to generate topics....
[...]
...The LDA model on the bag of words [3] was extended to model 1) the impact of authors [29, 33]; 2) the impact of the directionsensitive messages sent between social entities (e....
[...]
...All these hyper parameter settings simply follow the tradition of topic modeling [3]....
[...]
...Since the Latent Dirichlet Allocation (LDA) model [3] has been extensively adopted in information retrieval [4, 12, 25, 29, 33, 18, 22, 24], as the first step we extend LDA for topic evolution analysis....
[...]

Proceedings Article•

Latent Dirichlet Allocation

[...]

David M. Blei¹, Andrew Y. Ng¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

03 Jan 2001

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

...read moreread less

25,546 citations

Journal Article•DOI•

Term Weighting Approaches in Automatic Text Retrieval

[...]

Gerard Salton¹, Chris Buckley¹•Institutions (1)

Cornell University¹

01 Aug 1988-Information Processing and Management

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

...read moreread less

Abstract: The experimental evidence accumulated over the past 20 years indicates that textindexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective term weighting systems. This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

...read moreread less

9,460 citations

Journal Article•DOI•

Finding scientific topics

[...]

Thomas L. Griffiths¹, Mark Steyvers²•Institutions (2)

Stanford University¹, University of California, Irvine²

06 Apr 2004-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.

...read moreread less

Abstract: A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying “hot topics” by examining temporal dynamics and tagging abstracts to illustrate semantic content.

...read moreread less

5,680 citations

"Detecting topic evolution in scient..." refers background or methods in this paper

..., [14]) reported that LDA under Gibbs sampling normally requires around 500-1, 000 iterations to reach convergence....
[...]
...Collapsed Gibbs sampler can be used to infer the LDA posterior probabilities [14]....
[...]

Journal Article•DOI•

Hierarchical Dirichlet Processes

[...]

Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei

01 Dec 2006-Journal of the American Statistical Association

TL;DR: This work considers problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups, and considers a hierarchical model, specifically one in which the base measure for the childDirichlet processes is itself distributed according to a Dirichlet process.

...read moreread less

Abstract: We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes ...

...read moreread less

3,755 citations