A collective topic model for milestone paper discovery
TL;DR: A collective topic model based on Probabilistic latent semantic analysis (PLSA), authorship, published venues and citation relations are used for quantifying paper importance and experiments indicate that this model is superior in milestone paper discovery when compared to a previous model which considers only papers.
Abstract: Prior arts stay at the foundation for future work in academic research. However the increasingly large amount of publications makes it difficult for researchers to effectively discover the most important previous works to the topic of their research. In this paper, we study the automatic discovery of the core papers for a research area. We propose a collective topic model on three types of objects: papers, authors and published venues. We model any of these objects as bags of citations. Based on Probabilistic latent semantic analysis (PLSA), authorship, published venues and citation relations are used for quantifying paper importance. Our method discusses milestone paper discovery in different cases of input objects. Experiments on the ACL Anthology Network (ANN) indicate that our model is superior in milestone paper discovery when compared to a previous model which considers only papers.
Summary (2 min read)
- Academic literature surveying plays a vital role in academic research; researchers can learn what has been done, what research gaps might exist and what potential research directions to work on.
- Academic search engines such as Google Scholar 1 and CiteSeerX 2 enable researchers to find related literatures or prior arts.
- The authors experimental results show that paper importance is well captured by their model; authorship and published venues have considerable influence on milestone paper discovery.
3. PROBABILISTIC TOPIC MODEL
- The importance of a paper depends on a variety of factors, including the authority of authors, the publication venue and co-citation relationship with other papers.
- Since authors and venues are linked with documents in the academic document collection, the authors build a “virtual document” for each author and venue by aggregating all documents associated with that author or venue (they call the result author document and venue document, respectively).
- This way, for each author or venue the authors also derive a bag of citation IDs.
- Based on , the authors assume that the multipletyped documents (paper document, author document and venue document) have a common set of latent topics and each topic is represented as the distribution over citations.
- Then, the problem about milestone paper discovery is defined as follows.
3.1 Model Description
- Table 1 describes meanings of the notations used in their model.
- Each document is represented as the distribution over topics and each topic is represented as the distribution over citations.
- Then, the process of generating an academic document is as follows: for each citation in that document, firstly sample a topic zk according to the distribution from paper topic distribution δ(z; d) or author topic distribution ζ(z; a) or venue topic distribution ψ(z; v) based on the document type.
- Then, draw a citation c from the sampled topic distribution φ(:; zk) in topic citation distribution φ(c; z).
- The authors developed their model based on PLSA .
3.2 Parameter Inference
- The authors use the Expectation-Maximization (EM) algorithm for parameter inference.
- Each E-step computes the lower bound function Q of L(θ).
- In the first E-step, the posterior probabilities are randomly initialized.
- The ACL Anthology Network (ANN)  was used in their experiments.
- This dataset is also used in previous work ; thus, the authors can use it to perform some comparisons with .
- Figure 2 shows the perplexity scores during model estimation for different values of k.
- From this graph, the authors can see that a value of k around 150 is appropriate for this dataset, since it gives the lowest perplexity score among all tested values.
4.2 Experimental Results
- 2.1 Results of Topic Milestone Paper Discovery Each topic is presented as the mixture of citations in their model.
- Those citations can be ranked based on φ(ci; zk) and citations ranking at the top for each topic zk are considered as topic milestone papers.
- Table 2 presents topic milestone papers for Sentiment Analysis in  while Table 3 shows their results.
- Finally, their model can also indicate popular topics for an author or a venue.
Did you find this useful? Give us your feedback
Cites background from "A collective topic model for milest..."
...Lu et al. (2014) proposed a topic model which uses authorship, published venues, and citation relations among scientific documents to detect topics and identify the most notable works in the corpus....
...The collective topic model (CTM) proposed by Lu et al. (2014) simultaneously discovers topics and related milestone papers in the corpus by modeling papers, authors, and published venues as a bag of citations based on the PLSA model....
Cites background or methods from "A collective topic model for milest..."
...In our model, different from [6, 17], we use the topics extracted from textual information....
...Lu et al. extend the method by considering additional factors that influence the importance of papers, such as authorship and published venues....
...Thus, the topics described in [6, 17] are too general but imprecise....
...Although [6, 17] use “topic” in the discription of their methods, the topic defined in [6, 17] is actually a cluster of documents....
...In [6, 17], the reference for a document is determined by sampling cited documents according to the topicdocument distribution....
Cites background from "A collective topic model for milest..."
...Topic model is a common technology for the evolution of research themes [31,32] and discovery of high quality papers ....
"A collective topic model for milest..." refers methods in this paper
...EM iteratively executes two steps, an E-step and a M-step, until L(θ) converges ....
"A collective topic model for milest..." refers background in this paper
... X. W. Wang, C. Zhai, and D. Roth....
...Mei and Zhai  used temporal text mining techniques to discover latent themes from text and constructed theme evolution graphs....
... Q. Mei and C. Zhai....
Related Papers (5)