scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A collective topic model for milestone paper discovery

03 Jul 2014-pp 1019-1022
TL;DR: A collective topic model based on Probabilistic latent semantic analysis (PLSA), authorship, published venues and citation relations are used for quantifying paper importance and experiments indicate that this model is superior in milestone paper discovery when compared to a previous model which considers only papers.
Abstract: Prior arts stay at the foundation for future work in academic research. However the increasingly large amount of publications makes it difficult for researchers to effectively discover the most important previous works to the topic of their research. In this paper, we study the automatic discovery of the core papers for a research area. We propose a collective topic model on three types of objects: papers, authors and published venues. We model any of these objects as bags of citations. Based on Probabilistic latent semantic analysis (PLSA), authorship, published venues and citation relations are used for quantifying paper importance. Our method discusses milestone paper discovery in different cases of input objects. Experiments on the ACL Anthology Network (ANN) indicate that our model is superior in milestone paper discovery when compared to a previous model which considers only papers.

Summary (2 min read)

1. INTRODUCTION

  • Academic literature surveying plays a vital role in academic research; researchers can learn what has been done, what research gaps might exist and what potential research directions to work on.
  • Academic search engines such as Google Scholar 1 and CiteSeerX 2 enable researchers to find related literatures or prior arts.
  • The authors experimental results show that paper importance is well captured by their model; authorship and published venues have considerable influence on milestone paper discovery.

3. PROBABILISTIC TOPIC MODEL

  • The importance of a paper depends on a variety of factors, including the authority of authors, the publication venue and co-citation relationship with other papers.
  • Since authors and venues are linked with documents in the academic document collection, the authors build a “virtual document” for each author and venue by aggregating all documents associated with that author or venue (they call the result author document and venue document, respectively).
  • This way, for each author or venue the authors also derive a bag of citation IDs.
  • Based on [3], the authors assume that the multipletyped documents (paper document, author document and venue document) have a common set of latent topics and each topic is represented as the distribution over citations.
  • Then, the problem about milestone paper discovery is defined as follows.

3.1 Model Description

  • Table 1 describes meanings of the notations used in their model.
  • Each document is represented as the distribution over topics and each topic is represented as the distribution over citations.
  • Then, the process of generating an academic document is as follows: for each citation in that document, firstly sample a topic zk according to the distribution from paper topic distribution δ(z; d) or author topic distribution ζ(z; a) or venue topic distribution ψ(z; v) based on the document type.
  • Then, draw a citation c from the sampled topic distribution φ(:; zk) in topic citation distribution φ(c; z).
  • The authors developed their model based on PLSA [4].

3.2 Parameter Inference

  • The authors use the Expectation-Maximization (EM) algorithm for parameter inference.
  • Each E-step computes the lower bound function Q of L(θ).
  • In the first E-step, the posterior probabilities are randomly initialized.

4.1 Dataset

  • The ACL Anthology Network (ANN) [7] was used in their experiments.
  • This dataset is also used in previous work [8]; thus, the authors can use it to perform some comparisons with [8].
  • Figure 2 shows the perplexity scores during model estimation for different values of k.
  • From this graph, the authors can see that a value of k around 150 is appropriate for this dataset, since it gives the lowest perplexity score among all tested values.

4.2 Experimental Results

  • 2.1 Results of Topic Milestone Paper Discovery Each topic is presented as the mixture of citations in their model.
  • Those citations can be ranked based on φ(ci; zk) and citations ranking at the top for each topic zk are considered as topic milestone papers.
  • Table 2 presents topic milestone papers for Sentiment Analysis in [8] while Table 3 shows their results.
  • Finally, their model can also indicate popular topics for an author or a venue.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Collective Topic model for Milestone Paper Discovery
Ziyu Lu
The University of Hong Kong
Pokfulam Road, Hong Kong
zylu@cs.hku.hk
Nikos Mamoulis
The University of Hong Kong
Pokfulam Road, Hong Kong
nikos@cs.hku.hk
David W. Cheung
The University of Hong Kong
Pokfulam Road, Hong Kong
dcheung@cs.hku.hk
ABSTRACT
Prior arts stay at the foundation for future work in aca-
demic research. However the increasingly large amount of
publications make it difficult for researchers to effectively
discover the most important previous works to the topic of
their research. In this paper, we study the automatic dis-
covery of the core papers for a research area. We propose
a collective topic model on three types of objects: papers,
authors and published venues. We model any of these ob-
jects as bags of citations. Based on Probabilistic latent se-
mantic analysis (PLSA), authorship, published venues and
citation relations are used for quantifying paper importance.
Our method discusses milestone paper discovery in different
cases of input objects. Experiments on the ACL Anthology
Network (ANN) indicate that our model is superior in mile-
stone paper discovery when compared to a previous model
which considers only papers.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Information
Search and Retrieval—clustering, retrieval models
Keywords
Topic Model; Milestone Paper; Paper Importance
1. INTRODUCTION
Academic literature surveying plays a vital role in aca-
demic research; researchers can learn what has been done,
what research gaps might exist and what potential research
directions to work on. Academic search engines such as
Google Scholar
1
and CiteSeerX
2
enable researchers to find
related literatures or prior arts. However, the overwhelm-
ing number of publications makes it difficult to quickly ob-
tain the most important set of previous work of a subject.
1
http://scholar.google.com
2
http://citeseerx.ist.psu.edu/index
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia.
Copyright 2014 ACM 978-1-4503-2257-7/14/07 ...$15.00.
http://dx.doi.org/10.1145/2600428.2609499.
Some citation recommendation systems have been designed
to recommend appropriate citations for academic works [5,
1]. However, academic search engines and recommendation
systems might return qualified sets of papers based on se-
mantic similarity; paper importance has not been consid-
ered. It is essential to have some models for milestone paper
discovery; i.e., discover the core set of papers which can best
represent previous works for a research topic. Wang et al.
[8] studied topic milestone paper discovery and developed
a generative model for theme and topic evolution. They
used the idea of modeling each paper as a “bag of citations”
and measured paper impacts based on co-citation relations.
However, [8] only considered co-citation relations for topic
milestone paper discovery. Therefore, their model could be
biased against recently published papers (which are rarely
cited by others). Thus, the issue of milestone paper discov-
ery is not completely addressed.
In this paper, we propose a collective topic model for ob-
jects of different types (i.e., papers, authors, and venues)
in academic networks. The collective topic model quanti-
fies paper importance based on authorship, published venue
reputation and co-citation relationships in the academic col-
lection. We investigate the identification of milestone papers
in different cases. Our experimental results show that paper
importance is well captured by our model; authorship and
published venues have considerable influence on milestone
paper discovery.
2. RELATED WORK
Previous work exists in citation recommendation [5, 1];
i.e., recommending appropriate references to researchers. For
example, [5] designed a translation model between citation
contexts and reference words, and recommended a list of ci-
tations by using long queries such as sentences or a manuscript.
Bethard and Jurafsky [1] designed a feature-based learning
model for literature retrieval. A list of references are rec-
ommended using the abstract of the input object (i.e., a pa-
per) as query. These works perform recommendations based
on semantic analysis, but paper importance or ranking is
not considered. In addition, some works in topic evolution
may enable researchers to see how the research in a par-
ticular area evolves. Mei and Zhai [6] used temporal text
mining techniques to discover latent themes from text and
constructed theme evolution graphs. However, they cannot
identify the “micro-view” of a research field, e.g., milestone
papers for that area.
Wang et al. [8] used milestone paper discovery as an ap-
plication when developing a generative topic model for re-

Figure 1: Overview of our collective topic model
search theme evolution. They modeled a paper as a bag of
citations and use the co-citation relationships for evaluating
paper impact between topics. However, their results can be
biased against recently published papers, since they did not
take into account additional factors that influence the impor-
tance of papers, such as authorship and published venues.
In our work, we propose a topic model that considers these
additional influence factors.
3. PROBABILISTIC TOPIC MODEL
The importance of a paper depends on a variety of factors,
including the authority of authors, the publication venue and
co-citation relationship with other papers. We borrow the
idea of modeling a paper as a bag of citations from [8] in
order to consider the co-citation factor in the importance of
a paper; thus each paper is represented as a bag of citation
IDs. Since authors and venues are linked with documents in
the academic document collection, we build a “virtual doc-
ument” for each author and venue by aggregating all doc-
uments associated with that author or venue (we call the
result author document and venue document, respectively).
This way, for each author or venue we also derive a bag of
citation IDs. Based on [3], we assume that the multiple-
typed documents (paper document, author document and
venue document) have a common set of latent topics and
each topic is represented as the distribution over citations.
We model these documents using a probabilistic topic model
which quantifies the probability of a paper to be cited as pa-
per importance. Then, the problem about milestone paper
discovery is defined as follows.
Milestone paper discovery: Given an academic docu-
ment collection D (papers Q, author set A , published venues
V and citations C are known), the model outputs a set of
core papers M that includes citationIDs (paperID) ranking
based on co-citations at top within a topic or a document”
(The document type might be a paper document, an author
document or a venue document).
3.1 Model Description
An overview of our model is shown in Figure 1. Table 1
describes meanings of the notations used in our model. We
assume that for all documents there is a common set of k
latent topics. Each document is represented as the distri-
bution over topics and each topic is represented as the dis-
tribution over citations. Then, the process of generating an
academic document is as follows: for each citation in that
document, firstly sample a topic z
k
according to the distri-
bution from paper topic distribution δ(z; d) or author topic
distribution ζ(z; a) or venue topic distribution ψ(z; v) based
Table 1: Notations used in our collective model
Symbols Description
d, a d for a paper, a for an author
v, c, z v for a venue, c for a citation, z for a topic
T , k T for topic set, k for topic number
N The citation-paper Matrix
U The citation-author Matrix
E The citation-venue Matrix
φ(c; z) The topic-citation distribution
δ(z; d) The paper-topic distribution
ζ(z; a) The author-topic distribution
ψ(z; v) The venue-topic distribution
α, β, γ relative weights for d, a, v
on the document type. Then, draw a citation c from the
sampled topic distribution φ(:; z
k
) in topic citation distri-
bution φ(c; z).
We developed our model based on PLSA [4]. We can have
the following joint model for citations based on documents
in different types:
p(c
i
|d
j
) =
X
k
φ(c
i
; z
k
)δ(z
k
; d
j
) (1)
p(c
i
|a
n
) =
X
k
φ(c
i
; z
k
)ζ(z
k
; a
n
) (2)
p(c
i
|v
m
) =
X
k
φ(c
i
; z
k
)ψ(z
k
; v
m
) (3)
and the parameter set θ to be estimated is:
θ ={φ(c|z), δ(z; d), ζ(z; a), ψ(z; v)
| c C, d D, a A, v V, z T }
In order to estimate the parameters θ, we should maximize
the likelihood of the document collection D given θ. The
loglikelihood function is represented as
L(θ) =
X
i
(α
X
j
N
ij
log p(c
i
|d
j
) + β
X
n
U
in
log p(c
i
|a
n
)
+ γ
X
m
E
im
log p(c
i
|a
m
))
N
ij
indicates the occurrences of citation c
i
in paper d
j
in
the citation-paper matrix, U
in
the occurrences of citation c
i
cited by a
n
and E
in
the occurrences of citation c
i
cited by
papers in venue v
m
. α, β, γ indicate relative weights for
three-typed documents d, a, v.
3.2 Parameter Inference
We use the Expectation-Maximization (EM) algorithm for
parameter inference. EM iteratively executes two steps, an
E-step and a M-step, until L(θ) converges [2].
Each E-step computes the lower bound function Q of L(θ).
In this process, the posterior probabilities p(z
k
|c, o) (o can
be d, a, v) are re-computed using the new parameter values
from the previous M-step:
p(z
k
|c
i
, d
j
) =
φ(c
i
|z
k
)δ(z
k
|d
j
)
P
k
φ(c
i
; z
k
)δ(z
k
; d
j
)
p(z
k
|c
i
, a
n
) =
φ(c
i
; z
k
)ζ(z
k
; a
n
)
P
k
φ(c
i
; z
k
)ζ(z
k
; a
n
)

Table 2: Topic milestone papers (top-10 papers) for Sentiment Analysis from [8].
φ(c
i
; z
k
) Venue Paper Title
0.0785 EMNLP’02 Thumbs Up? Sentiment Classification Using Machine Learning Techniques
0.0672 ACL’02 Thumbs Up Or Thumbs Down? Semantic Orientation Applied To Unsupervised Classification Of Reviews
0.0483 HLT’05 Recognizing Contextual Polarity In Phrase-Level Sentiment Analysis
0.0436 ACL’04 A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based On Minimum Cuts
0.0365 ACL’97 Predicting The Semantic Orientation Of Adjectives
0.0312 COLING’04 Determining The Sentiment Of Opinions
0.0307 HLT’05 Extracting Product Features And Opinion From Reviews
0.0287 EMNLP’03 Towards Answering Opinion Questions: Separating Facts From Opinions And Identifying The Polarity Of Opinion Sentences
0.0279 EMNLP’03 Learning Extraction Patterns For Subjective Expressions
0.0169 ACL’05 Seeing Stars: Exploiting Class Relationships For Sentiment Categorization With Respect To Rating Scales
Table 3: Topic milestone papers (top-10 papers) for Sentiment Analysis in our collective model.
φ(c
i
; z
k
) Venue Paper Title
0.1317 EMNLP’02 Thumbs Up? Sentiment Classification Using Machine Learning Techniques
0.0747 ACL’04 A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based On Minimum Cuts
0.0689 ACL’02 Thumbs Up Or Thumbs Down? Semantic Orientation Applied To Unsupervised Classification Of Reviews
0.0482 HLT’05 Recognizing Contextual Polarity In Phrase-Level Sentiment Analysis
0.0351 ACL’05 Seeing Stars: Exploiting Class Relationships For Sentiment Categorization With Respect To Rating Scales
0.0304 ACL’07 Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification
0.0215 ACL’07 Structured Models for Fine-to-Coarse Sentiment Analysis
0.0210 EMNLP’08 Learning with Compositional Semantics as Structural Inference for Subsentential Sentiment Analysis
0.0195 EMNLP’08 Multilingual Subjectivity Analysis Using Machine Translation
0.0177 ACL-IJCNLP’09 Co-Training for Cross-Lingual Sentiment Classification
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
1 21 41 61 81 101 121 141 161 181
Perplexity
Iteration Number
k=50
k=100
k=120
k=150
k=200
k=300
k=500
Figure 2: The perplexity over different k for our
model
p(z
k
|c
i
, v
m
) =
φ(c
i
; z
k
)ψ(z
k
; v
m
)
P
k
φ(c
i
; z
k
)ψ(z
k
; v
m
)
In the first E-step, the posterior probabilities are ran-
domly initialized. The M-step that follows each E-step, re-
estimates the θ which maximizes Q as follows:
δ(z
k
; d
j
) =
P
i
N
ij
p(z
k
|c
i
, d
j
)
P
i
N
ij
ζ(z
k
; a
n
) =
P
i
U
in
p(z
k
|c
i
, a
n
)
P
i
N
in
ψ(z
k
; v
m
) =
P
i
U
im
p(z
k
|c
i
, v
m
)
P
i
N
im
φ(c
i
; z
k
) α
P
j
N
ij
p(z
k
|c
i
, d
j
)
P
i
P
j
N
ij
p(z
k
|c
i
, d
j
)
+ β
P
n
U
in
p(z
k
|c
i
, a
n
)
P
i
P
n
U
in
p(z
k
|c
i
, a
n
)
+ γ
P
m
E
im
p(z
k
|c
i
, v
m
)
P
i
P
m
E
im
p(z
k
|c
i
, v
m
)
4. EXPERIMENTS
4.1 Dataset
The ACL Anthology Network (ANN) [7] was used in our
experiments. There are 19408 papers by 15824 authors, pub-
lished in 342 venues and having received 85367 citations.
This dataset is also used in previous work [8]; thus, we can
use it to perform some comparisons with [8].
Before testing our model, we have to determine the num-
ber of topics k. Perplexity is commonly used to evaluate
performance of topic modeling. Thus, we select the value
of k which minimizes perplexity. Figure 2 shows the per-
plexity scores during model estimation for different values
of k. From this graph, we can see that a value of k around
150 is appropriate for this dataset, since it gives the low-
est perplexity score among all tested values. Therefore, we
perform our experiments using k = 150. In addition, we con-
sider the three-typed documents equally and set the weights
α = β = γ = 1.
4.2 Experimental Results
4.2.1 Results of Topic Milestone Paper Discovery
Each topic is presented as the mixture of citations in our
model. Within each topic z
k
, φ(c
i
; z
k
) indicates the impor-
tance of citation (i.e., the cited paper) c
i
. Those citations
can be ranked based on φ(c
i
; z
k
) and citations ranking at
the top for each topic z
k
are considered as topic milestone
papers. In order to compare with previous work [8], we use
the top-10 papers for the topic Sentiment Analysis as an
example. Table 2 presents topic milestone papers for Senti-
ment Analysis in [8] while Table 3 shows our results. There
are 5 overlapping papers and 5 different papers between the
two models. The top-1 paper is the same, but the order of
the overlapping papers is slightly different. The top-2 pa-
per in our model is ranked highly, compared with that in [8]
(top-4). We have more recently published papers. The rea-
son behind these differences is that the previous model does
not consider factors such as published venues and authorship
that have influence on paper importance.
4.2.2 Results of Venue Milestone Paper Discovery
In previous work, only topic milestone paper discovery has
been studied. Our model is more general in the sense that
it can identify milestone papers also for a given venue or
author. In the next experiment, the probability of a cita-
tion given a venue is computed by Equation 3 and papers
are ranked based on p(c|v). Here we only take the venue

Table 4: Milestone papers (top-10) for ACL in our collective model.
p(c|v) Paper Title Venue
0.0084 Building A Large Annotated Corpus Of English: The Penn Treebank JCL’93
0.0074 The Mathematics Of Statistical Machine Translation: Parameter Estimation JCL’93
0.0059 Statistical Phrase-Based Translation NAACL’03
0.0057 Bleu: A Method For Automatic Evaluation Of Machine Translation ACL’02
0.0056 A Hierarchical Phrase-Based Model For Statistical Machine Translation ACL’05
0.0056 A Systematic Comparison Of Various Statistical Alignment Models JCL’03
0.0050 Minimum Error Rate Training In Statistical Machine Translation ACL’03
0.0050 A Maximum-Entropy-Inspired Parser NAACL’00
0.0044 Stochastic Inversion Transduction Grammars And Bilingual Parsing Of Parallel Corpora JCL’97
0.0044 Accurate Unlexicalized Parsing ACL’03
Table 5: Author milestone papers (top-10) for the author Bo Pang.
p(c|a) Paper Title Venue
0.0461* Thumbs Up? Sentiment Classification Using Machine Learning Techniques EMNLP’02
0.0375 Thumbs Up Or Thumbs Down? Semantic Orientation Applied To Unsupervised Classification Of Reviews ACL’02
0.0253* A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based On Minimum Cuts ACL’04
0.0241 Predicting The Semantic Orientation Of Adjectives ACL’97
0.0172 Extracting Product Features And Opinions From Reviews HLT’05
0.0152 Towards Answering Opinion Questions: Separating Facts From Opinions And Identifying The Polarity Of Opinion Sentences EMNLP’03
0.0147 Learning Extraction Patterns For Subjective Expressions EMNLP’03
0.0093* Seeing Stars: Exploiting Class Relationships For Sentiment Categorization With Respect To Rating Scales ACL’05
0.0086 Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification ACL’07
0.0065 Learning Subjective Language JCL’04
ACL as example and Table 4 reports important papers in
ACL. These papers are mainly from three venues, e.g ACL,
NAACL and JCL (Journal of Computational Linguistics).
4.2.3 Results of Author Milestone Paper Discovery
Similarly, author milestone papers can be ranked, based
on the probability of a citation given an author p(c|a) (see
Equation 2). We used the author Bo Pang (the first author
of the top-1 paper in Table 3) as an example. Table 5 shows
the citations that Bo Pang has the highest probability to
cite (* indicates self-citation). The result has high overlap
with Table 2 (8 papers). This might indicate that author ci-
tation patterns are regular and the author factor has similar
influence as the citation relationships.
4.2.4 Topic Involvements
Finally, our model can also indicate popular topics for
an author or a venue. ψ(z; v) indicates how well a topic z
represents a venue v and ζ(z; a) shows how well a topic z
can represent an author a. Therefore, we can find the most
popular topics for a specific venue or an author. The top-
3 topics for ACL are respectively Name entity extraction,
Statistical parsing and Statistical machine translation. The
top-3 topics for Bo Pang are Sentiment analysis, Opinion
extraction and Paraphrases generation.
5. CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a collective topic model for
multiple-typed objects in the academic network, in order to
address the issue of milestone paper discovery. The model is
based on PLSA, and authorship, published venues and cita-
tion relations have been included in it. Our method can not
only discover topic milestone papers discussed in previous
work, but also explore venue milestone papers and author
milestone papers. In addition, it can find representative top-
ics for an author or a venue. Experiments on a real dataset
ANN show that our model can better evaluate the impact
of papers and its result is not biased against new publica-
tions. Directions for future work include the investigation of
more complicated models with biased mechanisms and the
integration of this model into existing academic literature
search/recommendation systems.
6. ACKNOWLEDGMENTS
This work was supported by grant HKU 715711E from
Hong Kong RGC. The authors would like to thank the anony-
mous reviewers for their valuable comments.
7. REFERENCES
[1] S. Bethard and D. Jurafsky. Who should i cite: learning
literature search models from citation behavior. In
CIKM ’10, pages 609–618. ACM, 2010.
[2] C. M. Bishop. Pattern Recognition and Machine
learning. Springer, 2006.
[3] H. Deng, B. Zhao, and J. Han. Collective topic
modeling for heterogeneous networks. In SIGIR ’11,
pages 1109–1110, New York, NY, USA, 2011. ACM.
[4] T. Hofmann. Probabilistic latent semantic indexing. In
SIGIR ’99, pages 50–57, New York, NY, USA, 1999.
ACM.
[5] W. Huang, S. Kataria, C. Caragea, P. Mitra, C. L.
Giles, and C. L. Giles. Recommending citations:
translating papers into references. In CIKM ’12, pages
1910–1914. ACM, 2012.
[6] Q. Mei and C. Zhai. Discovering evolutionary theme
patterns from text: An exploration of temporal text
mining. In KDD ’05, pages 198–207, New York, NY,
USA, 2005. ACM.
[7] D. R. Radev, P. Muthukrishnan, V. Qazvinian, and
A. Abu-Jbara. The acl anthology network corpus.
Language Resources and Evaluation Journal, pages
1–26, 2013.
[8] X. W. Wang, C. Zhai, and D. Roth. Understanding
evolution of research themes: a probabilistic generative
model for citations. In KDD’13, pages 1115–1123.
ACM, 2013.
Citations
More filters
Journal ArticleDOI
TL;DR: By modeling reviews as a three-order tensor, a refined tensor topic model (TTM) for text tensors inspired by Tucker decomposition is proposed and outperforms existing topic models in modeling texts with a user-item-word structure.

38 citations

Journal ArticleDOI
TL;DR: This paper proposes a citation- content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the con-tent of the document itself via a probabilistic generative model and tests the algorithm on two online datasets to demonstrate that it effectively discovers important topics and reflects the topic evolution of important research themes.
Abstract: Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation- content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the con-tent of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.

13 citations


Cites background from "A collective topic model for milest..."

  • ...Lu et al. (2014) proposed a topic model which uses authorship, published venues, and citation relations among scientific documents to detect topics and identify the most notable works in the corpus....

    [...]

  • ...The collective topic model (CTM) proposed by Lu et al. (2014) simultaneously discovers topics and related milestone papers in the corpus by modeling papers, authors, and published venues as a bag of citations based on the PLSA model....

    [...]

Book ChapterDOI
03 Jun 2018
TL;DR: This work creatively combine both topics of text and the influence of topics over citation networks to discover influential articles, and proposes MTID, a scalable generative model, which generates the network with these two parameters.
Abstract: Discovering important papers in different academic topics is known as topic-sensitive influential paper discovery. Previous works mainly find the influential papers based on the structure of citation networks but neglect the text information, while the text of documents gives a more precise description of topics. In our paper, we creatively combine both topics of text and the influence of topics over citation networks to discover influential articles. The observation on three standard citation networks shows that the existence of citations between papers is related to the topic of citing papers and the importance of cited papers. Based on this finding, we introduce two parameters to describe the topic distribution and the importance of a document. We then propose MTID, a scalable generative model, which generates the network with these two parameters. The experiment confirms superiority of MTID over other topic-based methods, in terms of at least 50% better citation prediction in recall, precision and mean reciprocal rank. In discovering influential articles in different topics, MTID not only identifies papers with high citations, but also succeeds in discovering other important papers, including papers about standard datasets and the rising stars.

3 citations


Cites background or methods from "A collective topic model for milest..."

  • ...In our model, different from [6, 17], we use the topics extracted from textual information....

    [...]

  • ...Lu et al.[6] extend the method by considering additional factors that influence the importance of papers, such as authorship and published venues....

    [...]

  • ...Thus, the topics described in [6, 17] are too general but imprecise....

    [...]

  • ...Although [6, 17] use “topic” in the discription of their methods, the topic defined in [6, 17] is actually a cluster of documents....

    [...]

  • ...In [6, 17], the reference for a document is determined by sampling cited documents according to the topicdocument distribution....

    [...]

Journal ArticleDOI
24 Jun 2016
TL;DR: An algorithm for social circle tag detection by multiple linear regression is proposed that is more effective than other relevant methods and successfully combines different categories of features in social circles.
Abstract: A social circle is a category of strong social relationships, such as families, classmates and good friends and so on. The information diffusion among members of online social circles is frequent and credible. The research of users’ online social circles has become popular in recent years. Many scholars propose methods for detecting users’ online social circles. On the other hand, the social meanings and the tags of a social circle are also important for the analysis of a social circle. However, little work involves the tags discovery of social circles. This paper proposes an algorithm for social circle tag detection by multiple linear regression. The model solves the data sparse problem of tags in social circles and successfully combines different categories of features in social circles. We also redmap the concept of the social circle into "reference circles" of an academic paper. We evaluate our method in datasets of both Facebook and Microsoft Academic Search, and prove that it is more effective than other relevant methods.

2 citations


Cites background from "A collective topic model for milest..."

  • ...Topic model is a common technology for the evolution of research themes [31,32] and discovery of high quality papers [33]....

    [...]

Proceedings ArticleDOI
05 Jan 2020
TL;DR: This paper construct the TIG on the ACL Anthology Network dataset and leverage it to analyze the properties of seminal papers and observe that seminal papers disseminate knowledge across different communities, trigger more research within its own community and apart from introducing new ideas, string together ideas from different communities.
Abstract: Every scientific article attempts to derive knowledge from existing literature and augment it with new insights. This dynamics of knowledge is commonly explored through references (it draws knowledge from) and citations (its impact on the field). In this paper, we propose to explore this phenomenon through construction of a topic influence graph (TIG) based on topic similarity between articles. More importantly, out of the set of possible TIGs, we determine an optimal TIG by using knowledge from citation graphs. Construction of TIG leverages traditional network analysis tools like community (sub-field) identification. In this paper, we construct the TIG on the ACL Anthology Network (AAN) dataset and leverage it to analyze the properties of seminal papers. Interestingly, we observe that seminal papers disseminate knowledge across different communities, trigger more research within its own community and apart from introducing new ideas, string together ideas from different communities.

2 citations


Cites methods from "A collective topic model for milest..."

  • ...Further, topic models have been employed for retrieving relevant papers [7]....

    [...]

References
More filters
Journal ArticleDOI
01 Dec 2013
TL;DR: The ACL Anthology Network is introduced, a comprehensive manually curated networked database of citations, collaborations, and summaries in the field of Computational Linguistics and a number of statistics about the network including the most cited authors, the most central collaborators, as well as network statistics.
Abstract: We introduce the ACL Anthology Network (AAN), a comprehensive manually curated networked database of citations, collaborations, and summaries in the field of Computational Linguistics. We also present a number of statistics about the network including the most cited authors, the most central collaborators, as well as network statistics about the paper citation, author citation, and author collaboration networks.

332 citations


"A collective topic model for milest..." refers methods in this paper

  • ...Experiments on a real dataset ANN show that our model can better evaluate the impact of papers and its result is not biased against new publications....

    [...]

  • ...The ACL Anthology Network (ANN) [7] was used in our experiments....

    [...]

Proceedings ArticleDOI
29 Oct 2012
TL;DR: This work proposes a method that "translates" research papers into references by considering the citations and their contexts from existing papers as parallel data written in two different "languages" using the translation model to create a relationship between these two "vocabularies".
Abstract: When we write or prepare to write a research paper, we always have appropriate references in mind. However, there are most likely references we have missed and should have been read and cited. As such a good citation recommendation system would not only improve our paper but, overall, the efficiency and quality of literature search. Usually, a citation's context contains explicit words explaining the citation. Using this, we propose a method that "translates" research papers into references. By considering the citations and their contexts from existing papers as parallel data written in two different "languages", we adopt the translation model to create a relationship between these two "vocabularies". Experiments on both CiteSeer and CiteULike dataset show that our approach outperforms other baseline methods and increase the precision, recall and f-measure by at least 5% to 10%, respectively. In addition, our approach runs much faster in the both training and recommending stage, which proves the effectiveness and the scalability of our work.

144 citations


"A collective topic model for milest..." refers background in this paper

  • ...Some citation recommendation systems have been designed to recommend appropriate citations for academic works [5, 1]....

    [...]

  • ...For example, [5] designed a translation model between citation contexts and reference words, and recommended a list of citations by using long queries such as sentences or a manuscript....

    [...]

  • ...Previous work exists in citation recommendation [5, 1]; i....

    [...]

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This work proposes a new task for evaluating the resulting retrieval models, where the retrieval system takes only an abstract as its input and must produce as output the list of references at the end of the abstract's article.
Abstract: Scientists depend on literature search to find prior work that is relevant to their research ideas. We introduce a retrieval model for literature search that incorporates a wide variety of factors important to researchers, and learns the weights of each of these factors by observing citation patterns. We introduce features like topical similarity and author behavioral patterns, and combine these with features from related work like citation count and recency of publication. We present an iterative process for learning weights for these features that alternates between retrieving articles with the current retrieval model, and updating model weights by training a supervised classifier on these articles. We propose a new task for evaluating the resulting retrieval models, where the retrieval system takes only an abstract as its input and must produce as output the list of references at the end of the abstract's article. We evaluate our model on a collection of journal, conference and workshop articles from the ACL Anthology Reference Corpus. Our model achieves a mean average precision of 28.7, a 12.8 point improvement over a term similarity baseline, and a significant improvement both over models using only features from related work and over models without our iterative learning.

140 citations


"A collective topic model for milest..." refers background in this paper

  • ...Some citation recommendation systems have been designed to recommend appropriate citations for academic works [5, 1]....

    [...]

  • ...Bethard and Jurafsky [1] designed a feature-based learning model for literature retrieval....

    [...]

  • ...Previous work exists in citation recommendation [5, 1]; i....

    [...]

  • ...[1] S. Bethard and D. Jurafsky....

    [...]

Proceedings ArticleDOI
11 Aug 2013
TL;DR: This paper proposes a novel way of analyzing literature citation to explore the research topics and the theme evolution by modeling article citation relations with a probabilistic generative model, and demonstrates that Citation-LDA can effectively discover the evolution of research themes, with better formed topics than (conventional) Content-L DA.
Abstract: Understanding how research themes evolve over time in a research community is useful in many ways (e.g., revealing important milestones and discovering emerging major research trends). In this paper, we propose a novel way of analyzing literature citation to explore the research topics and the theme evolution by modeling article citation relations with a probabilistic generative model. The key idea is to represent a research paper by a ``bag of citations'' and model such a ``citation document'' with a probabilistic topic model. We explore the extension of a particular topic model, i.e., Latent Dirichlet Allocation~(LDA), for citation analysis, and show that such a Citation-LDA can facilitate discovering of individual research topics as well as the theme evolution from multiple related topics, both of which in turn lead to the construction of evolution graphs for characterizing research themes. We test the proposed citation-LDA on two datasets: the ACL Anthology Network(AAN) of natural language research literatures and PubMed Central(PMC) archive of biomedical and life sciences literatures, and demonstrate that Citation-LDA can effectively discover the evolution of research themes, with better formed topics than (conventional) Content-LDA.

54 citations


"A collective topic model for milest..." refers background or methods or result in this paper

  • ...Table 2: Topic milestone papers (top-10 papers) for Sentiment Analysis from [8]....

    [...]

  • ...The top-2 paper in our model is ranked highly, compared with that in [8] (top-4)....

    [...]

  • ...This dataset is also used in previous work [8]; thus, we can use it to perform some comparisons with [8]....

    [...]

  • ...However, [8] only considered co-citation relations for topic milestone paper discovery....

    [...]

  • ...In order to compare with previous work [8], we use the top-10 papers for the topic Sentiment Analysis as an example....

    [...]

Proceedings ArticleDOI
24 Jul 2011
TL;DR: Experimental results demonstrate the effectiveness of the joint probabilistic topic model for simultaneously modeling the contents of multi-typed objects of a heterogeneous information network for the tasks of topic modeling and object clustering.
Abstract: In this paper, we propose a joint probabilistic topic model for simultaneously modeling the contents of multi-typed objects of a heterogeneous information network. The intuition behind our model is that different objects of the heterogeneous network share a common set of latent topics so as to adjust the multinomial distributions over topics for different objects collectively. Experimental results demonstrate the effectiveness of our approach for the tasks of topic modeling and object clustering.

17 citations