scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Topic discovery and evolution in scientific literature based on content and citations

01 Oct 2017-Journal of Zhejiang University Science C (Zhejiang University Press)-Vol. 18, Iss: 10, pp 1511-1524
TL;DR: This paper proposes a citation- content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the con-tent of the document itself via a probabilistic generative model and tests the algorithm on two online datasets to demonstrate that it effectively discovers important topics and reflects the topic evolution of important research themes.
Abstract: Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation- content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the con-tent of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.
Citations
More filters
Journal ArticleDOI
TL;DR: By analyzing the concept of Smart PSS, this paper questions the convergence between digital and service orientations and considers how digital technologies are used to enable decisions along the PSS lifecycle and/or at different planning levels.

95 citations

Posted ContentDOI
10 Apr 2020-medRxiv
TL;DR: It was observed that COVID-19 research puts more emphasis on clinical characterization, epidemiological study, and virus transmission at present than other CoV infections, and topics about diagnostics, therapeutics, vaccines, genomics and pathogenesis only accounted for less than 10% or even 4% of all the CO VID-19 publications, much lower than those of other coV infections.
Abstract: Background Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a virus that causes severe respiratory illness in humans, which results in global outbreak of novel coronavirus disease (COVID-19) currently. This study aimed to evaluate the characteristics of publications involving coronaviruses as well as COVID-19 by using topic modeling. Methods We extracted all abstracts and retained the most informative words from the COVID-19 Open Research Dataset, which contains 35,092 pieces of coronavirus related literature published up to March 20, 2020. Using Latent Dirichlet Allocation modeling, we trained a topic model from the corpus, analyzed the semantic relationships between topics and compared the topic distribution between COVID-19 and other CoV infections. Results Eight topics emerged overall: clinical characterization, pathogenesis research, therapeutics research, epidemiological study, virus transmission, vaccines research, virus diagnostics, and viral genomics. It was observed that current COVID-19 research puts more emphasis on clinical characterization, epidemiological study, and virus transmission. In contrast, topics about diagnostics, therapeutics, vaccines, genomics and pathogenesis only account for less than 10% or even 4% of all the COVID-19 publications, much lower than those of other CoV infections. Conclusions These results identified knowledge gaps in the area of COVID-19 and offered directions for future research.

22 citations

Book ChapterDOI
29 May 2017
TL;DR: This work analyzes bioinformatics literature using topic modeling to identify the “hot” topics in that area in order to make informed choices about research topics.
Abstract: Scientists exploring a new area of research are interested to know the “hot” topics in that area in order to make informed choices With exponential growth in scientific literature, identifying such trends manually is not easy Topic modeling has emerged as an effective approach to analyze large volumes of text While this approach has been applied on literature in other scientific areas, there has been no formal analysis of bioinformatics literature

10 citations


Cites background from "Topic discovery and evolution in sc..."

  • ...For instance, this kind of analytical data-driven insight can benefit researchers as they delve into new areas by providing knowledge of current popular topics and how the focus on different topics has shifted through time [1,24]....

    [...]

  • ...While there are several topic modeling algorithms [6,10,11], Latent Dirichlet Allocation (LDA) [6] is one of the most widely used approaches and has been shown to be effective at finding distinct topics from a corpus [7,24]....

    [...]

  • ...Within a particular domain, researchers are increasingly interested in exploring scientific literature to gain insights on how research develops and evolves over time [24]....

    [...]

Posted Content
TL;DR: This work uses a method to discover latent topics in tweets from Colombian Twitter news accounts in order to identify the most prominent events in the country, with an emphasis on security, violence and crime-related tweets.
Abstract: Cultural and social dynamics are important concepts that must be understood in order to grasp what a community cares about. To that end, an excellent source of information on what occurs in a community is the news, especially in recent years, when mass media giants use social networks to communicate and interact with their audience. In this work, we use a method to discover latent topics in tweets from Colombian Twitter news accounts in order to identify the most prominent events in the country. We pay particular attention to security, violence and crime-related tweets because of the violent environment that surrounds Colombian society. The latent topic discovery method that we use builds vector representations of the tweets by using FastText and finds clusters of tweets through the K-means clustering algorithm. The number of clusters is found by measuring the $C_V$ coherence for a range of number of topics of the Latent Dirichlet Allocation (LDA) model. We finally use Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction to visualise the tweets vectors. Once the clusters related to security, violence and crime are identified, we proceed to apply the same method within each cluster to perform a fine-grained analysis in which specific events mentioned in the news are grouped together. Our method is able to discover event-specific sets of news, which is the baseline to perform an extensive analysis of how people engage in Twitter threads on the different types of news, with an emphasis on security, violence and crime-related tweets.

6 citations


Cites background from "Topic discovery and evolution in sc..."

  • ...[30] presents an LDA-based model that relates the topic of a scientific paper with the content of the documents that it cites....

    [...]

Proceedings ArticleDOI
01 May 2018
TL;DR: An analysis of bioinformatics scholarly literature consisting of 143,000 research papers between 1987 and 2018 is conducted, which examines the research trends by performing temporal analysis to determine exciting areas of research and predict future trends.
Abstract: Bioinformatics is an emerging field that is constantly evolving as technology progresses and new biomedical discoveries are made. Bioinformatics research has led to several scientific breakthroughs in the past two decades and remains an active driver of scientific progress and technological advances. In this work, we conduct an analysis of bioinformatics scholarly literature consisting of 143,000 research papers between 1987 and 2018. We apply topic modeling to identify the salient themes in bioinformatics research. We examine the research trends by performing temporal analysis to determine exciting areas of research and predict future trends. In addition, we evaluate the impact of bioinformatics research on the industry by cross-linking the literature with patent databases. We also survey the author backgrounds and the publishing journals, both of which were found to have changed significantly within the past decade. This study provides valuable insight on the progress and current state of bioinformatics research.

4 citations


Cites methods from "Topic discovery and evolution in sc..."

  • ...Topic modeling is a frequently used method to categorize research areas in different fields [7][10][20]....

    [...]

  • ...LDA is the most commonly used topic modeling algorithm due to its effectiveness in identifying topics within a corpus [7][20]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations


"Topic discovery and evolution in sc..." refers methods in this paper

  • ...…content-LDA and citation-LDA models as our baseline, using the title to represent the papers in both PAMI and CS. 5.4.1 Perplexity evaluation Perplexity proposed by Blei et al. (2003) is an important criterion used to show the generalization power of a model on unseen data and the number of topics....

    [...]

Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"Topic discovery and evolution in sc..." refers methods in this paper

  • ...Among these methods, PageRank, which employs the random walk concept (Brin and Page, 1998), is applied most often to link analysis in web page ranking applications....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Journal ArticleDOI
TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Abstract: A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying “hot topics” by examining temporal dynamics and tagging abstracts to illustrate semantic content.

5,680 citations


"Topic discovery and evolution in sc..." refers methods in this paper

  • ...An inference is necessary for obtaining model parameters θd and φz via the collapsed Gibbs sampling algorithm (Griffiths and Steyvers, 2004)....

    [...]