scispace - formally typeset
Search or ask a question

Showing papers on "Dynamic topic model published in 2018"


Journal ArticleDOI
TL;DR: In this article, a bipartite network of documents and words is used to detect the number of topics and hierarchically cluster both the words and documents, which leads to better topic models than LDA.
Abstract: One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach that infers the latent topical structure of a collection of documents. Despite their success—particularly of the most widely used variant called latent Dirichlet allocation (LDA)—and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, for example, a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. We obtain a fresh view of the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. We achieve this by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods (using a stochastic block model (SBM) with nonparametric priors), we obtain a more versatile and principled framework for topic modeling (for example, it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. Our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.

148 citations


Posted Content
TL;DR: In this article, the authors extend the class of tractable priors from Wiener processes to the generic class of Gaussian processes (GPs) to explore topics that develop smoothly over time, that have a long-term memory or are temporally concentrated (for event detection).
Abstract: Dynamic topic models (DTMs) model the evolution of prevalent themes in literature, online media, and other forms of text over time. DTMs assume that word co-occurrence statistics change continuously and therefore impose continuous stochastic process priors on their model parameters. These dynamical priors make inference much harder than in regular topic models, and also limit scalability. In this paper, we present several new results around DTMs. First, we extend the class of tractable priors from Wiener processes to the generic class of Gaussian processes (GPs). This allows us to explore topics that develop smoothly over time, that have a long-term memory or are temporally concentrated (for event detection). Second, we show how to perform scalable approximate inference in these models based on ideas around stochastic variational inference and sparse Gaussian processes. This way we can train a rich family of DTMs to massive data. Our experiments on several large-scale datasets show that our generalized model allows us to find interesting patterns that were not accessible by previous approaches.

24 citations


Journal ArticleDOI
TL;DR: A model namedosLDA, which boosts the performance of topic modeling via the joint discovery of latent topics and the different objective and subjective power hidden in every word has lower computational complexity than supervised LDA, especially under an increasing number of topics.
Abstract: It is observed that distinct words in a given document have either strong or weak ability in delivering facts (i.e., the objective sense) or expressing opinions (i.e., the subjective sense) depending on the topics they associate with. Motivated by the intuitive assumption that different words have varying degree of discriminative power in delivering the objective sense or the subjective sense with respect to their assigned topics, a model named as ${i}$ dentified ${o}$ bjective– ${s}$ ubjective latent Dirichlet allocation (LDA) ( ${i}$ osLDA) is proposed in this paper. In the ${i}$ osLDA model, the simple Polya urn model adopted in traditional topic models is modified by incorporating it with a probabilistic generative process, in which the novel “ Bag-of-Discriminative-Words ” (BoDW) representation for the documents is obtained; each document has two different BoDW representations with regard to objective and subjective senses, respectively, which are employed in the joint objective and subjective classification instead of the traditional Bag-of-Topics representation. The experiments reported on documents and images demonstrate that: 1) the BoDW representation is more predictive than the traditional ones; 2) ${i}$ osLDA boosts the performance of topic modeling via the joint discovery of latent topics and the different objective and subjective power hidden in every word; and 3) ${i}$ osLDA has lower computational complexity than supervised LDA, especially under an increasing number of topics.

22 citations


Journal ArticleDOI
TL;DR: This paper proposes a new latent variable model where latent topics and their proportionals are learned by incorporating the prior based on Dirichlet mixture model, and carries out the inference for LDMM according to the variational Bayes and the collapsed variationalBayes.

20 citations


Journal ArticleDOI
TL;DR: This paper proposes new learning algorithms for activity analysis in video based on the expectation maximization approach and variational Bayes inference and proposes an anomaly localization procedure, elegantly embedded in the topic modeling framework.
Abstract: Semisupervised and unsupervised systems provide operators with invaluable support and can tremendously reduce the operators’ load. In the light of the necessity to process large volumes of video data and provide autonomous decisions, this paper proposes new learning algorithms for activity analysis in video. The activities and behaviors are described by a dynamic topic model. Two novel learning algorithms based on the expectation maximization approach and variational Bayes inference are proposed. Theoretical derivations of the posterior estimates of model parameters are given. The designed learning algorithms are compared with the Gibbs sampling inference scheme introduced earlier in the literature. A detailed comparison of the learning algorithms is presented on real video data. We also propose an anomaly localization procedure, elegantly embedded in the topic modeling framework. It is shown that the developed learning algorithms can achieve 95% success rate. The proposed framework can be applied to a number of areas, including transportation systems, security, and surveillance.

15 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed to assume that each topic is a probability distribution over concepts and then each concept is a probabilistic distribution over words, adding a latent concept layer between topic layer and word layer in traditional three-layer assumption.
Abstract: Recently, topic modeling has been widely used to discover the abstract topics in the multimedia field. Most of the existing topic models are based on the assumption of three-layer hierarchical Bayesian structure, i.e. each document is modeled as a probability distribution over topics, and each topic is a probability distribution over words. However, the assumption is not optimal. Intuitively, it’s more reasonable to assume that each topic is a probability distribution over concepts, and then each concept is a probability distribution over words, i.e. adding a latent concept layer between topic layer and word layer in traditional three-layer assumption. In this paper, we verify the proposed assumption by incorporating the new assumption in two representative topic models, and obtain two novel topic models. Extensive experiments were conducted among the proposed models and corresponding baselines, and the results show that the proposed models significantly outperform the baselines in terms of case study and perplexity, which means the new assumption is more reasonable than traditional one.

15 citations


Journal ArticleDOI
TL;DR: A parallel sparse partially collapsed Gibbs sampler is proposed and compared and it is proved that the partially collapsed samplers scale well with the size of the corpus and can be used in more modeling situations than the ordinary collapsed sampler.
Abstract: Topic models, and more specifically the class of latent Dirichlet allocation (LDA), are widely used for probabilistic modeling of text. Markov chain Monte Carlo (MCMC) sampling from the posterior d...

13 citations


Posted Content
TL;DR: This work presents a semi-automatic transfer topic labeling method, using the coding instructions of the Comparative Agendas Project to label topics, and shows that it works well for a majority of the topics it estimates, but finds that institution-specific topics require manual input.
Abstract: Topic models are widely used in natural language processing, allowing researchers to estimate the underlying themes in a collection of documents. Most topic models use unsupervised methods and hence require the additional step of attaching meaningful labels to estimated topics. This process of manual labeling is not scalable and suffers from human bias. We present a semi-automatic transfer topic labeling method that seeks to remedy these problems. Domain-specific codebooks form the knowledge-base for automated topic labeling. We demonstrate our approach with a dynamic topic model analysis of the complete corpus of UK House of Commons speeches 1935-2014, using the coding instructions of the Comparative Agendas Project to label topics. We show that our method works well for a majority of the topics we estimate; but we also find that institution-specific topics, in particular on subnational governance, require manual input. We validate our results using human expert coding.

10 citations


Proceedings Article
31 Mar 2018
TL;DR: This paper extends the class of tractable priors from Wiener processes to the generic class of Gaussian processes (GPs), which allows to explore topics that develop smoothly over time, that have a long-term memory or are temporally concentrated (for event detection).
Abstract: Dynamic topic models (DTMs) model the evolution of prevalent themes in literature, online media, and other forms of text over time. DTMs assume that word co-occurrence statistics change continuously and therefore impose continuous stochastic process priors on their model parameters. These dynamical priors make inference much harder than in regular topic models, and also limit scalability. In this paper, we present several new results around DTMs. First, we extend the class of tractable priors from Wiener processes to the generic class of Gaussian processes (GPs). This allows us to explore topics that develop smoothly over time, that have a long-term memory or are temporally concentrated (for event detection). Second, we show how to perform scalable approximate inference in these models based on ideas around stochastic variational inference and sparse Gaussian processes. This way we can train a rich family of DTMs to massive data. Our experiments on several large-scale datasets show that our generalized model allows us to find interesting patterns that were not accessible by previous approaches.

10 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: This paper presents a domain Expert Identification method with the improved dynamic LDA algorithm which solves these shortcomings of existed methods and considers both the semantic information of the domain and expert authority.
Abstract: In recent years, human society is transferring from information society to knowledge society Experts mastering professional knowledge are becoming more and more valuable resources in the society, therefore Expert Identification, also known as Expert Finding, became an important research field Existed Expert Identification work is mainly based on traditional information retrieval, or standard topic models Experts finding still faces a lot of problems, such as the missing of semantic information or the inaccuracy without changes over time taken into consideration This paper presents a domain Expert Identification method with the improved dynamic LDA algorithm which solves these shortcomings of existed methods Based on the standard LDA model, this method divides the corpus with large time span according to time to apply the dynamic LDA model and combines profile modelling and file modelling for expert modelling In addition, this method considers both the semantic information of the domain and expert authority Experiments show its feasibility and effectiveness, and its advantage over the traditional static topic model It has opened up new application fields of dynamic topic model

9 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: This work introduces a novel unsupervised neural dynamic topic model named as Recurrent Neural Network-Replicated Softmax Model (RNNRSM), where the discovered topics at each time influence the topic discovery in the subsequent time steps, and introduces a metric to quantify the capability ofynamic topic model to capture word evolution in topics over time.
Abstract: Dynamic topic modeling facilitates the identification of topical trends over time in temporal collections of unstructured documents. We introduce a novel unsupervised neural dynamic topic model named as Recurrent Neural Network-Replicated Softmax Model (RNNRSM), where the discovered topics at each time influence the topic discovery in the subsequent time steps. We account for the temporal ordering of documents by explicitly modeling a joint distribution of latent topical dependencies over time, using distributional estimators with temporal recurrent connections. Applying RNN-RSM to 19 years of articles on NLP research, we demonstrate that compared to state-of-the art topic models, RNNRSM shows better generalization, topic interpretation, evolution and trends. We also introduce a metric (named as SPAN) to quantify the capability of dynamic topic model to capture word evolution in topics over time.

Journal ArticleDOI
TL;DR: The authors' results show a better performance of mD TM in terms of the quality of the mined information compared to prior research and showcases mDTM as a promising tool for the effective mining of microblogs in a rapidly changing global information space.
Abstract: In this paper the authors build on prior literature to develop an adaptive and time-varying metadata-enabled dynamic topic model (mDTM) and apply it to a large Weibo dataset using an online Gibbs sampler for parameter estimation. Their approach simultaneously captures the maximum number of inherent dynamic features of microblogs thereby setting it apart from other online document mining methods in the extant literature. In summary, the authors' results show a better performance of mDTM in terms of the quality of the mined information compared to prior research and showcases mDTM as a promising tool for the effective mining of microblogs in a rapidly changing global information space.

Proceedings ArticleDOI
19 Jul 2018
TL;DR: Novel applications of the Negative-Binomial augmentation trick result in simple, efficient, closed-form updates of all the required conditional posteriors, resulting in far lower computational requirements as well as less sensitivity to initial conditions, as compared to existing approaches.
Abstract: The abundance of digital text has led to extensive research on topic models that reason about documents using latent representations. Since for many online or streaming textual sources such as news outlets, the number, and nature of topics change over time, there have been several efforts that attempt to address such situations using dynamic versions of topic models. Unfortunately, existing approaches encounter more complex inferencing when their model parameters are varied over time, resulting in high computation complexity and performance degradation. This paper introduces the DM-DTM, a dual Markov chain dynamic topic model, for characterizing a corpus that evolves over time. This model uses a gamma Markov chain and a Dirichlet Markov chain to allow the topic popularities and word-topic assignments, respectively, to vary smoothly over time. Novel applications of the Negative-Binomial augmentation trick result in simple, efficient, closed-form updates of all the required conditional posteriors, resulting in far lower computational requirements as well as less sensitivity to initial conditions, as compared to existing approaches. Moreover, via a gamma process prior, the number of desired topics is inferred directly from the data rather than being pre-specified and can vary as the data changes. Empirical comparisons using multiple real-world corpora demonstrate a clear superiority of DM-DTM over strong baselines for both static and dynamic topic models.

Journal ArticleDOI
TL;DR: An improved framework called SONMFSR (Soft Orthogonal NMF with Sparse Representation), which makes full use of soft orthogonality and sparsity constraints to tackle problems in practical NMF problems, and exhibits great potential in real-world applications.

Posted Content
TL;DR: The Viscovery platform is created, a platform for opinion summarization and trend tracking that is able to analyze a stream of opinions recovered from forums, using dynamic topic models to uncover the hidden structure of topics behind opinions, characterizing vocabulary dynamics.
Abstract: Opinions in forums and social networks are released by millions of people due to the increasing number of users that use Web 2.0 platforms to opine about brands and organizations. For enterprises or government agencies it is almost impossible to track what people say producing a gap between user needs/expectations and organizations actions. To bridge this gap we create Viscovery, a platform for opinion summarization and trend tracking that is able to analyze a stream of opinions recovered from forums. To do this we use dynamic topic models, allowing to uncover the hidden structure of topics behind opinions, characterizing vocabulary dynamics. We extend dynamic topic models for incremental learning, a key aspect needed in Viscovery for model updating in near-real time. In addition, we include in Viscovery sentiment analysis, allowing to separate positive/negative words for a specific topic at different levels of granularity. Viscovery allows to visualize representative opinions and terms in each topic. At a coarse level of granularity, the dynamic of the topics can be analyzed using a 2D topic embedding, suggesting longitudinal topic merging or segmentation. In this paper we report our experience developing this platform, sharing lessons learned and opportunities that arise from the use of sentiment analysis and topic modeling in real world applications.

Proceedings Article
01 Jan 2018
TL;DR: This paper proposes a Multi- Scale Dynamic Topic Model (MS-DTM) and a complementary Incremental Multi-Scale Dynamic topic Model (IMS-D TM) inference method that can be used to capture latent topics and their dynamics simultaneously, at different scales.
Abstract: Dynamic topic models (DTM) are commonly used for mining latent topics in evolving web corpora. In this paper, we note that a major limitation of the conventional DTM based models is that they assume a predetermined and fixed scale of topics. In reality, however, topics may have varying spans and topics of multiple scales can co-exist in a single web or social media data stream. Therefore, DTMs that assume a fixed epoch length may not be able to effectively capture latent topics and thus negatively affect accuracy. In this paper, we propose a Multi-Scale Dynamic Topic Model (MS-DTM) and a complementary Incremental Multi-Scale Dynamic Topic Model (IMS-DTM) inference method that can be used to capture latent topics and their dynamics simultaneously, at different scales. In this model, topic specific feature distributions are generated based on a multi-scale feature distribution of the previous epochs; moreover, multiple scales of the current epoch are analyzed together through a novel multi-scale incremental Gibbs sampling technique. We show that the proposed model significantly improves efficiency and effectiveness compared to the single scale dynamic DTMs and prior models that consider only multiple scales of the past.

Journal ArticleDOI
Zhinan Gou1, Lixin Han1, Ling Sun1, Jun Zhu1, Hong Yan2 
TL;DR: This paper introduces a new method for constructing DTM based on variational autoencoder and factor graphs that uses re-parameterization of the variational lower bound to generate a lower bound estimator which is optimized by standard stochastic gradient descent method directly.
Abstract: Topic models are widely used in various fields of machine learning and statistics. Among them, the dynamic topic model (DTM) is the most popular time-series topic model for the dynamic representations of text corpora. A major challenge is that the posterior distribution of DTM requires a complex reasoning process with the high cost of computing time in modeling, and even a tiny change of model requires restructuring. For these reasons, the variability and generality of DTM is so poor that DTM is difficult to be carried out. In this paper, we introduce a new method for constructing DTM based on variational autoencoder and factor graphs. This model uses re-parameterization of the variational lower bound to generate a lower bound estimator which is optimized by standard stochastic gradient descent method directly. At the same time, the optimization process is simplified by integrating the dynamic factor graph in the state space to achieve a better model. The experimental dataset uses a journal paper corpus that mainly focuses on natural language processing and spans twenty-five years (1984–2009) from DBLP. Experiment results indicate that the proposed method is effective and feasible by comparing several state-of-the-art baselines.

01 Jan 2018
TL;DR: This work outlines a research agenda for approaching that task by using LDA as a base in combination with the observation of state transitions in topics at consecutive times with a fixed number of topics k.
Abstract: Scientific communities are always changing and evolving. Topics of today might split or even disappear in the future, other topics might merge or appear at some time. Nowadays, the closest we come to picture these developments are dynamic topic models which come with a fixed number of topics k. It would be desirable to omit k. This work outlines a research agenda for approaching that task by using LDA as a base in combination with the observation of state transitions in topics at consecutive times.

Dissertation
01 Jan 2018
TL;DR: The application of topic models, a machine learning algorithm, to detect behaviour patterns in different types of data produced by a monitoring system is presented, suggesting potential for dynamic topic models to identify changes in routines that could aid early diagnosis of chronic diseases.
Abstract: Healthcare systems worldwide are facing growing demands on their resources due to an ageing population and increase in prevalence of chronic diseases. Innovative residential healthcare monitoring systems, using a variety of sensors are being developed to help address these needs. Interpreting the vast wealth of data generated is key to fully exploiting the benefits offered by a monitoring system. This thesis presents the application of topic models, a machine learning algorithm, to detect behaviour patterns in different types of data produced by a monitoring system. Latent Dirichlet Allocation was applied to real world activity data with corresponding ground truth labels of daily routines. The results from an existing dataset and a novel dataset collected using a custom mobile phone app, demonstrated that the patterns found are equivalent of routines. Long term monitoring can identify changes that could indicate an alteration in health status. Dynamic topic models were applied to simulated long term activity datasets to detect changes in the structure of daily routines. It was shown that the changes occurring in the simulated data can successfully be detected. This result suggests potential for dynamic topic models to identify changes in routines that could aid early diagnosis of chronic diseases. Furthermore, chronic conditions, such as diabetes and obesity, are related to quality of diet. Current research findings on the association between eating behaviours, especially snacking, and the impact on diet quality and health are often conflicting. One problem is the lack of consistent definitions for different types of eating event. The novel application of Latent Dirichlet Allocation to three nutrition datasets is described. The results demonstrated that combinations of food groups representative of eating event types can be detected. Moreover, labels assigned to these combinations showed good agreement with alternative methods for labelling eating event types.

Patent
04 May 2018
TL;DR: In this article, a dynamic short text cluster searching method is proposed, where short text stream data are used for building a short-term topic model and a long-term historic topic model is synthesized to amend the short-time topic model in a data stream to obtain the probability distribution of topics and feature words, clustering is performed by the conditional probability of thetext and the topics, and dynamic accurate searching of the keywords is formed.
Abstract: The invention discloses a dynamic short text cluster searching method. According to the method, short text stream data are used for building a short-term topic model and a long-term historic topic model is synthesized to amend the short-term topic model in a data stream to obtain the probability distribution of topics and feature words, clustering is performed by the conditional probability of thetext and the topics, and dynamic accurate searching of the keywords is formed. The dynamic topic model is built, the keyword searching function changing along the time is realized, the problems of sparsity of the short text data, information loss and the like are solved by a polynomial mixed topic model, and the efficiency and the performance of information searching are improved.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: An unsupervised learning approach is used to determine those students with higher effectiveness and no preparation of other labeled data sets and obtains better temporal clusters with dynamic topic modeling, suitable for the early in-trouble student identification task.
Abstract: Early in-trouble student identification in an academic credit system is a challenging popular task in the educational data mining field. Only the first few semesters of the students can be observed for the task so that the in-trouble students can be recognized soon and have enough time for improving their study performance. The task can be tackled with different machine learning approaches. In this paper, we use the unsupervised learning approach to determine those students with higher effectiveness and no preparation of other labeled data sets. In this approach, a temporal cluster analysis method is proposed in our work based on the temporal clusters returned by dynamic topic models. In addition, we consider temporal characteristics in the study performance of each student to form a pattern from the temporal clusters he/she belongs to over the time. Similar students share similar patterns and therefore, allowing us to determine the pattern types of in-trouble students and recognize them more accurately. In an evaluation study, experimental results show that our method outperforms the other unsupervised and supervised learning methods with higher Recall and F-measure values. It also obtains better temporal clusters with dynamic topic modeling. As a result, our method is suitable for the early in-trouble student identification task.

Proceedings ArticleDOI
09 May 2018
TL;DR: A novel method to predicate a new paper influence by collaboratively learning the latent vectors of paper features and correlations through the Factorization Machine method, which does not require the citation information to evaluate a paper quality.
Abstract: There are an increasing number of papers published every year. It is desired for researchers to find the new high-quality papers, which is also a challenging task due to the lack of citation information. In this paper, we propose a novel method to predicate a new paper influence by collaboratively learning the latent vectors of paper features and correlations. We propose the concept topic related authority to integrate the dynamic topic model with paper citations so as to learn how content and authors influence a paper quality. We adopt the Factorization Machine method to collaboratively learn the latent vectors of correlations between different paper features. Comparing with traditional methods, it does not require the citation information to evaluate a paper quality, which is appropriate for new published papers. We conduct extensive evaluation against a real dataset crawled from ACM Digital Library. The results show that our method outperforms the other methods.