scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Journal ArticleDOI
TL;DR: A new technique to semantically analyze knowledge flows across countries by using publication and citation data is proposed, which indicates that Japanese researchers focus in the research areas such as efficient use of Photovoltaic, Energy Conversion and Superconductors (to produce low-cost renewable energy).
Abstract: In this paper we propose a new technique to semantically analyze knowledge flows across countries by using publication and citation data. We start with the identification of research topics produced by a given source country. Then, we collect papers, published by the authors outside the source country, citing the identified research topics. At last, we group each set of citing papers separately to determine the scholarly impact of the actual identified research topics in the cited topics. The research topics are identified using our proposed topic model with distance matrix, an extension of classic Latent Dirichlet Allocation model. We also present a case study to illustrate the use of our proposed techniques in the subject area Energy during 2004---2009 using the Scopus database. We compare the Japanese and Chinese papers that cite the scientific literature produced by the researchers from the United States in order to show the difference in the use of same knowledge. The results indicate that Japanese researchers focus in the research areas such as efficient use of Photovoltaic, Energy Conversion and Superconductors (to produce low-cost renewable energy). In contrast with the Japanese researchers, Chinese researchers focus in the areas of Power Systems, Power Grids and Solar Cells production. Such analyses are useful for understanding the dynamics of the relevant knowledge flows across the nations.

37 citations

Journal ArticleDOI
TL;DR: Local users’ sentiments extracted from Geo-tweets data from January to December 2016, analyzed in the spatial and temporal perspective are explored, finding patterns which demonstrate the associations between the nature of Twitter content and the characteristics of places and users.
Abstract: Sentiment affects every aspect of people's lives and has strong impact on their mental health. This paper explores local users' sentiments extracted from Geo-tweets data from January to December 2016, analyzed in the spatial and temporal perspective. Because of large amount of noisy data and complicated procedure of extracting local user, a workflow is created, facilitating more researchers to reproduce, replicate or extend the procedures using similar Geo-tweet dataset. The workflow is sharing at Harvard Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6N9VUF). Using the processed data, each tweet's sentiment is classified according to the content. Then, the overall temporal variations of total number of positive, neural, and negative sentiments are analyzed on a monthly, daily and hourly level. From a spatial perspective, the Local Indicators of Spatial Association (LISA) statistical method is employed to discover the spatial clusters. In order to explore the content of positive sentiments, this paper applies the Latent Dirichlet Allocation (LDA) model to classify the Geo-tweets with positive sentiments into different topics. Combining the geospatial information with the topics, some patterns are found which demonstrate the associations between the nature of Twitter content and the characteristics of places and users. For example, weekend events and friend and family gatherings are the time that users prefer to post positive tweets. In the western part of US, users tend to post more photos to share the great moment on Twitter than other parts of the US.

37 citations

Journal ArticleDOI
TL;DR: To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification, and a novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated.
Abstract: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation.

37 citations

Proceedings ArticleDOI
15 Apr 2007
TL;DR: A latent Dirichlet-tree allocation (LDTA) model - a correlated latent semantic model - for unsupervised language model adaptation is proposed and empirical results show that the LDTA model has a faster training convergence than the LDA model with the same initial flat model.
Abstract: We propose a latent Dirichlet-tree allocation (LDTA) model - a correlated latent semantic model - for unsupervised language model adaptation. The LDTA model extends the latent Dirichlet allocation (LDA) model by replacing a Dirichlet prior with a Dirichlet-tree prior over the topic proportions. Latent topics under the same subtree are expected to be more correlated than topics under different subtrees. The LDTA model falls back to the LDA model using a depth-one Dirichlet-tree, and the model fits to the variational Bayes inference framework employed in the LDA model. Empirical results show that the LDTA model has a faster training convergence than the LDA model with the same initial flat model. Experimental results show that LDTA-adapted LM performed better than LDA-adapted LM on the Mandarin RT04-eval set when the models were trained using a small text corpus, while both models had the same recognition performance when the models were trained using a big text corpus. We observed 0.4% absolute CER reduction after LM adaptation using LSA marginals.

37 citations

Journal ArticleDOI
01 May 2016
TL;DR: A ranking mechanism capable of identifying the top-k social audience members on Twitter based on an index that has the potential to be adopted in real-world applications for differentiating prospective customers from the general audience and enabling market segmentation for better business decision making is presented.
Abstract: Even though social media offers plenty of business opportunities, for a company to identify the right audience from the massive amount of social media data is highly challenging given finite resources and marketing budgets. In this paper, we present a ranking mechanism that is capable of identifying the top-k social audience members on Twitter based on an index. Data from three different Twitter business account owners were used in our experiments to validate this ranking mechanism. The results show that the index developed using a combination of semi-supervised and supervised learning methods is indeed generic enough to retrieve relevant audience members from the three different data sets. This approach of combining Fuzzy Match, Twitter Latent Dirichlet Allocation and Support Vector Machine Ensemble is able to leverage on the content of account owners to construct seed words and training data sets with minimal annotation efforts. We conclude that this ranking mechanism has the potential to be adopted in real-world applications for differentiating prospective customers from the general audience and enabling market segmentation for better business decision making. An approach to rank the high-value social audience (HVSA) on Twitter is proposed.An HVSA index is developed using various methods with minimal annotation effort.Top-k HVSA members are identified from three data sets of different nature.A pooling strategy and Average [email protected] are recommended for the HVSA ranking.Audience segmentation on the ranked HVSA enables better decision making.

37 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446