scispace - formally typeset
Search or ask a question
Topic

Latent Dirichlet allocation

About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.


Papers
More filters
Journal ArticleDOI
TL;DR: A sentiment classification method called AS LDA is introduced, which assumes that words in subjective documents consists of two parts: sentiment element words and auxiliary words which are sampled accordingly from sentiment topics and auxiliary topics.

32 citations

Proceedings Article
01 Jan 2009
TL;DR: A new extension of the CTM method to enable modeling with multi-field topics in a global graphical structure, and a mean-field variational algorithm to allow joint learning of multinomial topic models from discrete data and Gaussianstyle topic models for real-valued data are proposed.
Abstract: Popular methods for probabilistic topic modeling like the Latent Dirichlet Allocation (LDA, [1]) and Correlated Topic Models (CTM, [2]) share an important property, i.e., using a common set of topics to model all the data. This property can be too restrictive for modeling complex data entries where multiple fields of heterogeneous data jointly provide rich information about each object or event. We propose a new extension of the CTM method to enable modeling with multi-field topics in a global graphical structure, and a mean-field variational algorithm to allow joint learning of multinomial topic models from discrete data and Gaussianstyle topic models for real-valued data. We conducted experiments with both simulated and real data, and observed that the multi-field CTM outperforms a conventional CTM in both likelihood maximization and perplexity reduction. A deeper analysis on the simulated data reveals that the superior performance is the result of successful discovery of the mapping among field-specific topics and observed data.

32 citations

Journal ArticleDOI
TL;DR: This paper proposed an unsupervised approach for aspect term extraction, a guided Latent Dirichlet Allocation (LDA) model that uses minimal aspect seed words from each aspect category to guide the model in identifying the hidden topics of interest to the user.
Abstract: Aspect level sentiment analysis is a fine-grained task in sentiment analysis. It extracts aspects and their corresponding sentiment polarity from opinionated text. The first subtask of identifying the opinionated aspects is called aspect extraction, which is the focus of the work. Social media platforms are an enormous resource of unlabeled data. However, data annotation for fine-grained tasks is quite expensive and laborious. Hence unsupervised models would be highly appreciated. The proposed model is an unsupervised approach for aspect term extraction, a guided Latent Dirichlet Allocation (LDA) model that uses minimal aspect seed words from each aspect category to guide the model in identifying the hidden topics of interest to the user. The guided LDA model is enhanced by guiding inputs using regular expressions based on linguistic rules. The model is further enhanced by multiple pruning strategies, including a BERT based semantic filter, which incorporates semantics to strengthen situations where co-occurrence statistics might fail to serve as a differentiator. The thresholds for these semantic filters have been estimated using Particle Swarm Optimization strategy. The proposed model is expected to overcome the disadvantage of basic LDA models that fail to differentiate the overlapping topics that represent each aspect category. The work has been evaluated on the restaurant domain of SemEval 2014, 2015 and 2016 datasets and has reported an F-measure of 0.81, 0.74 and 0.75 respectively, which is competitive in comparison to the state of art unsupervised baselines and appreciable even with respect to the supervised baselines.

32 citations

Proceedings ArticleDOI
23 Oct 2009
TL;DR: It is observed that almost in all metrics, information gain performs best at all keyword numbers while the LDA-based metrics perform similar to chi-square and document frequency thresholding.
Abstract: Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, feature selection is broadly used in text categorization systems for reducing the dimensionality. In the literature, there are some widely known metrics such as information gain and document frequency thresholding. Recently, a generative graphical model called latent dirichlet allocation (LDA) that can be used to model and discover the underlying topic structures of textual data, was proposed. In this paper, we use the hidden topic analysis of LDA for feature selection and compare it with the classical feature selection metrics in text categorization. For the experiments, we use SVM as the classifier and tf∗idf weighting for weighting the terms. We observed that almost in all metrics, information gain performs best at all keyword numbers while the LDA-based metrics perform similar to chi-square and document frequency thresholding.

32 citations

Posted ContentDOI
04 Aug 2020-medRxiv
TL;DR: An intelligent clustering-based classification and topics extracting model (named TClustVID) that analyze COVID-19-related public tweets to extract significant sentiments with high accuracy and showed higher performance compared to the traditional classifiers determined by clustering criteria.
Abstract: COVID-19, caused by the SARS-Cov2, varies greatly in its severity but represent serious respiratory symptoms with vascular and other complications, particularly in older adults. The disease can be spread by both symptomatic and asymptomatic infected individuals, and remains uncertainty over key aspects of its infectivity, no effective remedy yet exists and this disease causes severe economic effects globally. For these reasons, COVID-19 is the subject of intense and widespread discussion on social media platforms including Facebook and Twitter. These public forums substantially impact on public opinions in some cases and exacerbate widespread panic and misinformation spread during the crisis. Thus, this work aimed to design an intelligent clustering-based classification and topics extracting model (named TClustVID) that analyze COVID-19-related public tweets to extract significant sentiments with high accuracy. We gathered COVID-19 Twitter datasets from the IEEE Dataport repository and employed a range of data preprocessing methods to clean the raw data, then applied tokenization and produced a word-to-index dictionary. Thereafter, different classifications were employed to Twitter datasets which enabled exploration of the performance of traditional and TclustVID classification methods. TClustVID showed higher performance compared to the traditional classifiers determined by clustering criteria. Finally, we extracted significant topic clusters from TClustVID, split them into positive, neutral and negative clusters and implemented latent dirichlet allocation for extraction of popular COVID-19 topics. This approach identified common prevailing public opinions and concerns related to COVID-19, as well as attitudes to infection prevention strategies held by people from different countries concerning the current pandemic situation.

32 citations


Network Information
Related Topics (5)
Cluster analysis
146.5K papers, 2.9M citations
86% related
Support vector machine
73.6K papers, 1.7M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Feature extraction
111.8K papers, 2.1M citations
84% related
Convolutional neural network
74.7K papers, 2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023323
2022842
2021418
2020429
2019473
2018446