scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: Experimental results demonstrate that the usage of tag information can significantly improve accuracy, diversification and novelty of recommendations.
Abstract: Personalized recommender systems are confronting great challenges of accuracy, diversification and novelty, especially when the data set is sparse and lacks accessorial information, such as user profiles, item attributes and explicit ratings. Collaborative tags contain rich information about personalized preferences and item contents, and are therefore potential to help in providing better recommendations. In this article, we propose a recommendation algorithm based on an integrated diffusion on user–item–tag tripartite graphs. We use three benchmark data sets, Del.icio.us , MovieLens and BibSonomy , to evaluate our algorithm. Experimental results demonstrate that the usage of tag information can significantly improve accuracy, diversification and novelty of recommendations.

231 citations


Cites methods from "Latent dirichlet allocation"

  • ...eworks of collaborative filtering [11] and iterative diffusion algorithm [9], as well as some more complicated methods such as Probabilistic Latent Semantic Analysis [36], Latent Dirichlet Allocation [37] and Iterative Latent Semantic Analysis [38]. Systematic investigation on tag-aware recommendation algorithms must be very helpful in the futuredesign of recommender systems. 5. Acknowledgement We ack...

    [...]

Proceedings ArticleDOI
Yangqiu Song1, Haixun Wang1, Zhongyuan Wang1, Hongsong Li1, Weizhu Chen1 
16 Jul 2011
TL;DR: This paper develops a Bayesian inference mechanism to conceptualize words and short text by using a probabilistic knowledgebase that is as rich as the authors' mental world in terms of the concepts it contains and brings significant improvements in short text understanding.
Abstract: Most text mining tasks, including clustering and topic detection, are based on statistical methods that treat text as bags of words. Semantics in the text is largely ignored in the mining process, and mining results often have low interpretability. One particular challenge faced by such approaches lies in short text understanding, as short texts lack enough content from which statistical conclusions can be drawn easily. In this paper, we improve text understanding by using a probabilistic knowledgebase that is as rich as our mental world in terms of the concepts (of worldly facts) it contains. We then develop a Bayesian inference mechanism to conceptualize words and short text. We conducted comprehensive experiments on conceptualizing textual terms, and clustering short pieces of text such as Twitter messages. Compared to purely statistical methods such as latent semantic topic modeling or methods that use existing knowledge-bases (e.g., WordNet, Freebase and Wikipedia), our approach brings significant improvements in short text understanding as reflected by the clustering accuracy.

231 citations


Cites background from "Latent dirichlet allocation"

  • ...Statistical topic modeling [Blei et al., 2003; Blei and Lafferty, 2009] also requires sufficient words in a document to infer the document topic distribution....

    [...]

  • ...Compared with traditional latent semantic analysis (LSA) [Deerwester et al., 1990] and topic modeling such as latent Dirichlet allocation (LDA) [Blei et al., 2003], explicit semantic analysis (ESA) has the advantage of providing semantics that are interpretable by human beings....

    [...]

Journal ArticleDOI
TL;DR: This paper represents a complete, multilateral and systematic review of opinion mining and sentiment analysis to classify available methods and compare their advantages and drawbacks, in order to have better understanding of available challenges and solutions to clarify the future direction.
Abstract: Opinion mining is considered as a subfield of natural language processing, information retrieval and text mining. Opinion mining is the process of extracting human thoughts and perceptions from unstructured texts, which with regard to the emergence of online social media and mass volume of users’ comments, has become to a useful, attractive and also challenging issue. There are varieties of researches with different trends and approaches in this area, but the lack of a comprehensive study to investigate them from all aspects is tangible. In this paper we represent a complete, multilateral and systematic review of opinion mining and sentiment analysis to classify available methods and compare their advantages and drawbacks, in order to have better understanding of available challenges and solutions to clarify the future direction. For this purpose, we present a proper framework of opinion mining accompanying with its steps and levels and then we completely monitor, classify, summarize and compare proposed techniques for aspect extraction, opinion classification, summary production and evaluation, based on the major validated scientific works. In order to have a better comparison, we also propose some factors in each category, which help to have a better understanding of advantages and disadvantages of different methods.

231 citations


Cites methods from "Latent dirichlet allocation"

  • ...Their results show that systems based on LDA provide useful information about their staff members....

    [...]

  • ...Ma et al. (2015) proposed an approach of probabilistic topic model based on LDA in order to semantic search over citizens opinions about city issues on online platforms....

    [...]

  • ...Also LDA and LSA use the bag of words represented in documents, so they can be used only in document level opinion mining....

    [...]

  • ...Since this approach, uses statistical methods like latent semantic analysis (LSA) (Hofmann 1999) and latent Dirichlet allocation (LDA) (Blei et al. 2003), it is called statistical models too....

    [...]

Journal ArticleDOI
08 Oct 2018-PLOS ONE
TL;DR: This paper describes the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and performs a series of binary classification experiments to determine the feasibility of automatic cyberbullies detection.
Abstract: While social media offer great communication opportunities, they also increase the vulnerability of young people to threatening situations online. Recent studies report that cyberbullying constitutes a growing problem among youngsters. Successful prevention depends on the adequate detection of potentially harmful messages and the information overload on the Web requires intelligent systems to identify potential risks automatically. The focus of this paper is on automatic cyberbullying detection in social media text by modelling posts written by bullies, victims, and bystanders of online bullying. We describe the collection and fine-grained annotation of a cyberbullying corpus for English and Dutch and perform a series of binary classification experiments to determine the feasibility of automatic cyberbullying detection. We make use of linear support vector machines exploiting a rich feature set and investigate which information sources contribute the most for the task. Experiments on a hold-out test set reveal promising results for the detection of cyberbullying-related posts. After optimisation of the hyperparameters, the classifier yields an F1 score of 64% and 61% for English and Dutch respectively, and considerably outperforms baseline systems.

231 citations


Cites methods from "Latent dirichlet allocation"

  • ...• Topic model features: by making use of the Gensim topic modelling library [80], several LDA [81] and LSI [82] topic models with varying granularity (k = 20, 50, 100 and 200) were trained on data corresponding to each fine-grained category of a cyberbullying event (e.g. threats, defamations, insults, defenses)....

    [...]

  • ...• Topic model features: by making use of the Gensim topic modelling library (Rehurek & Sojka, 2010), several LDA (Blei et al., 2003) and LSI (Deerwester et al....

    [...]

Journal ArticleDOI
TL;DR: A meta-analysis across nearly 10,000 fMRI studies to comprehensively map psychological states to discrete subregions in medial frontal cortex using relatively unbiased data-driven methods provides hypotheses about the functional organization of medial prefrontal cortex that can be tested explicitly in future studies.
Abstract: The functional organization of human medial frontal cortex (MFC) is a subject of intense study. Using fMRI, the MFC has been associated with diverse psychological processes, including motor function, cognitive control, affect, and social cognition. However, there have been few large-scale efforts to comprehensively map specific psychological functions to subregions of medial frontal anatomy. Here we applied a meta-analytic data-driven approach to nearly 10,000 fMRI studies to identify putatively separable regions of MFC and determine which psychological states preferentially recruit their activation. We identified regions at several spatial scales on the basis of meta-analytic coactivation, revealing three broad functional zones along a rostrocaudal axis composed of 2–4 smaller subregions each. Multivariate classification analyses aimed at identifying the psychological functions most strongly predictive of activity in each region revealed a tripartite division within MFC, with each zone displaying a relatively distinct functional signature. The posterior zone was associated preferentially with motor function, the middle zone with cognitive control, pain, and affect, and the anterior with reward, social processing, and episodic memory. Within each zone, the more fine-grained subregions showed distinct, but subtler, variations in psychological function. These results provide hypotheses about the functional organization of medial prefrontal cortex that can be tested explicitly in future studies. SIGNIFICANCE STATEMENT Activation of medial frontal cortex in fMRI studies is associated with a wide range of psychological states ranging from cognitive control to pain. However, this high rate of activation makes it challenging to determine how these various processes are topologically organized across medial frontal anatomy. We conducted a meta-analysis across nearly 10,000 studies to comprehensively map psychological states to discrete subregions in medial frontal cortex using relatively unbiased data-driven methods. This approach revealed three distinct zones that differed substantially in function, each of which were further subdivided into 2–4 smaller subregions that showed additional functional variation. Each individual region was recruited by multiple psychological states, suggesting subregions of medial frontal cortex are functionally heterogeneous.

231 citations


Cites methods from "Latent dirichlet allocation"

  • ...To remedy this problem, we used a reduced semantic representation of the latent conceptual structure underlying the neuroimaging literature: a set of 60 topics derived using latent dirichlet allocation topic modeling (Blei et al., 2003)....

    [...]

References
More filters
Book
01 Jan 1995
TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.
Abstract: FUNDAMENTALS OF BAYESIAN INFERENCE Probability and Inference Single-Parameter Models Introduction to Multiparameter Models Asymptotics and Connections to Non-Bayesian Approaches Hierarchical Models FUNDAMENTALS OF BAYESIAN DATA ANALYSIS Model Checking Evaluating, Comparing, and Expanding Models Modeling Accounting for Data Collection Decision Analysis ADVANCED COMPUTATION Introduction to Bayesian Computation Basics of Markov Chain Simulation Computationally Efficient Markov Chain Simulation Modal and Distributional Approximations REGRESSION MODELS Introduction to Regression Models Hierarchical Linear Models Generalized Linear Models Models for Robust Inference Models for Missing Data NONLINEAR AND NONPARAMETRIC MODELS Parametric Nonlinear Models Basic Function Models Gaussian Process Models Finite Mixture Models Dirichlet Process Models APPENDICES A: Standard Probability Distributions B: Outline of Proofs of Asymptotic Theorems C: Computation in R and Stan Bibliographic Notes and Exercises appear at the end of each chapter.

16,079 citations


"Latent dirichlet allocation" refers background in this paper

  • ...Finally, Griffiths and Steyvers (2002) have presented a Markov chain Monte Carlo algorithm for LDA....

    [...]

  • ...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to ashierarchical models(Gelman et al., 1995), or more precisely asconditionally independent hierarchical models(Kass and Steffey, 1989)....

    [...]

  • ...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as conditionally independent hierarchical models (Kass and Steffey, 1989)....

    [...]

Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,443 citations


"Latent dirichlet allocation" refers methods in this paper

  • ...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester et al., 1990)....

    [...]

  • ...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notablylatent semantic indexing (LSI)(Deerwester et al., 1990)....

    [...]

Book
01 Jan 1983
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

12,059 citations


"Latent dirichlet allocation" refers background or methods in this paper

  • ...In the populartf-idf scheme (Salton and McGill, 1983), a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word....

    [...]

  • ...We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model....

    [...]

Book
01 Jan 1939
TL;DR: In this paper, the authors introduce the concept of direct probabilities, approximate methods and simplifications, and significant importance tests for various complications, including one new parameter, and various complications for frequency definitions and direct methods.
Abstract: 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions

7,086 citations