scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A novel multiscale and hierarchical framework is introduced, which describes the classification of TLS point clouds of cluttered urban scenes, and novel features of point clusters are constructed by employing the latent Dirichlet allocation (LDA).
Abstract: The effective extraction of shape features is an important requirement for the accurate and efficient classification of terrestrial laser scanning (TLS) point clouds. However, the challenge of how to obtain robust and discriminative features from noisy and varying density TLS point clouds remains. This paper introduces a novel multiscale and hierarchical framework, which describes the classification of TLS point clouds of cluttered urban scenes. In this framework, we propose multiscale and hierarchical point clusters (MHPCs). In MHPCs, point clouds are first resampled into different scales. Then, the resampled data set of each scale is aggregated into several hierarchical point clusters, where the point cloud of all scales in each level is termed a point-cluster set. This representation not only accounts for the multiscale properties of point clouds but also well captures their hierarchical structures. Based on the MHPCs, novel features of point clusters are constructed by employing the latent Dirichlet allocation (LDA). An LDA model is trained according to a training set. The LDA model then extracts a set of latent topics, i.e., a feature of topics, for a point cluster. Finally, to apply the introduced features for point-cluster classification, we train an AdaBoost classifier in each point-cluster set and obtain the corresponding classifiers to separate the TLS point clouds with varying point density and data missing into semantic regions. Compared with other methods, our features achieve the best classification results for buildings, trees, people, and cars from TLS point clouds, particularly for small and moving objects, such as people and cars.

157 citations


Cites methods from "Latent dirichlet allocation"

  • ...In contrast, we classify the point clouds on the point clusters with the help of our novel latent Dirichlet allocation (LDA)based features, instead of relying on explicit object separation....

    [...]

Journal ArticleDOI
TL;DR: This study presents a new large-scale sentiment data set COVIDSENTI, which consists of 90 000 COVID-19-related tweets collected in the early stages of the pandemic, from February to March 2020 and supports the view that there is a need to develop a proactive and agile public health presence to combat the spread of negative sentiment on social media following a pandemic.
Abstract: Social media (and the world at large) have been awash with news of the COVID-19 pandemic With the passage of time, news and awareness about COVID-19 spread like the pandemic itself, with an explosion of messages, updates, videos, and posts Mass hysteria manifest as another concern in addition to the health risk that COVID-19 presented Predictably, public panic soon followed, mostly due to misconceptions, a lack of information, or sometimes outright misinformation about COVID-19 and its impacts It is thus timely and important to conduct an ex post facto assessment of the early information flows during the pandemic on social media, as well as a case study of evolving public opinion on social media which is of general interest This study aims to inform policy that can be applied to social media platforms; for example, determining what degree of moderation is necessary to curtail misinformation on social media This study also analyzes views concerning COVID-19 by focusing on people who interact and share social media on Twitter As a platform for our experiments, we present a new large-scale sentiment data set COVIDSENTI, which consists of 90 000 COVID-19-related tweets collected in the early stages of the pandemic, from February to March 2020 The tweets have been labeled into positive, negative, and neutral sentiment classes We analyzed the collected tweets for sentiment classification using different sets of features and classifiers Negative opinion played an important role in conditioning public sentiment, for instance, we observed that people favored lockdown earlier in the pandemic; however, as expected, sentiment shifted by mid-March Our study supports the view that there is a need to develop a proactive and agile public health presence to combat the spread of negative sentiment on social media following a pandemic

157 citations

Proceedings ArticleDOI
27 Feb 2016
TL;DR: This first study quantifying levels of mental illness severity in social media is presented, examining a set of users on Instagram who post content on pro-eating disorder tags and finds that proportion of users whose content expresses high MIS have been on the rise since 2012.
Abstract: Social media sites have struggled with the presence of emotional and physical self-injury content. Individuals who share such content are often challenged with severe mental illnesses like eating disorders. We present the first study quantifying levels of mental illness severity (MIS) in social media. We examine a set of users on Instagram who post content on pro-eating disorder tags (26M posts from 100K users). Our novel statistical methodology combines topic modeling and novice/clinician annotations to infer MIS in a user's content. Alarmingly, we find that proportion of users whose content expresses high MIS have been on the rise since 2012 (13%/year increase). Previous MIS in a user's content over seven months can predict future risk with 81% accuracy. Our model can also forecast MIS levels up to eight months in the future with performance better than baseline. We discuss the health outcomes and design implications as well as ethical considerations of this line of research.

157 citations

Journal ArticleDOI
TL;DR: A unified probabilistic generative model, the Topic-Region Model (TRM), to simultaneously discover the semantic, temporal, and spatial patterns of users’ check-in activities, and to model their joint effect on Users’ decision making for selection of POIs to visit is proposed.
Abstract: Point-of-Interest (POI) recommendation has become an important means to help people discover attractive and interesting places, especially when users travel out of town. However, the extreme sparsity of a user-POI matrix creates a severe challenge. To cope with this challenge, we propose a unified probabilistic generative model, the Topic-Region Model (TRM), to simultaneously discover the semantic, temporal, and spatial patterns of users’ check-in activities, and to model their joint effect on users’ decision making for selection of POIs to visit. To demonstrate the applicability and flexibility of TRM, we investigate how it supports two recommendation scenarios in a unified way, that is, hometown recommendation and out-of-town recommendation. TRM effectively overcomes data sparsity by the complementarity and mutual enhancement of the diverse information associated with users’ check-in activities (e.g., check-in content, time, and location) in the processes of discovering heterogeneous patterns and producing recommendations. To support real-time POI recommendations, we further extend the TRM model to an online learning model, TRM-Online, to track changing user interests and speed up the model training. In addition, based on the learned model, we propose a clustering-based branch and bound algorithm (CBB) to prune the POI search space and facilitate fast retrieval of the top-k recommendations. We conduct extensive experiments to evaluate the performance of our proposals on two real-world datasets, including recommendation effectiveness, overcoming the cold-start problem, recommendation efficiency, and model-training efficiency. The experimental results demonstrate the superiority of our TRM models, especially TRM-Online, compared with state-of-the-art competitive methods, by making more effective and efficient mobile recommendations. In addition, we study the importance of each type of pattern in the two recommendation scenarios, respectively, and find that exploiting temporal patterns is most important for the hometown recommendation scenario, while the semantic patterns play a dominant role in improving the recommendation effectiveness for out-of-town users.

157 citations


Cites background or methods from "Latent dirichlet allocation"

  • ...In the standard topic models [Blei et al. 2003; Wallach et al. 2009], a document (i.e., a bag of words) contains a mixture of topics, represented by a topic distribution, and each word has a hidden topic label....

    [...]

  • ...Most existing researches on LDA-like model utilize various inference algorithms, such as variational Bayesian [Blei et al. 2003; Foulds et al. 2013], Gibbs sampling [Griffiths and Steyvers 2004], expectation propagation [Minka and Lafferty 2002] and belief propagation [Zeng et al....

    [...]

  • ...To avoid overfitting, we place a Dirichlet prior [Blei et al. 2003; Wallach et al. 2009] over each multinomial distribution....

    [...]

  • ...In the standard topic models [Blei et al. 2003; Wallach et al. 2009], a document (i....

    [...]

Proceedings ArticleDOI
TL;DR: This work extends the DeepCoNN model by introducing an additional latent layer representing the target user-target item pair, and shows that TransNets and extensions of it improve substantially over the previous state-of-the-art performance on recommendation tasks.
Abstract: Recently, deep learning methods have been shown to improve the performance of recommender systems over traditional methods, especially when review text is available. For example, a recent model, DeepCoNN, uses neural nets to learn one latent representation for the text of all reviews written by a target user, and a second latent representation for the text of all reviews for a target item, and then combines these latent representations to obtain state-of-the-art performance on recommendation tasks. We show that (unsurprisingly) much of the predictive value of review text comes from reviews of the target user for the target item. We then introduce a way in which this information can be used in recommendation, even when the target user's review for the target item is not available. Our model, called TransNets, extends the DeepCoNN model by introducing an additional latent layer representing the target user-target item pair. We then regularize this layer, at training time, to be similar to another latent representation of the target user's review of the target item. We show that TransNets and extensions of it improve substantially over the previous state-of-the-art.

157 citations


Cites methods from "Latent dirichlet allocation"

  • ...the rating is sampled from a Gaussian mixture. „e Collaborative Topic Regression (CTR) model proposed in [39] is a content based approach, as opposed to a context / review based approach. It uses LDA [5] to model the text of documents (scienti•c articles), and a combination of MF and content based model for recommendation. „e Rating-boosted Latent Topics (RBLT) model of [37] uses a simple technique o...

    [...]

References
More filters
Book
01 Jan 1995
TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.
Abstract: FUNDAMENTALS OF BAYESIAN INFERENCE Probability and Inference Single-Parameter Models Introduction to Multiparameter Models Asymptotics and Connections to Non-Bayesian Approaches Hierarchical Models FUNDAMENTALS OF BAYESIAN DATA ANALYSIS Model Checking Evaluating, Comparing, and Expanding Models Modeling Accounting for Data Collection Decision Analysis ADVANCED COMPUTATION Introduction to Bayesian Computation Basics of Markov Chain Simulation Computationally Efficient Markov Chain Simulation Modal and Distributional Approximations REGRESSION MODELS Introduction to Regression Models Hierarchical Linear Models Generalized Linear Models Models for Robust Inference Models for Missing Data NONLINEAR AND NONPARAMETRIC MODELS Parametric Nonlinear Models Basic Function Models Gaussian Process Models Finite Mixture Models Dirichlet Process Models APPENDICES A: Standard Probability Distributions B: Outline of Proofs of Asymptotic Theorems C: Computation in R and Stan Bibliographic Notes and Exercises appear at the end of each chapter.

16,079 citations


"Latent dirichlet allocation" refers background in this paper

  • ...Finally, Griffiths and Steyvers (2002) have presented a Markov chain Monte Carlo algorithm for LDA....

    [...]

  • ...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to ashierarchical models(Gelman et al., 1995), or more precisely asconditionally independent hierarchical models(Kass and Steffey, 1989)....

    [...]

  • ...Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling, where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as conditionally independent hierarchical models (Kass and Steffey, 1989)....

    [...]

Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,443 citations


"Latent dirichlet allocation" refers methods in this paper

  • ...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester et al., 1990)....

    [...]

  • ...To address these shortcomings, IR researchers have proposed several other dimensionality reduction techniques, most notablylatent semantic indexing (LSI)(Deerwester et al., 1990)....

    [...]

Book
01 Jan 1983
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

12,059 citations


"Latent dirichlet allocation" refers background or methods in this paper

  • ...In the populartf-idf scheme (Salton and McGill, 1983), a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word....

    [...]

  • ...We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model....

    [...]

Book
01 Jan 1939
TL;DR: In this paper, the authors introduce the concept of direct probabilities, approximate methods and simplifications, and significant importance tests for various complications, including one new parameter, and various complications for frequency definitions and direct methods.
Abstract: 1. Fundamental notions 2. Direct probabilities 3. Estimation problems 4. Approximate methods and simplifications 5. Significance tests: one new parameter 6. Significance tests: various complications 7. Frequency definitions and direct methods 8. General questions

7,086 citations