This paper empirically establishes that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a range of pooling schemes.
Abstract:
Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.
TL;DR: A simple, fast, and effective topic model for short texts, named GPU-DMM, based on the Dirichlet Multinomial Mixture model, which achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence.
TL;DR: This article extended two Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus.
TL;DR: Two different Dirichlet multinomial topic models are extended by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus.
TL;DR: For example, this article used topic modeling to reveal phenomenon-based constructs and grounded conceptual relationships in textual documents. But, they did not consider the relationship between concepts and concepts in the documents.
TL;DR: A novel model integrating topic modeling with short text aggregation during topic inference is presented, founded on general topical affinity of texts rather than particular heuristics, making the model readily applicable to various short texts.
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
TL;DR: Experimental results show that TwitterRank outperforms the one Twitter currently uses and other related algorithms, including the original PageRank and Topic-sensitive PageRank, which is proposed to measure the influence of users in Twitter.
Q1. What are the contributions in "Improving lda topic models for microblogs via tweet pooling and automatic labeling" ?
In this paper, the authors investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA ; they achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. The authors empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics.
Q2. Why did the authors remove tweets retrieved by more than one query?
The authors have removed tweets retrieved by more than one query in order to preserve uniqueness of tweet labels for later analysis with clustering metrics.
Q3. How many tweets were retrieved by more than one query?
less than one percent of tweets were retrieved by more than one query with the highest case of 4.6% overlap occurring in the generic dataset for the two queries “family” and “fun”.
Q4. Why do the authors use the same class label for some tweets?
This is because tweets with the same class label have a higher average TF-IDF similarity than tweets with a different class label, so pooling these tweets together makes more topic-aligned hashtag pools that aid cluster reconstruction.
Q5. What is the motivation behind tweet pooling?
The motivation behind tweet pooling is that individual tweets are very short (≤ 140 characters) and hence treating each tweet as an individual document does not present adequate term co-occurence data within documents.
Q6. What is the goal of this paper?
The goal of this paper is to obtain better LDA topics from Twitter content without modifying the basic machinery of standard LDA.
Q7. What is the method of aggregating tweets?
Then their first novel aggregation method of Burst Score-wise Pooling aggregates tweets for each burst-term into a single document for training LDA, where the authors found τ = 5 to provide best results.
Q8. How does the paper improve topic modeling?
This paper presented a way of aggregating tweets in order to improve the quality of LDA-based topic modeling in microblogs as measured by the ability of topics to reconstruct clusters and topic coherence.
Q9. What are the main metrics used in the evaluation of topic models?
Because there is no single method for evaluating topic models, the authors evaluate a range of metrics including those used in clustering (purity and NMI) and semantic topic coherence and interpretability (PMI) as discussed below.