(Open Access) Improving LDA topic models for microblogs via tweet pooling and automatic labeling (2013) | Rishabh Mehrotra

Q: What are the contributions in "Improving lda topic models for microblogs via tweet pooling and automatic labeling" ?

In this paper, the authors investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA ; they achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. The authors empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics.

Q: Why did the authors remove tweets retrieved by more than one query?

The authors have removed tweets retrieved by more than one query in order to preserve uniqueness of tweet labels for later analysis with clustering metrics.

Q: What is the motivation behind tweet pooling?

The motivation behind tweet pooling is that individual tweets are very short (≤ 140 characters) and hence treating each tweet as an individual document does not present adequate term co-occurence data within documents.

Q: What is the method of aggregating tweets?

Then their first novel aggregation method of Burst Score-wise Pooling aggregates tweets for each burst-term into a single document for training LDA, where the authors found τ = 5 to provide best results.

Improving LDA Topic Models for Microblogs

via Tweet Pooling and Automatic Labeling

Rishabh Mehrotra

BITS Pilani

Pilani, India

erishabh@gmail.com

Scott Sanner

NICTA & ANU

Canberra, Australia

Scott.Sanner@nicta.com.au

Wray Buntine

NICTA & ANU

Canberra, Australia

Wray.Buntine@nicta.com.au

Lexing Xie

ANU & NICTA

Canberra, Australia

lexing.xie@anu.edu.au

ABSTRACT

Twitter, or the world of 140 characters poses serious challenges to

the efﬁcacy of topic models on short, messy text. While topic mod-

els such as Latent Dirichlet Allocation (LDA) have a long history

of successful application to news articles and academic abstracts,

they are often less coherent when applied to microblog content like

Twitter. In this paper, we investigate methods to improve topics

learned from Twitter content without modifying the basic machin-

ery of LDA; we achieve this through various pooling schemes that

aggregate tweets in a data preprocessing step for LDA. We empir-

ically establish that a novel method of tweet pooling by hashtags

leads to a vast improvement in a variety of measures for topic co-

herence across three diverse Twitter datasets in comparison to an

unmodiﬁed LDA baseline and a variety of pooling schemes. An

additional contribution of automatic hashtag labeling further im-

proves on the hashtag pooling results for a subset of metrics. Over-

all, these two novel schemes lead to signiﬁcantly improved LDA

topic models on Twitter content.

Categories and Subject Descriptors

I.2.7 [Artiﬁcial Intelligence]: Natural Language Processing—Text

analysis; H.3.3 [Information Storage And Retrieval]: Informa-

tion Search and Retrieval—Clustering

Keywords

Topic modeling, LDA, Microblogs

1. INTRODUCTION

The “undirected informational” search task, where people seek

to better understand the information available in document corpora,

often uses techniques such as multidocument summarisation and

topic modeling. Topic models uncover the salient patterns of a col-

lection under the mixed-membership assumption: each document

can exhibit multiple patterns to different extents. When analysing

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for proﬁt or commercial advantage and that copies bear

this notice and the full citation on the ﬁrst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with

credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior speciﬁc permission and/or a fee. Request

permissions from permissions@acm.org.

SIGIR’13, July 28–August 1, 2013, Dublin, Ireland.

Table 1: Example Topic Words

Unpooled LDA (poor) Proposed approach (coherent)

barack cool apple health iphone ﬂu swine news pandemic health

los barackobama video uma gop death ﬂight h1n1 vaccine conﬁrmed

text, these patterns are represented as distributions over words, called

topics. Probabilistic topic models such as Latent Dirichlet Alloca-

tion (LDA) [1] are a class of Bayesian latent variable models that

have been adapted to model a diverse range of document genres.

We consider the application of LDA to Twitter content, which

poses unique challenges different to much of standard NLP con-

tent. Posts are (1) short – 140 characters or less, (2) mixed with

contextual clues such as URLs, tags, and Twitter names, and (3)

use informal language with misspelling, acronyms and nonstandard

abbreviations (e.g. O o haha wow). Hence, effectively modeling

content on Twitter requires techniques that can readily adapt to this

unwieldy data while requiring little supervision.

Unfortunately, it has been found that topic modeling techniques

like LDA do not work well with the messy form of Twitter con-

tent [15]. Topics learned from LDA are formally a multinomial dis-

tribution over words, and by convention the top-10 words are used

to identify the subject area or give an interpretation of a topic. The

naïve application of LDA to Twitter content produces mostly in-

coherent topics. Table 1 shows examples of poor topic words by

directly applying LDA and topic words which are much more co-

herent and interpretable.

How can we extract better topics in microblogging environments

with standard LDA? While linguistic “cleaning” of text could help

somewhat, for instance [3], a complementary approach using LDA

is also needed because there are so few words in a tweet. An intu-

itive solution to this problem is tweet pooling [13, 5]: merging re-

lated tweets together and presenting them as a single document to

the LDA model. In this paper we examine existing tweet-pooling

schemes to improve LDA topic quality and propose a few novel

schemes. We compare the performance of these methods across

three datasets, constructed to be representative of the diverse con-

tent collections in the microblog environment. We examine a vari-

ety of topic coherence evaluation metrics, including the ability of

the learned LDA topics to reconstruct known clusters and the inter-

pretability of these topics via statistical information measures.

Overall, we ﬁnd that the novel method of pooling tweets by hash-

tags yields superior performance for all metrics on all datasets,

and an automatic hashtag assignment scheme further improves the

hashtag pooling results on a subset of metrics. Hence this work

provides two novel methods for signiﬁcantly improving LDA topic

modeling on microblog content.

2. TWEET POOLING FOR TOPIC MODELS

The goal of this paper is to obtain better LDA topics from Twitter

content without modifying the basic machinery of standard LDA.

As noted in Section 1, microblog messages differ from conven-

tional text: message quality varies greatly, from newswire-like ut-

terances to babble. To address the challenges, we present various

pooling schemes to aggregate tweets into “macro-documents” for

use as training data to build better LDA models. The motivation be-

hind tweet pooling is that individual tweets are very short (≤ 140

characters) and hence treating each tweet as an individual docu-

ment does not present adequate term co-occurence data within doc-

uments. Aggregating tweets which are similar in some sense (se-

mantically, temporally, etc.) enriches the content present in a single

document from which the LDA can learn a better topic model. We

next describe various tweet pooling schemes.

Basic scheme – Unpooled: The default treats each tweet as a single

document and trains LDA on all tweets. This serves as our baseline

for comparison to pooled schemes.

Author-wise Pooling: Pooling tweets according to author is a stan-

dard away of aggregating Twitter data [13, 5] and shown to be supe-

rior to unpooled Tweets. To use this method, we build a document

for each author which combines all tweets they have posted.

Burst-score wise Pooling: A trend on Twitter [7] (sometimes re-

ferred to as a trending topic) consists of one or more terms and a

time period, such that the volume of messages posted for the terms

in the time period exceeds some expected level of activity. In order

to identify trends in Twitter posts, unusual “bursts" of term fre-

quency can be detected in the data. We run a simple burst detection

algorithm to detect such trending terms and aggregate tweets con-

taining those terms having high burst scores. To identify terms that

appear more frequently than expected, we will assign a score to

terms according to their deviation from an expected frequency. As-

sume that M is the set of all messages in our tweets dataset, R is

a set of one or more terms (a potential trending topic) to which we

wish to assign a score, and d ∈ D represents one day in a set D

of days. We then deﬁne M(R, d) as the subset of Twitter messages

in M such that (1) the message contains all the terms in R and (2)

the message was posted during day d. With this information, we

can compare the volume in a speciﬁc day to the other days. Let

Mean(R) =

|D|

d∈D

M(R, d). Correspondingly, SD(R) is the

standard deviation of M (R, d) over the days d ∈ D. The burst-

score is then deﬁned as:

burst-score(R, d) =

|M(R, d) − Mean(R)|

SD(R)

Let us denote an individual term having burst-score greater than

some threshold τ on some day d ∈ D as a burst-term. Then our ﬁrst

novel aggregation method of Burst Score-wise Pooling aggregates

tweets for each burst-term into a single document for training LDA,

where we found τ = 5 to provide best results. If any tweet has more

than one burst-term, this tweet gets added to the tweet-pool of each

of those burst-terms.

Temporal Pooling: When a major event occurs, a large number

of users often start tweeting about the event within a short period

of time. To capture such temporal coherence of tweets, the fourth

scheme and our second novel pooling proposal is known as Tempo-

ral Pooling, where we pool all tweets posted within the same hour.

Hashtag-based Pooling: A Twitter hashtag is a string of characters

preceded by the hash (#) character. In many cases hashtags can be

Table 2: Datasets

Dataset Term/%

Generic music/17.9 business/15.8 movie/14.5 design/10.8 food/9.6 fun/9.1 health/6.9 family/6.4

sport/4.9 space/3.2

Speciﬁc Obama/23.2 Sarkozy/0.4 baseball/3.5 cricket/1.8 Mcdonalds/1.5 Burgerking/0.5 Ap-

ple/16.3 Microsoft/6.8 United-states/40.7 France/4.9

Events Flight-447/0.9 Jackson/13.9 Lakers/13.8 attack/13.8 scandal/4.1 swine-ﬂu/13.8 reces-

sion/12.3 conference/14.1 T20/4.4 Iran-election/8.6

viewed as topical markers, an indication to the context of the tweet

or as the core idea expressed in the tweet, therefore hashtags are

adopted by other users that contribute similar content or express

a related idea. One example of the use of hashtags is "ask GAGA

anything using the tag #GoogleGoesGaga for her interview! RT so

every monster learns about it!! " referring to an exclusive interview

for Google by Lady Gaga (singer). For the hashtag-based pooling

scheme, we create pooled documents for each hashtag. If any tweet

has more than one hashtag, this tweet gets added to the tweet-pool

of each of those hashtags.

Other Pooling: While a few other combinations of pooling schemes

(eg.author-time, hashtag-time, etc) are possible, the initial results

obtained were not as good as those presented for the currently out-

lined pooling schemes.

3. TWITTER DATASET CONSTRUCTION

We construct three diverse datasets from a 20-30% sample of

public Twitter posts in 2009.

We chose one or two term queries

(often with similar pairs of queries to encourage a non-strongly

diagonal confusion matrix) to search a tweet collection and each

resulting set of tweets was labeled by the query that retrieved it.

Since the number of queries (equivalently the number of clusters)

is known beforehand, we could use this knowledge to evaluate how

well the topics output by LDA match with known clusters. A brief

description of the three datasets is as follows:

Generic Dataset: 359,478 tweets from 11 Jan’09 to 30 Jan’09. A

general dataset with tweets containing generic terms.

Speciﬁc Dataset: 214,580 tweets from 11 Jan’09 to 30 Jan’09. A

dataset composed of tweets that refer to named entities.

Event Dataset: 207,128 tweets from 1 Jun’09 to 30 Jun’09. A

dataset composed of tweets pertaining to speciﬁc events. The

query terms represent these events and the time period was

chosen speciﬁcally due to the high number of co-occurring

events being discussed in this month relative to other months.

Table 2 provides the exact query terms and the percentage of

tweets in the datasets retrieved by each query. Typically, less than

one percent of tweets were retrieved by more than one query with

the highest case of 4.6% overlap occurring in the generic dataset

for the two queries “family” and “fun”. We have removed tweets

retrieved by more than one query in order to preserve uniqueness

of tweet labels for later analysis with clustering metrics.

4. EVALUATION METRICS

Because there is no single method for evaluating topic models,

we evaluate a range of metrics including those used in clustering

(purity and NMI) and semantic topic coherence and interpretability

(PMI) as discussed below.

In order to cluster with LDA, we let a topic represent each cluster

and assign each tweet to its corresponding mixture topic of high-

est probability (an inferred quantity via LDA). Then by analysing

clustering-based metrics, we wish to understand how well the dif-

ferent tweet pooling schemes are able to reproduce clusters repre-

senting the original queries used to produce the datasets.

http://snap.stanford.edu/data/twitter7.html

Formally, let T

be the set of tweets in LDA topic cluster i and Q

be the set of tweets with query label j. Then let T = {T

, . . . , T

|T |

}

be the set of all |T | clusters and Q = {Q

, . . . , Q

|Q|

} be the set of

all |Q| query labels. Now we deﬁne our clustering-based metrics as

follows.

Purity: To compute purity [6], each LDA topic cluster is assigned

the query label most frequent in the cluster. Purity then simply mea-

sures the average “purity” of each cluster, i.e., the fraction of tweets

in a cluster having the assigned cluster query label. Formally:

Purity(T, Q) =

|T |

i∈{1...|T |}

max

j∈{1...|Q|}

∩ Q

Obviously, high purity scores reﬂect better original cluster recon-

struction.

Normalized Mutual Information (NMI): As a more information-

theoretic measure of cluster quality, we also evaluate normalized

mutual information (NMI) deﬁned as follows

NMI (T, Q) =

2I(T ; Q)

H(T ) + H(Q)

where I(·, ·) is mutual information, H(·) is entropy as deﬁned

in [6], T and Q are as deﬁned previously. NMI is always a number

between 0 and 1 and will be 1 if the clustering results exactly match

the category labels while 0 if the two sets are independent.

As an alternative to clustering quality, we can instead measure

topic interpretability or coherence, which is a human-judged qual-

ity that depends on the semantics of the words, and cannot be mea-

sured by model-based statistical measures that treat the words as

exchangeable tokens. It is possible to automatically measure topic

coherence with near-human accuracy [10] using a score based on

pointwise mutual information (PMI). We use this to measure co-

herence of the topics from different tweet-pooling schemes.

Pointwise Mutual Information (PMI): PMI is a measure of the

statistical independence of observing two words in close proximity

where the PMI of a given pair of words (u, v) is P M I(u, v) =

log

p(u,v)

p(u)p(v)

. Both probabilities are determined from the empiri-

cal statistics of the full collection, and we treat two words as co-

occurring if both words occur in the same tweet.

For a topic T

(k ∈ {1 . . . |T

|}), we measure topic coherence as

the average of PMI for the pairs of its ten highest probability words

, ..., w

PMI -Score(T

) =

100

i=1

j=1

P M I(w

, w

The average of the PMI score over all the topics is used as the ﬁnal

measure of the PMI score.

5. RESULTS FOR POOLING SCHEMES

In this section we discuss the results of the experimental eval-

uation of the tweet pooling schemes introduced in Section 2. The

datasets used were described in Section 3 while the evaluation met-

rics used were described in Section 4.

5.1 Document Characteristics

We ﬁrst look at characteristics of the documents in the different

pooling schemes for the three datasets as they may affect LDA.

Table 3 presents these statistics.

The statistics presented above highlight the differences in the

characteristics of the documents on which LDA models have been

trained. The number of documents is the largest for the Unpooled

and Author-wise schemes and smallest for the Hourly scheme, while

the corresponding size of the documents displays an inverse rela-

tionship. On average the document size increases by a factor of

seven in Hashtag-based pooling when compared against Unpooled

or Author-wise pooling schemes. On the other extreme lies the

Hourly pooling scheme with many fewer documents of a much

larger average size.

5.2 Comparison of Pooling Schemes

For the three datasets (viz. Generic, Speciﬁc and Events) and

four pooling schemes (plus Unpooled), we evaluate Purity scores,

NMI scores and PMI scores obtained by training LDA for 10 topics

using each setup. As seen in Table 4, the Hashtag-based pooling

scheme clearly performs better than the Unpooled scheme as well

as other pooling schemes.

6. AUTOMATIC HASHTAG LABELING

Hashtag-based pooling is clearly the best pooling scheme based

on the results of Table 4, yet if we examine how many tweets have

hashtags, we ﬁnd the following distribution: generic – 22.3%, spe-

ciﬁc, speciﬁc – 9.4%, event – 19.5%. In short, the majority of our

data is not being used effectively by Hashtag-based pooling. Con-

sequently, we conjecture that automatically assigning hashtags to

some tweets should improve the overall evaluation metrics. Next

we present an algorithm for performing this automated hashtag as-

signment with a tunable conﬁdence threshold.

Hashtag Labeling Algorithm: First we pool all tweets by existing

hashtags. Now, if the similarity score between an unlabeled and

labeled tweet exceeds a certain conﬁdence threshold C, we assign

the hashtag of the labeled tweet to the unlabeled tweet (and hence

it joins the pool for this hashtag). For our similarity metric between

tweets, two obvious candidates are cosine similarity using TF and

TF-IDF vector space representations [12].

On an initial exploratory analysis, we found that a medium-range

conﬁdence threshold of C = 0.5 to give best results for both Purity

and NMI. This is because tweets with the same class label have a

higher average TF-IDF similarity than tweets with a different class

label, so pooling these tweets together makes more topic-aligned

hashtag pools that aid cluster reconstruction. On the other hand,

PMI improved most for a rather high conﬁdence threshold of C =

0.9 since otherwise, tweets that were only marginally relevant to

a hashtag reduced the overall topical coherence of hashtag-pooled

documents and led to a noisier LDA model.

The overall results obtained via hashtag assignment are shown

in Table 5 and demonstrate that Hashtag-based pooling with TF-

similarity tag assignment results in the best cluster reproductions

with the highest Purity and NMI scores of all methods examined,

while simple Hashtag-based pooling without tag assignment gen-

erally provides best results for topic coherence measured via PMI.

7. RELATED WORK

Our work is different from many pioneering studies on topic

modeling of Tweets, as we focus on how to improve clustering

metrics and topic coherence with existing algorithms. Prior work

on topic modeling for tweets includes the following: [11] provide

a scalable implementation of a partially supervised learning model

(Labeled LDA). [5] provide a topical classiﬁcation of Twitter users

and messages. [4] propose a novel method for normalising ill-formed

out-of-vocabulary words in short microblog messages. The Twitter-

Rank system [13] and [5] used Author-wise pooling to apply LDA

to tweets. [15] compared topic characteristics between Twitter and

a traditional news medium – the New York Times; they propose to

use one topic per tweet (similar to PLSA), and argue that this is bet-

ter than the Unpooled LDA scheme and the author-topic model. [9]

proposed two methods to regularize the resulting topics towards

better coherence. [8] used retweet as an indicator of “interesting-

Table 3: Document Characteristics for different pooling schemes.

Pooling Scheme #of docs Avg # of words/doc Max # of words/doc

Generic Speciﬁc Events Generic Speciﬁc Events Generic Speciﬁc Events

Authorwise 208300 118133 67387 17.6 20.4 15.4 4893 3586 2775

Unpooled 359478 214580 207128 10.2 10.9 9.7 35 49 32

Burst Score 7658 7436 5434 76.5 154.2 71.6 61918 420249 57794

Hourly 465 464 463 8493.4 5387.5 2422 20144 18869 38893

Hashtag 8535 7029 4099 70.4 187.2 78.4 61918 420249 57794

Table 4: Results of different pooling schemes.

Scheme Purity NMI Score PMI score

Generic Speciﬁc Events Generic Speciﬁc Events Generic Speciﬁc Events

Unpooled 0.49 ± 0.08 0.64 ± 0.07 0.69 ± 0.09 0.28 ± 0.04 0.22 ± 0.05 0.39 ± 0.07 −1.27 ± 0.11 0.47 ± 0.12 0.47 ± 0.13

Author 0.54 ± 0.04 0.62 ± 0.05 0.60 ± 0.06 0.24 ± 0.04 0.17 ± 0.04 0.41 ± 0.06 0.21 ± 0.09 0.79 ± 0.15 0.51 ± 0.13

Hourly 0.45 ± 0.05 0.61 ± 0.06 0.61 ± 0.07 0.07 ± 0.04 0.09 ± 0.04 0.32 ± 0.05 −1.31 ± 0.12 0.87 ± 0.16 0.22 ± 0.14

Burstwise 0.42 ± 0.07 0.60 ± 0.04 0.64 ± 0.06 0.18 ± 0.05 0.16 ± 0.04 0.33 ± 0.04 0.48 ± 0.16 0.74 ± 0.14 0.58 ± 0.16

Hashtag 0.54 ± 0.04 0.68 ± 0.03 0.71 ± 0.04 0.28 ± 0.04 0.23 ± 0.03 0.42 ± 0.05 0.78 ± 0.15 1.43 ± 0.14 1.07 ± 0.17

Table 5: Overall percentage improvement of Hashtag-based pooling variants over Unpooled scheme.

Pooling Scheme Purity NMI Scores PMI Scores

Generic Speciﬁc Events Generic Speciﬁc Events Generic Speciﬁc Events

Hashtag +8.16% +5.88% +2.89% -3.44% +4.54% +7.69% +161% +204% +127%

Hashtag + Tag-Assignment (TF) +21% +12.5% +8.69% +20.6% +9.1% +12.8% +164% +155% +124%

Hashtag + Tag-Assignment (TF-IDF) +12.2% +9.4% +4.34% +10.3% +4.5% +10.25% +155% +159% +100%

ness” to improve retrieval quality. This related research suggests a

number of orthogonal methods that could be used to complement

our Hashtag-based tweet pooling scheme.

For automatic hashtag labeling that proved crucial to improving

topics in our Hashtag-based pooling model, additional features for

hashtag assignment can be found in the comprehensive study [14],

which can be leveraged in future extensions along with novel social-

media based similarity metrics like those incorporating inverse au-

thor frequency [2] and other social network properties.

8. CONCLUSION

This paper presented a way of aggregating tweets in order to

improve the quality of LDA-based topic modeling in microblogs

as measured by the ability of topics to reconstruct clusters and

topic coherence. The results presented in Table 4 on three diverse

selections of Twitter data suggest the novel scheme of Hashtag-

based pooling leads to drastically improved topic modeling over

Unpooled and other schemes. The further addition of automatic TF

similarity-based hashtag assignment to Hashtag-based pooling out-

performs all other pooling strategies and the Unpooled scheme on

cluster reconstruction metrics as shown in Table 5. In conclusion,

these two novel schemes present LDA users with novel methods for

signiﬁcantly improving LDA topic modeling on Twitter without re-

quiring any modiﬁcation of the underlying LDA machinery.

Acknowledgments

NICTA is funded by the Australian Government as represented

by the Department of Broadband, Communications and the Digi-

tal Economy and the Australian Research Council through the ICT

Centre of Excellence program.

9. REFERENCES

[1] D. Blei, A. Ng, and M. Jordon. Latent Dirichlet allocation.

volume 3, pages 993–1022, 2003.

[2] S. E. Chan, R. K. Pon, and A. F. Cárdenas. Visualization and

clustering of author social networks. pages 30–31, Arizona,

USA, 2006.

[3] B. Han, P. Cook, and T. Baldwin. Automatically constructing

a normalisation dictionary for microblogs. In Proc. of

EMNLP-CoNLL 2012, pages 421–432, Korea, 2012.

[4] B. Han, P. Cook, and T. Baldwin. Lexical normalisation of

social media text. ACM Transactions on Intelligent Systems

and Technology, 4(1), Feb. 2013.

[5] L. Hong and B. Davison. Empirical study of topic modeling

in Twitter. 1st ACM Workshop on Social Media Analytics,

2010.

[6] C. Manning, P. Raghavan, and H. Schütze. Introduction to

Information Retrieval. Cambridge University Press, 2008.

[7] M. Naaman, H. Becker, and L. Gravano. Hip and trendy:

Characterizing emerging trends on Twitter. J. Am. Soc. Inf.

Sci. Technol., 62(5):902–918, May 2011.

[8] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi.

Searching microblogs: coping with sparsity and document

quality. In CIKM ’11, pages 183–188, 2011.

[9] D. Newman, E. Bonilla, and W. Buntine. Improving topic

coherence with regularized topic models. NIPS, 2011.

[10] D. Newman, J. Lau, K. Grieser, and T. Baldwin. Automatic

evaluation of topic coherence. NAACL, 2010.

[11] D. Ramage, S. Dumais, and D. Liebling. Characterizing

microblogs with topic models. AAAI Conference on

Weblogs and Social Media, 2010.

[12] G. Salton and M. McGill. Introduction to modern

information retrieval. McGraw-Hill, 1983.

[13] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: ﬁnding

topic-sensitive inﬂuential twitterers. In WSDM’10, pages

261–270, 2010.

[14] L. Yang, T. Sun, M. Zhang, and Q. Mei. We know what

@you #tag: does the dual role affect hashtag adoption?

WWW ’12, pages 261–270, 2012.

[15] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and

X. Li. Comparing twitter and traditional media using topic

models. In ECIR’11, pages 338–349, 2011.

Improving LDA topic models for microblogs via tweet pooling and automatic labeling

Figures

Citations

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Improving Topic Models with Latent Feature Word Representations

Improving Topic Models with Latent Feature Word Representations

Topic Modeling in Management Research: Rendering New Theory from Textual Data

Short and sparse text topic modeling via self-aggregation

References

Latent dirichlet allocation

Latent Dirichlet Allocation

Introduction to Modern Information Retrieval

Introduction to Information Retrieval

TwitterRank: finding topic-sensitive influential twitterers

Related Papers (5)

Latent dirichlet allocation

Empirical study of topic modeling in Twitter

Comparing twitter and traditional media using topic models

TwitterRank: finding topic-sensitive influential twitterers

Probabilistic topic models

Frequently Asked Questions (9)

Q1. What are the contributions in "Improving lda topic models for microblogs via tweet pooling and automatic labeling" ?

Q2. Why did the authors remove tweets retrieved by more than one query?

Q3. How many tweets were retrieved by more than one query?

Q4. Why do the authors use the same class label for some tweets?

Q5. What is the motivation behind tweet pooling?

Q6. What is the goal of this paper?

Q7. What is the method of aggregating tweets?

Q8. How does the paper improve topic modeling?

Q9. What are the main metrics used in the evaluation of topic models?