Comparing Twitter Summarization Algorithms for Multiple Post Summaries

doi:10.1109/PASSAT/SOCIALCOM.2011.31

Comparing Twitter Summarization Algorithms

for Multiple Post Summaries

David Inouye* and Jugal K. Kalita+

*School of Electrical and Computer Engineering

Georgia Institute of Technology

Atlanta, Georgia 30332 USA

+Department of Computer Science

University of Colorado

Colorado Springs, CO 80918 USA

dinouye3@gatech.edu, jkalita@uccs.edu

Abstract—Due to the sheer volume of text generated by a

microblog site like Twitter, it is often difﬁcult to fully understand

what is being said about various topics. In an attempt to

understand microblogs better, this paper compares algorithms for

extractive summarization of microblog posts. We present two al-

gorithms that produce summaries by selecting several posts from

a given set. We evaluate the generated summaries by comparing

them to both manually produced summaries and summaries

produced by several leading traditional summarization systems.

In order to shed light on the special nature of Twitter posts,

we include extensive analysis of our results, some of which are

unexpected.

I. INTRODUCTION

Twitter

1

, the microblogging site started in 2006, has become

a social phenomenon. In February 2011, Twitter had 200 mil-

lion registered users

2

. There were a total of 25 billion tweets

in all of 2010

3

. While a majority of posts are conversational or

not particularly meaningful, about 3.6% of the posts concern

topics of mainstream news

4

.

To help people who read Twitter posts or tweets, Twitter

provides two interesting features: an API that allows users

to search for posts that contain a topic phrase and a short

list of popular topics called Trending Topics. A user can

perform a search for a topic and retrieve a list of the most

recent posts that contain the topic phrase. The difﬁculty in

interpreting the results is that the returned posts are only

sorted by recency, not relevancy. Therefore, the user is forced

to manually read through the posts in order to understand

what users are primarily saying about a particular topic. The

motivation of the summarizer is to automate this process.

In this paper, we discuss ongoing effort to create automatic

summaries of Twitter trending topics. In our recent prior work

1

http://www.twitter.com

2

http://www.bbc.co.uk/news/business-12889048

3

http://blog.twitter.com/2010/12/hindsight2010-top-trends-on-twitter.html

4

http://www.pearanalytics.com/blog/wp-content/uploads/2010/05/

Twitter-Study-August-2009.pdf

[1]–[4], we have discussed algorithms that can be used to pick

the single post that is representative of or is the summary of a

number of Twitter posts. Since the posts returned by the Twit-

ter API for a speciﬁed topic likely represent several sub-topics

or themes, it may be more appropriate to produce summaries

that encompass the multiple themes rather than just having

one post describe the whole topic. For this reason, this paper

extends the work signiﬁcantly to create summaries that contain

multiple posts. We compare our multiple post summaries with

ones produced by leading traditional summarizers.

II. RELATED WORK

Summarizing microblogs can be viewed as an instance of

the more general problem of automated text summarization,

which is the problem of automatically generating a condensed

version of the most important content from one or more

documents. A number of algorithms have been developed

for various aspects of document summarization during recent

years. Notable algorithms include SumBasic [5] and the cen-

troid algorithm [6]. SumBasic’s underlying premise is that

words that occur more frequently across documents have a

higher probability of being selected for human created multi-

document summaries than words that occur less frequently.

The centroid algorithm takes into consideration a centrality

measure of a sentence in relation to the overall topic of the

document cluster or in relation to a document in the case of

single document summarization. The LexRank algorithm [7]

for computing the relative importance of sentences or other

textual units in a document (or a set of documents) creates an

adjacency matrix among the textual units and then computes

the stationary distribution considering it to be a Markov chain.

The TextRank algorithm [8] is also a graph-based approach

that ﬁnds the most highly ranked sentences (or keywords) in

a document using the PageRank algorithm [9].

In most cases, text summarization is performed for the pur-

poses of saving users time by reducing the amount of content

to read. However, text summarization has also been performed

for purposes such as reducing the number of features required

for classifying (e.g. [10]) or clustering (e.g. [11]) documents.

Following another line of approach, early work by Kalita et al.

generated textual summaries of database query results [12]–

[14]. Instead of presenting a table of data rows as the response

to a database query, they generated textual summaries from

predominant patterns found within the data table.

In the context of the Web, multi-document summarization

is useful in combining information from multiple sources.

Information may have to be extracted from many different

articles and pieced together to form a comprehensive and

coherent summary. One major difference between single docu-

ment summarization and multi-document summarization is the

potential redundancy that comes from using many source texts.

One solution may involve clustering the important sentences

picked out from the various source texts and using only

a representative sentence from each cluster. For example,

McKeown et al. ﬁrst cluster the text units and then choose

the representative units from the clusters to include in the

ﬁnal summary [15]. Dhillon models a document collection as

a bipartite graph consisting of words and documents and uses

a spectral co-clustering algorithm to obtain excellent results

[16].

Finally, in the context of multi-document summarization, it

is appropriate to mention MEAD [17], a ﬂexible platform for

multi-document multi-lingual publicly-available summariza-

tion. MEAD implements multiple summarization algorithms

as well as provides metrics for evaluating multi-document

summaries.

III. PROBLEM DESCRIPTION

A Twitter post or tweet is at most 140 characters long

and in this study we only consider English posts. Because

a post is informal, it often has colloquial syntax, non-standard

orthography or non-standard spelling, and it frequently lacks

any punctuation.

The problem considered in this paper can be deﬁned as

follows:

Given a topic keyword or phrase T and the de-

sired length k for the summary, output a set of

representative posts S with a cardinality of k such

that 1) ∀s ∈ S, T is in the text of s, and 2)

∀s

i

, ∀s

j

∈ S, s

i

6∼ s

j

. s

i

6∼ s

j

means that the

two posts provide sufﬁciently different information in

order to keep the summaries from being redundant.

IV. SELECTED APPROACHES FOR TWITTER SUMMARIES

Among the many algorithms we discuss in prior papers [1]–

[4] for single-length summary creation for tweets, an algorithm

called the Hybrid TF-IDF algorithm that we developed worked

best. Thus, in this paper, we extend this algorithm to obtain

multi-post summaries. The contributions of this paper include

introduction of a hybrid TF-IDF based algorithm and a clus-

tering based algorithm for obtaining multi-post summaries of

Twitter posts along with detailed analysis of the Twitter post

domain for text processing by comparing these algorithms

with several other summarization algorithms. We ﬁnd some

unexpected results when we apply multiple document sum-

marization algorithms to short informal documents.

A. Hybrid TF-IDF with Similarity Threshold

Term Frequency Inverse Document Frequency, is a sta-

tistical weighting technique that assigns each term within a

document a weight that reﬂects the term’s saliency within

the document. The weight of a post is the summation of

the individual term weights within the post. To determine the

weight of a term, we use the formula:

T F

IDF = tf

ij

∗ log

2

N

df

j

(1)

where tf

ij

is the frequency of the term T

j

within the document

D

i

, N is the total number of documents, and df

j

is the number

of documents within the set that contain the term T

j

. We

assume that a term corresponds to a word and select the most

weighted post as summary.

The TF-IDF value is composed of two primary parts. The

term frequency component (TF) assigns more weight to words

that occur frequently within a document because important

words are often repeated. The inverse document frequency

component (IDF) compensates for the fact that some words

such as common stop words are frequent. Since these words

do not help discriminate between one sentence or document

over another, these words are penalized proportionally to their

inverse document frequency. The logarithm is taken to balance

the effect of the IDF component in the formula.

Equation (1) deﬁnes the weight of a term in the context of

a document. However, a microblog post is not a traditional

document. Therefore, one question we must ﬁrst answer is

how we deﬁne a document. One option is to deﬁne a single

document that encompasses all the posts. In this case, the

TF component’s deﬁnition is straightforward since we can

compute the frequencies of the terms across all the posts.

However, doing so causes us to lose the IDF component

since we only have a single document. On the other extreme,

we could deﬁne each post as a document making the IDF

component’s deﬁnition clear. But, the TF component now has

a problem: because each post contains only a handful of words,

most term frequencies will be a small constant for a given post.

To handle this situation, we redeﬁne TF-IDF in terms of

a hybrid document. We primarily deﬁne a document as a

single post. However, when computing the term frequencies,

we assume the document is the entire collection of posts.

Therefore, the TF component of the TF-IDF formula uses the

entire collection of posts while the IDF component treats each

post as a separate document. This way, we have differentiated

term frequencies but also do not lose the IDF component.

We next choose a normalization method since otherwise

the TF-IDF algorithm will always bias towards longer posts.

We normalize the weight of a post by dividing it by a

normalization factor. Since common stop words do not help

discriminate the saliency of sentences, we give stop words—

as deﬁned by a prebuilt list—a weight of zero. Given this,

our deﬁnition of the TF-IDF summarization algorithm is now

complete for microblogs. We summarize this algorithm below

in Equations (2)-(6).

W (s) =

P

#W ordsInP ost

i=0

W (w

i

)

nf(s)

(2)

W (w

i

) = tf (w

i

) ∗ log

2

(idf(w

i

)) (3)

tf(w

i

) =

#OccurrencesOf W ordInAllP osts

#W ordsInAllP osts

(4)

idf(w

i

) =

#P osts

#P ostsInW hichW ordOccurs

(5)

nf(s) = max[M inimumT hreshold, (6)

#W ordsInP ost]

where W is the weight assigned to a post or a word, nf is a

normalization factor, w

i

is the ith word, and s is a post.

We select the top k most weighted posts. In order to avoid

redundancy, the algorithm selects the next top post and checks

to make sure that it does not have a similarity above a given

threshold t with any of the other previously selected posts

because the top most weighted posts may be very similar

or discuss the same subtopic. This similarity threshold ﬁlters

out a possible summary post s

′

i

if it satisﬁes the following

condition:

sim(s

′

i

, s

j

) > t

∀s

j

∈ R where R is the set of posts aleady chosen for the

ﬁnal summary and t is the similarity threshold. We use the

cosine similarity measure. The threshold was varied from 0 to

0.99 in increments of 0.01 for a total of 100 tests in order to

ﬁnd the best threshold to be used.

B. Cluster Summarizer

We develop another method for summarizing a set of Twitter

posts. Similar to [15] and [16], we ﬁrst cluster the tweets into

k clusters based on a similarity measure and then summarize

each cluster by picking the most weighted post as determined

by the Hybrid TF-IDF weighting described in Section IV-A.

During preliminary tests, we evaluated how well different

clustering algorithms would work on Twitter posts using the

weights computed by the Hybrid TF-IDF algorithm and the

cosine similarity measure. We implemented two variations of

the k-means algorithm: bisecting k-means [18] and k-means++

[19]. The bisecting k-means algorithm initially divides the

input into two clusters and then divides the largest cluster

into two smaller clusters. This splitting is repeated until the

kth cluster is formed. The k-means++ algorithm is similar to

the regular k-means algorithm except that it chooses the initial

centroids differently. It picks an initial centroid c

1

from the set

of vertices V randomly. It then chooses the next centroid c

i

,

selecting c

i

= v

′

∈ V with the probability

D(v

′

)

2

P

v∈V

D(v)

2

where

D(v) is the shortest Euclidean distance from v to the closest

center which is already known. It repeats this selection process

until k initial centroids have been chosen. After trying these

methods, we found that the bisecting k-means++ algorithm—a

combination of the two algorithms—performed the best, even

though the performance gain above standard k-means was not

very high according to our evaluation methods.

Thus, the cluster summarizer attempts to creat k subtopics

by clustering the posts. It then feeds each subtopic cluster to

the Hybrid TF-IDF algorithm discussed in IV-A that selects

the most weighted post for each subtopic.

C. Additional Summarization Algorithms to Compare Results

We compare the results of summarization of the two

newly introduced algorithms with baseline algorithms and

well-known multi-document summarization algorithms. The

baseline algorithms include a Random summarizer and a Most

Recent summarizer. The other algorithms we compare our

results with are SumBasic, MEAD, LexRank and TextRank.

1) Random Summarizer: This summarizer randomly

chooses k posts or each topic as summary. This method was

chosen in order to provide worst case performance and set the

lower bound of performance.

2) Most Recent Summarizer: This summarizer chooses the

most recent k posts from the selection pool as a summary.

It is analogous to choosing the ﬁrst part of a news article

as summary. It was implemented because often intelligent

summarizers cannot perform better than simple summarizers

that just use the ﬁrst part of the document as summary.

3) SumBasic: SumBasic [5] uses simple word probabilities

with an update function to compute the best k posts. It was

chosen because it depends solely on the frequency of words

in the original text and is conceptually very simple.

4) MEAD: This summarizer

5

[17] is a well-known ﬂexible

and extensible multi-document summarization system and was

chosen to provide a comparison between the more structured

document domain—in which MEAD works fairly well—and

the domain of Twitter posts being studied. In addition, the

default MEAD program is a cluster based summarizer so it

will provide some comparison to our cluster summarizer.

5) LexRank: This summarizer [7] uses a graph based

method that computes pairwise similarity between two

sentences—in our case two posts—and makes the similarity

score the weight of the edge between the two sentences. The

ﬁnal score of a sentence is computed based on the weights of

the edges that are connected to it. This summarizer was chosen

to provide a baseline for graph based summarization instead

of direct frequency summarization. Though it does depend on

frequency, this system uses the relationships among sentences

to add more information and is therefore a more complex

algorithm than the frequency based ones.

6) TextRank: This summarizer [8] is another graph based

method that uses the PageRank [9] algorithm. This provided

another graph based summarizer that incorporates potentially

more information than LexRank since it recursively changes

the weights of posts. Therefore, the ﬁnal score of each post

is not only dependent on how it is related to immediately

connected posts but also how those posts are related to other

posts. TextRank incorporates the whole complexity of the

graph rather than just pairwise similarities.

5

http://www.summarization.com/mead/

V. EXPERIMENTAL SETUP

A. Data Collection

For ﬁve consecutive days, we collected the top ten currently

trending topics from Twitter’s home page at roughly the

same time every evening. For each topic, we downloaded the

maximum number (approximately 1500) of posts. Therefore,

we had 50 trending topics with a set of 1500 posts for each.

B. Preprocessing the Posts

Pre-processing steps included converting any Unicode char-

acters into their ASCII equivalents, ﬁltering out any embedded

URL’s, discarding spam using a Na

¨

ıve Bayes classiﬁer, etc.

These pre-processing steps and their rationale are described

more fully in [1].

C. Evaluation Methods

Summary evaluation is performed using one of two meth-

ods: intrinsic, or extrinsic. In intrinsic evaluation, the quality

of the summary is judged based on direct analysis using prede-

ﬁned metrics such as grammaticality, ﬂuency, or content [20].

Extrinsic evaluations measure how well a summary enables

a user to perform a task. To perform intrinsic evaluation, a

common approach is to create one or more manual summaries

and to compare the automated summaries against the man-

ual summaries. One popular automatic evaluation metric is

ROUGE, which is a suite of metrics [21]. Both precision and

recall of the automated summaries can be computed using

related formulations of the metric. Given that M S is the set of

manual summaries and u is the set of unigrams in a particular

manual summary, precision can be deﬁned as

p =

P

m∈MS

P

u∈m

match(u)

P

m∈MS

P

u∈m

count(u)



=

matched

retrieved



, (7)

where count(u) is the number of unigrams in the automated

summary and match(u) is the number of co-occurring un-

igrams between the manual and automated summaries. The

ROUGE metric can be slightly altered so that it measures the

recall of the auto summaries such that

r =

P

m∈MS

P

u∈m

match(u)

| MS | ∗

P

u∈a

count(u)



=

matched

relevant



, (8)

where | MS | is the number of manual summaries and a is

the auto summary. We also report the F-measure, which is the

harmonic mean of precision and recall.

Lin’s use of of ROUGE with the very short (around 10

words) summary task of DUC 2003 shows that ROUGE-1

and other ROUGEs correlate highly with human judgments

[21]. Since this task is very similar to creating microblog

summaries, we implement ROUGE-1 as a metric. However,

since we want certainty that ROUGE-1 correlates with a

human evaluation, we implemented a human evaluation using

Amazon Mechanical Turk

6

, a paid system that pays human

workers small amounts of money for completing a short

Human Intelligence Task, or HIT. The HITs used for summary

evaluation displayed the summaries to be compared side by

side with the topic speciﬁed. Then, we asked the user, “The

6

http://www.mturk.com

TABLE I

ANSWERS TO THE SURVEY ABOUT HOW MANY CLUSTERS SEEMED

APPROPRIAT E FOR EACH TWITTER TOPIC.

Answer “3 (Less)” “4 (About Right)” “5 (More)”

Count 13 28 9

auto-generated summary expresses of the meaning

of the human produced summary.” The possible answers were

“All,” “Most,” “Some,” “Hardly Any” and “None” which

correspond to a score of 5 through 1, respectively.

D. Manual Summarization

1) Choice of k: An initial question that we must answer

before using any multi-post extractive summarizer on a set of

Twitter posts is the question of how many posts are appropriate

in a summary. Though it is possible to choose k automatically

for clustering [22], we decided to focus our experiments on

summaries with a predeﬁned value of k for several reasons.

First, we wanted to explore other summarization algorithms

for which automatically choosing k is not as straightforward

as in the cluster summarization algorithm. For example, the

SumBasic summarization does not have any mechanism for

choosing the right number of posts in the summary. Second,

we thought it would be difﬁcult to perform evaluation where

the manual summaries were two or three posts in length and

the automatic summaries were ﬁve or six posts in length—or

vice versa—because the ROUGE evaluation metric is sensitive

to length even with some normalization.

To get a subjective idea of what people thought about the

value of k = 4 after being immersed in manual clustering

for a while, we took a survey of the volunteers after they

performed clustering of 50 topics—2 people for each of the

25 topics—with 100 posts in each topic. We asked them “How

many clusters do you think this should have had?” with the

choices “3 (Less)”, “4 (About Right)” or “5 (More)”. The

results are in Table I. This survey is probably biased towards

“4 (About Right)” because the question does not allow for

numbers other than 3, 4 or 5. Therefore, these results must be

taken tentatively but they at least suggest that there is some

signiﬁcant variability about the best value for k. Our bias is

also based on the fact that our initial 1500 Twitter posts on

each topic were obtained within a small interval of 15 minutes

so we thought a small number would be good.

Since the volunteers had already clustered the posts into

four clusters, the manual summaries were four-post long as

well. This kept the already onerous manual summary creation

process somewhat simple. However, this also means that being

dependent on a single length for the summaries may impact

our evaluation process described next in an unknown way.

2) Manual Summarization Method: Our manual multi-post

summaries were created by volunteers who were undergrad-

uates from around the US gathered together in an NSF-

supported REU program. Each of the ﬁrst 25 topics was man-

ually summarized by two different volunteers

7

by performing

7

A total of 16 volunteers produced manual summaries in such a combination

that no volunteer would be compared against another speciﬁed volunteer more

than once.

0.29

0.3

0.31

0.32

0.33

0.34

0.35

0.36

0

0.04

0.08

0.12

0.16

0.2

0.24

0.28

0.32

0.36

0.4

0.44

0.48

0.52

0.56

0.6

0.64

0.68

0.72

0.76

0.8

0.84

0.88

0.92

0.96

Similarity)Threshold

Fig. 1. F-measures of Hybrid TF-IDF algorithm over different thresholds.

steps parallel to the steps of the cluster summarizer. First, the

volunteers clustered the posts into 4 clusters (k = 4). Second,

they chose the most representative post from each cluster. And

ﬁnally, they ordered the representative posts in a way that

they thought was most logical or coherent. These steps were

chosen because it was initially thought that a clustering based

solution would be the best way to summarize the Twitter posts

and it seemed simpler for the volunteers to cluster ﬁrst rather

than simply looking at all the posts at once. These procedures

probably biased the manual summaries—and consequently

the results—towards clustering based solutions but since the

cluster summarizer itself did not perform particularly well in

the evaluations, it seems that this bias was not particularly

strong.

E. Setup of the Summarizers

Like the manual summaries, the automated summaries were

restricted to producing four post summaries. For MEAD, each

post was formatted to be one document. For LexRank—which

is implemented in the standard MEAD distribution—the posts

for each topic were concatenated into one document. Because

the exact implementation of TextRank [8] was unavailable, the

TextRank summarizer was implemented internally.

For the Hybrid TF-IDF summarizer, in order to keep the

posts from being too similar in content, a preliminary test to

determine the best cosine similarity threshold was conducted.

The F-measure scores when varying the similarity threshold t

of the Hybrid TF-IDF summarizer from 0 to 0.99 are shown in

Figure 1. The best performing threshold of t = 0.77 seems to

be reasonable because it allows for some similarity between

ﬁnal summary posts but does not allow them to be nearly

identical.

VI. RESULTS AND ANALYSIS

The average F-measure of all the iterations was computed.

For the summarizers that involve random seeding (e.g., random

summarizer and cluster summarizer), 100 summaries were

produced for each topic to avoid the effects of random seeding.

These numbers can be seen more clearly in Table II. Also,

because we realized that the overlap of the topic keywords in

TABLE II

EVALUATION NUMBERS FOR ROUGE AND MTURK EVALUAT IONS.

Number of summaries Randomly seeded* Others

Number of topics 25 25

Summaries per topic 100 1

Total summaries computed 2500 25

ROUGE evaluation

ROUGE scores computed 2500 25

MTurk evaluation

Number of summaries evaluated 25

+

25

Number of manual summaries per topic 2 2

Evaluators per manual summary 2 2

Total MTurk evaluations 100 100

* The randomly seeded summaries were the Random Summarizer and the

Cluster Summarizer.

+

An average scoring post based on the F-measure for each topic was

chosen for the MTurk evaluations because evaluating 2500 summaries

would have been impractical.

TABLE III

AVERAGE VALUES OF F-MEASURE, RECALL AND PRECISION ORDERED BY

F-MEASURE.

F-measure Recall Precision

LexRank 0.2027 0.1894 0.2333

Random 0.2071 0.2283 0.1967

Mead 0.2204 0.3050 0.1771

Manual 0.2252 0.2320 0.2320

Cluster 0.2310 0.2554 0.2180

TextRank 0.2328 0.3053 0.1954

MostRecent 0.2329 0.2463 0.2253

Hybrid TF-IDF 0.2524 0.2666 0.2499

SumBasic 0.2544 0.3274 0.2127

the summary is trivial since every post contains the keywords,

we ignored keyword overlap in our ROUGE calculations.

For the human evaluations using Amazon Mechanical Turk,

each automatic summary was compared to both manual sum-

maries by two different evaluators. This leads to 100 evalua-

tions per summarizer as can be seen in Table II. The manual

summaries were evaluated against each other by pretending

that one of them was the automatic summary.

A. Results

Our experiments evaluated eight different summarizers:

random, most recent, MEAD, TextRank, LexRank, cluster,

Hybrid TF-IDF and SumBasic. Both the automatic ROUGE

based evaluation and the MTurk human evaluation are reported

for all eight summarizers in Figures 2 and 3, respectively. The

values of average F-measure, recall and precision can be seen

in Table III. The values of average MTurk scores can be seen

at the top of Table V.

B. Analysis of Results

1) General Observations: We see that both the ROUGE

scores and the human evaluation scores do not seem to

obviously differentiate among the summarizers as seen in

Figures 2 and 3. Therefore, we performed a paired two-sided

T-test for each summarizer compared to each other summarizer

for both the ROUGE scores and the human evaluation scores.

For the ROUGE scores, the twenty ﬁve average F-measure

scores corresponding to each topic were used for the paired

Comparing Twitter Summarization Algorithms for Multiple Post Summaries

Citations

Cites methods from "Comparing Twitter Summarization Alg..."

Cites background or methods from "Comparing Twitter Summarization Alg..."

Cites background from "Comparing Twitter Summarization Alg..."

Cites background or methods from "Comparing Twitter Summarization Alg..."

References

"Comparing Twitter Summarization Alg..." refers background or methods in this paper

"Comparing Twitter Summarization Alg..." refers background in this paper

Related Papers (5)