scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Comparing Twitter Summarization Algorithms for Multiple Post Summaries

TL;DR: This paper compares algorithms for extractive summarization of micro log posts with two algorithms that produce summaries by selecting several posts from a given set.
Abstract: Due to the sheer volume of text generated by a micro log site like Twitter, it is often difficult to fully understand what is being said about various topics. In an attempt to understand micro logs better, this paper compares algorithms for extractive summarization of micro log posts. We present two algorithms that produce summaries by selecting several posts from a given set. We evaluate the generated summaries by comparing them to both manually produced summaries and summaries produced by several leading traditional summarization systems. In order to shed light on the special nature of Twitter posts, we include extensive analysis of our results, some of which are unexpected.

Content maybe subject to copyright    Report

Comparing Twitter Summarization Algorithms
for Multiple Post Summaries
David Inouye* and Jugal K. Kalita+
*School of Electrical and Computer Engineering
Georgia Institute of Technology
Atlanta, Georgia 30332 USA
+Department of Computer Science
University of Colorado
Colorado Springs, CO 80918 USA
dinouye3@gatech.edu, jkalita@uccs.edu
Abstract—Due to the sheer volume of text generated by a
microblog site like Twitter, it is often difficult to fully understand
what is being said about various topics. In an attempt to
understand microblogs better, this paper compares algorithms for
extractive summarization of microblog posts. We present two al-
gorithms that produce summaries by selecting several posts from
a given set. We evaluate the generated summaries by comparing
them to both manually produced summaries and summaries
produced by several leading traditional summarization systems.
In order to shed light on the special nature of Twitter posts,
we include extensive analysis of our results, some of which are
unexpected.
I. INTRODUCTION
Twitter
1
, the microblogging site started in 2006, has become
a social phenomenon. In February 2011, Twitter had 200 mil-
lion registered users
2
. There were a total of 25 billion tweets
in all of 2010
3
. While a majority of posts are conversational or
not particularly meaningful, about 3.6% of the posts concern
topics of mainstream news
4
.
To help people who read Twitter posts or tweets, Twitter
provides two interesting features: an API that allows users
to search for posts that contain a topic phrase and a short
list of popular topics called Trending Topics. A user can
perform a search for a topic and retrieve a list of the most
recent posts that contain the topic phrase. The difficulty in
interpreting the results is that the returned posts are only
sorted by recency, not relevancy. Therefore, the user is forced
to manually read through the posts in order to understand
what users are primarily saying about a particular topic. The
motivation of the summarizer is to automate this process.
In this paper, we discuss ongoing effort to create automatic
summaries of Twitter trending topics. In our recent prior work
1
http://www.twitter.com
2
http://www.bbc.co.uk/news/business-12889048
3
http://blog.twitter.com/2010/12/hindsight2010-top-trends-on-twitter.html
4
http://www.pearanalytics.com/blog/wp-content/uploads/2010/05/
Twitter-Study-August-2009.pdf
[1]–[4], we have discussed algorithms that can be used to pick
the single post that is representative of or is the summary of a
number of Twitter posts. Since the posts returned by the Twit-
ter API for a specified topic likely represent several sub-topics
or themes, it may be more appropriate to produce summaries
that encompass the multiple themes rather than just having
one post describe the whole topic. For this reason, this paper
extends the work significantly to create summaries that contain
multiple posts. We compare our multiple post summaries with
ones produced by leading traditional summarizers.
II. RELATED WORK
Summarizing microblogs can be viewed as an instance of
the more general problem of automated text summarization,
which is the problem of automatically generating a condensed
version of the most important content from one or more
documents. A number of algorithms have been developed
for various aspects of document summarization during recent
years. Notable algorithms include SumBasic [5] and the cen-
troid algorithm [6]. SumBasic’s underlying premise is that
words that occur more frequently across documents have a
higher probability of being selected for human created multi-
document summaries than words that occur less frequently.
The centroid algorithm takes into consideration a centrality
measure of a sentence in relation to the overall topic of the
document cluster or in relation to a document in the case of
single document summarization. The LexRank algorithm [7]
for computing the relative importance of sentences or other
textual units in a document (or a set of documents) creates an
adjacency matrix among the textual units and then computes
the stationary distribution considering it to be a Markov chain.
The TextRank algorithm [8] is also a graph-based approach
that finds the most highly ranked sentences (or keywords) in
a document using the PageRank algorithm [9].
In most cases, text summarization is performed for the pur-
poses of saving users time by reducing the amount of content

to read. However, text summarization has also been performed
for purposes such as reducing the number of features required
for classifying (e.g. [10]) or clustering (e.g. [11]) documents.
Following another line of approach, early work by Kalita et al.
generated textual summaries of database query results [12]–
[14]. Instead of presenting a table of data rows as the response
to a database query, they generated textual summaries from
predominant patterns found within the data table.
In the context of the Web, multi-document summarization
is useful in combining information from multiple sources.
Information may have to be extracted from many different
articles and pieced together to form a comprehensive and
coherent summary. One major difference between single docu-
ment summarization and multi-document summarization is the
potential redundancy that comes from using many source texts.
One solution may involve clustering the important sentences
picked out from the various source texts and using only
a representative sentence from each cluster. For example,
McKeown et al. first cluster the text units and then choose
the representative units from the clusters to include in the
final summary [15]. Dhillon models a document collection as
a bipartite graph consisting of words and documents and uses
a spectral co-clustering algorithm to obtain excellent results
[16].
Finally, in the context of multi-document summarization, it
is appropriate to mention MEAD [17], a flexible platform for
multi-document multi-lingual publicly-available summariza-
tion. MEAD implements multiple summarization algorithms
as well as provides metrics for evaluating multi-document
summaries.
III. PROBLEM DESCRIPTION
A Twitter post or tweet is at most 140 characters long
and in this study we only consider English posts. Because
a post is informal, it often has colloquial syntax, non-standard
orthography or non-standard spelling, and it frequently lacks
any punctuation.
The problem considered in this paper can be defined as
follows:
Given a topic keyword or phrase T and the de-
sired length k for the summary, output a set of
representative posts S with a cardinality of k such
that 1) s S, T is in the text of s, and 2)
s
i
, s
j
S, s
i
6∼ s
j
. s
i
6∼ s
j
means that the
two posts provide sufficiently different information in
order to keep the summaries from being redundant.
IV. SELECTED APPROACHES FOR TWITTER SUMMARIES
Among the many algorithms we discuss in prior papers [1]–
[4] for single-length summary creation for tweets, an algorithm
called the Hybrid TF-IDF algorithm that we developed worked
best. Thus, in this paper, we extend this algorithm to obtain
multi-post summaries. The contributions of this paper include
introduction of a hybrid TF-IDF based algorithm and a clus-
tering based algorithm for obtaining multi-post summaries of
Twitter posts along with detailed analysis of the Twitter post
domain for text processing by comparing these algorithms
with several other summarization algorithms. We find some
unexpected results when we apply multiple document sum-
marization algorithms to short informal documents.
A. Hybrid TF-IDF with Similarity Threshold
Term Frequency Inverse Document Frequency, is a sta-
tistical weighting technique that assigns each term within a
document a weight that reflects the term’s saliency within
the document. The weight of a post is the summation of
the individual term weights within the post. To determine the
weight of a term, we use the formula:
T F
IDF = tf
ij
log
2
N
df
j
(1)
where tf
ij
is the frequency of the term T
j
within the document
D
i
, N is the total number of documents, and df
j
is the number
of documents within the set that contain the term T
j
. We
assume that a term corresponds to a word and select the most
weighted post as summary.
The TF-IDF value is composed of two primary parts. The
term frequency component (TF) assigns more weight to words
that occur frequently within a document because important
words are often repeated. The inverse document frequency
component (IDF) compensates for the fact that some words
such as common stop words are frequent. Since these words
do not help discriminate between one sentence or document
over another, these words are penalized proportionally to their
inverse document frequency. The logarithm is taken to balance
the effect of the IDF component in the formula.
Equation (1) defines the weight of a term in the context of
a document. However, a microblog post is not a traditional
document. Therefore, one question we must first answer is
how we define a document. One option is to define a single
document that encompasses all the posts. In this case, the
TF component’s definition is straightforward since we can
compute the frequencies of the terms across all the posts.
However, doing so causes us to lose the IDF component
since we only have a single document. On the other extreme,
we could define each post as a document making the IDF
component’s definition clear. But, the TF component now has
a problem: because each post contains only a handful of words,
most term frequencies will be a small constant for a given post.
To handle this situation, we redefine TF-IDF in terms of
a hybrid document. We primarily define a document as a
single post. However, when computing the term frequencies,
we assume the document is the entire collection of posts.
Therefore, the TF component of the TF-IDF formula uses the
entire collection of posts while the IDF component treats each
post as a separate document. This way, we have differentiated
term frequencies but also do not lose the IDF component.
We next choose a normalization method since otherwise
the TF-IDF algorithm will always bias towards longer posts.
We normalize the weight of a post by dividing it by a
normalization factor. Since common stop words do not help
discriminate the saliency of sentences, we give stop words—
as defined by a prebuilt list—a weight of zero. Given this,

our definition of the TF-IDF summarization algorithm is now
complete for microblogs. We summarize this algorithm below
in Equations (2)-(6).
W (s) =
P
#W ordsInP ost
i=0
W (w
i
)
nf(s)
(2)
W (w
i
) = tf (w
i
) log
2
(idf(w
i
)) (3)
tf(w
i
) =
#OccurrencesOf W ordInAllP osts
#W ordsInAllP osts
(4)
idf(w
i
) =
#P osts
#P ostsInW hichW ordOccurs
(5)
nf(s) = max[M inimumT hreshold, (6)
#W ordsInP ost]
where W is the weight assigned to a post or a word, nf is a
normalization factor, w
i
is the ith word, and s is a post.
We select the top k most weighted posts. In order to avoid
redundancy, the algorithm selects the next top post and checks
to make sure that it does not have a similarity above a given
threshold t with any of the other previously selected posts
because the top most weighted posts may be very similar
or discuss the same subtopic. This similarity threshold filters
out a possible summary post s
i
if it satisfies the following
condition:
sim(s
i
, s
j
) > t
s
j
R where R is the set of posts aleady chosen for the
final summary and t is the similarity threshold. We use the
cosine similarity measure. The threshold was varied from 0 to
0.99 in increments of 0.01 for a total of 100 tests in order to
find the best threshold to be used.
B. Cluster Summarizer
We develop another method for summarizing a set of Twitter
posts. Similar to [15] and [16], we first cluster the tweets into
k clusters based on a similarity measure and then summarize
each cluster by picking the most weighted post as determined
by the Hybrid TF-IDF weighting described in Section IV-A.
During preliminary tests, we evaluated how well different
clustering algorithms would work on Twitter posts using the
weights computed by the Hybrid TF-IDF algorithm and the
cosine similarity measure. We implemented two variations of
the k-means algorithm: bisecting k-means [18] and k-means++
[19]. The bisecting k-means algorithm initially divides the
input into two clusters and then divides the largest cluster
into two smaller clusters. This splitting is repeated until the
kth cluster is formed. The k-means++ algorithm is similar to
the regular k-means algorithm except that it chooses the initial
centroids differently. It picks an initial centroid c
1
from the set
of vertices V randomly. It then chooses the next centroid c
i
,
selecting c
i
= v
V with the probability
D(v
)
2
P
vV
D(v)
2
where
D(v) is the shortest Euclidean distance from v to the closest
center which is already known. It repeats this selection process
until k initial centroids have been chosen. After trying these
methods, we found that the bisecting k-means++ algorithm—a
combination of the two algorithms—performed the best, even
though the performance gain above standard k-means was not
very high according to our evaluation methods.
Thus, the cluster summarizer attempts to creat k subtopics
by clustering the posts. It then feeds each subtopic cluster to
the Hybrid TF-IDF algorithm discussed in IV-A that selects
the most weighted post for each subtopic.
C. Additional Summarization Algorithms to Compare Results
We compare the results of summarization of the two
newly introduced algorithms with baseline algorithms and
well-known multi-document summarization algorithms. The
baseline algorithms include a Random summarizer and a Most
Recent summarizer. The other algorithms we compare our
results with are SumBasic, MEAD, LexRank and TextRank.
1) Random Summarizer: This summarizer randomly
chooses k posts or each topic as summary. This method was
chosen in order to provide worst case performance and set the
lower bound of performance.
2) Most Recent Summarizer: This summarizer chooses the
most recent k posts from the selection pool as a summary.
It is analogous to choosing the first part of a news article
as summary. It was implemented because often intelligent
summarizers cannot perform better than simple summarizers
that just use the first part of the document as summary.
3) SumBasic: SumBasic [5] uses simple word probabilities
with an update function to compute the best k posts. It was
chosen because it depends solely on the frequency of words
in the original text and is conceptually very simple.
4) MEAD: This summarizer
5
[17] is a well-known flexible
and extensible multi-document summarization system and was
chosen to provide a comparison between the more structured
document domain—in which MEAD works fairly well—and
the domain of Twitter posts being studied. In addition, the
default MEAD program is a cluster based summarizer so it
will provide some comparison to our cluster summarizer.
5) LexRank: This summarizer [7] uses a graph based
method that computes pairwise similarity between two
sentences—in our case two posts—and makes the similarity
score the weight of the edge between the two sentences. The
final score of a sentence is computed based on the weights of
the edges that are connected to it. This summarizer was chosen
to provide a baseline for graph based summarization instead
of direct frequency summarization. Though it does depend on
frequency, this system uses the relationships among sentences
to add more information and is therefore a more complex
algorithm than the frequency based ones.
6) TextRank: This summarizer [8] is another graph based
method that uses the PageRank [9] algorithm. This provided
another graph based summarizer that incorporates potentially
more information than LexRank since it recursively changes
the weights of posts. Therefore, the final score of each post
is not only dependent on how it is related to immediately
connected posts but also how those posts are related to other
posts. TextRank incorporates the whole complexity of the
graph rather than just pairwise similarities.
5
http://www.summarization.com/mead/

V. EXPERIMENTAL SETUP
A. Data Collection
For five consecutive days, we collected the top ten currently
trending topics from Twitter’s home page at roughly the
same time every evening. For each topic, we downloaded the
maximum number (approximately 1500) of posts. Therefore,
we had 50 trending topics with a set of 1500 posts for each.
B. Preprocessing the Posts
Pre-processing steps included converting any Unicode char-
acters into their ASCII equivalents, filtering out any embedded
URLs, discarding spam using a Na
¨
ıve Bayes classifier, etc.
These pre-processing steps and their rationale are described
more fully in [1].
C. Evaluation Methods
Summary evaluation is performed using one of two meth-
ods: intrinsic, or extrinsic. In intrinsic evaluation, the quality
of the summary is judged based on direct analysis using prede-
fined metrics such as grammaticality, fluency, or content [20].
Extrinsic evaluations measure how well a summary enables
a user to perform a task. To perform intrinsic evaluation, a
common approach is to create one or more manual summaries
and to compare the automated summaries against the man-
ual summaries. One popular automatic evaluation metric is
ROUGE, which is a suite of metrics [21]. Both precision and
recall of the automated summaries can be computed using
related formulations of the metric. Given that M S is the set of
manual summaries and u is the set of unigrams in a particular
manual summary, precision can be defined as
p =
P
mMS
P
um
match(u)
P
mMS
P
um
count(u)
=
matched
retrieved
, (7)
where count(u) is the number of unigrams in the automated
summary and match(u) is the number of co-occurring un-
igrams between the manual and automated summaries. The
ROUGE metric can be slightly altered so that it measures the
recall of the auto summaries such that
r =
P
mMS
P
um
match(u)
| MS |
P
ua
count(u)
=
matched
relevant
, (8)
where | MS | is the number of manual summaries and a is
the auto summary. We also report the F-measure, which is the
harmonic mean of precision and recall.
Lin’s use of of ROUGE with the very short (around 10
words) summary task of DUC 2003 shows that ROUGE-1
and other ROUGEs correlate highly with human judgments
[21]. Since this task is very similar to creating microblog
summaries, we implement ROUGE-1 as a metric. However,
since we want certainty that ROUGE-1 correlates with a
human evaluation, we implemented a human evaluation using
Amazon Mechanical Turk
6
, a paid system that pays human
workers small amounts of money for completing a short
Human Intelligence Task, or HIT. The HITs used for summary
evaluation displayed the summaries to be compared side by
side with the topic specified. Then, we asked the user, “The
6
http://www.mturk.com
TABLE I
ANSWERS TO THE SURVEY ABOUT HOW MANY CLUSTERS SEEMED
APPROPRIAT E FOR EACH TWITTER TOPIC.
Answer “3 (Less)” “4 (About Right)” “5 (More)”
Count 13 28 9
auto-generated summary expresses of the meaning
of the human produced summary. The possible answers were
All, “Most, “Some, “Hardly Any” and “None” which
correspond to a score of 5 through 1, respectively.
D. Manual Summarization
1) Choice of k: An initial question that we must answer
before using any multi-post extractive summarizer on a set of
Twitter posts is the question of how many posts are appropriate
in a summary. Though it is possible to choose k automatically
for clustering [22], we decided to focus our experiments on
summaries with a predefined value of k for several reasons.
First, we wanted to explore other summarization algorithms
for which automatically choosing k is not as straightforward
as in the cluster summarization algorithm. For example, the
SumBasic summarization does not have any mechanism for
choosing the right number of posts in the summary. Second,
we thought it would be difficult to perform evaluation where
the manual summaries were two or three posts in length and
the automatic summaries were five or six posts in length—or
vice versa—because the ROUGE evaluation metric is sensitive
to length even with some normalization.
To get a subjective idea of what people thought about the
value of k = 4 after being immersed in manual clustering
for a while, we took a survey of the volunteers after they
performed clustering of 50 topics—2 people for each of the
25 topics—with 100 posts in each topic. We asked them “How
many clusters do you think this should have had?” with the
choices “3 (Less)”, “4 (About Right)” or “5 (More)”. The
results are in Table I. This survey is probably biased towards
“4 (About Right)” because the question does not allow for
numbers other than 3, 4 or 5. Therefore, these results must be
taken tentatively but they at least suggest that there is some
significant variability about the best value for k. Our bias is
also based on the fact that our initial 1500 Twitter posts on
each topic were obtained within a small interval of 15 minutes
so we thought a small number would be good.
Since the volunteers had already clustered the posts into
four clusters, the manual summaries were four-post long as
well. This kept the already onerous manual summary creation
process somewhat simple. However, this also means that being
dependent on a single length for the summaries may impact
our evaluation process described next in an unknown way.
2) Manual Summarization Method: Our manual multi-post
summaries were created by volunteers who were undergrad-
uates from around the US gathered together in an NSF-
supported REU program. Each of the first 25 topics was man-
ually summarized by two different volunteers
7
by performing
7
A total of 16 volunteers produced manual summaries in such a combination
that no volunteer would be compared against another specified volunteer more
than once.

0.29
0.3
0.31
0.32
0.33
0.34
0.35
0.36
0
0.04
0.08
0.12
0.16
0.2
0.24
0.28
0.32
0.36
0.4
0.44
0.48
0.52
0.56
0.6
0.64
0.68
0.72
0.76
0.8
0.84
0.88
0.92
0.96
Similarity)Threshold
Fig. 1. F-measures of Hybrid TF-IDF algorithm over different thresholds.
steps parallel to the steps of the cluster summarizer. First, the
volunteers clustered the posts into 4 clusters (k = 4). Second,
they chose the most representative post from each cluster. And
finally, they ordered the representative posts in a way that
they thought was most logical or coherent. These steps were
chosen because it was initially thought that a clustering based
solution would be the best way to summarize the Twitter posts
and it seemed simpler for the volunteers to cluster first rather
than simply looking at all the posts at once. These procedures
probably biased the manual summaries—and consequently
the results—towards clustering based solutions but since the
cluster summarizer itself did not perform particularly well in
the evaluations, it seems that this bias was not particularly
strong.
E. Setup of the Summarizers
Like the manual summaries, the automated summaries were
restricted to producing four post summaries. For MEAD, each
post was formatted to be one document. For LexRank—which
is implemented in the standard MEAD distribution—the posts
for each topic were concatenated into one document. Because
the exact implementation of TextRank [8] was unavailable, the
TextRank summarizer was implemented internally.
For the Hybrid TF-IDF summarizer, in order to keep the
posts from being too similar in content, a preliminary test to
determine the best cosine similarity threshold was conducted.
The F-measure scores when varying the similarity threshold t
of the Hybrid TF-IDF summarizer from 0 to 0.99 are shown in
Figure 1. The best performing threshold of t = 0.77 seems to
be reasonable because it allows for some similarity between
final summary posts but does not allow them to be nearly
identical.
VI. RESULTS AND ANALYSIS
The average F-measure of all the iterations was computed.
For the summarizers that involve random seeding (e.g., random
summarizer and cluster summarizer), 100 summaries were
produced for each topic to avoid the effects of random seeding.
These numbers can be seen more clearly in Table II. Also,
because we realized that the overlap of the topic keywords in
TABLE II
EVALUATION NUMBERS FOR ROUGE AND MTURK EVALUAT IONS.
Number of summaries Randomly seeded* Others
Number of topics 25 25
Summaries per topic 100 1
Total summaries computed 2500 25
ROUGE evaluation
ROUGE scores computed 2500 25
MTurk evaluation
Number of summaries evaluated 25
+
25
Number of manual summaries per topic 2 2
Evaluators per manual summary 2 2
Total MTurk evaluations 100 100
* The randomly seeded summaries were the Random Summarizer and the
Cluster Summarizer.
+
An average scoring post based on the F-measure for each topic was
chosen for the MTurk evaluations because evaluating 2500 summaries
would have been impractical.
TABLE III
AVERAGE VALUES OF F-MEASURE, RECALL AND PRECISION ORDERED BY
F-MEASURE.
F-measure Recall Precision
LexRank 0.2027 0.1894 0.2333
Random 0.2071 0.2283 0.1967
Mead 0.2204 0.3050 0.1771
Manual 0.2252 0.2320 0.2320
Cluster 0.2310 0.2554 0.2180
TextRank 0.2328 0.3053 0.1954
MostRecent 0.2329 0.2463 0.2253
Hybrid TF-IDF 0.2524 0.2666 0.2499
SumBasic 0.2544 0.3274 0.2127
the summary is trivial since every post contains the keywords,
we ignored keyword overlap in our ROUGE calculations.
For the human evaluations using Amazon Mechanical Turk,
each automatic summary was compared to both manual sum-
maries by two different evaluators. This leads to 100 evalua-
tions per summarizer as can be seen in Table II. The manual
summaries were evaluated against each other by pretending
that one of them was the automatic summary.
A. Results
Our experiments evaluated eight different summarizers:
random, most recent, MEAD, TextRank, LexRank, cluster,
Hybrid TF-IDF and SumBasic. Both the automatic ROUGE
based evaluation and the MTurk human evaluation are reported
for all eight summarizers in Figures 2 and 3, respectively. The
values of average F-measure, recall and precision can be seen
in Table III. The values of average MTurk scores can be seen
at the top of Table V.
B. Analysis of Results
1) General Observations: We see that both the ROUGE
scores and the human evaluation scores do not seem to
obviously differentiate among the summarizers as seen in
Figures 2 and 3. Therefore, we performed a paired two-sided
T-test for each summarizer compared to each other summarizer
for both the ROUGE scores and the human evaluation scores.
For the ROUGE scores, the twenty five average F-measure
scores corresponding to each topic were used for the paired

Citations
More filters
Journal ArticleDOI
TL;DR: This paper defines five key research questions in this new application area, examined through a survey of state-of-the-art approaches to mining semantics from social media streams; user, network, and behaviour modelling; and intelligent, semanticbased information access.
Abstract: Using semantic technologies for mining and intelligent information access to social media is a challenging, emerging research area. Traditional search methods are no longer able to address the more complex information seeking behaviour in media streams, which has evolved towards sense making, learning, investigation, and social search. Unlike carefully authored news text and longer web context, social media streams pose a number of new challenges, due to their large-scale, short, noisy, contextdependent, and dynamic nature. This paper defines five key research questions in this new application area, examined through a survey of state-of-the-art approaches to mining semantics from social media streams; user, network, and behaviour modelling; and intelligent, semanticbased information access. The survey includes key methods not just from the Semantic Web research field, but also from the related areas of natural language processing and user modelling. In conclusion, key outstanding challenges are discussed and new directions for research are proposed.

144 citations


Cites methods from "Comparing Twitter Summarization Alg..."

  • ...Another, simpler task design has been used by [63] for evaluation of tweet summaries....

    [...]

Proceedings ArticleDOI
28 Jul 2013
TL;DR: This paper proposes a novel prototype called Sumblr (SUMmarization By stream cLusteRing) for tweet streams, and develops a TCV-Rank summarization technique for generating online summaries and historical summaries of arbitrary time durations.
Abstract: With the explosive growth of microblogging services, short-text messages (also known as tweets) are being created and shared at an unprecedented rate. Tweets in its raw form can be incredibly informative, but also overwhelming. For both end-users and data analysts it is a nightmare to plow through millions of tweets which contain enormous noises and redundancies. In this paper, we study continuous tweet summarization as a solution to address this problem. While traditional document summarization methods focus on static and small-scale data, we aim to deal with dynamic, quickly arriving, and large-scale tweet streams. We propose a novel prototype called Sumblr (SUMmarization By stream cLusteRing) for tweet streams. We first propose an online tweet stream clustering algorithm to cluster tweets and maintain distilled statistics called Tweet Cluster Vectors. Then we develop a TCV-Rank summarization technique for generating online summaries and historical summaries of arbitrary time durations. Finally, we describe a topic evolvement detection method, which consumes online and historical summaries to produce timelines automatically from tweet streams. Our experiments on large-scale real tweets demonstrate the efficiency and effectiveness of our approach.

132 citations


Cites background or methods from "Comparing Twitter Summarization Alg..."

  • ...Although there exist numerous studies on document summarization [6, 26, 23, 13, 9, 11], these methods cannot satisfy our requirements, because: (1) They mainly focus on static and small-sized datasets, making it intractable to improve their efficiency....

    [...]

  • ...ClusterSum [13]: first clusters the tweets and then summarizes each cluster by picking the most weighted post according to the hybrid TF-IDF weighting described in [13]....

    [...]

  • ...proposed a modified Hybrid TFIDF algorithm and a Cluster-based algorithm to generate multiple post summaries [13]....

    [...]

Proceedings ArticleDOI
13 May 2013
TL;DR: An efficient, scalable system to detect events from tweets (ET), which automatically detects events from a set of tweets using an extraction scheme for event representative keywords, and a hierarchical clustering technique based on the common co-occurring features of keywords.
Abstract: Social media sites such as Twitter and Facebook have emerged as popular tools for people to express their opinions on various topics The large amount of data provided by these media is extremely valuable for mining trending topics and events In this paper, we build an efficient, scalable system to detect events from tweets (ET) Our approach detects events by exploring their textual and temporal components ET does not require any target entity or domain knowledge to be specified; it automatically detects events from a set of tweets The key components of ET are (1) an extraction scheme for event representative keywords, (2) an efficient storage mechanism to store their appearance patterns, and (3) a hierarchical clustering technique based on the common co-occurring features of keywords The events are determined through the hierarchical clustering process We evaluate our system on two data-sets; one is provided by VAST challenge 2011, and the other published by US based users in January 2013 Our results show that we are able to detect events of relevance efficiently

109 citations


Cites background from "Comparing Twitter Summarization Alg..."

  • ...idf can not be directly used on them [6]....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes a multimedia social event summarization framework to automatically generate visualized summaries from the microblog stream of multiple media types and conducts extensive experiments on two real-world microblog datasets to demonstrate the superiority of the proposed framework as compared to the state-of-the-art approaches.
Abstract: Microblogging services have revolutionized the way people exchange information. Confronted with the ever-increasing numbers of social events and the corresponding microblogs with multimedia contents, it is desirable to provide visualized summaries to help users to quickly grasp the essence of these social events for better understanding. While existing approaches mostly focus only on text-based summary, microblog summarization with multiple media types (e.g., text, image, and video) is scarcely explored. In this paper, we propose a multimedia social event summarization framework to automatically generate visualized summaries from the microblog stream of multiple media types. Specifically, the proposed framework comprises three stages, as follows. 1) A noise removal approach is first devised to eliminate potentially noisy images. An effective spectral filtering model is exploited to estimate the probability that an image is relevant to a given event. 2) A novel cross-media probabilistic model, termed Cross-Media-LDA (CMLDA), is proposed to jointly discover subevents from microblogs of multiple media types. The intrinsic correlations among these different media types are well explored and exploited for reinforcing the cross-media subevent discovery process. 3) Finally, based on the cross-media knowledge of all the discovered subevents, a multimedia microblog summary generation process is designed to jointly identify both representative textual and visual samples, which are further aggregated to form a holistic visualized summary. We conduct extensive experiments on two real-world microblog datasets to demonstrate the superiority of the proposed framework as compared to the state-of-the-art approaches.

97 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This paper reports on a three-fold study aimed at leveraging Twitter as a main source of software user requirements to enable a responsive, interactive, and adaptive data-driven requirements engineering process.
Abstract: Twitter enables large populations of end-users of software to publicly share their experiences and concerns about software systems in the form of micro-blogs. Such data can be collected and classified to help software developers infer users' needs, detect bugs in their code, and plan for future releases of their systems. However, automatically capturing, classifying, and presenting useful tweets is not a trivial task. Challenges stem from the scale of the data available, its unique format, diverse nature, and high percentage of irrelevant information and spam. Motivated by these challenges, this paper reports on a three-fold study that is aimed at leveraging Twitter as a main source of software user requirements. The main objective is to enable a responsive, interactive, and adaptive data-driven requirements engineering process. Our analysis is conducted using 4,000 tweets collected from the Twitter feeds of 10 software systems sampled from a broad range of application domains. The results reveal that around 50% of collected tweets contain useful technical information. The results also show that text classifiers such as Support Vector Machines and Naive Bayes can be very effective in capturing and categorizing technically informative tweets. Additionally, the paper describes and evaluates multiple summarization strategies for generating meaningful summaries of informative software-relevant tweets.

90 citations


Cites background or methods from "Comparing Twitter Summarization Alg..."

  • ...IDF: Introduced by Inouye and Kalita [44], the hybrid TF....

    [...]

  • ...In other words, the likelihood of words appearing in a humangenerated summary is positively correlated with their frequency [44]....

    [...]

  • ...• Hybrid TF.IDF: Introduced by Inouye and Kalita [44], the hybrid TF.IDF approach is a frequency-based summarization technique that is designed to summarize social media data....

    [...]

  • ...In our analysis, we investigate the performance of a number of extractive summarization techniques that have been shown to work well in the context of mirco-blogging data on social media [44], [43], [45], [39]....

    [...]

  • ...This hybrid modification over classical single-document TF is necessary to capture concerns that are frequent over the entire collection [44]....

    [...]

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"Comparing Twitter Summarization Alg..." refers background or methods in this paper

  • ...…length k for the summary, output a set of representative posts S with a cardinality of k such that 1) ∀s ∈ S, T is in the text of s, and 2) ∀si,∀sj ∈ S, si 6∼ sj . si 6∼ sj means that the two posts provide sufficiently different information in order to keep the summaries from being redundant....

    [...]

  • ...The TextRank algorithm [8] is also a graph-based approach that finds the most highly ranked sentences (or keywords) in a document using the PageRank algorithm [9]....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Proceedings ArticleDOI
07 Jan 2007
TL;DR: By augmenting k-means with a very simple, randomized seeding technique, this work obtains an algorithm that is Θ(logk)-competitive with the optimal clustering.
Abstract: The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.

7,539 citations


"Comparing Twitter Summarization Alg..." refers background in this paper

  • ...…length k for the summary, output a set of representative posts S with a cardinality of k such that 1) ∀s ∈ S, T is in the text of s, and 2) ∀si,∀sj ∈ S, si 6∼ sj . si 6∼ sj means that the two posts provide sufficiently different information in order to keep the summaries from being redundant....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method called the "gap statistic" for estimating the number of clusters (groups) in a set of data, which uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution.
Abstract: We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

4,283 citations

Proceedings Article
01 Jul 2004
TL;DR: TextRank, a graph-based ranking model for text processing, is introduced and it is shown how this model can be successfully used in natural language applications.
Abstract: In this paper, the authors introduce TextRank, a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications.

3,891 citations