scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Experiments in Microblog Summarization

TL;DR: The goal is to produce summaries that are similar to what a human would produce for the same collection of posts on a specific topic, and evaluate the summaries produced by the summarizing algorithms, compare them with human-produced summaries and obtain excellent results.
Abstract: —This paper presents algorithms for summarizingmicroblog posts. In particular, our algorithms process collectionsof short posts on specific topics on the well-known site calledTwitter and create short summaries from these collections ofposts on a specific topic. The goal is to produce summariesthat are similar to what a human would produce for the samecollection of posts on a specific topic. We evaluate the summariesproduced by the summarizing algorithms, compare them withhuman-produced summaries and obtain excellent results. I. I NTRODUCTION Twitter, the microblogging site started in 2006, has becomea social phenomenon, with more than 20 million visitors eachmonth. While the majority posts are conversational or notvery meaningful, about 3.6% of the posts concern topics ofmainstream news 1 . At the end of 2009, Twitter had 75 millionaccount holders, of which about 20% are active 2 . There areapproximately 2.5 million Twitter posts per day 3 . To helppeople who read Twitter posts or tweets, Twitter provides ashort list of popular topics called

Summary (4 min read)

Introduction

  • Twitter, the microblogging site started in 2006, has become a social phenomenon, with more than 20 million visitors each month.
  • While the majority posts are conversational or not very meaningful, about 3.6% of the posts concern topics of mainstream news1.
  • To help people who read Twitter posts or tweets, Twitter provides a short list of popular topics called Trending Topics.
  • The authors create summaries in various ways and evaluate them using metrics for automatic summary evaluation.

III. PROBLEM DESCRIPTION

  • The difficulty in interpreting the results is that the returned posts are only sorted by recency.
  • The motivation of the summarizer is to automate this process and generate a more representative summary in less time and effort.
  • Most search engines built into microblogging services only return a limited number of results when querying for a particular topic or phrase.
  • Twitter only returns a maximum of 1500 posts for a single search phrase.
  • Given a set of posts that are all related by containing a common search phrase (e.g. a topic), generate a summary that best describes the primary gist of what users are saying about that search phrase.

IV. SELECTED APPROACHES

  • The authors choose an extractive approach since its methodologies more closely relate to the structure and diversity of microblogs.
  • Microblogs are the antithesis to long documents.
  • Extractive techniques are also known to better scale with more diverse domains [22].
  • First, the authors create the novel Phrase Reinforcement algorithm that uses a graph to represent overlapping phrases in a set of related microblog sentences.
  • The authors develop another primary algorithm based on a well established statistical methodology known as TF-IDF.

A. The Phrase Reinforcement Algorithm

  • The Phrase Reinforcement (PR) algorithm generates summaries by looking for the most commonly occurring phrases.
  • The second observation is that microbloggers often repeat the most relevant posts for a trending topic by quoting others.
  • Subsequently, the algorithm isolates the set of words that occur immediately before the current node’s phrase.
  • First, the authors create the left partial summary by searching for all paths (using a depth-first search algorithm) that begin at the root node and end at any other node, going backwards.
  • It produces the following final summary: RIP Comedian Soupy Sales dies at age 83.

B. Hybrid TF-IDF Summarization

  • After analyzing the results obtained by the Phrase Reinforcement approach, the authors notice that it significantly improves upon their earlier results, but it still leaves room for improvement as its performance only halves the difference between the random and manual summarization methods (see Section VII-E).
  • For straightforward automated summarization, the application of TF-IDF is simplistic.
  • The sentences are ordered by their weights from which the top m sentences with the most weight are chosen as the summary.
  • Therefore, TF-IDF gives the most weight to words that occur most frequently within a small number of documents and the least weight to terms that occur infrequently or occur within the majority of the documents.
  • When generating a summary from multiple documents this becomes an issue because the terms within the longer documents have more weight.

C. Algorithm

  • The authors don’t have a traditional document.
  • On the other extreme, the authors could define each post as a document making the IDF component’s definition clear.
  • When computing the term frequencies, the authors assume the document is the entire collection of posts.
  • The authors next choose a normalization method since otherwise the TF-IDF algorithm always biases towards longer sentences.
  • The authors summarize this algorithm below in Equations (3)-(7).

A. Data Collection and Pre-processing

  • For five consecutive days, the authors collected the top ten currently trending topics from Twitter’s home page at roughly the same time every evening.
  • For each topic, the authors downloaded the maximum number (approximately 1500) of posts.
  • Therefore, the authors had 50 trending topics with a set of 1500 posts for each.

B. Evaluation Methods

  • There is no definitive standard against which one can compare the results from an automated summarization system.
  • In intrinsic evaluation, the quality of the summary is judged based on direct analysis using a number of predefined metrics such as grammaticality, fluency, or content [1].
  • ROUGE is a suite of metrics that automatically measures the similarity between an automated summary and a set of manual summaries [30].
  • (8) Here, n is the length of the n-grams, Count(n-gram) is the number of n-grams in the manual summary, and Match(n-gram is the number of co-occurring n-grams between the manual and automated summaries.
  • Lin [30] performed evaluations to understand how well different forms of ROUGE’s results correlate with human judgments.

C. Manual Summaries

  • Two volunteers generated a complete set of 50 manual “best” summaries possible for all topics in 140 characters or less while using only the information contained within the posts (see Table 1).
  • The manual summaries generated by their two volunteers are semantically very similar to one another but have different lengths and word choices.
  • The authors use ROUGE-1 to compare the manual summaries against one another.
  • By evaluating their two manual summaries against one another, the authors help establish practical upper limits of performance for automated summaries.
  • These results in addition to the results of the preliminary algorithms collectively establish a range of expected performance for their primary algorithms.

D. Performance of Naı̈ve Algorithms

  • These results are higher than the authors originally anticipated given their average manual F-measure is only 0.34.
  • Some overlap is explained by that ROUGE-1 measures common words.
  • Overall, the random sentence approach produces a summary that is more balanced in terms of recall and precision and a higher F-measure as well (0.23 vs. 0.20).
  • The shortest sentence and shortest post approaches generate far too short summaries, averaging only two or three words in length.
  • Because of their short length, these two approaches generate very high precision but fail horribly at recall, scoring less than either of the random approaches.

E. Performance of Phrase Reinforcement Algorithm

  • This is a significant improvement over the random sentence approach.
  • This score is an improvement over the random summaries.
  • By assigning less weight to nodes farther from the root phrase, the algorithm prefers more common shorter phrases over less common longer phrases.
  • There appears to be a threshold (when b ≈ 100) for which smaller values of b begin reducing the average summary length.
  • In Figure 4, the label “PR Phrase (NULL)” indicate the absence of the weighting parameter all together.

F. Performance of the Hybrid TF-IDF Algorithm

  • In Figure 7, the authors present the results of the TF-IDF algorithm for the ROUGE-1 metric.
  • The TF-IDF results are denoted as TF-IDF Sentence (11) to distinguish the fact that the TF-IDF algorithm produces sentences instead of phrases for summaries and that the authors are using a threshold of 11 words as their normalization factor.
  • This score is also higher than the average Content score of the Phrase Reinforcement algorithm which was 3.66.
  • Interestingly, the TF-IDF summaries are one word shorter, on average, than the manual summaries with an average length of 9 words.
  • As seen in Figure 10, by varying the normalization threshold, the authors are able to control the average summary length and resulting ROUGE-1 precision and recall.

VIII. CONCLUSION

  • The authors have presented two primary approaches to microblog summarization.
  • The authors find, after exhaustive experimentation, that the Hybrid TF-IDF algorithm produces as good summaries as the PR algorithm or even better.
  • One challenge will be to produce a coherent multi-sentence summary because of issues such as presence of anaphora and other coherence issues.
  • The authors also want to cluster the posts on a specific topic into k clusters to find various themes and sub-themes that are present in the posts and then find a summary that may be onepost or multi-post.
  • Of course, one of the biggest problems the authors have observed with microblogs is redundancy; in other words, quite frequently, a large number of posts are similar to one another.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Experiments in Microblog Summarization
Beaux Sharifi, Mark-Anthony Hutton and Jugal K. Kalita
Department of Computer Science
University of Colorado
Colorado Springs, CO 80918 USA
{bsharifi,mhutton86}@gmail.com, jkalita@uccs.edu
Abstract—This paper presents algorithms for summarizing
microblog posts. In particular, our algorithms process collections
of short posts on specific topics on the well-known site called
Twitter and create short summaries from these collections of
posts on a specific topic. The goal is to produce summaries
that are similar to what a human would produce for the same
collection of posts on a specific topic. We evaluate the summaries
produced by the summarizing algorithms, compare them with
human-produced summaries and obtain excellent results.
I. INTRODUCTION
Twitter, the microblogging site started in 2006, has become
a social phenomenon, with more than 20 million visitors each
month. While the majority posts are conversational or not
very meaningful, about 3.6% of the posts concern topics of
mainstream news
1
. At the end of 2009, Twitter had 75 million
account holders, of which about 20% are active
2
. There are
approximately 2.5 million Twitter posts per day
3
. To help
people who read Twitter posts or tweets, Twitter provides a
short list of popular topics called Trending Topics. Without
the availability of such topics on which one can search to find
related posts, for the general user, there is no way to get an
overall understanding of what is being written about in Twitter.
Twitter works with a website called WhatTheTrend
4
to
provide definitions of trending topics. WhatTheTrend allows
users to manually enter descriptions of why a topic is trending.
However, WhatTheTrend suffers from spam and irrelevant
posts that reduce its utility to some extent.
In this paper, we discuss an effort to automatically create
“summary” posts of Twitter trending topics. In short, we
perform searches for Twitter on trending topics, get a large
number of posts on a topic and then automatically create a
short post that is representative of all the posts on the topic.
We create summaries in various ways and evaluate them using
metrics for automatic summary evaluation.
II. RELATED WORK
Automatically summarizing microblog topics is a new area
of research and to the authors’ best knowledge, approaches
have not been published. However, summarizing microblogs
can be viewed as an instance of automated text summarization,
1
http://www.pearanalytics.com/blog/tag/twitter/
2
http://themetricsystem.rjmetrics.com/2010/01/26/new-data-on-twitters-
users-and-engagement/
3
http://blog.twitter.com/2010/02/measuring-tweets.html
4
http://www.whatthetrend.com/
which is the problem of automatically generating a condensed
version of the most important content from one or more
documents for a particular set of users or tasks [1].
As early as the 1950s, Luhn was experimenting with meth-
ods for automatically generating extracts of technical articles
[2]. Edmundson [3] developed techniques for summarizing a
diverse corpora of documents to help users evaluate documents
for further reading. Early research in text summarization
focused primarily on simple statistical techniques that relied
upon lexical features such as word frequencies (e.g. [2]) or
formatting clues such as titles and headings (e.g. [3]).
Later work integrated more sophisticated approaches such
as machine learning, and natural language processing. In most
cases, text summarization is performed for the purposes of
saving users time by reducing the amount of content to read.
However, text summarization has also been performed for
purposes such as reducing the number of features required
for classifying (e.g. [4]) or clustering (e.g. [5]) documents.
Following another line of approach, [6] and [7], generated
textual summaries of results of database queries. With the
growth of the Web, interest has grown to improve summariza-
tion while also to summarize new forms of documents such as
web pages (e.g. [8]), emails (e.g. [9]–[11]), newsgroups (e.g.
[12]) discussion forums (e.g. [13]–[15]), and blogs (e.g. [16],
[17]). Recently, interest has emerged in multiple document
summarization as well (e.g. [18]–[21]) thanks in part to
annual conferences that aim to further the state of the art in
summarization by providing large test collections and common
evaluation of summarizing systems
5
.
III. PROBLEM DESCRIPTION
On a microblogging service such as Twitter, a user can
perform a search for a topic and retrieve a list of posts that
contain the phrase. The difficulty in interpreting the results
is that the returned posts are only sorted by recency. The
motivation of the summarizer is to automate this process and
generate a more representative summary in less time and effort.
Most search engines built into microblogging services only
return a limited number of results when querying for a
particular topic or phrase. For example, Twitter only returns
a maximum of 1500 posts for a single search phrase. Our
problem description is as follows:
5
http://www.nist.gov/tac/

Given a set of posts that are all related by containing
a common search phrase (e.g. a topic), generate a
summary that best describes the primary gist of what
users are saying about that search phrase.
IV. SELECTED APPROACHES
The first decision in developing a microblog summarization
algorithm is to choose whether to use an abstractive or
extractive approach. We choose an extractive approach since
its methodologies more closely relate to the structure and
diversity of microblogs. Abstractive approaches are beneficial
in situations where high rates of compression are required.
However, microblogs are the antithesis to long documents. Ab-
stractive systems usually perform best in limited domains since
they require outside knowledge sources. These approaches
might not work so well with microblogs since they are un-
structured and diverse in subject matter. Extractive techniques
are also known to better scale with more diverse domains [22].
We implement several extractive algorithms. We start by
implementing two preliminary algorithms based on very sim-
ple techniques. We also develop and implement two primary
algorithms. First, we create the novel Phrase Reinforcement
algorithm that uses a graph to represent overlapping phrases
in a set of related microblog sentences. This graph allows
the generation of one or more summaries and is discussed in
detail in Section VI-A. We develop another primary algorithm
based on a well established statistical methodology known as
TF-IDF. The Hybrid TF-IDF algorithm is discussed at length
in Section VI-B.
V. NA
¨
IVE ALGORITHMS
While the two preliminary approaches are simplistic, they
serve a critical role towards allowing us to evaluate the results
of our primary algorithms since no prior results yet exist.
A. Random Approach
Given a filtered collection of relevant posts that are each
related to a single topic, we generate a summary by simply
choosing at random either a post or a sentence.
B. Length Approach
Our second preliminary approach serves as an indicator
of how easy or difficult it is to improve upon the random
approach to summarization. Given a collection of posts and
sentences that are each related to a single topic, we generate
four independent summaries. For two of the summaries, we
choose both the shortest and longest post from the collection.
For the remaining two, we choose both the shortest and longest
topic sentence from the collection.
VI. PRIMARY ALGORITHMS
We discuss two primary algorithms: one is called the Phrase
Reinforcement Algorithm that we develop. The other is an
adaptation of the well-known TF-IDF approach.
A. The Phrase Reinforcement Algorithm
The Phrase Reinforcement (PR) algorithm generates sum-
maries by looking for the most commonly occurring phrases.
The algorithm has been discussed briefly in [23], [24]. More
details can be found in [25].
1) Motivation: The algorithm is inspired by two observa-
tions. The first is that users tend to use similar words when
describing a particular topic, especially immediately adjacent
to the topic phrase. For example, consider the following posts
collected on the day of the comedian Soupy Sales’ death:
1) Aw, Comedian Soupy Sales died.
2) RIP Comedian Soupy Sales dies at age 83.
3) My favorite comedian Soupy Sales died.
4) RT @NY: RIP Comedian Soupy Sales dies at age 83.
5) RIP: Soupy Sales Died Today.
6) Soupy Sales meant silliness and laughs.
Notice that all posts contain words immediately after the
phrase Soupy Sales that in some way refer to his death.
Furthermore, posts 2 and 4 share the word RIP and posts 1,
3 and 5 share the word died. Therefore, there exists some
overlap in word usage adjacent to the phrase Soupy Sales.
The second observation is that microbloggers often repeat
the most relevant posts for a trending topic by quoting oth-
ers. Quoting uses the following form: RT @[TwitterAccount-
Name]: Quoted Message. RT refers to Re-Tweet and indicates
one is copying a post from the indicated Twitter account. For
example, the following is a quoted post: RT @dcagle: Our first
Soupy Sales RIP cartoon. While users writing their own posts
occasionally use the same or similar words, retweeting causes
entire sentences to perfectly overlap with one another. This,
in turn, greatly increases the average length of an overlapping
phrase for a given topic. The main idea of the algorithm is to
determine the most heavily overlapping phrase centered about
the topic phrase. The justification is that repeated information
is often a good indicator of its relative importance [2].
2) The Algorithm: The algorithm begins with the topic
phrase. These phrases are typically trending topics, but can be
other non-trending topics as well. Assume our starting phrase
is the trending topic Soupy Sales. The input to the algorithm
is a set of posts that each contains the starting phrase. The
algorithm can take as input any number of posts returned by
Twitter. In the running example, we pick the first 100 posts
returned by the Twitter search engine.
a) Building a Word Graph: The root node contains the
topic phrase, in this case soupy sales. We build a graph
showing how words occur before and after the phrase in the
root node, considering all the posts on the topic. We think of
the graph as containing two halves: a sub-graph to the left of
the root node containing words occurring in specific positions
to the left of the root node’s phrase, and a similar sub-graph
to the right of the root node.
To construct the left-hand side, the algorithm starts with the
root node. It reduces the set of input sentences to the set of
sentences that contain the current node’s phrase. The current
node and the root node are initially the same. Since every

input sentence is guaranteed to contain the root phrase, our
list of sentences does not change initially. Subsequently, the
algorithm isolates the set of words that occur immediately
before the current node’s phrase. From this set, duplicate
words are combined and assigned a count that represents how
many instances of those words are detected. For each of these
unique words, the algorithm adds them to the graph as nodes
with their associated counts to the left of the current node.
In the graph, all the nodes are in lower-case and stripped of
any non-alpha-numeric characters. This increases the amount
of overlap among words. Each node has an associated count.
The associated node’s phrase has exactly count number of
occurrences within the set of input sentences at the same
position and word sequence relative to the root node. Nodes
with a count less than two are not added to the graph since the
algorithm is looking for overlapping phrases. The algorithm
continues this process recursively for each node added to the
graph until all the potential words have been added to the
left-hand side of the graph.
The algorithm repeats these steps symmetrically for the
right-hand side of the graph. At the completion of the graph-
building process, the graph looks like the one in Figure 1.
b) Weighting Individual Word Nodes: The algorithm pre-
pares for the generation of summaries by weighting individual
nodes. Node weighting is performed to account for the fact that
some words have more informational content than others. For
example, the root node soupy sales contains no information
since it is common to every input sentence. We give it a weight
of zero. Common stop words are noisy features that do not
help discriminate between phrases. We give them a weight
of zero as well. Finally, for the remaining words, we first
initialize their weights to the same values as their counts. Then,
to account for the fact that some phrases are naturally longer
than others, we penalize nodes that occur farther from the root
node by an amount that is proportional to their distance:
W eight(Node)= (1)
Count(N ode) Distance(N od e) log
b
Count(N ode).
The logarithm base b can be used to tune the algorithm
towards longer or shorter summaries. For aggressive sum-
marization (higher precision), the base can be set to small
values (e.g. 2 or the natural logarithm ln). While for longer
summaries (higher recall), the base can be set to larger values
(e.g. 100). Weighting our example graph gives the graph in
Figure 2. We assume the logarithm base b is set to 10 for
helping generate longer summaries.
c) Generating Summaries: The algorithm looks for the
most overlapping phrase within the graph. First, we create
the left partial summary by searching for all paths (using a
depth-first search algorithm) that begin at the root node and
end at any other node, going backwards. The path with the
most weight is chosen. Assume in Figure 2, the path with
most weight on the left-hand side of the root node (here 5.1
including the root node) on the left side of the graph contains
the nodes rip, comedian and soupy sales. Thus, the best left
partial summary is rip comedian soupy sales.
We repeat the partial summary creation process for the right-
hand side of the current root node soupy sales. Since we want
to generate phrases that are actually found within the input
sentences, we reorganize the tree by placing the entire left
partial summary, rip comedian soupy sales in the root node.
Assume we get the path shown in Figure 3 as the most heavily
weighted on the right hand side. The full summary generated
by the algorithm for our example is: rip comedian soupy sales
dies at age 83.
Strictly speaking, we can build either the left or right partial
summary first, and the other one next. The full summary
obtained by the algorithm is the same in both cases.
d) Post-processing: This summary has lost its case-
sensitivity and formatting since these features were removed
to increase the amount of overlap among phrases. To recover
these features, we perform a simple best-fit algorithm between
the summary and the set of input sentences to find a matching
phrase that contains the summary. We know such a match-
ing phrase exists within at least two of the input sentences
since the algorithm only generates summaries from common
phrases. Once, we find the first matching phrase, this phrase is
our final summary. It produces the following final summary:
RIP Comedian Soupy Sales dies at age 83.
B. Hybrid TF-IDF Summarization
After analyzing the results obtained by the Phrase Rein-
forcement approach, we notice that it significantly improves
upon our earlier results, but it still leaves room for improve-
ment as its performance only halves the difference between
the random and manual summarization methods (see Section
VII-E). Therefore, we use another approach based upon a
technique dating back to early summarization work [2].
1) TF-IDF: Term Frequency-Inverse Document Frequency,
is a statistical weighting technique that has been applied to
many information retrieval problems. It has been used for
automatic indexing [26], query matching of documents [27],
and automated summarization [28]. Generally TF-IDF is not
known as a leading algorithms in automated summarization.
For straightforward automated summarization, the applica-
tion of TF-IDF is simplistic. The idea is to assign each sen-
tence within a document a weight that reflects the sentence’s
saliency within the document. The sentences are ordered by
their weights from which the top m sentences with the most
weight are chosen as the summary. The weight of a sentence
is the summation of the individual term weights within the
sentence. Terms can be words, phrases, or any other type of
lexical feature [29]. To determine the weight of a term, we
use the formula:
TF
IDF = tf
ij
log
2
N
df
j
(2)
where tf
ij
is the frequency of the term T
j
within the document
D
i
, N is the total number of documents, and df
j
is the number
of documents within the set that contain the term T
j
[26].
The TF-IDF value is composed of two primary parts. The
term frequency component (TF) assigns more weight to words

Fig. 1. Fully constructed PR graph (allowing non-overlapping words/phrases).
Fig. 2. Fully weighted PR graph (requiring overlapping words/phrases).
Fig. 3. Fully constructed PR graph for right half of summary). This also
demonstrates best complete summary
that occur frequently within a document because important
words are often repeated [2]. The inverse document frequency
component (IDF) compensates for the fact that some words
such as common stop words are frequent. Since these words
do not help discriminate between one sentence or document
over another, these words are penalized proportionally to their
inverse document frequency (the logarithm is taken to balance
the effect of the IDF component in the formula). Therefore,
TF-IDF gives the most weight to words that occur most
frequently within a small number of documents and the least
weight to terms that occur infrequently or occur within the
majority of the documents.
One noted problem with TF-IDF is that the formula is
sensitive to document length. Singhal et al. note that longer
documents have higher term frequencies since they often
repeat terms while also having a larger number of terms
[29]. This doesn’t have any ill effects on single document
summarization. However, when generating a summary from
multiple documents this becomes an issue because the terms
within the longer documents have more weight.
C. Algorithm
Equation 2 defines the weight of a term in the context of
a document. However, we don’t have a traditional document.
Instead, we have a set of microblogging posts that are each
related to a topic. So, one question we must first answer is
how we define a document. An option is to define a single
document that encompasses all the posts together. In this
case, the TF component’s definition is straightforward since
we can compute the frequencies of the terms across all the
posts. However, doing so causes us to lose the IDF component
since we only have a single document. On the other extreme,
we could define each post as a document making the IDF
component’s definition clear. But, the TF component now has a
problem: Because each post contains only a handful of words,
most term frequencies will be a small constant for a given
post.
To handle this situation, we redefine TF-IDF in terms of a
hybrid document. We primarily define a document as a single
sentence. However, when computing the term frequencies, we
assume the document is the entire collection of posts. This
way, we have differentiated term frequencies but also don’t
lose the IDF component. A term is a single word in a sentence.
We next choose a normalization method since otherwise
the TF-IDF algorithm always biases towards longer sentences.
We normalize the weight of a sentence by dividing it by a
normalization factor. Since stop words do not help discriminate
the saliency of sentences, we give each of these types of words
a weight of zero by comparing them with a prebuilt list. We
summarize this algorithm below in Equations (3)-(7).
W (S)=
!
#W ordsInSentence
i=0
W (w
i
)
nf(S)
(3)
W (w
i
)=tf(w) log
2
(idf(w
i
)) (4)
tf(w
i
)=
#OccurrencesOf W ordInAllP osts
#W ordsInAllP osts
(5)
idf(w
i
)=
#SentencesInAllP osts
#SentencesInW hichW ordOccurs
(6)
nf(S)=max[M inimumT hreshold, (7)
#W ordsInSentence]
where W is the weight assigned to a sentence or a word, nf is
a normalization factor, w
i
is the ith word, and S is a sentence.
VII. EXPERIMENTAL SETUP AND EVALUATION
A. Data Collection and Pre-processing
For five consecutive days, we collected the top ten currently
trending topics from Twitter’s home page at roughly the
same time every evening. For each topic, we downloaded the
maximum number (approximately 1500) of posts. Therefore,
we had 50 trending topics with a set of 1500 posts for each.

We performed several forms of preprocessing in order to filter
the posts into a usable form. Details can be found in [25].
B. Evaluation Methods
There is no definitive standard against which one can
compare the results from an automated summarization system.
Summary evaluation is generally performed using one of two
methods: intrinsic, or extrinsic. In intrinsic evaluation, the
quality of the summary is judged based on direct analysis
using a number of predefined metrics such as grammaticality,
fluency, or content [1]. Extrinsic evaluations measure how well
a summary enables a user to perform some form of task [22].
To perform intrinsic evaluation, a common approach is to
create one or more manual summaries and to compare the
automated summaries against the manual summaries. One
popular automatic evaluation metric that has been adopted
by the Document Understanding Conference since 2004 is
ROUGE. ROUGE is a suite of metrics that automatically
measures the similarity between an automated summary and
a set of manual summaries [30]. One of the simplest ROUGE
metrics is the ROUGE-N metric:
ROUGE-N =
!
sMS
!
n-g ramsS
Match(n-gram)
!
sMS
!
n-g ramsS
Count(n-gram)
.
(8)
Here, n is the length of the n-grams, Count(n-gram)
is the number of n-grams in the manual summary, and
Match(n-gram is the number of co-occurring n-grams be-
tween the manual and automated summaries.
Lin [30] performed evaluations to understand how well
different forms of ROUGE’s results correlate with human
judgments. One result of particular consequence for our work
is his comparison of ROUGE with the very short (around
10 words) summary task of DUC 2003. Lin found ROUGE-
1 and other ROUGEs to correlate highly with human judg-
ments. Since this task is very similar to creating microblog
summaries, we implement ROUGE-1 as a metric.
Since we want certainty that ROUGE-1 correlates with a
human evaluation of automated summaries, we also imple-
ment a manual metric used during DUC 2002: the Content
metric which asks a human judge to measure how completely
an automated summary expresses the meaning of a human
summary on a 1 · · · 5 scale, 1 being worst and 5 being best.
C. Manual Summaries
Two volunteers generated a complete set of 50 manual
“best” summaries possible for all topics in 140 characters or
less while using only the information contained within the
posts (see Table 1).
1) Manual Summary Evaluation: The manual summaries
generated by our two volunteers are semantically very similar
to one another but have different lengths and word choices. We
use ROUGE-1 to compare the manual summaries against one
another. We do the same using the Content metric in order to
understand how semantically similar the two summaries are.
By evaluating our two manual summaries against one another,
we help establish practical upper limits of performance for
TABLE I
EXAMPLES OF MANU AL SU MMARIES
Topic Manual Summary 1 Manual Summary 2
#BeatCancer Every retweet of Tweet #beatcancer to
#BeatCancer will result in 1 help fund cancer
cent being donated towards research
Cancer Research.
Kelly Clarkson Between Taylor Swift and Taylor Swift v.
Kelly Clarkson, which Kelly Clarkson
one do you prefer.....?
TABLE II
ROUGE-1, CONTENT, AND LENGTH FOR MANUAL SUMMARIES
Rouge-1
Content Length
F Precision Recall
Manual 1 0.34 0.31 0.37 4.4 11
Manual 2 0.34 0.37 0.31 4.1 9
Manual Avg 0.34 0.34 0.34 4.2 10
TABLE III
CONTENT PERFORMANCE FOR NAIVE SUMMARIES
Content Performance
Manual Average 4.2
Random Sentence 3.0
Shortest Sentence 2.3
automated summaries. These results in addition to the results
of the preliminary algorithms collectively establish a range of
expected performance for our primary algorithms. We compare
the manual summaries against one another bi-directionally by
assuming either set was the set of automated summaries.
To generate the Content performance, we asked a volunteer
to evaluate how well one summary expressed the meaning of
the corresponding manual summary. The average results for
computing the ROUGE-1 and Content metrics on the manual
summaries are shown in Table II.
D. Performance of Na
¨
ıve Algorithms
The generation of random sentences produces an average
recall of 0.23, an average precision of 0.22, and an F-measure
of 0.23. These results are higher than we originally anticipated
given our average manual F-measure is only 0.34. Some
overlap is explained by that ROUGE-1 measures common
words. Second, while we call our first preliminary approach
“random”, we introduce some bias into this approach by our
preprocessing steps discussed earlier.
To understand how random sentences are compared to man-
ual summaries, we present Content performance in Table III.
The random sentence approach generated a Content score of
3.0 on a scale of 5. In addition to choosing random sentences,
we also chose random posts. This slightly improves the recall
scores over the random sentence approach (0.24 vs. 0.23), but
worsens the precision (0.17 vs. 0.22).
The random post approach produces an average length of
15 words while the random sentence averaged 12 words.
Since the random sentence approach is closer to the average
manual summary length, it scores higher precision. Overall,
the random sentence approach produces a summary that is
more balanced in terms of recall and precision and a higher
F-measure as well (0.23 vs. 0.20).

Citations
More filters
Proceedings ArticleDOI
14 Feb 2012
TL;DR: An algorithm that generates a journalistic summary of an event using only status updates from Twitter as a source is described, and the results are superior to the previous algorithm and approach the readability and grammaticality of the human-generated summaries.
Abstract: The status updates posted to social networks, such as Twitter and Facebook, contain a myriad of information about what people are doing and watching. During events, such as sports games, many updates are sent describing and expressing opinions about the event. In this paper, we describe an algorithm that generates a journalistic summary of an event using only status updates from Twitter as a source. Temporal cues, such as spikes in the volume of status updates, are used to identify the important moments within an event, and a sentence ranking method is used to extract relevant sentences from the corpus of status updates describing each important moment within an event. We evaluate our algorithm compared to human-generated summaries and the previous best summarization algorithm, and find that the results of our method are superior to the previous algorithm and approach the readability and grammaticality of the human-generated summaries.

264 citations

Journal ArticleDOI
01 Jan 2013
TL;DR: Methodologies of detecting and identifying trending topics from streaming data from Twitter's streaming API were outlined, and term frequency-inverse document frequency analysis identified unigrams, bigrams, and trigrams as trending topics.
Abstract: As social media continue to grow, the zeitgeist of society is increasingly found not in the headlines of traditional media institutions, but in the activity of ordinary individuals. The identification of trending topics utilises social media (such as Twitter) to provide an overview of the topics and issues that are currently popular within the online community. In this paper, we outline methodologies of detecting and identifying trending topics from streaming data. Data from Twitter's streaming API was collected and put into documents of equal duration using data collection procedures that allow for analysis over multiple timespans, including those not currently associated with Twitter-identified trending topics. Term frequency-inverse document frequency analysis and relative normalised term frequency analysis were performed on the documents to identify the trending topics. Relative normalised term frequency analysis identified unigrams, bigrams, and trigrams as trending topics, while term frequency-inverse document frequency analysis identified unigrams as trending topics. Application of these methodologies to streaming data resulted in F-measures ranging from 0.1468 to 0.7508.

183 citations


Cites background from "Experiments in Microblog Summarizat..."

  • ...The third criterion implemented involved utilizing only the term frequency of each element, rather than both the term frequency and the inverse document frequency....

    [...]

Proceedings ArticleDOI
01 Oct 2011
TL;DR: This paper compares algorithms for extractive summarization of micro log posts with two algorithms that produce summaries by selecting several posts from a given set.
Abstract: Due to the sheer volume of text generated by a micro log site like Twitter, it is often difficult to fully understand what is being said about various topics. In an attempt to understand micro logs better, this paper compares algorithms for extractive summarization of micro log posts. We present two algorithms that produce summaries by selecting several posts from a given set. We evaluate the generated summaries by comparing them to both manually produced summaries and summaries produced by several leading traditional summarization systems. In order to shed light on the special nature of Twitter posts, we include extensive analysis of our results, some of which are unexpected.

174 citations

23 Jun 2011
TL;DR: A Mechanical Turk annotated corpus of forum discussions is used as a gold standard for the recognition of disagreement in online ideological forums and the utility of meta-post features, contextual features, dependency features and word-based features for signaling the disagreement relation is analyzed.
Abstract: The recent proliferation of political and social forums has given rise to a wealth of freely accessible naturalistic arguments. People can "talk" to anyone they want, at any time, in any location, about any topic. Here we use a Mechanical Turk annotated corpus of forum discussions as a gold standard for the recognition of disagreement in online ideological forums. We analyze the utility of meta-post features, contextual features, dependency features and word-based features for signaling the disagreement relation. We show that using contextual and dialogic features we can achieve accuracies up to 68% as compared to a unigram baseline of 63%.

113 citations

Proceedings Article
03 Jun 2012
TL;DR: This paper proposes a ranked list of historical event summaries by distilling high quality event representations using a novel temporal query expansion technique and demonstrates both the value of the microblog event retrieval task and the effectiveness of the proposed search methodologies.
Abstract: Microblog streams often contain a considerable amount of information about local, regional, national, and global events Most existing microblog search capabilities are focused on recent happenings and do not provide the ability to search and explore past events This paper proposes the problem of structured retrieval of historical event information over microblog archives Rather than retrieving individual microblog messages in response to an event query, we propose retrieving a ranked list of historical event summaries by distilling high quality event representations using a novel temporal query expansion technique The results of an exploratory study carried out over a large archive of Twitter messages demonstrates both the value of the microblog event retrieval task and the effectiveness of our proposed search methodologies

102 citations


Cites background from "Experiments in Microblog Summarizat..."

  • ...There has also been recent work on summarizing sets of microblog posts (Sharifi et al., 2010)....

    [...]

References
More filters
Book
01 Jan 2008
TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
Abstract: Class-tested and coherent, this groundbreaking new textbook teaches web-era information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. Written from a computer science perspective by three leading experts in the field, it gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Although originally designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also create a buzz for researchers and professionals alike.

11,804 citations

Proceedings Article
01 Jul 2004
TL;DR: TextRank, a graph-based ranking model for text processing, is introduced and it is shown how this model can be successfully used in natural language applications.
Abstract: In this paper, the authors introduce TextRank, a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications.

3,891 citations


"Experiments in Microblog Summarizat..." refers background in this paper

  • ...[18]–[21]) thanks in part to annual conferences that aim to further the state of the art in summarization by providing large test collections and common evaluation of summarizing systems5....

    [...]

Book
Gerard Salton1
03 Jan 1989

3,571 citations


"Experiments in Microblog Summarizat..." refers background or methods in this paper

  • ...It has been used for automatic indexing [26], query matching of documents [27], and automated summarization [28]....

    [...]

  • ...where tf ij is the frequency of the term Tj within the document Di, N is the total number of documents, and df j is the number of documents within the set that contain the term Tj [26]....

    [...]

Journal ArticleDOI
TL;DR: In the exploratory research described, the complete text of an article in machine-readable form is scanned by an IBM 704 data-processing machine and analyzed in accordance with a standard program.
Abstract: Excerpts of technical papers and magazine articles that serve the purposes of conventional abstracts have been created entirely by automatic means. In the exploratory research described, the complete text of an article in machine-readable form is scanned by an IBM 704 data-processing machine and analyzed in accordance with a standard program. Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the "auto-abstract."

3,094 citations


"Experiments in Microblog Summarizat..." refers background or methods in this paper

  • ...As early as the 1950s, Luhn was experimenting with methods for automatically generating extracts of technical articles [2]....

    [...]

  • ...Therefore, we use another approach based upon a technique dating back to early summarization work [2]....

    [...]

  • ...[2]) or formatting clues such as titles and headings (e....

    [...]

  • ...that occur frequently within a document because important words are often repeated [2]....

    [...]

  • ...The justification is that repeated information is often a good indicator of its relative importance [2]....

    [...]

Book ChapterDOI
09 Jul 1995
TL;DR: The results show that a learning algorithm based on the Minimum Description Length (MDL) principle was able to raise the percentage of interesting articles to be shown to users from 14% to 52% on average.
Abstract: A significant problem in many information filtering systems is the dependence on the user for the creation and maintenance of a user profile, which describes the user's interests. NewsWeeder is a netnews-filtering system that addresses this problem by letting the user rate his or her interest level for each article being read (1-5), and then learning a user profile based on these ratings. This paper describes how NewsWeeder accomplishes this task, and examines the alternative learning methods used. The results show that a learning algorithm based on the Minimum Description Length (MDL) principle was able to raise the percentage of interesting articles to be shown to users from 14% to 52% on average. Further, this performance significantly outperformed (by 21%) one of the most successful techniques in Information Retrieval (IR), term-frequency/inverse-document-frequency (tf-idf) weighting.

2,234 citations

Frequently Asked Questions (8)
Q1. What have the authors contributed in "Experiments in microblog summarization" ?

This paper presents algorithms for summarizing microblog posts. 

The authors want to extend their work in various ways. The authors can do so using the PR algorithm or the Hybrid TF-IDF algorithm, by picking posts with the top n weighs. One challenge will be to produce a coherent multi-sentence summary because of issues such as presence of anaphora and other coherence issues. The authors also want to cluster the posts on a specific topic into k clusters to find various themes and sub-themes that are present in the posts and then find a summary that may be onepost or multi-post. 

Since ROUGE-1 measures unigram overlap between the manual and automated summaries, their initial guess of an optimal threshold is one that produces an average summary length equal to the average manual summary length of 10 words. 

Since the authors want certainty that ROUGE-1 correlates with a human evaluation of automated summaries, the authors also implement a manual metric used during DUC 2002: the Content metric which asks a human judge to measure how completely an automated summary expresses the meaning of a human summary on a 1 · · · 5 scale, 1 being worst and 5 being best. 

Since the authors want to generate phrases that are actually found within the input sentences, the authors reorganize the tree by placing the entire left partial summary, rip comedian soupy sales in the root node. 

One challenge will be to produce a coherent multi-sentence summary because of issues such as presence of anaphora and other coherence issues. 

Since these words do not help discriminate between one sentence or document over another, these words are penalized proportionally to their inverse document frequency (the logarithm is taken to balance the effect of the IDF component in the formula). 

The TF-IDF results are denoted as TF-IDF Sentence (11) to distinguish the fact that theTF-IDF algorithm produces sentences instead of phrases for summaries and that the authors are using a threshold of 11 words as their normalization factor.