Proceedings Article•DOI•

Experiments in Microblog Summarization

Beaux Sharifi¹, Mark-Anthony Hutton¹, Jugal Kalita¹•Institutions (1)

University of Colorado Colorado Springs¹

20 Aug 2010-pp 49-56

TL;DR: The goal is to produce summaries that are similar to what a human would produce for the same collection of posts on a speciﬁc topic, and evaluate the summaries produced by the summarizing algorithms, compare them with human-produced summaries and obtain excellent results.

read less

Abstract: —This paper presents algorithms for summarizingmicroblog posts. In particular, our algorithms process collectionsof short posts on speciﬁc topics on the well-known site calledTwitter and create short summaries from these collections ofposts on a speciﬁc topic. The goal is to produce summariesthat are similar to what a human would produce for the samecollection of posts on a speciﬁc topic. We evaluate the summariesproduced by the summarizing algorithms, compare them withhuman-produced summaries and obtain excellent results. I. I NTRODUCTION Twitter, the microblogging site started in 2006, has becomea social phenomenon, with more than 20 million visitors eachmonth. While the majority posts are conversational or notvery meaningful, about 3.6% of the posts concern topics ofmainstream news 1 . At the end of 2009, Twitter had 75 millionaccount holders, of which about 20% are active 2 . There areapproximately 2.5 million Twitter posts per day 3 . To helppeople who read Twitter posts or tweets, Twitter provides ashort list of popular topics called

...read moreread less

Summary (4 min read)

Jump to: [Introduction] – [II. RELATED WORK] – [III. PROBLEM DESCRIPTION] – [IV. SELECTED APPROACHES] – [A. The Phrase Reinforcement Algorithm] – [B. Hybrid TF-IDF Summarization] – [C. Algorithm] – [A. Data Collection and Pre-processing] – [B. Evaluation Methods] – [C. Manual Summaries] – [D. Performance of Naı̈ve Algorithms] – [E. Performance of Phrase Reinforcement Algorithm] – [F. Performance of the Hybrid TF-IDF Algorithm] and [VIII. CONCLUSION]

Introduction

Twitter, the microblogging site started in 2006, has become a social phenomenon, with more than 20 million visitors each month.
While the majority posts are conversational or not very meaningful, about 3.6% of the posts concern topics of mainstream news1.
To help people who read Twitter posts or tweets, Twitter provides a short list of popular topics called Trending Topics.
The authors create summaries in various ways and evaluate them using metrics for automatic summary evaluation.

III. PROBLEM DESCRIPTION

The difficulty in interpreting the results is that the returned posts are only sorted by recency.
The motivation of the summarizer is to automate this process and generate a more representative summary in less time and effort.
Most search engines built into microblogging services only return a limited number of results when querying for a particular topic or phrase.
Twitter only returns a maximum of 1500 posts for a single search phrase.
Given a set of posts that are all related by containing a common search phrase (e.g. a topic), generate a summary that best describes the primary gist of what users are saying about that search phrase.

IV. SELECTED APPROACHES

The authors choose an extractive approach since its methodologies more closely relate to the structure and diversity of microblogs.
Microblogs are the antithesis to long documents.
Extractive techniques are also known to better scale with more diverse domains [22].
First, the authors create the novel Phrase Reinforcement algorithm that uses a graph to represent overlapping phrases in a set of related microblog sentences.
The authors develop another primary algorithm based on a well established statistical methodology known as TF-IDF.

A. The Phrase Reinforcement Algorithm

The Phrase Reinforcement (PR) algorithm generates summaries by looking for the most commonly occurring phrases.
The second observation is that microbloggers often repeat the most relevant posts for a trending topic by quoting others.
Subsequently, the algorithm isolates the set of words that occur immediately before the current node’s phrase.
First, the authors create the left partial summary by searching for all paths (using a depth-first search algorithm) that begin at the root node and end at any other node, going backwards.
It produces the following final summary: RIP Comedian Soupy Sales dies at age 83.

B. Hybrid TF-IDF Summarization

After analyzing the results obtained by the Phrase Reinforcement approach, the authors notice that it significantly improves upon their earlier results, but it still leaves room for improvement as its performance only halves the difference between the random and manual summarization methods (see Section VII-E).
For straightforward automated summarization, the application of TF-IDF is simplistic.
The sentences are ordered by their weights from which the top m sentences with the most weight are chosen as the summary.
Therefore, TF-IDF gives the most weight to words that occur most frequently within a small number of documents and the least weight to terms that occur infrequently or occur within the majority of the documents.
When generating a summary from multiple documents this becomes an issue because the terms within the longer documents have more weight.

C. Algorithm

The authors don’t have a traditional document.
On the other extreme, the authors could define each post as a document making the IDF component’s definition clear.
When computing the term frequencies, the authors assume the document is the entire collection of posts.
The authors next choose a normalization method since otherwise the TF-IDF algorithm always biases towards longer sentences.
The authors summarize this algorithm below in Equations (3)-(7).

A. Data Collection and Pre-processing

For five consecutive days, the authors collected the top ten currently trending topics from Twitter’s home page at roughly the same time every evening.
For each topic, the authors downloaded the maximum number (approximately 1500) of posts.
Therefore, the authors had 50 trending topics with a set of 1500 posts for each.

B. Evaluation Methods

There is no definitive standard against which one can compare the results from an automated summarization system.
In intrinsic evaluation, the quality of the summary is judged based on direct analysis using a number of predefined metrics such as grammaticality, fluency, or content [1].
ROUGE is a suite of metrics that automatically measures the similarity between an automated summary and a set of manual summaries [30].
(8) Here, n is the length of the n-grams, Count(n-gram) is the number of n-grams in the manual summary, and Match(n-gram is the number of co-occurring n-grams between the manual and automated summaries.
Lin [30] performed evaluations to understand how well different forms of ROUGE’s results correlate with human judgments.

C. Manual Summaries

Two volunteers generated a complete set of 50 manual “best” summaries possible for all topics in 140 characters or less while using only the information contained within the posts (see Table 1).
The manual summaries generated by their two volunteers are semantically very similar to one another but have different lengths and word choices.
The authors use ROUGE-1 to compare the manual summaries against one another.
By evaluating their two manual summaries against one another, the authors help establish practical upper limits of performance for automated summaries.
These results in addition to the results of the preliminary algorithms collectively establish a range of expected performance for their primary algorithms.

D. Performance of Naı̈ve Algorithms

These results are higher than the authors originally anticipated given their average manual F-measure is only 0.34.
Some overlap is explained by that ROUGE-1 measures common words.
Overall, the random sentence approach produces a summary that is more balanced in terms of recall and precision and a higher F-measure as well (0.23 vs. 0.20).
The shortest sentence and shortest post approaches generate far too short summaries, averaging only two or three words in length.
Because of their short length, these two approaches generate very high precision but fail horribly at recall, scoring less than either of the random approaches.

E. Performance of Phrase Reinforcement Algorithm

This is a significant improvement over the random sentence approach.
This score is an improvement over the random summaries.
By assigning less weight to nodes farther from the root phrase, the algorithm prefers more common shorter phrases over less common longer phrases.
There appears to be a threshold (when b ≈ 100) for which smaller values of b begin reducing the average summary length.
In Figure 4, the label “PR Phrase (NULL)” indicate the absence of the weighting parameter all together.

F. Performance of the Hybrid TF-IDF Algorithm

In Figure 7, the authors present the results of the TF-IDF algorithm for the ROUGE-1 metric.
The TF-IDF results are denoted as TF-IDF Sentence (11) to distinguish the fact that the TF-IDF algorithm produces sentences instead of phrases for summaries and that the authors are using a threshold of 11 words as their normalization factor.
This score is also higher than the average Content score of the Phrase Reinforcement algorithm which was 3.66.
Interestingly, the TF-IDF summaries are one word shorter, on average, than the manual summaries with an average length of 9 words.
As seen in Figure 10, by varying the normalization threshold, the authors are able to control the average summary length and resulting ROUGE-1 precision and recall.

VIII. CONCLUSION

The authors have presented two primary approaches to microblog summarization.
The authors find, after exhaustive experimentation, that the Hybrid TF-IDF algorithm produces as good summaries as the PR algorithm or even better.
One challenge will be to produce a coherent multi-sentence summary because of issues such as presence of anaphora and other coherence issues.
The authors also want to cluster the posts on a specific topic into k clusters to find various themes and sub-themes that are present in the posts and then find a summary that may be onepost or multi-post.
Of course, one of the biggest problems the authors have observed with microblogs is redundancy; in other words, quite frequently, a large number of posts are similar to one another.

Did you find this useful? Give us your feedback

Figures (18)

TABLE II ROUGE-1, CONTENT, AND LENGTH FOR MANUAL SUMMARIES

Fig. 8. Content performance for Hybrid TFIDF algorithm

TABLE III CONTENT PERFORMANCE FOR NAIVE SUMMARIES

TABLE V SUMMARIES PRODUCED BY THE PHRASE REINFORCEMENT ALGORITHM

Fig. 4. ROUGE-1 performance for different weightings of the Phrase Reinforcement algorithm

Fig. 5. Average summary lengths for different weightings of the Phrase Reinforcement algorithm

TABLE IV ROUGE-1, CONTENT AND LENGTH FOR THE PR SUMMARIES

Fig. 6. Content performance for different weightings of the Phrase Reinforcement algorithm

TABLE VI SUMMARIES PRODUCED BY THE HYBRID TFIDF ALGORITHM

Fig. 10. ROUGE-1 performance vs. weight for Hybrid TFIDF algorithm

Fig. 7. Rouge performance for Hybrid TFIDF algorithm

Fig. 9. Average summary length for Hybrid TFIDF algorithm

Fig. 11. Content performance vs. weight for Hybrid TFIDF algorithm

Fig. 12. ROUGE-1 performance vs. weight for Hybrid TFIDF algorithm for various normalization thresholds

Fig. 1. Fully constructed PR graph (allowing non-overlapping words/phrases).

Fig. 3. Fully constructed PR graph for right half of summary). This also demonstrates best complete summary

Fig. 2. Fully weighted PR graph (requiring overlapping words/phrases).

Content maybe subject to copyright Report

Experiments in Microblog Summarization

Beaux Shariﬁ, Mark-Anthony Hutton and Jugal K. Kalita

Department of Computer Science

University of Colorado

Colorado Springs, CO 80918 USA

{bshariﬁ,mhutton86}@gmail.com, jkalita@uccs.edu

Abstract—This paper presents algorithms for summarizing

microblog posts. In particular, our algorithms process collections

of short posts on speciﬁc topics on the well-known site called

Twitter and create short summaries from these collections of

posts on a speciﬁc topic. The goal is to produce summaries

that are similar to what a human would produce for the same

collection of posts on a speciﬁc topic. We evaluate the summaries

produced by the summarizing algorithms, compare them with

human-produced summaries and obtain excellent results.

I. INTRODUCTION

Twitter, the microblogging site started in 2006, has become

a social phenomenon, with more than 20 million visitors each

month. While the majority posts are conversational or not

very meaningful, about 3.6% of the posts concern topics of

mainstream news

. At the end of 2009, Twitter had 75 million

account holders, of which about 20% are active

. There are

approximately 2.5 million Twitter posts per day

. To help

people who read Twitter posts or tweets, Twitter provides a

short list of popular topics called Trending Topics. Without

the availability of such topics on which one can search to ﬁnd

related posts, for the general user, there is no way to get an

overall understanding of what is being written about in Twitter.

Twitter works with a website called WhatTheTrend

provide deﬁnitions of trending topics. WhatTheTrend allows

users to manually enter descriptions of why a topic is trending.

However, WhatTheTrend suffers from spam and irrelevant

posts that reduce its utility to some extent.

In this paper, we discuss an effort to automatically create

“summary” posts of Twitter trending topics. In short, we

perform searches for Twitter on trending topics, get a large

number of posts on a topic and then automatically create a

short post that is representative of all the posts on the topic.

We create summaries in various ways and evaluate them using

metrics for automatic summary evaluation.

II. RELATED WORK

Automatically summarizing microblog topics is a new area

of research and to the authors’ best knowledge, approaches

have not been published. However, summarizing microblogs

can be viewed as an instance of automated text summarization,

http://www.pearanalytics.com/blog/tag/twitter/

http://themetricsystem.rjmetrics.com/2010/01/26/new-data-on-twitters-

users-and-engagement/

http://blog.twitter.com/2010/02/measuring-tweets.html

http://www.whatthetrend.com/

which is the problem of automatically generating a condensed

version of the most important content from one or more

documents for a particular set of users or tasks [1].

As early as the 1950s, Luhn was experimenting with meth-

ods for automatically generating extracts of technical articles

[2]. Edmundson [3] developed techniques for summarizing a

diverse corpora of documents to help users evaluate documents

for further reading. Early research in text summarization

focused primarily on simple statistical techniques that relied

upon lexical features such as word frequencies (e.g. [2]) or

formatting clues such as titles and headings (e.g. [3]).

Later work integrated more sophisticated approaches such

as machine learning, and natural language processing. In most

cases, text summarization is performed for the purposes of

saving users time by reducing the amount of content to read.

However, text summarization has also been performed for

purposes such as reducing the number of features required

for classifying (e.g. [4]) or clustering (e.g. [5]) documents.

Following another line of approach, [6] and [7], generated

textual summaries of results of database queries. With the

growth of the Web, interest has grown to improve summariza-

tion while also to summarize new forms of documents such as

web pages (e.g. [8]), emails (e.g. [9]–[11]), newsgroups (e.g.

[12]) discussion forums (e.g. [13]–[15]), and blogs (e.g. [16],

[17]). Recently, interest has emerged in multiple document

summarization as well (e.g. [18]–[21]) thanks in part to

annual conferences that aim to further the state of the art in

summarization by providing large test collections and common

evaluation of summarizing systems

III. PROBLEM DESCRIPTION

On a microblogging service such as Twitter, a user can

perform a search for a topic and retrieve a list of posts that

contain the phrase. The difﬁculty in interpreting the results

is that the returned posts are only sorted by recency. The

motivation of the summarizer is to automate this process and

generate a more representative summary in less time and effort.

Most search engines built into microblogging services only

return a limited number of results when querying for a

particular topic or phrase. For example, Twitter only returns

a maximum of 1500 posts for a single search phrase. Our

problem description is as follows:

http://www.nist.gov/tac/

Given a set of posts that are all related by containing

a common search phrase (e.g. a topic), generate a

summary that best describes the primary gist of what

users are saying about that search phrase.

IV. SELECTED APPROACHES

The ﬁrst decision in developing a microblog summarization

algorithm is to choose whether to use an abstractive or

extractive approach. We choose an extractive approach since

its methodologies more closely relate to the structure and

diversity of microblogs. Abstractive approaches are beneﬁcial

in situations where high rates of compression are required.

However, microblogs are the antithesis to long documents. Ab-

stractive systems usually perform best in limited domains since

they require outside knowledge sources. These approaches

might not work so well with microblogs since they are un-

structured and diverse in subject matter. Extractive techniques

are also known to better scale with more diverse domains [22].

We implement several extractive algorithms. We start by

implementing two preliminary algorithms based on very sim-

ple techniques. We also develop and implement two primary

algorithms. First, we create the novel Phrase Reinforcement

algorithm that uses a graph to represent overlapping phrases

in a set of related microblog sentences. This graph allows

the generation of one or more summaries and is discussed in

detail in Section VI-A. We develop another primary algorithm

based on a well established statistical methodology known as

TF-IDF. The Hybrid TF-IDF algorithm is discussed at length

in Section VI-B.

V. NA

IVE ALGORITHMS

While the two preliminary approaches are simplistic, they

serve a critical role towards allowing us to evaluate the results

of our primary algorithms since no prior results yet exist.

A. Random Approach

Given a ﬁltered collection of relevant posts that are each

related to a single topic, we generate a summary by simply

choosing at random either a post or a sentence.

B. Length Approach

Our second preliminary approach serves as an indicator

of how easy or difﬁcult it is to improve upon the random

approach to summarization. Given a collection of posts and

sentences that are each related to a single topic, we generate

four independent summaries. For two of the summaries, we

choose both the shortest and longest post from the collection.

For the remaining two, we choose both the shortest and longest

topic sentence from the collection.

VI. PRIMARY ALGORITHMS

We discuss two primary algorithms: one is called the Phrase

Reinforcement Algorithm that we develop. The other is an

adaptation of the well-known TF-IDF approach.

A. The Phrase Reinforcement Algorithm

The Phrase Reinforcement (PR) algorithm generates sum-

maries by looking for the most commonly occurring phrases.

The algorithm has been discussed brieﬂy in [23], [24]. More

details can be found in [25].

1) Motivation: The algorithm is inspired by two observa-

tions. The ﬁrst is that users tend to use similar words when

describing a particular topic, especially immediately adjacent

to the topic phrase. For example, consider the following posts

collected on the day of the comedian Soupy Sales’ death:

1) Aw, Comedian Soupy Sales died.

2) RIP Comedian Soupy Sales dies at age 83.

3) My favorite comedian Soupy Sales died.

4) RT @NY: RIP Comedian Soupy Sales dies at age 83.

5) RIP: Soupy Sales Died Today.

6) Soupy Sales meant silliness and laughs.

Notice that all posts contain words immediately after the

phrase Soupy Sales that in some way refer to his death.

Furthermore, posts 2 and 4 share the word RIP and posts 1,

3 and 5 share the word died. Therefore, there exists some

overlap in word usage adjacent to the phrase Soupy Sales.

The second observation is that microbloggers often repeat

the most relevant posts for a trending topic by quoting oth-

ers. Quoting uses the following form: RT @[TwitterAccount-

Name]: Quoted Message. RT refers to Re-Tweet and indicates

one is copying a post from the indicated Twitter account. For

example, the following is a quoted post: RT @dcagle: Our ﬁrst

Soupy Sales RIP cartoon. While users writing their own posts

occasionally use the same or similar words, retweeting causes

entire sentences to perfectly overlap with one another. This,

in turn, greatly increases the average length of an overlapping

phrase for a given topic. The main idea of the algorithm is to

determine the most heavily overlapping phrase centered about

the topic phrase. The justiﬁcation is that repeated information

is often a good indicator of its relative importance [2].

2) The Algorithm: The algorithm begins with the topic

phrase. These phrases are typically trending topics, but can be

other non-trending topics as well. Assume our starting phrase

is the trending topic Soupy Sales. The input to the algorithm

is a set of posts that each contains the starting phrase. The

algorithm can take as input any number of posts returned by

Twitter. In the running example, we pick the ﬁrst 100 posts

returned by the Twitter search engine.

a) Building a Word Graph: The root node contains the

topic phrase, in this case soupy sales. We build a graph

showing how words occur before and after the phrase in the

root node, considering all the posts on the topic. We think of

the graph as containing two halves: a sub-graph to the left of

the root node containing words occurring in speciﬁc positions

to the left of the root node’s phrase, and a similar sub-graph

to the right of the root node.

To construct the left-hand side, the algorithm starts with the

root node. It reduces the set of input sentences to the set of

sentences that contain the current node’s phrase. The current

node and the root node are initially the same. Since every

input sentence is guaranteed to contain the root phrase, our

list of sentences does not change initially. Subsequently, the

algorithm isolates the set of words that occur immediately

before the current node’s phrase. From this set, duplicate

words are combined and assigned a count that represents how

many instances of those words are detected. For each of these

unique words, the algorithm adds them to the graph as nodes

with their associated counts to the left of the current node.

In the graph, all the nodes are in lower-case and stripped of

any non-alpha-numeric characters. This increases the amount

of overlap among words. Each node has an associated count.

The associated node’s phrase has exactly count number of

occurrences within the set of input sentences at the same

position and word sequence relative to the root node. Nodes

with a count less than two are not added to the graph since the

algorithm is looking for overlapping phrases. The algorithm

continues this process recursively for each node added to the

graph until all the potential words have been added to the

left-hand side of the graph.

The algorithm repeats these steps symmetrically for the

right-hand side of the graph. At the completion of the graph-

building process, the graph looks like the one in Figure 1.

b) Weighting Individual Word Nodes: The algorithm pre-

pares for the generation of summaries by weighting individual

nodes. Node weighting is performed to account for the fact that

some words have more informational content than others. For

example, the root node soupy sales contains no information

since it is common to every input sentence. We give it a weight

of zero. Common stop words are noisy features that do not

help discriminate between phrases. We give them a weight

of zero as well. Finally, for the remaining words, we ﬁrst

initialize their weights to the same values as their counts. Then,

to account for the fact that some phrases are naturally longer

than others, we penalize nodes that occur farther from the root

node by an amount that is proportional to their distance:

W eight(Node)= (1)

Count(N ode) − Distance(N od e) ∗ log

Count(N ode).

The logarithm base b can be used to tune the algorithm

towards longer or shorter summaries. For aggressive sum-

marization (higher precision), the base can be set to small

values (e.g. 2 or the natural logarithm ln). While for longer

summaries (higher recall), the base can be set to larger values

(e.g. 100). Weighting our example graph gives the graph in

Figure 2. We assume the logarithm base b is set to 10 for

helping generate longer summaries.

c) Generating Summaries: The algorithm looks for the

most overlapping phrase within the graph. First, we create

the left partial summary by searching for all paths (using a

depth-ﬁrst search algorithm) that begin at the root node and

end at any other node, going backwards. The path with the

most weight is chosen. Assume in Figure 2, the path with

most weight on the left-hand side of the root node (here 5.1

including the root node) on the left side of the graph contains

the nodes rip, comedian and soupy sales. Thus, the best left

partial summary is rip comedian soupy sales.

We repeat the partial summary creation process for the right-

hand side of the current root node soupy sales. Since we want

to generate phrases that are actually found within the input

sentences, we reorganize the tree by placing the entire left

partial summary, rip comedian soupy sales in the root node.

Assume we get the path shown in Figure 3 as the most heavily

weighted on the right hand side. The full summary generated

by the algorithm for our example is: rip comedian soupy sales

dies at age 83.

Strictly speaking, we can build either the left or right partial

summary ﬁrst, and the other one next. The full summary

obtained by the algorithm is the same in both cases.

d) Post-processing: This summary has lost its case-

sensitivity and formatting since these features were removed

to increase the amount of overlap among phrases. To recover

these features, we perform a simple best-ﬁt algorithm between

the summary and the set of input sentences to ﬁnd a matching

phrase that contains the summary. We know such a match-

ing phrase exists within at least two of the input sentences

since the algorithm only generates summaries from common

phrases. Once, we ﬁnd the ﬁrst matching phrase, this phrase is

our ﬁnal summary. It produces the following ﬁnal summary:

RIP Comedian Soupy Sales dies at age 83.

B. Hybrid TF-IDF Summarization

After analyzing the results obtained by the Phrase Rein-

forcement approach, we notice that it signiﬁcantly improves

upon our earlier results, but it still leaves room for improve-

ment as its performance only halves the difference between

the random and manual summarization methods (see Section

VII-E). Therefore, we use another approach based upon a

technique dating back to early summarization work [2].

1) TF-IDF: Term Frequency-Inverse Document Frequency,

is a statistical weighting technique that has been applied to

many information retrieval problems. It has been used for

automatic indexing [26], query matching of documents [27],

and automated summarization [28]. Generally TF-IDF is not

known as a leading algorithms in automated summarization.

For straightforward automated summarization, the applica-

tion of TF-IDF is simplistic. The idea is to assign each sen-

tence within a document a weight that reﬂects the sentence’s

saliency within the document. The sentences are ordered by

their weights from which the top m sentences with the most

weight are chosen as the summary. The weight of a sentence

is the summation of the individual term weights within the

sentence. Terms can be words, phrases, or any other type of

lexical feature [29]. To determine the weight of a term, we

use the formula:

IDF = tf

∗ log

(2)

where tf

is the frequency of the term T

within the document

, N is the total number of documents, and df

is the number

of documents within the set that contain the term T

[26].

The TF-IDF value is composed of two primary parts. The

term frequency component (TF) assigns more weight to words

Fig. 1. Fully constructed PR graph (allowing non-overlapping words/phrases).

Fig. 2. Fully weighted PR graph (requiring overlapping words/phrases).

Fig. 3. Fully constructed PR graph for right half of summary). This also

demonstrates best complete summary

that occur frequently within a document because important

words are often repeated [2]. The inverse document frequency

component (IDF) compensates for the fact that some words

such as common stop words are frequent. Since these words

do not help discriminate between one sentence or document

over another, these words are penalized proportionally to their

inverse document frequency (the logarithm is taken to balance

the effect of the IDF component in the formula). Therefore,

TF-IDF gives the most weight to words that occur most

frequently within a small number of documents and the least

weight to terms that occur infrequently or occur within the

majority of the documents.

One noted problem with TF-IDF is that the formula is

sensitive to document length. Singhal et al. note that longer

documents have higher term frequencies since they often

repeat terms while also having a larger number of terms

[29]. This doesn’t have any ill effects on single document

summarization. However, when generating a summary from

multiple documents this becomes an issue because the terms

within the longer documents have more weight.

C. Algorithm

Equation 2 deﬁnes the weight of a term in the context of

a document. However, we don’t have a traditional document.

Instead, we have a set of microblogging posts that are each

related to a topic. So, one question we must ﬁrst answer is

how we deﬁne a document. An option is to deﬁne a single

document that encompasses all the posts together. In this

case, the TF component’s deﬁnition is straightforward since

we can compute the frequencies of the terms across all the

posts. However, doing so causes us to lose the IDF component

since we only have a single document. On the other extreme,

we could deﬁne each post as a document making the IDF

component’s deﬁnition clear. But, the TF component now has a

problem: Because each post contains only a handful of words,

most term frequencies will be a small constant for a given

post.

To handle this situation, we redeﬁne TF-IDF in terms of a

hybrid document. We primarily deﬁne a document as a single

sentence. However, when computing the term frequencies, we

assume the document is the entire collection of posts. This

way, we have differentiated term frequencies but also don’t

lose the IDF component. A term is a single word in a sentence.

We next choose a normalization method since otherwise

the TF-IDF algorithm always biases towards longer sentences.

We normalize the weight of a sentence by dividing it by a

normalization factor. Since stop words do not help discriminate

the saliency of sentences, we give each of these types of words

a weight of zero by comparing them with a prebuilt list. We

summarize this algorithm below in Equations (3)-(7).

W (S)=

#W ordsInSentence

i=0

W (w

)

nf(S)

(3)

W (w

)=tf(w) ∗ log

(idf(w

)) (4)

tf(w

#OccurrencesOf W ordInAllP osts

#W ordsInAllP osts

(5)

idf(w

#SentencesInAllP osts

#SentencesInW hichW ordOccurs

(6)

nf(S)=max[M inimumT hreshold, (7)

#W ordsInSentence]

where W is the weight assigned to a sentence or a word, nf is

a normalization factor, w

is the ith word, and S is a sentence.

VII. EXPERIMENTAL SETUP AND EVALUATION

A. Data Collection and Pre-processing

For ﬁve consecutive days, we collected the top ten currently

trending topics from Twitter’s home page at roughly the

same time every evening. For each topic, we downloaded the

maximum number (approximately 1500) of posts. Therefore,

we had 50 trending topics with a set of 1500 posts for each.

We performed several forms of preprocessing in order to ﬁlter

the posts into a usable form. Details can be found in [25].

B. Evaluation Methods

There is no deﬁnitive standard against which one can

compare the results from an automated summarization system.

Summary evaluation is generally performed using one of two

methods: intrinsic, or extrinsic. In intrinsic evaluation, the

quality of the summary is judged based on direct analysis

using a number of predeﬁned metrics such as grammaticality,

ﬂuency, or content [1]. Extrinsic evaluations measure how well

a summary enables a user to perform some form of task [22].

To perform intrinsic evaluation, a common approach is to

create one or more manual summaries and to compare the

automated summaries against the manual summaries. One

popular automatic evaluation metric that has been adopted

by the Document Understanding Conference since 2004 is

ROUGE. ROUGE is a suite of metrics that automatically

measures the similarity between an automated summary and

a set of manual summaries [30]. One of the simplest ROUGE

metrics is the ROUGE-N metric:

ROUGE-N =

s∈MS

n-g rams∈S

Match(n-gram)

s∈MS

n-g rams∈S

Count(n-gram)

(8)

Here, n is the length of the n-grams, Count(n-gram)

is the number of n-grams in the manual summary, and

Match(n-gram is the number of co-occurring n-grams be-

tween the manual and automated summaries.

Lin [30] performed evaluations to understand how well

different forms of ROUGE’s results correlate with human

judgments. One result of particular consequence for our work

is his comparison of ROUGE with the very short (around

10 words) summary task of DUC 2003. Lin found ROUGE-

1 and other ROUGEs to correlate highly with human judg-

ments. Since this task is very similar to creating microblog

summaries, we implement ROUGE-1 as a metric.

Since we want certainty that ROUGE-1 correlates with a

human evaluation of automated summaries, we also imple-

ment a manual metric used during DUC 2002: the Content

metric which asks a human judge to measure how completely

an automated summary expresses the meaning of a human

summary on a 1 · · · 5 scale, 1 being worst and 5 being best.

C. Manual Summaries

Two volunteers generated a complete set of 50 manual

“best” summaries possible for all topics in 140 characters or

less while using only the information contained within the

posts (see Table 1).

1) Manual Summary Evaluation: The manual summaries

generated by our two volunteers are semantically very similar

to one another but have different lengths and word choices. We

use ROUGE-1 to compare the manual summaries against one

another. We do the same using the Content metric in order to

understand how semantically similar the two summaries are.

By evaluating our two manual summaries against one another,

we help establish practical upper limits of performance for

TABLE I

EXAMPLES OF MANU AL SU MMARIES

Topic Manual Summary 1 Manual Summary 2

#BeatCancer Every retweet of Tweet #beatcancer to

#BeatCancer will result in 1 help fund cancer

cent being donated towards research

Cancer Research.

Kelly Clarkson Between Taylor Swift and Taylor Swift v.

Kelly Clarkson, which Kelly Clarkson

one do you prefer.....?

TABLE II

ROUGE-1, CONTENT, AND LENGTH FOR MANUAL SUMMARIES

Rouge-1

Content Length

F Precision Recall

Manual 1 0.34 0.31 0.37 4.4 11

Manual 2 0.34 0.37 0.31 4.1 9

Manual Avg 0.34 0.34 0.34 4.2 10

TABLE III

CONTENT PERFORMANCE FOR NAIVE SUMMARIES

Content Performance

Manual Average 4.2

Random Sentence 3.0

Shortest Sentence 2.3

automated summaries. These results in addition to the results

of the preliminary algorithms collectively establish a range of

expected performance for our primary algorithms. We compare

the manual summaries against one another bi-directionally by

assuming either set was the set of automated summaries.

To generate the Content performance, we asked a volunteer

to evaluate how well one summary expressed the meaning of

the corresponding manual summary. The average results for

computing the ROUGE-1 and Content metrics on the manual

summaries are shown in Table II.

D. Performance of Na

ıve Algorithms

The generation of random sentences produces an average

recall of 0.23, an average precision of 0.22, and an F-measure

of 0.23. These results are higher than we originally anticipated

given our average manual F-measure is only 0.34. Some

overlap is explained by that ROUGE-1 measures common

words. Second, while we call our ﬁrst preliminary approach

“random”, we introduce some bias into this approach by our

preprocessing steps discussed earlier.

To understand how random sentences are compared to man-

ual summaries, we present Content performance in Table III.

The random sentence approach generated a Content score of

3.0 on a scale of 5. In addition to choosing random sentences,

we also chose random posts. This slightly improves the recall

scores over the random sentence approach (0.24 vs. 0.23), but

worsens the precision (0.17 vs. 0.22).

The random post approach produces an average length of

15 words while the random sentence averaged 12 words.

Since the random sentence approach is closer to the average

manual summary length, it scores higher precision. Overall,

the random sentence approach produces a summary that is

more balanced in terms of recall and precision and a higher

F-measure as well (0.23 vs. 0.20).

HTML Viewer

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Experiments in microblog summarization" ?

This paper presents algorithms for summarizing microblog posts.

Q2. What are the future works mentioned in the paper "Experiments in microblog summarization" ?

The authors want to extend their work in various ways. The authors can do so using the PR algorithm or the Hybrid TF-IDF algorithm, by picking posts with the top n weighs. One challenge will be to produce a coherent multi-sentence summary because of issues such as presence of anaphora and other coherence issues. The authors also want to cluster the posts on a specific topic into k clusters to find various themes and sub-themes that are present in the posts and then find a summary that may be onepost or multi-post.

Q3. What is the way to control the average summary length?

Since ROUGE-1 measures unigram overlap between the manual and automated summaries, their initial guess of an optimal threshold is one that produces an average summary length equal to the average manual summary length of 10 words.

Q4. What is the common way to evaluate a summary?

Since the authors want certainty that ROUGE-1 correlates with a human evaluation of automated summaries, the authors also implement a manual metric used during DUC 2002: the Content metric which asks a human judge to measure how completely an automated summary expresses the meaning of a human summary on a 1 · · · 5 scale, 1 being worst and 5 being best.

Q5. What is the way to generate a partial summary?

Since the authors want to generate phrases that are actually found within the input sentences, the authors reorganize the tree by placing the entire left partial summary, rip comedian soupy sales in the root node.

Q6. What is the challenge of a multi-sentence summary?

One challenge will be to produce a coherent multi-sentence summary because of issues such as presence of anaphora and other coherence issues.

Q7. Why is the inverse document frequency component penalized?

Since these words do not help discriminate between one sentence or document over another, these words are penalized proportionally to their inverse document frequency (the logarithm is taken to balance the effect of the IDF component in the formula).

Q8. How many words are used as the normalization factor?

The TF-IDF results are denoted as TF-IDF Sentence (11) to distinguish the fact that theTF-IDF algorithm produces sentences instead of phrases for summaries and that the authors are using a threshold of 11 words as their normalization factor.

Experiments in Microblog Summarization

Summary (4 min read)

Introduction

III. PROBLEM DESCRIPTION

IV. SELECTED APPROACHES

A. The Phrase Reinforcement Algorithm

B. Hybrid TF-IDF Summarization

C. Algorithm

A. Data Collection and Pre-processing

B. Evaluation Methods

C. Manual Summaries

D. Performance of Naı̈ve Algorithms

E. Performance of Phrase Reinforcement Algorithm

F. Performance of the Hybrid TF-IDF Algorithm

VIII. CONCLUSION

Figures (18)

Citations

Cites background from "Experiments in Microblog Summarizat..."

Cites background from "Experiments in Microblog Summarizat..."

References

"Experiments in Microblog Summarizat..." refers background in this paper

"Experiments in Microblog Summarizat..." refers background or methods in this paper

"Experiments in Microblog Summarizat..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Experiments in microblog summarization" ?

Q2. What are the future works mentioned in the paper "Experiments in microblog summarization" ?

Q3. What is the way to control the average summary length?

Q4. What is the common way to evaluate a summary?

Q5. What is the way to generate a partial summary?

Q6. What is the challenge of a multi-sentence summary?

Q7. Why is the inverse document frequency component penalized?

Q8. How many words are used as the normalization factor?