Experiments in Microblog Summarization
Summary (4 min read)
Introduction
- Twitter, the microblogging site started in 2006, has become a social phenomenon, with more than 20 million visitors each month.
- While the majority posts are conversational or not very meaningful, about 3.6% of the posts concern topics of mainstream news1.
- To help people who read Twitter posts or tweets, Twitter provides a short list of popular topics called Trending Topics.
- The authors create summaries in various ways and evaluate them using metrics for automatic summary evaluation.
III. PROBLEM DESCRIPTION
- The difficulty in interpreting the results is that the returned posts are only sorted by recency.
- The motivation of the summarizer is to automate this process and generate a more representative summary in less time and effort.
- Most search engines built into microblogging services only return a limited number of results when querying for a particular topic or phrase.
- Twitter only returns a maximum of 1500 posts for a single search phrase.
- Given a set of posts that are all related by containing a common search phrase (e.g. a topic), generate a summary that best describes the primary gist of what users are saying about that search phrase.
IV. SELECTED APPROACHES
- The authors choose an extractive approach since its methodologies more closely relate to the structure and diversity of microblogs.
- Microblogs are the antithesis to long documents.
- Extractive techniques are also known to better scale with more diverse domains [22].
- First, the authors create the novel Phrase Reinforcement algorithm that uses a graph to represent overlapping phrases in a set of related microblog sentences.
- The authors develop another primary algorithm based on a well established statistical methodology known as TF-IDF.
A. The Phrase Reinforcement Algorithm
- The Phrase Reinforcement (PR) algorithm generates summaries by looking for the most commonly occurring phrases.
- The second observation is that microbloggers often repeat the most relevant posts for a trending topic by quoting others.
- Subsequently, the algorithm isolates the set of words that occur immediately before the current node’s phrase.
- First, the authors create the left partial summary by searching for all paths (using a depth-first search algorithm) that begin at the root node and end at any other node, going backwards.
- It produces the following final summary: RIP Comedian Soupy Sales dies at age 83.
B. Hybrid TF-IDF Summarization
- After analyzing the results obtained by the Phrase Reinforcement approach, the authors notice that it significantly improves upon their earlier results, but it still leaves room for improvement as its performance only halves the difference between the random and manual summarization methods (see Section VII-E).
- For straightforward automated summarization, the application of TF-IDF is simplistic.
- The sentences are ordered by their weights from which the top m sentences with the most weight are chosen as the summary.
- Therefore, TF-IDF gives the most weight to words that occur most frequently within a small number of documents and the least weight to terms that occur infrequently or occur within the majority of the documents.
- When generating a summary from multiple documents this becomes an issue because the terms within the longer documents have more weight.
C. Algorithm
- The authors don’t have a traditional document.
- On the other extreme, the authors could define each post as a document making the IDF component’s definition clear.
- When computing the term frequencies, the authors assume the document is the entire collection of posts.
- The authors next choose a normalization method since otherwise the TF-IDF algorithm always biases towards longer sentences.
- The authors summarize this algorithm below in Equations (3)-(7).
A. Data Collection and Pre-processing
- For five consecutive days, the authors collected the top ten currently trending topics from Twitter’s home page at roughly the same time every evening.
- For each topic, the authors downloaded the maximum number (approximately 1500) of posts.
- Therefore, the authors had 50 trending topics with a set of 1500 posts for each.
B. Evaluation Methods
- There is no definitive standard against which one can compare the results from an automated summarization system.
- In intrinsic evaluation, the quality of the summary is judged based on direct analysis using a number of predefined metrics such as grammaticality, fluency, or content [1].
- ROUGE is a suite of metrics that automatically measures the similarity between an automated summary and a set of manual summaries [30].
- (8) Here, n is the length of the n-grams, Count(n-gram) is the number of n-grams in the manual summary, and Match(n-gram is the number of co-occurring n-grams between the manual and automated summaries.
- Lin [30] performed evaluations to understand how well different forms of ROUGE’s results correlate with human judgments.
C. Manual Summaries
- Two volunteers generated a complete set of 50 manual “best” summaries possible for all topics in 140 characters or less while using only the information contained within the posts (see Table 1).
- The manual summaries generated by their two volunteers are semantically very similar to one another but have different lengths and word choices.
- The authors use ROUGE-1 to compare the manual summaries against one another.
- By evaluating their two manual summaries against one another, the authors help establish practical upper limits of performance for automated summaries.
- These results in addition to the results of the preliminary algorithms collectively establish a range of expected performance for their primary algorithms.
E. Performance of Phrase Reinforcement Algorithm
- This is a significant improvement over the random sentence approach.
- This score is an improvement over the random summaries.
- By assigning less weight to nodes farther from the root phrase, the algorithm prefers more common shorter phrases over less common longer phrases.
- There appears to be a threshold (when b ≈ 100) for which smaller values of b begin reducing the average summary length.
- In Figure 4, the label “PR Phrase (NULL)” indicate the absence of the weighting parameter all together.
F. Performance of the Hybrid TF-IDF Algorithm
- In Figure 7, the authors present the results of the TF-IDF algorithm for the ROUGE-1 metric.
- The TF-IDF results are denoted as TF-IDF Sentence (11) to distinguish the fact that the TF-IDF algorithm produces sentences instead of phrases for summaries and that the authors are using a threshold of 11 words as their normalization factor.
- This score is also higher than the average Content score of the Phrase Reinforcement algorithm which was 3.66.
- Interestingly, the TF-IDF summaries are one word shorter, on average, than the manual summaries with an average length of 9 words.
- As seen in Figure 10, by varying the normalization threshold, the authors are able to control the average summary length and resulting ROUGE-1 precision and recall.
VIII. CONCLUSION
- The authors have presented two primary approaches to microblog summarization.
- The authors find, after exhaustive experimentation, that the Hybrid TF-IDF algorithm produces as good summaries as the PR algorithm or even better.
- One challenge will be to produce a coherent multi-sentence summary because of issues such as presence of anaphora and other coherence issues.
- The authors also want to cluster the posts on a specific topic into k clusters to find various themes and sub-themes that are present in the posts and then find a summary that may be onepost or multi-post.
- Of course, one of the biggest problems the authors have observed with microblogs is redundancy; in other words, quite frequently, a large number of posts are similar to one another.
Did you find this useful? Give us your feedback
Citations
264 citations
183 citations
Cites background from "Experiments in Microblog Summarizat..."
...The third criterion implemented involved utilizing only the term frequency of each element, rather than both the term frequency and the inverse document frequency....
[...]
174 citations
113 citations
102 citations
Cites background from "Experiments in Microblog Summarizat..."
...There has also been recent work on summarizing sets of microblog posts (Sharifi et al., 2010)....
[...]
References
11,804 citations
3,891 citations
"Experiments in Microblog Summarizat..." refers background in this paper
...[18]–[21]) thanks in part to annual conferences that aim to further the state of the art in summarization by providing large test collections and common evaluation of summarizing systems5....
[...]
3,571 citations
"Experiments in Microblog Summarizat..." refers background or methods in this paper
...It has been used for automatic indexing [26], query matching of documents [27], and automated summarization [28]....
[...]
...where tf ij is the frequency of the term Tj within the document Di, N is the total number of documents, and df j is the number of documents within the set that contain the term Tj [26]....
[...]
3,094 citations
"Experiments in Microblog Summarizat..." refers background or methods in this paper
...As early as the 1950s, Luhn was experimenting with methods for automatically generating extracts of technical articles [2]....
[...]
...Therefore, we use another approach based upon a technique dating back to early summarization work [2]....
[...]
...[2]) or formatting clues such as titles and headings (e....
[...]
...that occur frequently within a document because important words are often repeated [2]....
[...]
...The justification is that repeated information is often a good indicator of its relative importance [2]....
[...]
2,234 citations
Related Papers (5)
Frequently Asked Questions (8)
Q2. What are the future works mentioned in the paper "Experiments in microblog summarization" ?
The authors want to extend their work in various ways. The authors can do so using the PR algorithm or the Hybrid TF-IDF algorithm, by picking posts with the top n weighs. One challenge will be to produce a coherent multi-sentence summary because of issues such as presence of anaphora and other coherence issues. The authors also want to cluster the posts on a specific topic into k clusters to find various themes and sub-themes that are present in the posts and then find a summary that may be onepost or multi-post.
Q3. What is the way to control the average summary length?
Since ROUGE-1 measures unigram overlap between the manual and automated summaries, their initial guess of an optimal threshold is one that produces an average summary length equal to the average manual summary length of 10 words.
Q4. What is the common way to evaluate a summary?
Since the authors want certainty that ROUGE-1 correlates with a human evaluation of automated summaries, the authors also implement a manual metric used during DUC 2002: the Content metric which asks a human judge to measure how completely an automated summary expresses the meaning of a human summary on a 1 · · · 5 scale, 1 being worst and 5 being best.
Q5. What is the way to generate a partial summary?
Since the authors want to generate phrases that are actually found within the input sentences, the authors reorganize the tree by placing the entire left partial summary, rip comedian soupy sales in the root node.
Q6. What is the challenge of a multi-sentence summary?
One challenge will be to produce a coherent multi-sentence summary because of issues such as presence of anaphora and other coherence issues.
Q7. Why is the inverse document frequency component penalized?
Since these words do not help discriminate between one sentence or document over another, these words are penalized proportionally to their inverse document frequency (the logarithm is taken to balance the effect of the IDF component in the formula).
Q8. How many words are used as the normalization factor?
The TF-IDF results are denoted as TF-IDF Sentence (11) to distinguish the fact that theTF-IDF algorithm produces sentences instead of phrases for summaries and that the authors are using a threshold of 11 words as their normalization factor.