scispace - formally typeset
Search or ask a question

Showing papers on "Multi-document summarization published in 2010"


Journal ArticleDOI
TL;DR: A Survey of Text Summarization Extractive techniques has been presented and it is shown that extracting important sentences, paragraphs etc. from the source text and concatenating them into shorter form conveys the most important information from the original text document.
Abstract: Text Summarization is condensing the source text into a shorter version preserving its information content and overall meaning. It is very difficult for human beings to manually summarize large documents of text. Text Summarization methods can be classified into extractive and Abstractive summarization. An extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form. The importance of sentences is decided based on statistical and linguistic features of sentences. An Abstractive summarization method consists of understanding the original text and re-telling it in fewer words. It uses linguistic methods to examine and interpret the text and then to find the new concepts and expressions to best describe it by generating a new shorter text that conveys the most important information from the original text document. In this paper, a Survey of Text Summarization Extractive techniques has been presented.

559 citations


Proceedings Article
23 Aug 2010
TL;DR: A novel graph-based summarization framework (Opinosis) that generates concise abstractive summaries of highly redundant opinions that have better agreement with human summaries compared to the baseline extractive method.
Abstract: We present a novel graph-based summarization framework (Opinosis) that generates concise abstractive summaries of highly redundant opinions. Evaluation results on summarizing user reviews show that Opinosis summaries have better agreement with human summaries compared to the baseline extractive method. The summaries are readable, reasonably well-formed and are informative enough to convey the major opinions.

500 citations


Proceedings Article
02 Jun 2010
TL;DR: It is shown, both theoretically and empirically, that a modified greedy algorithm can efficiently solve the budgeted submodular maximization problem near-optimally, and derive new approximation bounds in doing so.
Abstract: We treat the text summarization problem as maximizing a submodular function under a budget constraint. We show, both theoretically and empirically, a modified greedy algorithm can efficiently solve the budgeted submodular maximization problem near-optimally, and we derive new approximation bounds in doing so. Experiments on DUC'04 task show that our approach is superior to the best-performing method from the DUC'04 evaluation on ROUGE-1 scores.

432 citations


Proceedings ArticleDOI
13 Oct 2010
TL;DR: The paper presents a solution which mitigates the two approaches, i.e., short and accurate textual descriptions that illustrate the software entities without having to read the details of the implementation.
Abstract: During maintenance developers cannot read the entire code of large systems. They need a way to get a quick understanding of source code entities (such as, classes, methods, packages, etc.), so they can efficiently identify and then focus on the ones related to their task at hand. Sometimes reading just a method header or a class name does not tell enough about its purpose and meaning, while reading the entire implementation takes too long. We study a solution which mitigates the two approaches, i.e., short and accurate textual descriptions that illustrate the software entities without having to read the details of the implementation. We create such descriptions using techniques from automatic text summarization. The paper presents a study that investigates the suitability of various such techniques for generating source code summaries. The results indicate that a combination of text summarization techniques is most appropriate for source code summarization and that developers generally agree with the summaries produced.

356 citations


Proceedings Article
11 Jul 2010
TL;DR: This paper forms extractive summarization as a two step learning problem building a generative model for pattern discovery and a regression model for inference based on the lexical and structural characteristics of the sentences.
Abstract: Scoring sentences in documents given abstract summaries created by humans is important in extractive multi-document summarization. In this paper, we formulate extractive summarization as a two step learning problem building a generative model for pattern discovery and a regression model for inference. We calculate scores for sentences in document clusters based on their latent characteristics using a hierarchical topic model. Then, using these scores, we train a regression model based on the lexical and structural characteristics of the sentences, and use the model to score sentences of new documents to form a summary. Our system advances current state-of-the-art improving ROUGE scores by ~7%. Generated summaries are less redundant and more coherent based upon manual quality evaluations.

156 citations


Proceedings Article
24 Sep 2010
TL;DR: The results establish the usefulness of discourse features and find that lexical overlap provides a simple and cheap alternative to discourse for computing text structure with comparable performance for the task of content selection.
Abstract: We present analyses aimed at eliciting which specific aspects of discourse provide the strongest indication for text importance. In the context of content selection for single document summarization of news, we examine the benefits of both the graph structure of text provided by discourse relations and the semantic sense of these relations. We find that structure information is the most robust indicator of importance. Semantic sense only provides constraints on content selection but is not indicative of important content by itself. However, sense features complement structure information and lead to improved performance. Further, both types of discourse information prove complementary to non-discourse features. While our results establish the usefulness of discourse features, we also find that lexical overlap provides a simple and cheap alternative to discourse for computing text structure with comparable performance for the task of content selection.

155 citations


Proceedings Article
23 Aug 2010
TL;DR: It is shown that four well-known summarization tasks including generic, query-focused, update, and comparative summarization can be modeled as different variations derived from the proposed framework.
Abstract: Multi-document summarization has been an important problem in information retrieval. It aims to distill the most important information from a set of documents to generate a compressed summary. Given a sentence graph generated from a set of documents where vertices represent sentences and edges indicate that the corresponding vertices are similar, the extracted summary can be described using the idea of graph domination. In this paper, we propose a new principled and versatile framework for multi-document summarization using the minimum dominating set. We show that four well-known summarization tasks including generic, query-focused, update, and comparative summarization can be modeled as different variations derived from the proposed framework. Approximation algorithms for performing summarization are also proposed and empirical experiments are conducted to demonstrate the effectiveness of our proposed framework.

150 citations


Proceedings Article
11 Jul 2010
TL;DR: This work presents the first systematic assessment of several diverse classes of metrics designed to capture various aspects of well-written text, and trains and test linguistic quality models on consecutive years of NIST evaluation data to show the generality of results.
Abstract: To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization. We present the first systematic assessment of several diverse classes of metrics designed to capture various aspects of well-written text. We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results. For grammaticality, the best results come from a set of syntactic features. Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference information, and summarization specific features. Our best results are 90% accuracy for pairwise comparisons of competing systems over a test set of several inputs and 70% for ranking summaries of a specific input.

105 citations


Proceedings Article
23 Aug 2010
TL;DR: Comparisons show how this methodology excels at the task of single paper summarization, and how it out-performs other multi-document summarization methods.
Abstract: This paper presents an approach to summarize single scientific papers, by extracting its contributions from the set of citation sentences written in other papers. Our methodology is based on extracting significant keyphrases from the set of citation sentences and using these keyphrases to build the summary. Comparisons show how this methodology excels at the task of single paper summarization, and how it out-performs other multi-document summarization methods.

103 citations


Proceedings Article
11 Jul 2010
TL;DR: This paper proposes to consider the translation quality of each sentence in the English-to-Chinese cross-language summarization process, and suggests that the English sentences with high translation quality and high informative-ness are selected and translated to form the Chinese summary.
Abstract: Cross-language document summarization is a task of producing a summary in one language for a document set in a different language. Existing methods simply use machine translation for document translation or summary translation. However, current machine translation services are far from satisfactory, which results in that the quality of the cross-language summary is usually very poor, both in readability and content. In this paper, we propose to consider the translation quality of each sentence in the English-to-Chinese cross-language summarization process. First, the translation quality of each English sentence in the document set is predicted with the SVM regression method, and then the quality score of each sentence is incorporated into the summarization process. Finally, the English sentences with high translation quality and high informative-ness are selected and translated to form the Chinese summary. Experimental results demonstrate the effectiveness and usefulness of the proposed approach.

98 citations


Journal ArticleDOI
TL;DR: This article proposes using a small number of nearest neighbor documents to improve document summarization and keyphrase extraction for the specified document, under the assumption that the neighbor documents could provide additional knowledge and more clues.
Abstract: Document summarization and keyphrase extraction are two related tasks in the IR and NLP fields, and both of them aim at extracting condensed representations from a single text document. Existing methods for single document summarization and keyphrase extraction usually make use of only the information contained in the specified document. This article proposes using a small number of nearest neighbor documents to improve document summarization and keyphrase extraction for the specified document, under the assumption that the neighbor documents could provide additional knowledge and more clues. The specified document is expanded to a small document set by adding a few neighbor documents close to the document, and the graph-based ranking algorithm is then applied on the expanded document set to make use of both the local information in the specified document and the global information in the neighbor documents. Experimental results on the Document Understanding Conference (DUC) benchmark datasets demonstrate the effectiveness and robustness of our proposed approaches. The cross-document sentence relationships in the expanded document set are validated to be beneficial to single document summarization, and the word cooccurrence relationships in the neighbor documents are validated to be very helpful to single document keyphrase extraction.

Proceedings Article
23 Aug 2010
TL;DR: This work applies a new content-based evaluation framework called Fresa to compute a variety of divergences among probability distributions in text summarization tasks including generic and focus-based multi-document summarization in English and generic single-document summary in French and Spanish.
Abstract: We study correlation of rankings of text summarization systems using evaluation methods with and without human models. We apply our comparison framework to various well-established content-based evaluation measures in text summarization such as coverage, Responsiveness, Pyramids and Rouge studying their associations in various text summarization tasks including generic and focus-based multi-document summarization in English and generic single-document summarization in French and Spanish. The research is carried out using a new content-based evaluation framework called Fresa to compute a variety of divergences among probability distributions.

Proceedings Article
06 Jun 2010
TL;DR: It is provided evidence that intrinsic evaluation of summaries using Amazon's Mechanical Turk is quite difficult and that non-expert judges are not able to recover system rankings derived from experts.
Abstract: We provide evidence that intrinsic evaluation of summaries using Amazon's Mechanical Turk is quite difficult. Experiments mirroring evaluation at the Text Analysis Conference's summarization track show that non-expert judges are not able to recover system rankings derived from experts.

Journal ArticleDOI
TL;DR: A novel document-sensitive graph model that emphasizes the influence of global document set information on local sentence evaluation and develops an iterative sentence ranking algorithm, namely DsR (Document-Sensitive Ranking), which outperforms previous graph-based models in both generic and query-oriented summarization tasks.
Abstract: In recent years, graph-based models and ranking algorithms have drawn considerable attention from the extractive document summarization community. Most existing approaches take into account sentence-level relations (e.g. sentence similarity) but neglect the difference among documents and the influence of documents on sentences. In this paper, we present a novel document-sensitive graph model that emphasizes the influence of global document set information on local sentence evaluation. By exploiting document–document and document–sentence relations, we distinguish intra-document sentence relations from inter-document sentence relations. In such a way, we move towards the goal of truly summarizing multiple documents rather than a single combined document. Based on this model, we develop an iterative sentence ranking algorithm, namely DsR (Document-Sensitive Ranking). Automatic ROUGE evaluations on the DUC data sets show that DsR outperforms previous graph-based models in both generic and query-oriented summarization tasks.

Journal ArticleDOI
TL;DR: The average continuity, an automatic evaluation measure of sentence ordering in a summary, is introduced and its appropriateness for this task is investigated.
Abstract: Ordering information is a difficult but important task for applications generating natural language texts such as multi-document summarization, question answering, and concept-to-text generation. In multi-document summarization, information is selected from a set of source documents. However, improper ordering of information in a summary can confuse the reader and deteriorate the readability of the summary. Therefore, it is vital to properly order the information in multi-document summarization. We present a bottom-up approach to arrange sentences extracted for multi-document summarization. To capture the association and order of two textual segments (e.g. sentences), we define four criteria: chronology, topical-closeness, precedence, and succession. These criteria are integrated into a criterion by a supervised learning approach. We repeatedly concatenate two textual segments into one segment based on the criterion, until we obtain the overall segment with all sentences arranged. We evaluate the sentence orderings produced by the proposed method and numerous baselines using subjective gradings as well as automatic evaluation measures. We introduce the average continuity, an automatic evaluation measure of sentence ordering in a summary, and investigate its appropriateness for this task.

Journal ArticleDOI
31 Dec 2010
TL;DR: A new content–based method for the evaluation of text summarization systems without human models which is used to produce system rankings is studied and a variety of divergences among probability distributions are computed.
Abstract: We study a new content–based method for the evaluation of text summarization systems without human models which is used to produce system rankings The research is carried out using a new content–based evaluation framework called Fresa to compute a variety of divergences among probability distributions We apply our comparison framework to various well–established content–based evaluation measures in text summarization such as COVERAGE, RESPONSIVENESS, PYRAMIDS and ROUGE studying their associations in various text summarization tasks including generic multi–document summarization in English and French, focus–based multi–document summarization in English and generic single–document summarization in French and Spanish

Proceedings Article
23 Aug 2010
TL;DR: Two new LSA based summarization algorithms are proposed and their performances are compared using their ROUGE-L scores to find out well-formed summaries.
Abstract: Text summarization solves the problem of extracting important information from huge amount of text data. There are various methods in the literature that aim to find out well-formed summaries. One of the most commonly used methods is the Latent Semantic Analysis (LSA). In this paper, different LSA based summarization algorithms are explained and two new LSA based summarization algorithms are proposed. The algorithms are evaluated on Turkish documents, and their performances are compared using their ROUGE-L scores. One of our algorithms produces the best scores.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: A new summarization method based on an incremental hierarchical clustering framework to update summaries as soon as a new document arrives to demonstrate the effectiveness and efficiency of this proposed method.
Abstract: Document summarization has become a hot topic in recent years. However, most of existing summarization methods work on a batch of documents and do not consider that documents may arrive in a sequence and the corresponding summaries need to be updated in real time. In this paper, we propose a new summarization method based on an incremental hierarchical clustering framework to update summaries as soon as a new document arrives. Extensive experimental results demonstrate the effectiveness and efficiency of our proposed method.

Proceedings ArticleDOI
19 Nov 2010
TL;DR: Innovative unsupervised methods for automatic sentence extraction using graph-based ranking algorithms and shortest path algorithm are presented.
Abstract: Summarization is a brief and accurate representation of input text such that the output covers the most important concepts of the source in a condensed manner. Text Summarization is an emerging technique for understanding the main purpose of any kind of documents. To visualize a large text document within a short duration and small visible area like PDA screen, summarization provides a greater flexibility and convenience. This paper presents innovative unsupervised methods for automatic sentence extraction using graph-based ranking algorithms and shortest path algorithm.

Journal ArticleDOI
01 Nov 2010
TL;DR: The experimental results on open benchmark data sets from DUC2005 and DUC2007 show that the proposed generic multi‐ document summarization method significantly outperforms the baseline methods for multi‐document summarization.
Abstract: Multi-document summarization is a process of automatic creation of a compressed version of a given collection of documents that provides useful information to users. In this article we propose a generic multi-document summarization method based on sentence clustering. We introduce five clustering methods, which optimize various aspects of intra-cluster similarity, inter-cluster dissimilarity and their combinations. To solve the clustering problem a modification of discrete particle swarm optimization algorithm has been proposed. The experimental results on open benchmark data sets from DUC2005 and DUC2007 show that our method significantly outperforms the baseline methods for multi-document summarization.

Proceedings Article
23 Aug 2010
TL;DR: The prototype Related Work Summarization system, ReWoS, takes in set of keywords arranged in a hierarchical fashion that describes a target paper's topics to drive the creation of an extractive summary using two different strategies for locating appropriate sentences for general topics as well as detailed ones.
Abstract: We introduce the novel problem of automatic related work summarization. Given multiple articles (e.g., conference/journal papers) as input, a related work summarization system creates a topic-biased summary of related work specific to the target paper. Our prototype Related Work Summarization system, ReWoS, takes in set of keywords arranged in a hierarchical fashion that describes a target paper's topics, to drive the creation of an extractive summary using two different strategies for locating appropriate sentences for general topics as well as detailed ones. Our initial results show an improvement over generic multi-document summarization baselines in a human evaluation.

Proceedings Article
23 Aug 2010
TL;DR: An extractive summarization model is proposed to provide an evaluation framework for the position information and results show that word position information is more effective and adaptive than sentence position information.
Abstract: Position information has been proved to be very effective in document summarization, especially in generic summarization. Existing approaches mostly consider the information of sentence positions in a document, based on a sentence position hypothesis that the importance of a sentence decreases with its distance from the beginning of the document. In this paper, we consider another kind of position information, i.e., the word position information, which is based on the ordinal positions of word appearances instead of sentence positions. An extractive summarization model is proposed to provide an evaluation framework for the position information. The resulting systems are evaluated on various data sets to demonstrate the effectiveness of the position information in different summarization tasks. Experimental results show that word position information is more effective and adaptive than sentence position information.

Proceedings ArticleDOI
09 Jan 2010
TL;DR: This paper proposes a knowledge discovery approach on the Web by providing an overview of the information on a Website using an integration of summarization and visualization techniques, which is capable to reduce the time required to identify and search for information or knowledge from the Web.
Abstract: The number of Web sites has noticeably increased to roughly 225 million in the last ten years. This means there is a rapid growth of knowledge and information on the Internet. Although search engines can help users to filter their desired information based on key words, the searched result is normally presented in the form of a list, and users have to visit each Web page in order to determine the appropriateness of the result. A considerable amount of time therefore has to be spent on finding the required information. To address this issue, this paper proposes a knowledge discovery approach on the Web by providing an overview of the information on a Website using an integration of summarization and visualization techniques. This includes text summarization, tag cloud, Document Type View, and interactive features such as drill down and thumbnails. This approach is capable to reduce the time required to identify and search for information or knowledge from the Web.

Journal ArticleDOI
TL;DR: This paper reframe the extractive summarization task using a regression scheme instead of binary classification, and evaluates the approaches using the ICSI meeting corpus on both the human transcripts and speech recognition output, and shows performance improvement using different sampling methods and regression model.

Proceedings ArticleDOI
02 Apr 2010
TL;DR: This paper considers document summarization as a multi-objective optimization problem involving four objective functions, namely information coverage, significance, redundancy and text coherence, and chooses the DUC 2005 and 2006 query-oriented summarization tasks to exam the proposed model.
Abstract: In this paper, we consider document summarization as a multi-objective optimization problem involving four objective functions, namely information coverage, significance, redundancy and text coherence. These functions measure the possible summaries based on the identified core terms and main topics (i.e. a cluster of semantically or statistically related core terms). We choose the DUC 2005 and 2006 query-oriented summarization tasks to exam the proposed model. The encouraging results indicate that the multi-objective optimization based framework for document summarization is truly a promising research direction.

Journal ArticleDOI
TL;DR: This article presents eight different methods of generating multidocument summaries and evaluates each of these methods on a large set of topics used in past DUC workshops, showing a significant improvement in the quality of summaries based on topic themes over MDS methods that use other alternative topic representations.
Abstract: The problem of using topic representations for multidocument summarization (MDS) has received considerable attention recently. Several topic representations have been employed for producing informative and coherent summaries. In this article, we describe five previously known topic representations and introduce two novel representations of topics based on topic themes. We present eight different methods of generating multidocument summaries and evaluate each of these methods on a large set of topics used in past DUC workshops. Our evaluation results show a significant improvement in the quality of summaries based on topic themes over MDS methods that use other alternative topic representations.

Proceedings Article
16 Jul 2010
TL;DR: This work evaluates deep content selection methods for multidocument summarization based on the CST model (Cross-document Structure Theory) and shows that the use of CST model helps to improve informativeness and quality in automatic summaries.
Abstract: Recently, with the huge amount of growing information in the web and the little available time to read and process all this information, automatic summaries have become very important resources. In this work, we evaluate deep content selection methods for multidocument summarization based on the CST model (Cross-document Structure Theory). Our methods consider summarization preferences and focus on the overall main problems of multidocument treatment: redundancy, complementarity, and contradiction among different information sources. We also evaluate the impact of the CST model over superficial summarization systems. Our results show that the use of CST model helps to improve informativeness and quality in automatic summaries.

Proceedings ArticleDOI
19 Jul 2010
TL;DR: Evaluation results on the collection of press releases by Miami-Dade County Department of Emergency Management during Hurricane Wilma in 2005 demonstrate the efficacy of Ontology-enriched Multi-Document Summarization.
Abstract: In this poster, we propose a novel document summarization approach named Ontology-enriched Multi-Document Summarization(OMS) for utilizing background knowledge to improve summarization results. OMS first maps the sentences of input documents onto an ontology, then links the given query to a specific node in the ontology, and finally extracts the summary from the sentences in the subtree rooted at the query node. By using the domain-related ontology, OMS can better capture the semantic relevance between the query and the sentences, and thus lead to better summarization results. As a byproduct, the final summary generated by OMS can be represented as a tree showing the hierarchical relationships of the extracted sentences. Evaluation results on the collection of press releases by Miami-Dade County Department of Emergency Management during Hurricane Wilma in 2005 demonstrate the efficacy of OMS.

Proceedings Article
09 Oct 2010
TL;DR: This paper proposes an A* search algorithm to find the best extractive summary up to a given length, which is both optimal and efficient to run, and proposes a discriminative training algorithm which directly maximises the quality of the best summary.
Abstract: In this paper we address two key challenges for extractive multi-document summarization: the search problem of finding the best scoring summary and the training problem of learning the best model parameters. We propose an A* search algorithm to find the best extractive summary up to a given length, which is both optimal and efficient to run. Further, we propose a discriminative training algorithm which directly maximises the quality of the best summary, rather than assuming a sentence-level decomposition as in earlier work. Our approach leads to significantly better results than earlier techniques across a number of evaluation metrics.

Proceedings Article
23 Aug 2010
TL;DR: The development of an opinion summarization system that works on Bengali News corpus and building of annotated gold standard corpus and acquisition of linguistics tools for lexico-syntactic, syntactic and discourse level features extraction are described.
Abstract: In this paper the development of an opinion summarization system that works on Bengali News corpus has been described. The system identifies the sentiment information in each document, aggregates them and represents the summary information in text. The present system follows a topic-sentiment model for sentiment identification and aggregation. Topic-sentiment model is designed as discourse level theme identification and the topic-sentiment aggregation is achieved by theme clustering (k-means) and Document level Theme Relational Graph representation. The Document Level Theme Relational Graph is finally used for candidate summary sentence selection by standard page rank algorithms used in Information Retrieval (IR). As Bengali is a resource constrained language, the building of annotated gold standard corpus and acquisition of linguistics tools for lexico-syntactic, syntactic and discourse level features extraction are described in this paper. The reported accuracy of the Theme detection technique is 83.60% (precision), 76.44% (recall) and 79.85% (F-measure). The summarization system has been evaluated with Precision of 72.15%, Recall of 67.32% and F-measure of 69.65%.