scispace - formally typeset
Search or ask a question

Showing papers on "Multi-document summarization published in 2005"


Proceedings Article
01 Oct 2005
TL;DR: This paper discusses a language independent algorithm for single and multiple document summarization that is independent of the language used for summarization in this paper.
Abstract: This paper discusses a language independent algorithm for single and multiple document summarization.

209 citations


Proceedings ArticleDOI
02 Oct 2005
TL;DR: An improved method for feature extraction that draws on an existing unsupervised method is introduced that turns the task of feature extraction into one of term similarity by mapping crude (learned) features into a user-defined taxonomy of the entity's features.
Abstract: Capturing knowledge from free-form evaluative texts about an entity is a challenging task. New techniques of feature extraction, polarity determination and strength evaluation have been proposed. Feature extraction is particularly important to the task as it provides the underpinnings of the extracted knowledge. The work in this paper introduces an improved method for feature extraction that draws on an existing unsupervised method. By including user-specific prior knowledge of the evaluated entity, we turn the task of feature extraction into one of term similarity by mapping crude (learned) features into a user-defined taxonomy of the entity's features. Results show promise both in terms of the accuracy of the mapping as well as the reduction in the semantic redundancy of crude features.

209 citations


Journal ArticleDOI
TL;DR: The paper discusses thoroughly the promising paths for future research in medical documents summarization, including the issue of scaling to large collections of documents in various languages and from different media, on personalization issues, on portability to new sub-domains, and on the integration of summarization technology in practical applications.

201 citations


Proceedings ArticleDOI
15 Aug 2005
TL;DR: This paper presents eight different methods of generating MDS and evaluates each of these methods on a large set of topics used in past DUC workshops, showing a significant improvement in the quality of summaries based on topic themes over MDS methods that use other alternative topic representations.
Abstract: The problem of using topic representations for multi-document summarization (MDS) has received considerable attention recently. In this paper, we describe five different topic representations and introduce a novel representation of topics based on topic themes. We present eight different methods of generating MDS and evaluate each of these methods on a large set of topics used in past DUC workshops. Our evaluation results show a significant improvement in the quality of summaries based on topic themes over MDS methods that use other alternative topic representations.

174 citations


Proceedings ArticleDOI
04 Sep 2005
TL;DR: It is shown that a summarization system that uses a combination of lexical, prosodic, structural and discourse features produces the most accurate summaries, and that a combinations of acoustic/prosodic and structural features are enough to build a ‘good’ summarizer when speech transcription is not available.
Abstract: We present results of an empirical study of the usefulness of different types of features in selecting extractive summaries of news broadcasts for our Broadcast News Summarization System. We evaluate lexical, prosodic, structural and discourse features as predictors of those news segments which should be included in a summary. We show that a summarization system that uses a combination of these feature sets produces the most accurate summaries, and that a combination of acoustic/prosodic and structural features are enough to build a ‘good’ summarizer when speech transcription is not available.

169 citations


Proceedings ArticleDOI
Ani Nenkova1
09 Jul 2005
TL;DR: An overview of the achieved results in the different types of summarization tasks, comparing both the broader classes of baselines, systems and humans, as well as individual pairs of summarizers (both human and automatic).
Abstract: Since 2001, the Document Understanding Conferences have been the forum for researchers in automatic text summarization to compare methods and results on common test sets. Over the years, several types of summarization tasks have been addressed--single document summarization, multi-document summarization, summarization focused by question, and headline generation. This paper is an overview of the achieved results in the different types of summarization tasks. We compare both the broader classes of baselines, systems and humans, as well as individual pairs of summarizers (both human and automatic). An analysis of variance model is fitted, with summarizer and input set as independent variables, and the coverage score as the dependent variable, and simulation-based multiple comparisons were performed. The results document the progress in the field as a whole, rather then focusing on a single system, and thus can serve as a future reference on the work done up to date, as well as a starting point in the formulation of future tasks. Results also indicate that most progress in the field has been achieved in generic multi-document summarization and that the most challenging task is that of producing a focused summary in answer to a question/topic.

167 citations


Journal ArticleDOI
TL;DR: A news delivery and summarization system, acting as a user's agent, gathers and recaps news items based on specifications and interests.
Abstract: A news delivery and summarization system, acting as a user's agent, gathers and recaps news items based on specifications and interests.

135 citations


Journal ArticleDOI
27 Nov 2005
TL;DR: This paper formulate the problem of summarization of a data set of transactions with categorical attributes as an optimization problem involving two objective functions – compaction gain and information loss and proposes metrics to characterize the output of any summarization algorithm.
Abstract: In this paper, we formulate the problem of summarization of a dataset of transactions with categorical attributes as an optimization problem involving two objective functions - compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We investigate two approaches to address this problem. The first approach is an adaptation of clustering and the second approach makes use of frequent item sets from the association analysis domain. We illustrate one application of summarization in the field of network data where we show how our technique can be effectively used to summarize network traffic into a compact but meaningful representation. Specifically, we evaluate our proposed algorithms on the 1998 DARPA Off-line Intrusion Detection Evaluation data and network data generated by SKAION Corp for the ARDA information assurance program.

117 citations


Patent
04 Mar 2005
TL;DR: In this paper, a system for generating a summary of a plurality of documents and presenting the summary information to a user is provided which includes a computer readable document collection containing a plurality related documents stored in electronic form.
Abstract: A system for generating a summary of a plurality of documents and presenting the summary information to a user is provided which includes a computer readable document collection containing a plurality of related documents stored in electronic form. Documents can be pre-processed to group documents into document clusters. The document clusters can also be assigned to predetermined document categories for presentation to a user. A number of multiple document summarization engines are provided which generate summaries for specific classes of multiple documents clusters. A summarizer router is employed to determining a relationship of the documents in a cluster and select one of the document summarization engines for use in generating a summary of the cluster. A single event engine is provided to generate summaries of documents which are closely related temporally and to a specific event. A dissimilarity engine for multiple document summary generation is provided which generates summaries of document clusters having documents with varying degrees of relatedness. A user interface is provided to display categories, cluster titles, summaries, related images.

91 citations


Journal ArticleDOI
TL;DR: The research shows that customization is feasible in a medical digital library and employs a unified user model to create a tailored summary of relevant documents for either a physician or lay person.

83 citations



Proceedings Article
26 Jan 2005
TL;DR: The authors used sentence classification and ideas from information retrieval to generate multi-document biographies and achieved the top performance in task 5-short summaries focused by person questions in DUC2004.
Abstract: In this paper we describe a biography summarization system using sentence classification and ideas from information retrieval. Although the individual techniques are not new, assembling and applying them to generate multi-document biographies is new. Our system was evaluated in DUC2004. It is among the top performers in task 5–short summaries focused by person questions.

Proceedings ArticleDOI
19 Sep 2005
TL;DR: Text summarization method is proposed that creates text summary by definition of the relevance score of each sentence and extracting sentences from the original documents using genetic algorithms.
Abstract: In this paper, we propose text summarization method that creates text summary by definition of the relevance score of each sentence and extracting sentences from the original documents. While summarization this method takes into account weight of each sentence in the document. The essence of the method suggested is in preliminary identification of every sentence in the document with characteristic vector of words, which appear in the document, and calculation of relevance score for each sentence. The relevance score of sentence is determined through its comparison with all the other sentences in the document and with the document title by cosine measure. Prior to application of this method the scope of features is defined and then the weight of each word in the sentence is calculated with account of those features. The weights of features, influencing relevance of words, are determined using genetic algorithms.

Journal ArticleDOI
TL;DR: A novel approach to multi-document summarization, which explicitly addresses the problem of detecting, and retaining for the summary, multiple themes in document collections, and applies Iterative Residual Rescaling (IRR).
Abstract: This paper describes a novel approach to multi-document summarization, which explicitly addresses the problem of detecting, and retaining for the summary, multiple themes in document collections. We place equal emphasis on the processes of theme identification and theme presentation. For the former, we apply Iterative Residual Rescaling (IRR); for the latter, we argue for graphical display elements. IRR is an algorithm designed to account for correlations between words and to construct multi-dimensional topical space indicative of relationships among linguistic objects (documents, phrases, and sentences). Summaries are composed of objects with certain properties, derived by exploiting the many-to-many relationships in such a space. Given their inherent complexity, our multi-faceted summaries benefit from a visualization environment. We discuss some essential features of such an environment.

01 Jan 2005
TL;DR: The Embra system is presented, a rst-time entry to DUC for 2005 which performed at or above median for the manual assessment of responsiveness and on 4 out of 5 linguistic quality questions.
Abstract: We present the Embra system, a rst-time entry to DUC for 2005 which performed at or above median for the manual assessment of responsiveness and on 4 out of 5 linguistic quality questions. The system takes a novel approach to relevance and redundancy, modeling sentence similarity using a latent semantic space constructed over a very large corpus. We present a simple approach to modeling specicity based on named entities which shows a small improvement over baseline. Finally, we discuss coherence and present a sentence reordering algorithm with a componentlevel evaluation demonstrating a positive effect.

Proceedings ArticleDOI
10 May 2005
TL;DR: The goal of this project is to make the Web more accessible by providing some of the features naturally available to sighted users to users with visual impairments, which can emerge from simplification and summarization.
Abstract: The goal of this project is to make the Web more accessible by providing some of the features naturally available to sighted users to users with visual impairments. These features are direct access and gestalt understanding, which can emerge from simplification and summarization. Simplification is achieved by retaining sections of the web page that are considered important while removing the clutter. The purpose of summarization is to provide the users with a preview of the web page. Simplification and summarization are implemented as a "guide dog" that helps users navigate the entire web site.

Journal ArticleDOI
TL;DR: In this article, an extension of the standard hidden Markov model is proposed to generate word-to-word and phrase-tophrase alignments between documents and their human-written abstracts.
Abstract: Current research in automatic single-document summarization is dominated by two effective, yet naive approaches: summarization by sentence extraction and headline generation via bag-of-words models. While successful in some tasks, neither of these models is able to adequately capture the large set of linguistic devices utilized by humans when they produce summaries. One possible explanation for the widespread use of these models is that good techniques have been developed to extract appropriate training data for them from existing document/abstract and document/headline corpora. We believe that future progress in automatic summarization will be driven both by the development of more sophisticated, linguistically informed models, as well as a more effective leveraging of document/abstract corpora. In order to open the doors to simultaneously achieving both of these goals, we have developed techniques for automatically producing word-to-word and phrase-to-phrase alignments between documents and their human-written abstracts. These alignments make explicit the correspondences that exist in such document/abstract pairs and create a potentially rich data source from which complex summarization algorithms may learn. This paper describes experiments we have carried out to analyze the ability of humans to perform such alignments, and based on these analyses, we describe experiments for creating them automatically. Our model for the alignment task is based on an extension of the standard hidden Markov model and learns to create alignments in a completely unsupervised fashion. We describe our model in detail and present experimental results that show that our model is able to learn to reliably identify word- and phrase-level alignments in a corpus of document, abstract pairs.

Journal ArticleDOI
TL;DR: The results show that relying on generic linguistic resources and statistical techniques offer a basis for text summarization.
Abstract: The technologies for single- and multi-document summarization that are described and evaluated in this article can be used on heterogeneous texts for different summarization tasks. They refer to the extraction of important sentences from the documents, compressing the sentences to their essential or relevant content, and detecting redundant content across sentences. The technologies are tested at the Document Understanding Conference, organized by the National Institute of Standards and Technology, USA in 2002 and 2003. The system obtained good to very good results in this competition. We tested our summarization system also on a variety of English Encyclopedia texts and on Dutch magazine articles. The results show that relying on generic linguistic resources and statistical techniques offer a basis for text summarization.

Journal ArticleDOI
TL;DR: This paper proposed a method to calculate sentence importance using scores, for responses to multiple questions, generated by a Question-Answering engine, and described the integration of this method with a generic multi-document summarization system.
Abstract: In recent years, answer-focused summarization has gained attention as a technology complementary to information retrieval and question answering. In order to realize multi-document summarization focused by multiple questions, we propose a method to calculate sentence importance using scores, for responses to multiple questions, generated by a Question-Answering engine. Further, we describe the integration of this method with a generic multi-document summarization system. The evaluation results demonstrate that the performance of the proposed method is better than not only several baselines but also other participants' systems at the evaluation workshop NTCIR4 TSC3 Formal Run. However, it should be noted that some of the other systems do not use the information of questions.

Proceedings ArticleDOI
30 Oct 2005
TL;DR: A new approach under the hub-authority framework is proposed that combines the text content with some cues and explores the sub-topics in the multi-documents by bringing the features of these sub- topics into graph-based sentence ranking algorithms.
Abstract: Multi-document extractive summarization relies on the concept of sentence centrality to identify the most important sentences in a document. Although some research has introduced the graph-based ranking algorithms such as PageRank and HITS into the text summarization, we propose a new approach under the hub-authority framework in this paper. Our approach combines the text content with some cues such as "cue phrase", "sentence length" and "first sentence" and explores the sub-topics in the multi-documents by bringing the features of these sub-topics into graph-based sentence ranking algorithms. We provide an evaluation of our method on DUC 2004 data. The results show that our approach is an effective graph-ranking schema in multi-document generic text summarization.

Proceedings ArticleDOI
31 Oct 2005
TL;DR: This work presents a method to create query-specific summaries by adding structure to documents by extracting associations between their fragments.
Abstract: Summarization of text documents is increasingly important with the amount of data available on the Internet. The large majority of current approaches view documents as linear sequences of words and create query-independent summaries. However, ignoring the structure of the document degrades the quality of summaries. Furthermore, the popularity of web search engines requires query-specific summaries. We present a method to create query-specific summaries by adding structure to documents by extracting associations between their fragments.

01 Jan 2005
TL;DR: The MuST (Multimodal Summarization for Trend Information) workshop was designed to encourage cooperative and competitive studies on summarization and visualization for trend information and is expected to encourage studies in a wide variety of research fields.
Abstract: The MuST (Multimodal Summarization for Trend Information) workshop was designed to encourage cooperative and competitive studies on summarization and visualization for trend information. The main objective of the workshop is to develop technologies using multimedia presentation that will allow intelligent systems to provide users with appropriate answers to their queries on trend information. These technologies not only could be considered as multimedia presentation generation and multimodal dialogue processing, but also rely largely on information access technologies such as automatic summarization and information extraction, and information visualization. Therefore, the workshop is expected to encourage studies in a wide variety of research fields. A noteworthy feature of the workshop is that the participants share the same research resource, whereby they address common or related themes, with the expectation of encouraging active research and discussion, conforming communities, and constructing and accumulating resources such as tools and corpora.

01 Jan 2005
TL;DR: This paper presents the team TUT/NII results at DUC 2005 and additional experiments on improving multi-document summarization, and investigated improvements of ROUGE and BE scores with the approach based on sentence extraction, weighted by sentence type annotation and combined with polarity term frequencies.
Abstract: In this paper, we present our team TUT/NII results at DUC 2005 and additional experiments on improving multi-document summarization. Summarization systems have typically focused on the factual aspect of information needs. Subjectivity analysis is another essential aspect for better understanding of information needs. Our approach is based on sentence extraction, weighted by sentence type annotation, and combined with polarity term frequencies. We selected 10 topics related to subjectivity with analysis of “narratives”, and investigated improvements of ROUGE (RecallOriented Understudy for Gisting Evaluation) and BE (Basic Elements) scores with our approach. In addition, the factual aspect of information needs was also investigated.

DOI
01 Jan 2005
TL;DR: This work presents a new approach to multilingual multi-document summarization that uses text similarity to choose sentences from English documents based on the content of the machine translated documents.
Abstract: We present a new approach for summarizing clusters of documents on the same event, some of which are machine translations of foreign-language documents and some of which are English. Our approach to multilingual multi-document summarization uses text similarity to choose sentences from English documents based on the content of the machine translated documents. A manual evaluation shows that 68% of the sentence replacements improve the summary, and the overall summarization approach outperforms first-sentence extraction baselines in automatic ROUGEbased evaluations.

Patent
Benyu Zhang1, Dou Shen1, Hua-Jun Zeng1, Wei-Ying Ma1, Zheng Chen1 
10 Aug 2005
TL;DR: In this article, a method and system for calculating the significance of a sentence within a document is provided, which can then be used to identify significant sentences of a document based on the important words that a sentence contains and select significant sentences as a summary of the document.
Abstract: A method and system for calculating the significance of a sentence within a document is provided. The summarization system calculates the significance of the sentences of a document and selects the most significant sentences as the summary of the document. The summarization system calculates the significance of a sentence based on the "important" words of the document that are contained within the sentence. The summarization system calculates the importance of words of the document using various scoring techniques and then combines the scores to classify a word as important or not important. The summarization system can then be used to identify significant sentences of the document based on the important words that a sentence contains and select significant sentences as a summary of the document.

Journal ArticleDOI
TL;DR: An intelligent system, the event indexing and summarization (EIS) system, for automatic document summarization, which is based on a cognitive psychology model (the event-indexing model) and the roles and importance of sentences and their syntax in document understanding is introduced.

01 Jan 2005
TL;DR: CATS is a multidocument summarizing document that produces an integrated summary of the need for information at a given level of granularity from a set of topic related documents developed at the Universit´ e de Montrfor DUC2005.
Abstract: CATS is a multidocument summarizing sys- tem developed at the Universit´ e de Montrfor DUC2005. From a set of topic related docu- ments, it produces an integrated summary an- swering the need for information at a given level of granularity. It starts from a thematic analysis of the documents to identify a list of text segments containing interesting aspects re- lated to the subject. It then matches these themes with the ones detected in the question. The very good results obtained at the DUC competition are described and discussed. Abstract

Proceedings ArticleDOI
07 Nov 2005
TL;DR: This paper for the first time investigates using lexical chains as a model of multiple documents written in Chinese to generate an indicative, moderately fluent summary and finds that lexical Chains are effective for multidocument summarization.
Abstract: This paper for the first time investigates using lexical chains as a model of multiple documents written in Chinese to generate an indicative, moderately fluent summary. The algorithm which computes lexical chains based on the HowNet knowledge database is modified to improve the performance and suit Chinese summarization. Based on an analysis of semanteme, the algorithm can remove redundant similarities and remain differences in information content among multiple documents. The method pre-processes the text first, then constructs lexical chains and identifies strong chains. Then significant sentences are extracted from each document and are ordered, and redundant information are recognized and removed. Finally, the summary is generated in chronological order, and the anaphora resolution technology is applied to improve the fluency of the summary. Evaluation results show that the performance of the presented system is obviously better than that of the baseline system, and lexical chains are effective for multidocument summarization.

Proceedings ArticleDOI
06 Oct 2005
TL;DR: This paper presents an approach to automatically acquiring distinctions in cognitive status using machine learning over the forms of referring expressions appearing in the input, and examines two specific distinctions---whether a person in the news can be assumed to be known to a target audience (hearer-old vs hearer-new)
Abstract: Machine summaries can be improved by using knowledge about the cognitive status of news article referents. In this paper, we present an approach to automatically acquiring distinctions in cognitive status using machine learning over the forms of referring expressions appearing in the input. We focus on modeling references to people, both because news often revolve around people and because existing natural language tools for named entity identification are reliable. We examine two specific distinctions---whether a person in the news can be assumed to be known to a target audience (hearer-old vs hearer-new) and whether a person is a major character in the news story. We report on machine learning experiments that show that these distinctions can be learned with high accuracy, and validate our approach using human subjects.

01 Jan 2005
TL;DR: It is demonstrated that the initial application of a sentence-trimming approach (Trimmer) to the problem of multi-document summarization in the MSE2005 and DUC2005 tasks was able to port Trimmer easily and the direct impact of sentence trimming was minimal compared to other features used in the system.
Abstract: We implemented an initial application of a sentence-trimming approach (Trimmer) to the problem of multi-document summarization in the MSE2005 and DUC2005 tasks. Sentence trimming was incorporated into a feature-based summarization system, called MultiDocument Trimmer (MDT), by using sentence trimming as both a preprocessing stage and a feature for sentence ranking. We demonstrate that we were able to port Trimmer easily to this new problem. Although the direct impact of sentence trimming was minimal compared to other features used in the system, the interaction of the other features resulted in trimmed sentences accounting for nearly half of the selected summary sentences.