scispace - formally typeset
Search or ask a question

Showing papers on "Multi-document summarization published in 2002"


Proceedings ArticleDOI
01 Dec 2002
TL;DR: A generic framework of video summarization based on the modeling of viewer's attention is presented, which takes advantage of computational attention models and eliminates the needs of complex heuristic rules inVideo summarization.
Abstract: Automatic generation of video summarization is one of the key techniques in video management and browsing. In this paper, we present a generic framework of video summarization based on the modeling of viewer's attention. Without fully semantic understanding of video content, this framework takes advantage of understanding of video content, this framework takes advantage of computational attention models and eliminates the needs of complex heuristic rules in video summarization. A set of methods of audio-visual attention model features are proposed and presented. The experimental evaluations indicate that the computational attention based approach is an effective alternative to video semantic analysis for video summarization.

602 citations


Journal ArticleDOI
TL;DR: This article propose a methodology for studying the properties of ordering information in the news genre and describe experiments done on a corpus of multiple acceptable orderings they developed for the task, based on these experiments, they implemented a strategy for ordering information that combines constraints from chronological order of events and topical relatedness.
Abstract: The problem of organizing information for multidocument summarization so that the generated summary is coherent has received relatively little attention. While sentence ordering for single document summarization can be determined from the ordering of sentences in the input article, this is not the case for multidocument summarization where summary sentences may be drawn from different input articles. In this paper, we propose a methodology for studying the properties of ordering information in the news genre and describe experiments done on a corpus of multiple acceptable orderings we developed for the task. Based on these experiments, we implemented a strategy for ordering information that combines constraints from chronological order of events and topical relatedness. Evaluation of our augmented algorithm shows a significant improvement of the ordering over two baseline strategies.

355 citations


Proceedings Article
24 Mar 2002
TL;DR: Columbia's Newsblaster system for online news summarization is presented, a system that crawls the web for news articles, clusters them on specific topics and produces multidocument summaries for each cluster.
Abstract: Recently, there have been significant advances in several areas of language technology, including clustering, text categorization, and summarization. However, efforts to combine technology from these areas in a practical system for information access have been limited. In this paper, we present Columbia's Newsblaster system for online news summarization. Many of the tools developed at Columbia over the years are combined together to produce a system that crawls the web for news articles, clusters them on specific topics and produces multidocument summaries for each cluster.

294 citations


Proceedings ArticleDOI
06 Jul 2002
TL;DR: NeATS is a multi-document summarization system that attempts to extract relevant or interesting portions from a set of documents about some topic and present them in coherent order.
Abstract: NeATS is a multi-document summarization system that attempts to extract relevant or interesting portions from a set of documents about some topic and present them in coherent order. NeATS is among the best performers in the large scale summarization evaluation DUC 2001.

233 citations


Book ChapterDOI
11 Nov 2002
TL;DR: This paper presents a summarization procedure based on the application of trainable Machine Learning algorithms which employs a set of features extracted directly from the original text, based on statistical and linguistic features extracted from a simplified argumentative structure of the text.
Abstract: In this paper we address the automatic summarization task. Recent research works on extractive-summary generation employ some heuristics, but few works indicate how to select the relevant features. We will present a summarization procedure based on the application of trainable Machine Learning algorithms which employs a set of features extracted directly from the original text. These features are of two kinds: statistical - based on the frequency of some elements in the text; and linguistic - extracted from a simplified argumentative structure of the text. We also present some computational results obtained with the application of our summarizer to some well known text databases, and we compare these results to some baseline summarization procedures.

220 citations


Journal Article
TL;DR: A linear-time algorithm for lexical chain computation is presented that makes lexical chains a computationally feasible candidate as an intermediate representation for automatic text summarization.
Abstract: While automatic text summarization is an area that has received a great deal of attention in recent research, the problem of efficiency in this task has not been frequently addressed. When the size and quantity of documents available on the Internet and from other sources are considered, the need for a highly efficient tool that produces usable summaries is clear. We present a linear-time algorithm for lexical chain computation. The algorithm makes lexical chains a computationally feasible candidate as an intermediate representation for automatic text summarization. A method for evaluating lexical chains as an intermediate step in summarization is also presented and carried out. Such an evaluation was heretofore not possible because of the computational complexity of previous lexical chains algorithms.

181 citations


Journal ArticleDOI
TL;DR: SumUM is a text summarization system that takes a raw technical text as input and produces an indicative informative summary that motivates the topics, describes entities, and defines concepts.
Abstract: We present and evaluate SumUM, a text summarization system that takes a raw technical text as input and produces an indicative informative summary. The indicative part of the summary identifies the topics of the document, and the informative part elaborates on some of these topics according to the reader's interest. SumUM motivates the topics, describes entities, and defines concepts. It is a first step for exploring the issue of dynamic summarization. This is accomplished through a process of shallow syntactic and semantic analysis, concept identification, and text regeneration. Our method was developed through the study of a corpus of abstracts written by professional abstractors. Relying on human judgment, we have evaluated indicativeness, informativeness, and text acceptability of the automatic summaries. The results thus far indicate good performance when compared with other summarization technologies.

149 citations


Journal ArticleDOI
TL;DR: Analysis of feedback forms filled in after each decision indicated that the intelligibility of present-day machine-generated summaries is high, and the evaluation methods used in the SUMMAC evaluation are of interest to both summarization evaluation as well as evaluation of other ‘output-related’ NLP technologies, where there may be many potentially acceptable outputs.
Abstract: The TIPSTER Text Summarization Evaluation (SUMMAC) has developed several new extrinsic and intrinsic methods for evaluating summaries. It has established definitively that automatic text summarization is very effective in relevance assessment tasks on news articles. Summaries as short as 17% of full text length sped up decision-making by almost a factor of 2 with no statistically significant degradation in accuracy. Analysis of feedback forms filled in after each decision indicated that the intelligibility of present-day machine-generated summaries is high. Systems that performed most accurately in the production of indicative and informative topic-related summaries used term frequency and co-occurrence statistics, and vocabulary overlap comparisons between text passages. However, in the absence of a topic, these statistical methods do not appear to provide any additional leverage: in the case of generic summaries, the systems were indistinguishable in accuracy. The paper discusses some of the tradeoffs and challenges faced by the evaluation, and also lists some of the lessons learned, impacts, and possible future directions. The evaluation methods used in the SUMMAC evaluation are of interest to both summarization evaluation as well as evaluation of other ‘output-related’ NLP technologies, where there may be many potentially acceptable outputs, with no automatic way to compare them.

145 citations


Journal Article
TL;DR: The task and the challenges involved and motivates are introduced and an approach for obtaining automatic-extract summaries for human transcripts of multiparty dialogues of four different genres is presented, without any restriction on domain.
Abstract: Automatic summarization of open-domain spoken dialogues is a relatively new research area. This article introduces the task and the challenges involved and motivates and presents an approach for obtaining automatic-extract summaries for human transcripts of multiparty dialogues of four different genres, without any restriction on domain. We address the following issues, which are intrinsic to spoken-dialogue summarization and typically can be ignored when summarizing written text such as news wire data: (1) detection and removal of speech disfluencies; (2) detection and insertion of sentence boundaries; and (3) detection and linking of cross-speaker information units (question-answer pairs). A system evaluation is performed using a corpus of 23 dialogue excerpts with an average duration of about 10 minutes, comprising 80 topical segments and about 47,000 words total. The corpus was manually annotated for relevant text spans by six human annotators. The global evaluation shows that for the two more informal genres, our summarization system using dialogue-specific components significantly outperforms two baselines: (1) a maximum-marginal-relevance ranking algorithm using TF*IDF term weighting, and (2) a LEAD baseline that extracts the first n words from a text.

121 citations


Journal ArticleDOI
TL;DR: The authors presented an approach for obtaining automatic-extract summaries for human transcripts of multiparty dialogues of four different genres, without any restriction on domain, addressing the following issues: detection and removal of speech disfluencies; detection and insertion of sentence boundaries; and detection and linking of cross-speaker information units (question-answer pairs).
Abstract: Automatic summarization of open-domain spoken dialogues is a relatively new research area. This article introduces the task and the challenges involved and motivates and presents an approach for obtaining automatic-extract summaries for human transcripts of multiparty dialogues of four different genres, without any restriction on domain.We address the following issues, which are intrinsic to spoken-dialogue summarization and typically can be ignored when summarizing written text such as news wire data: (1) detection and removal of speech disfluencies; (2) detection and insertion of sentence boundaries; and (3) detection and linking of cross-speaker information units (question-answer pairs).A system evaluation is performed using a corpus of 23 dialogue excerpts with an average duration of about 10 minutes, comprising 80 topical segments and about 47,000 words total. The corpus was manually annotated for relevant text spans by six human annotators. The global evaluation shows that for the two more informal genres, our summarization system using dialogue-specific components significantly outperforms two baselines: (1) a maximum-marginal-relevance ranking algorithm using TF*IDF term weighting, and (2) a LEAD baseline that extracts the first n words from a text.

118 citations


Proceedings ArticleDOI
11 Aug 2002
TL;DR: This work proposes new semi-supervised algorithms for training classification models for text summarization, and analyzes their performances on two data sets - the Reuters news-wire corpus and the Computation and Language collection of TIPSTER SUMMAC.
Abstract: With the huge amount of information available electronically, there is an increasing demand for automatic text summarization systems. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting sentence segments and adopt the supervised learning paradigm which requires to label documents at the text span level. This is a costly process, which puts strong limitations on the applicability of these methods. We investigate here the use of semi-supervised algorithms for summarization. These techniques make use of few labeled data together with a larger amount of unlabeled data. We propose new semi-supervised algorithms for training classification models for text summarization. We analyze their performances on two data sets - the Reuters news-wire corpus and the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC. We perform comparisons with a baseline - non learning - system, and a reference trainable summarizer system.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: A Cross Document Summarizer XDoX designed specifically to summarize large document sets (50-500 documents and more) and shows examples of summaries obtained in tests as well as from the first Document Understanding Conference (DUC).
Abstract: In this paper we describe a Cross Document Summarizer XDoX designed specifically to summarize large document sets (50-500 documents and more). Such sets of documents are typically obtained from routing or filtering systems run against a continuous stream of data, such as a newswire. XDoX works by identifying the most salient themes within the set (at the granularity level that is regulated by the user) and composing an extraction summary, which reflects these main themes. In the current version, XDoX is not optimized to produce a summary based on a few unrelated documents; indeed, such summaries are best obtained simply by concatenating summaries of individual documents. We show examples of summaries obtained in our tests as well as from our participation in the first Document Understanding Conference (DUC).

Proceedings ArticleDOI
24 Aug 2002
TL;DR: This paper proposes a method of sentence extraction based on Support Vector Machines (SVMs) and confirms the method's performance, and clarifies the different features effective for extracting different document genres.
Abstract: Extracting sentences that contain important information from a document is a form of text summarization. The technique is the key to the automatic generation of summaries similar to those written by humans. To achieve such extraction, it is important to be able to integrate heterogeneous pieces of information. One approach, parameter tuning by machine learning, has been attracting a lot of attention. This paper proposes a method of sentence extraction based on Support Vector Machines (SVMs). To confirm the method's performance, we conduct experiments that compare our method to three existing methods. Results on the Text Summarization Challenge (TSC) corpus show that our method offers the highest accuracy. Moreover, we clarify the different features effective for extracting different document genres.

Patent
22 Jul 2002
TL;DR: A real-time interactive document summarization system which allows the user to continuously control the amount of detail to be included in a document summary is presented in this article, where the user can also control the quality of the summary.
Abstract: A real-time interactive document summarization system which allows the user to continuously control the amount of detail to be included in a document summary.

Proceedings ArticleDOI
24 Mar 2002
TL;DR: This paper describes the multi-document text summarization system NeATS, which was among the top two performers of the DUC-01 evaluation.
Abstract: This paper describes the multi-document text summarization system NeATS. Using a simple algorithm, NeATS was among the top two performers of the DUC-01 evaluation.

Proceedings ArticleDOI
24 Aug 2002
TL;DR: A framework for the evaluation of summaries in English and Chinese using similarity measures that can be used to evaluate extractive, non-extractive, single and multi-document summarization is described.
Abstract: We describe a framework for the evaluation of summaries in English and Chinese using similarity measures. The framework can be used to evaluate extractive, non-extractive, single and multi-document summarization. We focus on the resources developed that are made available for the research community.

Proceedings Article
01 May 2002
TL;DR: This work has introduced information extraction technique such as named entity tagging and pattern discovery to a summarization system based on sentence extraction technique and achieved one of the best performance in subjective evaluation of summarization results.
Abstract: We have introduced information extraction technique such as named entity tagging and pattern discovery to a summarization system based on sentence extraction technique, and evaluated the performance in the Document Understanding Conference 2001 (DUC-2001). We participated in the Single Document Summarization task in DUC-2001 and achieved one of the best performance in subjective evaluation of summarization results.

Proceedings Article
01 May 2002
TL;DR: This work describes the development of Language and Evaluation Resources for the evaluation of summaries in English and Chinese and focuses on the resources developed that are made available for the research community.
Abstract: We describe our work on the development of Language and Evaluation Resources for the evaluation of summaries in English and Chinese. The language resources include a parallel corpus of English and Chinese texts which are translations of each other, a set of queries in both languages, clusters of documents relevants to each query, sentence relevance measures for each sentence in the document clusters, and manual multi-document summaries at different compression rates. The evaluation resources consist of metrics for measuring the content of automatic summaries against reference summaries. The framework can be used in the evaluation of extractive, non-extractive, single and multi-document summarization. We focus on the resources developed that are made available for the research community.

01 Jan 2002
TL;DR: The result of the Single-Document Summarization task shows that the summarization system employs the machine learning algorithm, Support Vector Machines, to classify a sentence into an important or an unimportant sentence.
Abstract: We participated in the Document Understanding Conference 2002 (DUC-2002) in order to conrm the eectiv eness of our summarization system based on an important sentence extraction technique. Our system employs the machine learning algorithm, Support Vector Machines, to classify a sentence into an important or an unimportant sentence. The result of the Single-Document Summarization task shows that our system’s performance achieved a high grade in coverage metrics.

01 Jan 2002
TL;DR: This thesis presents a cut-and-paste approach to addressing the text generation problem in domain-independent, single-document summarization, and built a large-scale, reusable lexicon by combining multiple, heterogeneous resources.
Abstract: Automatic text summarization provides a concise summary for a document. In this thesis, we present a cut-and-paste approach to addressing the text generation problem in domain-independent, single-document summarization. We found that professional abstractors often reuse the text in an original document for producing the text in a summary. But rather than simply extracting the original text, as in most existing automatic summarizers, humans often edit the extracted sentences. We call such editing operations “revision operations”. Our summarizer simulates two revision operations that are frequently used by humans: sentence reduction and sentence combination. Sentence reduction removes inessential phrases from sentences and sentence combination merges sentences and phrases together. The sentence reduction algorithm we propose relies on multiple sources of knowledge to decide when it is appropriate to delete a phrase from a sentence, including linguistic knowledge, probabilities trained from corpus examples, and context information. The sentence combination module relies on a set of rules to decide how to combine sentences and phrases and when to combine them. Sentence reduction aims to improve the conciseness of generated summaries and sentence combination aims to improve the coherence of generated summaries. We call this approach “cut-and-paste” since it produces summaries by excerpting and combining sentences and phrases from original documents, unlike the extraction technique which produces summaries by simply extracting sentences or passages. Our work also includes a Hidden Markov Model based sentence decomposition program which analyzes human-written summaries. The decomposition program identifies where the phrases of a summary originate in the original document, producing an aligned corpus of summaries and articles that we use to train and evaluate the summarizer. We also built a large-scale, reusable lexicon by combining multiple, heterogeneous resources. The lexicon contains lexical, syntactic, and semantic knowledge. It can be used in many applications.

Proceedings ArticleDOI
01 May 2002
TL;DR: This paper presented a corpus of annotated bibliographies for text summarization, consisting of 2000 annotated entries collected from various Internet websites using search engines, which they used in finding the distribution of types of information included in indicative summaries.
Abstract: We report on a language resource consisting of 2000 annotated bibliography entries, which is being analyzed as part of our research on indicative document summarization. We show how annotated bibliographies cover certain aspects of summarization that have not been well-covered by other summary corpora, and motivate why they constitute an important form to study for information retrieval. We detail our methodology for collecting the corpus, and overview our document feature markup that we introduced to facilitate summary analysis. We present the characteristics of the corpus, methods of collection, and show its use in finding the distribution of types of information included in indicative summaries and their relative ordering within the summaries. Automatic text summarization has largely been synonymous with domain-independent, sentence extraction techniques (for an overview, see Paice (1990)). These approaches have used a battery of indicators such as cue phrases, term frequency, and sentence position to choose sentences to extract and form into a summary. An alternative approach is to collect sample summaries and apply machine learning techniques to identify what types of information are included in a summary, and identify their stylistic, grammatical, and lexical choice characteristics and to generate or regenerate a summary based on these characteristics. In this paper, we examine the first step towards this goal: the collection of an appropriate summary corpus. We focus on annotated bibliography entries, because they are written without reliance on sentence extraction. Futhermore, these entries contain both informative (i.e., details and topics of the resource) as well as indicative (e.g., metadata such as author or purpose) information. We believe that summary texts similar in form to annotated bibliography entries, such as the one shown in Figure 1, can better serve users and replace standard -top sentence or query word in context summaries commonly found in current generation search engines. Our corpus of summaries consists of 2000 annotated bibliography entries collected from various Internet websites using search engines. We first review aspects and dimensions of text summaries, and detail reasons for collecting a corpus of annotated bibliography entries. We follow with details on the collection methodology and a description of our annotation of the entries. We conclude with some current applications of the corpus to automatic text summarization research.

Proceedings ArticleDOI
14 Jul 2002
TL;DR: This paper examines how Centrifuser, one such summarization system, was designed with respect to methods used in the library community and operationalized six of these strategies by computing an informative extract, indicative differences between documents, as well as navigational links to narrow or broaden a user's query.
Abstract: A current application of automatic text summarization is to provide an overview of relevant documents coming from an information retrieval (IR) system. This paper examines how Centrifuser, one such summarization system, was designed with respect to methods used in the library community. We have reviewed these librarian expert techniques to assist information seekers and codified them into eight distinct strategies. We detail how we have operationalized six of these strategies in Centrifuser by computing an informative extract, indicative differences between documents, as well as navigational links to narrow or broaden a user's query. We conclude the paper with results from a preliminary evaluation.

Patent
23 Dec 2002
TL;DR: In this paper, a pre-processing summarization technique that makes use of knowledge specific to the electronic mail domain is proposed to pre-process an electronic mail message so that commercially available document summarization software can subsequently generate a more useful summary from the message.
Abstract: The present invention discloses a pre-processing summarization technique that makes use of knowledge specific to the electronic mail domain to pre-process an electronic mail message so that commercially-available document summarization software can subsequently generate a more useful summary from the message. The summarization technique removes extraneous headers, quoted text, forward information, and electronic signatures, leaving more useful text to be summarized. If an enclosing electronic mail thread exists, the summarization technique uses the electronic mail message's ancestors to provide additional context for summarizing the electronic mail message. The disclosed system can be used with IBM Lotus Notes and Domino infrastructure, along with existing single-document summarizer software, to generate a summary of the discourse activity in an electronic mail thread dynamically. The summary may be further augmented to list any names, dates, and names of companies that are present in the electronic mail message being summarized.

Eric Chang1
01 Jan 2002
TL;DR: A system which combines spoken query in addition to automatic title summarization to ease searching for information on a mobile device is presented.
Abstract: Ease of browsing and searching for information on mobile devices has been an area of increasing interest in the World Wide Web research community [1, 2, 3, 6, 7]. While some work has been done to enhance the usability of handwriting recognition to input queries through techniques such as automatic word suggestion [2], the use of speech as an input mechanism has not been extensively studied. This paper presents a system which combines spoken query in addition to automatic title summarization to ease searching for information on a mobile device. Preliminary usability study with 10 subjects indicates that spoken queries is preferred over other input methods.

Journal ArticleDOI
TL;DR: Text summarization techniques that enable users to focus on the key content of a document are described that do not require additional knowledge sources and should be applicable to any set of text documents.
Abstract: A BSTRACT : Medical information is available from a variety of new online resources. Given the number and diversity of sources, methods must be found that will enable users to quickly assimilate and determine the content of a document. Summarization is one such tool that can help users to quickly determine the main points of a document. Previous methods to automatically summarize text documents typically do not attempt to infer or define the content of a document. Rather these systems rely on secondary features or clues that may point to content. This paper describes text summarization techniques that enable users to focus on the key content of a document. The techniques presented here analyze groups of similar documents in order to form a content model. The content model is used to select sentences forming the summary. The technique does not require additional knowledge sources; thus the method should be applicable to any set of text documents.

Journal ArticleDOI
Robert G. Farrell1
TL;DR: This article examines how to create summaries of on-line asynchronous communication, in particular, discussion groups by describing a hierarchical discourse summarization algorithm and its implementation in system called Interactive Discussion Summarizer (IDS).
Abstract: The explosion of available information on the Internet has fueled the demand for automatic methods of text summarization. Existing approaches have primarily focused on abstracting documents such as news articles or technical papers. In this article, we examine how to create summaries of on-line asynchronous communication, in particular, discussion groups. First, we provide background on the nature of discussions as informal communication, and then we give a short history of computer conferencing and discussion systems. We then explain our approach to the problem and a set of observations and experiments we have done, putting our work in the context of research on automatic text summarization. We then describe a hierarchical discourse summarization algorithm and its implementation in system called Interactive Discussion Summarizer (IDS). We close with discussion and conclusions. Copyright © 2002 John Wiley & Sons, Ltd.

Book ChapterDOI
08 Apr 2002
TL;DR: It is argued that for summaries to be truly useful within data mining, they must include concepts abstracted from the text in addition to sentences extracted from theText summarization.
Abstract: Text summarizers automatically construct summaries of a natural-language document. This paper examines the use of text summarization within data mining, identifying the potential summarizers have for uncovering interesting and unexpected information. It describes the current state of the art in commercial summarization and current approaches to the evaluation of summarizers. The paper then proposes a new model for text summarization and suggests a new form of evaluation. It argues that for summaries to be truly useful within data mining, they must include concepts abstracted from the text in addition to sentences extracted from the text. The paper uses two news articles to illustrate its points.

Proceedings ArticleDOI
24 Aug 2002
TL;DR: This paper focuses on subject shift and presents a method for extracting key paragraphs from documents that discuss the same event using the results of event tracking which starts from a few sample documents and finds all subsequent documents that discusses the sameevent.
Abstract: For multi-document summarization where documents are collected over an extended period of time, the subject in a document changes over time. This paper focuses on subject shift and presents a method for extracting key paragraphs from documents that discuss the same event. Our extraction method uses the results of event tracking which starts from a few sample documents and finds all subsequent documents that discuss the same event. The method was tested on the TDT1 corpus, and the result shows the effectiveness of the method.

Journal Article
TL;DR: SumUM is a text summarization system that takes a raw technical text as input and produces an indicative informative summary that motivates the topics, describes entities, and defines concepts.
Abstract: We present and evaluate SumUM, a text summarization system that takes a raw technical text as input and produces an indicative informative summary. The indicative part of the summary identifies the topics of the document, and the informative part elaborates on some of these topics according to the reader's interest. SumUM motivates the topics, describes entities, and defines concepts. It is a first step for exploring the issue of dynamic summarization. This is accomplished through a process of shallow syntactic and semantic analysis, concept identification, and text regeneration. Our method was developed through the study of a corpus of abstracts written by professional abstractors. Relying on human judgment, we have evaluated indicativeness, informativeness, and text acceptability of the automatic summaries. The results thus far indicate good performance when compared with other summarization technologies.

Patent
16 Dec 2002
TL;DR: In a document information processing apparatus, intermediate information, which contains the same character information as in document information created by a document creation application and is used for reduction of the amount of the document information.
Abstract: In a document information processing apparatus, intermediate information, which contains the same character information as in document information created by a document creation application and is used for reduction of the amount of the document information, is generated based on the document information, word information contained in the document information or in the intermediate information is extracted, and summary information is generated by adding the extracted word information to the intermediate information which was subjected to a reduction of amount of information according to the need. The generated summary information not only has a small data volume but also contains all the word information, and is therefore usable for a searching process using character information, such as full-text searching.