scispace - formally typeset
Search or ask a question
Topic

Multi-document summarization

About: Multi-document summarization is a research topic. Over the lifetime, 2270 publications have been published within this topic receiving 71850 citations.


Papers
More filters
27 Oct 2010
TL;DR: This paper introduces a novel approach for automatic summarization and presents its evaluation by TAC 2009 on a newswire articles summarization task and shows that this optimization is having a great influence on both human and automatic evaluations.
Abstract: In this paper, we present a combination of a multi-document summarization system with a genetic algorithm. We first introduce a novel approach for automatic summarization. CBSEAS, the system which implements this approach, integrates a new method to detect redundancy at its very core in order to produce summaries with a good informational diversity. However, the evaluation of our system at TAC 2008 --Text Analysis Conference-- revealed that system adaptation to a specific domain is fundamental to obtain summaries of an acceptable quality. The second part of this paper is dedicated to a genetic algorithm which aims to adapt our system to specific domains. We present its evaluation by TAC 2009 on a newswire articles summarization task and show that this optimization is having a great influence on both human and automatic evaluations.

1 citations

Book ChapterDOI
05 Dec 2017
TL;DR: A semantic relatedness based query focused text summarization technique is introduced to find relevant information from single text document and produces better summaries with increased number of query related sentences included.
Abstract: In this paper, a semantic relatedness based query focused text summarization technique is introduced to find relevant information from single text document. This semantic relatedness measure extracts the related sentences according to the query. The query focused text summarization approach can work on short query when the query does not contain enough information. Better summaries are produced by this method with increased number of query related sentences included. Experiments and evaluation are done on DUC 2005 and 2006 datasets and results show significant performance.

1 citations

Posted Content
TL;DR: This article proposed a submodular function-based framework for query-focused opinion summarization, where relevance ordering produced by a statistical ranker and information coverage with respect to topic distribution and diverse viewpoints are both encoded as sub-modular functions.
Abstract: We present a submodular function-based framework for query-focused opinion summarization. Within our framework, relevance ordering produced by a statistical ranker, and information coverage with respect to topic distribution and diverse viewpoints are both encoded as submodular functions. Dispersion functions are utilized to minimize the redundancy. We are the first to evaluate different metrics of text similarity for submodularity-based summarization methods. By experimenting on community QA and blog summarization, we show that our system outperforms state-of-the-art approaches in both automatic evaluation and human evaluation. A human evaluation task is conducted on Amazon Mechanical Turk with scale, and shows that our systems are able to generate summaries of high overall quality and information diversity.

1 citations

Journal Article
TL;DR: A design framework for data summarization engine that uses the ontology as the fundamental structure of the interactive user-computer interface and demonstrates the effectiveness of the design of ontology-based data summarizations engine.
Abstract: Data summarization has recently received considerable attention in the information technology community. This design research paper discusses the methodology and principles of the design of data summarization engine. Based on an initial analysis of requirement representations in data summarization, it proposes a generic organization of ontology for data summarization engine. Furthermore, this paper proposes a design framework for data summarization engine that uses the ontology as the fundamental structure of the interactive user-computer interface. A prototype project of ontology-based data summarization engine for summarizing the data incompleteness demonstrates the effectiveness of the design of ontology-based data summarization engine.

1 citations

Susan Gauch1
01 Jan 2003
TL;DR: In this paper, the authors proposed a clustering algorithm for placing words in similarity classes, where each target word is described by a composite vector that records the occurrence of words positioned near any occurrence of the target.
Abstract: In future digital libraries, even "perfect" retrieval will typically return too much material for a user to cope with. One way to deal with this problem is to produce automated summaries tailored to the user's requirements. One of the prime purposes of a summary of a collection of documents is to collapse together all of the important information elements that are common to the collection. This requires some method of discovering classes of similar items, e.g., word classes. This paper describes automated techniques for placing words in similarity classes. To do this, each target word is described by a composite vector that records the occurrence of words positioned near any occurrence of the target. Target words with similar contexts are grouped together by a clustering algorithm. We describe how such classifications can be used in information retrieval and for the summarization of biological literature. The dilemma of "perfect" retrieval In retrieving documents or portions of full-text documents, recall is the percentage of the desired documents that are retrieved and precision is the percentage of all retrieved documents that are of the desired type. No matter how good future systems become, even if they achieved 100% recall and precision, the amount of information that will be on line will be so large that the user will still be overwhelmed. It will rarely be the case that one returned paragraph or even one entire document will answer the user's questions. The information the user wants is typically scattered throughout the documents simply because none of the documents were written (nor could they have been written) to satisfy the interests that one particular user would have at some later time. The user could look for a review of the topic, but again, there would probably not be a review focused on the user's interests, much less one that was as up-to-date as the literature itself. Because the information desired is scattered across many documents, ranking the documents in order of relevance does not solve the problem. One solution: Summarizing document sets Retrieval systems could help to avoid the dilemma above if they could automatically produce a summary of the relevant documents tailored to the user's interests, particular query, level of expertise and adjusted to some particular length (from a paragraph to many pages). There has been work on extracting information from single sentences, from paragraphs (Zadrozny & Jensen, 1991), work on summarizing the arguments in whole documents (Alvarado, 1990) and work on automatic abstracting (Paice, 1990). Extensions of these techniques can be applied to summarizing the contents of sets of documents. Manual analysis of reviews in the biological and computer science literature reveals the strategies authors use to summarize large collections of literature. One of the primary devices is to generate that describe items of a given class, citing the appropriate sources. The listings could be sets of genes or enzymes in biological articles or sorting algorithms or network protocols in computer science. Discovering the set of items in a given class in a document collection would need to be automated for this strategy to succeed. It is not appropriate to say that the system should refer to some standard listing of the items of a given class, because new terms are constantly being introduced in rapidly moving fields such as computer science or biology. Furthermore, terminology and use is often specific to a given subfield. As an example, we would like an automated summary system to produce tables such as the following for a biological topic, Term Context and source λ repres sor "OL and OR each contain a series of nonidentical binding sites for the λ repressor..." [Stryer, 1975] Pages 35-39 of Genetic Switch [Ptashne, 1992]. lac repres sor The repressor of the lactose operon [Stryer, 1975] In a navigation (hypertext) environment the user could select any of the items in the table for expansion. In order to select the terms that should be grouped together in tabular summaries as in the example above (λ and lac), word classification must be done. This is described in the next section. Describing and quantifying word contexts To discover word classes, we describe the context of a word (the target word) by the preceding two context words and the following two context words. Each context position is represented by a vector containing the joint frequencies of the 150 highest frequency words in the corpus, giving a 600-dimensional context vector. The entries in the context vectors are converted to mutual information measures, with smoothing. The similarities of the resultant context vectors for the 1,000 highest frequency words are computed from the normalized inner products of their context vectors (cosine rule). The resulting set of 500,000 similarities is used as the basis of a hierarchical clustering algorithm, a bottom-up approach producing binary trees with a similarity at each node, -1.0 ≤ ≤ 1.0. The method was inspired by (Finch & Chater, 1992) and is described in more detail in (Futrelle & Gauch, 1993). Near the leaves, the words were found to be grouped by both semantic and syntactic similarity. Further up the tree, the larger classes retained only syntactic similarity. 1 Prof. Gauch's current address: Dept. of Computer Science, U. Kansas, Lawrence, KS. Some examples from the biological literature The corpus used for this analysis was the 220,000 words of text in 1,700 abstracts that completely cover the field of bacterial chemotaxis since its inception in 1965. Bacterial chemotaxis is a phenomena in which single bacteria move toward higher concentrations of chemical attractants such as sugars (and away from repellents). One of the classes of terms that is constantly being added to by biologists is genetic mutant designators. One class of these the system discovered consists of ten items: motB, tar, tsr, cheB, cheZ, cheY, cheA, flaA, flaE, double There are two apparent anomalies in this list, "tar" and "double", both common words in other contexts. The utility of the classification method is that it is sensitive to the particular use of these words in this specialized field. "tar" means "taxis towards aspartate" in this field and "double" is used to describe mutants which have two lesions in the same or different genes. Thus, if a table of mutants were constructed to summarize this set of papers it should include all ten items. The following class contains compounds that are attractants used in chemotaxis studies, aspartate, maltose, galactose, ribose, serine These could usefully be placed in a list summarizing the major compounds of interest. The word classes also include , which are fundamental to the understanding of living systems, chemotaxis, taxis, sensing, motility, rotation, behavior, movement, transport, uptake Again, a tabulation of these along with excerpts describing them or references to articles devoted to them would be useful as part of a summary. Note that the word classes shown above are both syntactically and semantically homogeneous. The examples above contain only nouns. The homogeneity is easily seen from some other classes generated by the system, adjectives: higher, lower, greater, less other, several, many molecular, structural nouns (physical units): degrees, min, s, mM, microM, nm

1 citations


Network Information
Related Topics (5)
Natural language
31.1K papers, 806.8K citations
85% related
Ontology (information science)
57K papers, 869.1K citations
84% related
Web page
50.3K papers, 975.1K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
83% related
Graph (abstract data type)
69.9K papers, 1.2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202374
2022160
202152
202061
201947
201852