scispace - formally typeset
Search or ask a question
Topic

Multi-document summarization

About: Multi-document summarization is a research topic. Over the lifetime, 2270 publications have been published within this topic receiving 71850 citations.


Papers
More filters
Proceedings ArticleDOI
12 Oct 2005
TL;DR: The idea of the approach is to obtain concepts of words based on HowNet, and use concept as feature, instead of word, to form a rough summarization, and then calculate degree of semantic similarity of sentence for reducing its redundancy.
Abstract: In this paper, we propose a practical approach for extracting the most relevant sentences from the original document to form a summary. The idea of our approach is to obtain concepts of words based on HowNet, and use concept as feature, instead of word. We use conceptual vector space model to form a rough summarization, and then calculate degree of semantic similarity of sentence for reducing its redundancy. Experimental results show that our approach is effective and efficient, and performance of the system is reliable.

8 citations

Dissertation
17 Apr 2014
TL;DR: This thesis proposes a novel AA based summarization method, the weighted Hierarchical Archetypal Analysis, and investigates the impact of using the content-graph and multi-element graph model for language- and domain-independent extractive multi-document generic and query focused summarization.
Abstract: This thesis is about automatic document summarization, with experimental results on general, query, update and comparative multi-document summarization (MDS). We describe prior work and our own improvements on some important aspects of a summarization system, including text modeling by means of a graph and sentence selection via archetypal analysis. The centerpiece of this work is a novel method for summarization that we call “Archetypal Analysis Summarization”. Archetypal Analysis (AA) is a promising unsupervised learning tool able to completely assemble the advantages of clustering and the flexibility of matrix factorization. We propose a novel AA based summarization method based on following observations. In generic document summarization, given a graph representation of a set of documents, positively and/or negatively salient sentences are values on the data set boundary. To compute these extreme values, general or weighted archetypes, we choose to use archetypal analysis and weighted archetypal analysis, respectively. While each sentence in a data set is estimated as a mixture of archetypal sentences, the archetypes themselves are restricted to being sparse mixtures, i.e. convex combinations of the original sentences. Since AA in this way readily offers soft clustering and probabilistic ranking, we suggest considering it as a method for simultaneous sentence clustering and ranking. Another important argument in favour of using AA in MDS is that in contrast to other factorization methods which extract prototypical, characteristic, even basic sentences, AA selects distinct (archetypal) sentences, thus induces variability and diversity in produced summaries. Our research contributes by presenting some new modeling approaches based on graph notation which facilitate the text summarization task. We investigate the impact of using the content-graph and multi-element graph model for language- and domain-independent extractive multi-document generic and query focused summarization. We also propose the novel version of AA, the weighted Hierarchical Archetypal Analysis. We consider the use of it for four best-known summarization tasks, including generic, query-focused, update, and comparative summarization. Experiments on summarization data sets (DUC04-07, TAC08) are conducted to demonstrate the efficiency and effectiveness of our framework for all four kinds of the multi-document summarization task.

8 citations

Journal ArticleDOI
28 Feb 2015
TL;DR: The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
Abstract: As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on- line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary. Text summarization is one of the most popular research areas today because of the problem of the information overloading available on the web, and has increased the necessity of the more strong and powerful text summarizers. The condensation of information from text is needed and this can be achieved by text summarization by reducing the length of the original text. Text summarization is commonly classified into two types extractive and abstractive. Extractive summarization means extracting few sentences from the original document based on some statistical factors and adding them into summary. Extractive summarization usually tends to sentence extraction rather than summarization. Whereas abstractive summarization are more powerful than extractive summarization because they generate the sentences based on their semantic meaning. Hence this leads to a meaningful summarization which is more accurate than extractive summaries. Summarization by extractive just extracts the sentences from the original document and adds them to summary. Extractive method is based on statistical features not on semantic relation with sentences (2) and are easier to implement. Therefore the summary generated by this method tends to be inconsistent. Summarization by abstraction needs understanding of the original text and then generating the summary which is semantically related. It is difficult to compute abstractive summary because it needs understanding of complex natural language processing tasks. There are few issues of extractive summarization. Extracted sentences usually tend to be longer than average. Due to this, parts of the segments that are not essential for summary also get included, consuming space. Important or relevant information is usually spread across sentences, and extractive summaries cannot capture this (unless the summary is long enough to hold all those sentences). Conflicting information may not be presented accurately. Pure extraction often leads to problems in overall coherence of the summary. These problems become more severe in the multi-document case, since extracts are drawn from different sources. Therefore abstractive

8 citations

Journal ArticleDOI
TL;DR: A novel methodology is proposed that identifies events from these text corpora and creates summarization for each event using a probabilistic, topic model to learn the potential topics from the massive documents and further discover events in terms of the topic distributions of documents.
Abstract: We have witnessed the proliferation of the Internet over the past few decades. A large amount of textual information is generated on the Web. It is impossible to locate and digest all the latest updates available on the Web for individuals. Text summarization would provide an efficient way to generate short, concise abstracts from the massive documents. These massive documents involve many events which are hard to be identified by the summarization procedure directly. We propose a novel methodology that identifies events from these text corpora and creates summarization for each event. We employ a probabilistic, topic model to learn the potential topics from the massive documents and further discover events in terms of the topic distributions of documents. To target the summarization, we define the word set coverage problem (WSCP) to capture the most representative sentences to summarize an event. For getting solution of the WSCP, we propose an approximate algorithm to solve the optimization problem. We conduct a set of experiments to evaluate our proposed approach on two real datasets: Sina news and Johnson & Johnson medical news. On both datasets, our proposed method outperforms competitive baselines by considering the harmonic mean of coverage and conciseness.

7 citations


Network Information
Related Topics (5)
Natural language
31.1K papers, 806.8K citations
85% related
Ontology (information science)
57K papers, 869.1K citations
84% related
Web page
50.3K papers, 975.1K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
83% related
Graph (abstract data type)
69.9K papers, 1.2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202374
2022160
202152
202061
201947
201852