scispace - formally typeset
Search or ask a question
Topic

Multi-document summarization

About: Multi-document summarization is a research topic. Over the lifetime, 2270 publications have been published within this topic receiving 71850 citations.


Papers
More filters
Proceedings ArticleDOI
04 Jun 2009
TL;DR: This work proposes a one-step approach for document summarization that jointly performs sentence extraction and compression by solving an integer linear program.
Abstract: Text summarization is one of the oldest problems in natural language processing. Popular approaches rely on extracting relevant sentences from the original documents. As a side effect, sentences that are too long but partly relevant are doomed to either not appear in the final summary, or prevent inclusion of other relevant sentences. Sentence compression is a recent framework that aims to select the shortest subsequence of words that yields an informative and grammatical sentence. This work proposes a one-step approach for document summarization that jointly performs sentence extraction and compression by solving an integer linear program. We report favorable experimental results on newswire data.

132 citations

Posted Content
TL;DR: This work argues that faithfulness is also a vital prerequisite for a practical abstractive summarization system and proposes a dual-attention sequence-to-sequence framework to force the generation conditioned on both the source text and the extracted fact descriptions.
Abstract: Unlike extractive summarization, abstractive summarization has to fuse different parts of the source text, which inclines to create fake facts. Our preliminary study reveals nearly 30% of the outputs from a state-of-the-art neural summarization system suffer from this problem. While previous abstractive summarization approaches usually focus on the improvement of informativeness, we argue that faithfulness is also a vital prerequisite for a practical abstractive summarization system. To avoid generating fake facts in a summary, we leverage open information extraction and dependency parse technologies to extract actual fact descriptions from the source text. The dual-attention sequence-to-sequence framework is then proposed to force the generation conditioned on both the source text and the extracted fact descriptions. Experiments on the Gigaword benchmark dataset demonstrate that our model can greatly reduce fake summaries by 80%. Notably, the fact descriptions also bring significant improvement on informativeness since they often condense the meaning of the source text.

132 citations

Proceedings ArticleDOI
20 Jul 2008
TL;DR: The proposed summarization methods utilizing comments showed significant improvement over those not using comments, and the methods using feature-biased sentence extraction approach were observed to outperform that using uniform-document approach.
Abstract: Comments left by readers on Web documents contain valuable information that can be utilized in different information retrieval tasks including document search, visualization, and summarization In this paper, we study the problem of comments-oriented document summarization and aim to summarize a Web document (eg, a blog post) by considering not only its content, but also the comments left by its readers We identify three relations (namely, topic, quotation, and mention) by which comments can be linked to one another, and model the relations in three graphs The importance of each comment is then scored by: (i) graph-based method, where the three graphs are merged into a multi-relation graph; (ii) tensor-based method, where the three graphs are used to construct a 3rd-order tensor To generate a comments-oriented summary, we extract sentences from the given Web document using either feature-biased approach or uniform-document approach The former scores sentences to bias keywords derived from comments; while the latter scores sentences uniformly with comments In our experiments using a set of blog posts with manually labeled sentences, our proposed summarization methods utilizing comments showed significant improvement over those not using comments The methods using feature-biased sentence extraction approach were observed to outperform that using uniform-document approach

130 citations

Proceedings ArticleDOI
Xinfan Meng1, Furu Wei2, Xiaohua Liu2, Ming Zhou2, Sujian Li1, Houfeng Wang1 
12 Aug 2012
TL;DR: This paper proposes an entity-centric topic-based opinion summarization framework, which aims to produce opinion summaries in accordance with topics and remarkably emphasizing the insight behind the opinions in Twitter.
Abstract: Microblogging services, such as Twitter, have become popular channels for people to express their opinions towards a broad range of topics. Twitter generates a huge volume of instant messages (i.e. tweets) carrying users' sentiments and attitudes every minute, which both necessitates automatic opinion summarization and poses great challenges to the summarization system. In this paper, we study the problem of opinion summarization for entities, such as celebrities and brands, in Twitter. We propose an entity-centric topic-based opinion summarization framework, which aims to produce opinion summaries in accordance with topics and remarkably emphasizing the insight behind the opinions. To this end, we first mine topics from #hashtags, the human-annotated semantic tags in tweets. We integrate the #hashtags as weakly supervised information into topic modeling algorithms to obtain better interpretation and representation for calculating the similarity among them, and adopt Affinity Propagation algorithm to group #hashtags into coherent topics. Subsequently, we use templates generalized from paraphrasing to identify tweets with deep insights, which reveal reasons, express demands or reflect viewpoints. Afterwards, we develop a target (i.e. entity) dependent sentiment classification approach to identifying the opinion towards a given target (i.e. entity) of tweets. Finally, the opinion summary is generated through integrating information from dimensions of topic, opinion and insight, as well as other factors (e.g. topic relevancy, redundancy and language styles) in an unified optimization framework. We conduct extensive experiments on a real-life data set to evaluate the performance of individual opinion summarization modules as well as the quality of the produced summary. The promising experiment results show the effectiveness of the proposed framework and algorithms.

128 citations

Journal ArticleDOI
TL;DR: Two independent methods for identifying salient sentences in biomedical texts using concepts derived from domain-specific resources are presented and it is shown that the best performance is achieved when the two methods are combined.
Abstract: Text summarization is a method for data reduction. The use of text summarization enables users to reduce the amount of text that must be read while still assimilating the core information. The data reduction offered by text summarization is particularly useful in the biomedical domain, where physicians must continuously find clinical trial study information to incorporate into their patient treatment efforts. Such efforts are often hampered by the high-volume of publications. This paper presents two independent methods (BioChain and FreqDist) for identifying salient sentences in biomedical texts using concepts derived from domain-specific resources. Our semantic-based method (BioChain) is effective at identifying thematic sentences, while our frequency-distribution method (FreqDist) removes information redundancy. The two methods are then combined to form a hybrid method (ChainFreq). An evaluation of each method is performed using the ROUGE system to compare system-generated summaries against a set of manually-generated summaries. The BioChain and FreqDist methods outperform some common summarization systems, while the ChainFreq method improves upon the base approaches. Our work shows that the best performance is achieved when the two methods are combined. The paper also presents a brief physician's evaluation of three randomly-selected papers from an evaluation corpus to show that the author's abstract does not always reflect the entire contents of the full-text.

128 citations


Network Information
Related Topics (5)
Natural language
31.1K papers, 806.8K citations
85% related
Ontology (information science)
57K papers, 869.1K citations
84% related
Web page
50.3K papers, 975.1K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
83% related
Graph (abstract data type)
69.9K papers, 1.2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202374
2022160
202152
202061
201947
201852