scispace - formally typeset
Search or ask a question
Topic

Multi-document summarization

About: Multi-document summarization is a research topic. Over the lifetime, 2270 publications have been published within this topic receiving 71850 citations.


Papers
More filters
Posted Content
TL;DR: This paper proposes a topic-centric unsupervised multi-document summarization framework to generate extractive and abstractive summaries for groups of scientific articles across 20 Fields of Study in Microsoft Academic Graph (MAG) and news articles from DUC-2004 Task 2.
Abstract: Recent advances in natural language processing have enabled automation of a wide range of tasks, including machine translation, named entity recognition, and sentiment analysis. Automated summarization of documents, or groups of documents, however, has remained elusive, with many efforts limited to extraction of keywords, key phrases, or key sentences. Accurate abstractive summarization has yet to be achieved due to the inherent difficulty of the problem, and limited availability of training data. In this paper, we propose a topic-centric unsupervised multi-document summarization framework to generate extractive and abstractive summaries for groups of scientific articles across 20 Fields of Study (FoS) in Microsoft Academic Graph (MAG) and news articles from DUC-2004 Task 2. The proposed algorithm generates an abstractive summary by developing salient language unit selection and text generation techniques. Our approach matches the state-of-the-art when evaluated on automated extractive evaluation metrics and performs better for abstractive summarization on five human evaluation metrics (entailment, coherence, conciseness, readability, and grammar). We achieve a kappa score of 0.68 between two co-author linguists who evaluated our results. We plan to publicly share MAG-20, a human-validated gold standard dataset of topic-clustered research articles and their summaries to promote research in abstractive summarization.

3 citations

Proceedings ArticleDOI
01 Jan 2022
TL;DR: The authors propose a simple approach to reorder the documents according to their relative importance before concatenating and summarizing them, which makes the salient content easier to learn by the summarization model.
Abstract: A common method for extractive multi-document news summarization is to re-formulate it as a single-document summarization problem by concatenating all documents as a single meta-document. However, this method neglects the relative importance of documents. We propose a simple approach to reorder the documents according to their relative importance before concatenating and summarizing them. The reordering makes the salient content easier to learn by the summarization model. Experiments show that our approach outperforms previous state-of-the-art methods with more complex architectures.

3 citations

Proceedings ArticleDOI
23 Aug 2017
TL;DR: This paper presents a graph-based algorithm that is capable of producing extractive summaries that are both diversified from a sentiment point of view and topically well-covered and shows improvements in ROUGE metrics.
Abstract: With the abundance of reviews published on the Web about a given product, consumers are looking for ways to view major opinions that can be presented in a quick and succinct way. Reviews contain many different opinions, making the ability to show a diversified review summary that focus on coverage and diversity a major goal. Most review summarization work focuses on showing salient reviews as a summary which might ignore diversity in summaries. In this paper, we present a graph-based algorithm that is capable of producing extractive summaries that are both diversified from a sentiment point of view and topically well-covered. First, we use statistical measures to find topical words. Then we split the dataset based on the sentiment class of the reviews and perform the ranking on each sentiment graph. When compared with different baselines, our approach scores best in most ROUGE metrics. Specifically, our approach shows improvements of 3.9% in ROUGE-1 and 1.8% in ROUGE-L in comparison with the best competing baseline.

3 citations

Dissertation
01 Jan 2014
TL;DR: This work extends a few existing unsupervised topic models such as Latent Dirichlet Allocation (LDA) to model documents which are annotated from two different perspectives and shows that using topic models it is possible to outperform keyword summaries generated by annotating videos through state-of-the-art object recognition techniques from computer vision.
Abstract: Probabilistic topic models have recently become the cornerstone of unsupervised exploratory analysis of text documents using Bayesian statistics. The strength of the models lie in their modularity—random variables can be introduced or modified to suit the requirements of the different applications. Many of these models however consider modeling only one particular view of the observations such as treating documents as a flat collection of words ignoring the nuances of the different classes of annotations which may be present in an implicit and/or explicit form. We extend a few existing unsupervised topic models such as Latent Dirichlet Allocation (LDA) to model documents which are annotated from two different perspectives. The perspectives consist of both a word level (e.g. part-of-speech, affect, positional etc.) tag annotation and a document level (e.g. crowd-sourced document labels, captions of embedded multimedia) highlighting. The new models are dubbed as the Tag2LDA class of models whose primary goal is to combine the best aspects of supervised and unsupervised modeling learning under one framework. Additionally, the correspondence class of Tag2LDA models explored in this context are state-of-the-art among the family of parametric tag-topic models in terms of predictive log likelihoods. These models are presented in Chapter 4. The field of automatic summary generation is increasingly gaining traction and there is a steady rise in demand of the summarization algorithms that is applicable to a wide variety of genres of text and other kinds of data as well (e.g. video). Producing short summaries in a human readable form is very attractive particularly for very large datasets. However, the problem is NP-Hard even for smaller domains such as summarizing small sets of newswire documents. We use the Tag2LDA class of models in conjunction with local models (e.g. extracting syntactic and semantic roles of words, Rhetorical Structure trees, etc.) to do multi-document summarization of text documents based on information needs that are guided by a common information model. The guided summarization task, as laid out in recent text summarization competitions, aims to cover information needs by asking questions like “who did what when and where?” We also have successfully applied multi-modal topic models to summarize domain specific videos into natural language text directly from low level features extracted from the videos. The experiments performed for this task are described in detail in Chapter 5. Finally, in Chapter 6, we show that using topic models it is possible to outperform keyword summaries generated by annotating videos through state-of-the-art object recognition techniques from computer vision. Summarizing a video in terms of natural language generated from such keywords in context removes the laborious frame-by-frame drawing of bounding boxes surrounding objects of interest—a scheme which is required for annotating videos to training a large number of object detectors. The topic models that we develop for this purpose instead use easily available short lingual descriptions of entire videos to predict text for a given domain specific test video. The models are also novel in handling both text and video features particularly with regards to multimedia topic discovery from captioned videos whose features can belong to both discrete and real valued domains.

3 citations

Journal ArticleDOI
TL;DR: This paper presents how the emotional dimensions issued from real viewers can be used as an important input for computing which part is the most interesting in the total time of a film.
Abstract: For different reasons, many viewers like to watch a summary of films without having to waste their time. Traditionally, video film was analyzed manually to provide a summary of it, but this costs an important amount of work time. Therefore, it has become urgent to propose a tool for the automatic video summarization job. The automatic video summarization aims at extracting all of the important moments in which viewers might be interested. All summarization criteria can differ from one video to another. This paper presents how the emotional dimensions issued from real viewers can be used as an important input for computing which part is the most interesting in the total time of a film. Our results, which are based on lab experiments that were carried out, are significant and promising.

3 citations


Network Information
Related Topics (5)
Natural language
31.1K papers, 806.8K citations
85% related
Ontology (information science)
57K papers, 869.1K citations
84% related
Web page
50.3K papers, 975.1K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
83% related
Graph (abstract data type)
69.9K papers, 1.2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202374
2022160
202152
202061
201947
201852