scispace - formally typeset
Search or ask a question

Showing papers on "Multi-document summarization published in 2023"


Proceedings ArticleDOI
10 Feb 2023
TL;DR: In this article , a prototype-driven continuous summarization (PDSum) algorithm is proposed for multi-document sets stream summarization, which builds a lightweight prototype of each document set and exploits it to adapt to new documents while preserving accumulated knowledge from previous documents.
Abstract: Summarizing text-rich documents has been long studied in the literature, but most of the existing efforts have been made to summarize a static and predefined multi-document set. With the rapid development of online platforms for generating and distributing text-rich documents, there arises an urgent need for continuously summarizing dynamically evolving multi-document sets where the composition of documents and sets is changing over time. This is especially challenging as the summarization should be not only effective in incorporating relevant, novel, and distinctive information from each concurrent multi-document set, but also efficient in serving online applications. In this work, we propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS), and introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization. PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents while preserving accumulated knowledge from previous documents. To update new summaries, the most representative sentences for each multi-document set are extracted by measuring their similarities to the prototypes. A thorough evaluation with real multi-document sets streams demonstrates that PDSum outperforms state-of-the-art unsupervised multi-document summarization algorithms in EMDS in terms of relevance, novelty, and distinctiveness and is also robust to various evaluation settings.

2 citations



Journal ArticleDOI
TL;DR: This paper presented an effective way to summarize using a Text Rank algorithm, which focuses on summarizing single Hindi text document at a time based on natural language processing (NLP) for Hindi text documents.
Abstract: The availability of information today accessible in digital form has accelerated. Retrieving useful document from such large pool of information gets difficult. So, to summarize these text documents is very crucial. Text summarization is a process of minimizing the original source document to get essential information of that document. It eliminates the redundant, less important content and provides you with the vital information in a shorter version usually half a length of the original text. Creating a manual summary is a very time-consuming task. Automatic summarization helps in getting the gist of information present in a particular document in a very short period. In the comparison of all Indian regional languages, there is very less amount of work done for summarization of Hindi documents. This paper presents an effective way to summarize using a Text Rank algorithm. It focuses on summarizing single Hindi text document at a time based on natural language processing (NLP).

1 citations


Journal ArticleDOI
TL;DR: In this article , the authors present a classification and analysis of video summarization approaches, with a focus on real-time video summarisation (RVS) domain techniques that can be used to summarize videos.
Abstract: With the massive expansion of videos on the internet, searching through millions of them has become quite challenging. Smartphones, recording devices, and file sharing are all examples of ways to capture massive amounts of real time video. In smart cities, there are many surveillance cameras, which has created a massive volume of video data whose indexing, retrieval, and administration is a difficult problem. Exploring such results takes time and degrades the user experience. In this case, video summarization is extremely useful. Video summarization allows for the efficient storing, retrieval, and browsing of huge amounts of information from video without sacrificing key features. This article presents a classification and analysis of video summarization approaches, with a focus on real-time video summarization (RVS) domain techniques that can be used to summarize videos. The current study will be useful in integrating essential research findings and data for quick reference, laying the preliminaries, and investigating prospective research directions. A variety of practical uses, including aberrant detection in a video surveillance system, have made successful use of video summarization in smart cities.

1 citations



Proceedings ArticleDOI
25 Jan 2023
TL;DR: In this paper , the authors investigated the performance of two unsupervised methods, Latent Semantic Analysis (LSA) and Maximal Marginal Relevance (MMR), in summarization of Persian broadcast news.
Abstract: The methods of automatic speech summarization are classified into two groups: supervised and unsupervised methods. Supervised methods are based on a set of features, while unsupervised methods perform summarization based on a set of rules. Latent Semantic Analysis (LSA) and Maximal Marginal Relevance (MMR) are considered the most important and well-known unsupervised methods in automatic speech summarization. This study set out to investigate the performance of two aforementioned unsupervised methods in transcriptions of Persian broadcast news summarization. The results show that in generic summarization, LSA outperforms MMR, and in query-based summarization, MMR outperforms LSA in broadcast news summarization.

1 citations


Book ChapterDOI
01 Jan 2023
TL;DR: In this paper , the authors proposed a Fuzzy Bi-GRU model for extracting the most useful and relevant information from a massive amount of information from that massive list and it can be possible through automatic text summarization (ATS).
Abstract: As a massive amount of information is produced on the internet nowadays, the need for extracting the most useful and relevant information from that massive list is one of the most attractive research and it can be possible through a mechanism called automatic text summarization (ATS). This summarization mechanism is classified into single and multi-documents based on the number of source documents. When multiple source documents communicate similar information called multi-documents and it is the biggest challenge in the field of ATS. This motivates us to work on the long multi-documents by calculating the sentence scores using a fuzzy inference system. From the extracted sentences, the similarity or redundancy has to be removed using Bi- GRU, and then an abstractive summary need to be generated for those identified sentences has to produce. The proposed system is validated and tested using Standard datasets namely, DUC, BBC news, and CNN/daily mail. The proposed Fuzzy Bi-GRU is compared with other cutting-edge models, and empirical results indicate that it outperforms all other models in terms of ROUGE- N and L scores.

1 citations


Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed an extractive summarization model based on the graph neural network (GNN) to learn cross-sentence relationships using a graph-structured document representation.
Abstract: Extractive text summarization selects the most important sentences from a document, preserves their original meaning, and produces an objective and fact-based summary. It is faster and less computationally intensive than abstract summarization techniques. Learning cross-sentence relationships is crucial for extractive text summarization. However, most of the language models currently in use process text data sequentially, which makes it difficult to capture such inter-sentence relations, especially in long documents. This paper proposes an extractive summarization model based on the graph neural network (GNN) to address this problem. The model effectively represents cross-sentence relationships using a graph-structured document representation. In addition to sentence nodes, we introduce two nodes with different granularity in the graph structure, words and topics, which bring different levels of semantic information. The node representations are updated by the graph attention network (GAT). The final summary is obtained using the binary classification of the sentence nodes. Our text summarization method was demonstrated to be highly effective, as supported by the results of our experiments on the CNN/DM and NYT datasets. To be specific, our approach outperformed baseline models of the same type in terms of ROUGE scores on both datasets, indicating the potential of our proposed model for enhancing text summarization tasks.

Book ChapterDOI
TL;DR: SmartEDU as discussed by the authors is a platform for drafting slides for a textual document, and the research that lead to its development can be found in the introduction of SmartEDU, which is based on the Distillbart model for unsupervised summarization.
Abstract: Slide decks are a common medium for presenting a topic. To reduce the time required for their preparation, we present SmartEDU, a platform for drafting slides for a textual document, and the research that lead to its development. Drafts are Powerpoint files generated in three steps: pre-processing, for acquiring or discovering section titles; summarization, for compressing the contents of each section; slide composition, for organizing the summaries into slides. The resulting file may be further edited by the user. Several summarization methods were experimented in public datasets of presentations and in Wikipedia articles. Based on automatic evaluation measures and collected human opinions, we conclude that a Distillbart model is preferred to unsupervised summarization, especially when it comes to overall draft quality.

Journal ArticleDOI
TL;DR: This article presented a method for summarizing multi-document news web pages based on similarity models and sentence ranking, where relevant sentences are extracted from the original article, where they collected from five news websites that cover the same topic and event.
Abstract: In the area of text summarization, there have been significant advances recently. In the meantime, the current trend in text summarization is focused more on news summarization. Therefore, developing a synthesis approach capable of extracting, comparing, and ranking sentences is vital to create a summary of various news articles in the context of erroneous online data. It is necessary, however, for the news summarization system to be able to deal with multi-document summaries due to content redundancy. This paper presents a method for summarizing multi-document news web pages based on similarity models and sentence ranking, where relevant sentences are extracted from the original article. English-language articles are collected from five news websites that cover the same topic and event. According to our experimental results, our approach provides better results than other recent methods for summarizing news.

Posted ContentDOI
07 Feb 2023
TL;DR: The FINDSum dataset as mentioned in this paper is a large-scale dataset for long text and multi-table summarization, built on 21,125 annual reports from 3,794 companies, it has two subsets for summarizing each company's results of operations and liquidity.
Abstract: Automatic document summarization aims to produce a concise summary covering the input document's salient information. Within a report document, the salient information can be scattered in the textual and non-textual content. However, existing document summarization datasets and methods usually focus on the text and filter out the non-textual content. Missing tabular data can limit produced summaries' informativeness, especially when summaries require covering quantitative descriptions of critical metrics in tables. Existing datasets and methods cannot meet the requirements of summarizing long text and multiple tables in each report. To deal with the scarcity of available data, we propose FINDSum, the first large-scale dataset for long text and multi-table summarization. Built on 21,125 annual reports from 3,794 companies, it has two subsets for summarizing each company's results of operations and liquidity. To summarize the long text and dozens of tables in each report, we present three types of summarization methods. Besides, we propose a set of evaluation metrics to assess the usage of numerical information in produced summaries. Dataset analyses and experimental results indicate the importance of jointly considering input textual and tabular data when summarizing report documents.


Journal ArticleDOI
09 May 2023-PLOS ONE
TL;DR: This paper proposed a graph-based extractive single-document summarization method for Hausa text by modifying the existing PageRank algorithm using the normalized common bigrams count between adjacent sentences as the initial vertex score.
Abstract: Automatic text summarization is one of the most promising solutions to the ever-growing challenges of textual data as it produces a shorter version of the original document with fewer bytes, but the same information as the original document. Despite the advancements in automatic text summarization research, research involving the development of automatic text summarization methods for documents written in Hausa, a Chadic language widely spoken in West Africa by approximately 150,000,000 people as either their first or second language, is still in early stages of development. This study proposes a novel graph-based extractive single-document summarization method for Hausa text by modifying the existing PageRank algorithm using the normalized common bigrams count between adjacent sentences as the initial vertex score. The proposed method is evaluated using a primarily collected Hausa summarization evaluation dataset comprising of 113 Hausa news articles on ROUGE evaluation toolkits. The proposed approach outperformed the standard methods using the same datasets. It outperformed the TextRank method by 2.1%, LexRank by 12.3%, centroid-based method by 19.5%, and BM25 method by 17.4%.

Book ChapterDOI
01 Jan 2023
TL;DR: In this paper , extractive and abstract methods for text summarization are investigated and the implications of sentences are calculated using linguistic and statistical characteristics, and they explore many efforts in automatic summarization, particularly recent ones.
Abstract: The volume of data on the Internet has increased at an exponential rate during the previous decade. Consequently, the need for a method for converting this massive amount of raw data into meaningful information that a human brain can comprehend emerges. Text summarization is a common research technique that aids in dealing with a massive quantity of data. Automatic summarization is a well-known approach for distilling the important ideas in a document. It works by creating a shortened form of the text and preserving important information. Techniques for text summarizing are classified as extractive or abstractive. Extractive summarization methods reduce the burden of summarization by choosing a few relevant sentences from the original text. The implications of sentences are calculated using linguistic and statistical characteristics. This paper investigates extractive and abstract methods for text summarization. We will also explore many efforts in automatic summarization, particularly recent ones, in this article.

Journal ArticleDOI
TL;DR: Automatic summarization is the act of computationally condensing a set of data to produce a subset (a summary) that captures the key ideas or information within the original text as discussed by the authors .
Abstract: Automatic summarization is the act of computationally condensing a set of data to produce a subset (a summary) that captures the key ideas or information within the original text. To do this, artificial intelligence algorithms that are tailored for diverse sorts of data are frequently created and used. Ten research articles considering databases like IEEE, Scopus, and Springer Nature have been considered. The paradigm shift that AI has created in the field of Automatic Text Summarization is discussed in detail.

Proceedings ArticleDOI
01 May 2023
TL;DR: In this article , the authors leverage the power of two popular natural language processing techniques, Bidirectional Encoder Representations from Transformers (BERT) and Gated Recurrent Unit (GRU), for multi-document summarization.
Abstract: Multi-document summarization has been a challenging task due to the difficulties in capturing essential information from multiple sources and generating coherent and non-redundant summaries. In this proposed model, we address these challenges by leveraging the power of two popular natural language processing techniques, Bidirectional Encoder Representations from Transformers (BERT) and Gated Recurrent Unit (GRU). The Document Understanding Conference (DUC) dataset, a widely recognized benchmark dataset for multi-document summarization, was used to train and evaluate the model. By using BERT to generate contextual embeddings and GRU to capture sequence information, the proposed method outperforms previous methods in terms of summarization quality metrics such as ROUGE (RecallOriented Understudy for Gisting Evaluation). The proposed model has significant potential for use in various applications, such as news summarization, document summarization, and automated content creation. This study demonstrates that combining BERT and GRU models can effectively capture the contextual and sequential information in multi-document summarization, leading to high-quality summaries that overcome the limitations of previous methods.

Proceedings ArticleDOI
03 Mar 2023
TL;DR: In this article , the authors explore the possible approaches and techniques available for generating a user adaptive video summary and present a comparative analysis of the techniques to provide an insight to the researchers working in this area.
Abstract: Video Summarization shortens a video content by extracting the most significant part from it and presenting the extracted contents in a summarized form that maybe a collection of keyframes or key shots in temporal sequence. In the recent past, various techniques have been suggested for automatic summarization of videos. It has been observed that summarization of videos is a subjective task and the traditional approaches of summarization though, are capable of generating generic summaries but are often incapable of generating the most appropriate and customized summary as desired by the user. A user intuitive and adaptive approach enables to summarize the video as per the preference of the user. In this paper, we discuss various frameworks for generating a user preference-based summary from a video. We explore the possible approaches and techniques available for generating a user adaptive video summary and present a comparative analysis of the techniques to provide an insight to the researchers working in this area.

Proceedings ArticleDOI
30 Apr 2023
TL;DR: Zhang et al. as mentioned in this paper proposed a new hyperbolic interaction model for extractive multi-document summarization (HISum), which first learns document and candidate summary representations in the same hyper-bolic space to capture latent hierarchical structures and then estimates the importance scores of candidates by jointly modeling interactions between each candidate and the document from global and local views.
Abstract: Extractive summarization helps provide a short description or a digest of news or other web texts. It enhances the reading experience of users, especially when they are reading on small displays (e.g., mobile phones). Matching-based methods are recently proposed for the extractive summarization task, which extracts a summary from a global view via a document-summary matching framework. However, these methods only calculate similarities between candidate summaries and the entire document embeddings, insufficiently capturing interactions between different contextual information in the document to accurately estimate the importance of candidates. In this paper, we propose a new hyperbolic interaction model for extractive multi-document summarization (HISum). Specifically, HISum first learns document and candidate summary representations in the same hyperbolic space to capture latent hierarchical structures and then estimates the importance scores of candidates by jointly modeling interactions between each candidate and the document from global and local views. Finally, the importance scores are used to rank and extract the best candidate as the extracted summary. Experimental results on several benchmarks show that HISum outperforms the state-of-the-art extractive baselines1.

Proceedings ArticleDOI
30 Apr 2023
TL;DR: In this paper , a multi-step process for aspect-based summarization of a legal case file related to regulating bodies is proposed, which allows different stakeholders to consume information of interest therein efficiently.
Abstract: Aspect-based summarization of a legal case file related to regulating bodies allows different stakeholders to consume information of interest therein efficiently. In this paper, we propose a multi-step process to achieve the same. First, we explore the semantic sentence segmentation of SEBI case files via classification. We also propose a dataset of Indian legal adjudicating orders which contain tags from carefully crafted domain-specific sentence categories with the help of legal experts. We experiment with various machine learning and deep learning methods for this multi-class classification. Then, we examine the performance of numerous summarization methods on the segmented document to generate persona-specific summaries. Finally, we develop a pipeline making use of the best methods in both sub-tasks to achieve high recall.

Book ChapterDOI
01 Jan 2023
TL;DR: In this article , a summarization and paraphrasing technique using the LexRank algorithm and PEGASUS transformer for abstractive summarization of legal documents has been proposed and compared on six different documents.
Abstract: Legal documents are generally verbose and contain lots of dense legal text. Lawyers often must study prior cases, and reading such documents can be time-consuming. Such documents may also be incomprehensible to the ordinary public who lack legal understanding. Enormous amounts of legal data online have made access to case files and documents simple. Hence, automatic summarization and paraphrasing using machine learning (ML) of such documents has become an important area of research to make the documents comprehensible for lawyers and the ordinary public. In this paper, previous approaches to automatic summarization are reviewed by us and we compare their output on six different documents. We also provide a summarization and paraphrasing technique using the LexRank algorithm and PEGASUS transformer for abstractive summarization. The summary produced by the LexRank model outperforms the other models tested by obtaining a higher ROUGE-F1-score. The final summary retains the important information from the source document and is paraphrased to a simpler language.

Journal ArticleDOI
TL;DR: In this article , an extractive-video-summarizer that utilizes state-of-the-art pre-trained ML models and open-source libraries at its core is presented.
Abstract: Video summarization aims to produce a high-quality text-based summary of videos so that it can convey all the important information or the zest of the videos to users. The process of video summarization involves the conversion of video files to audio files, which are then converted to text files. This entire process is accompanied by the use of transformer architecture of Natural Language Processing. Although a lot of studies have been carried out for text summarization, we present our model, an extractive-video-summarizer, that utilizes state-of-the-art pre-trained ML models and open-source libraries at its core. The extractive-video-summarizer uses the following regime(I) Preparation of a multidisciplinary dataset of videos, (II) Extraction of audios from video files, (III)Text generation from audio files, (IV) Text summarization using extractive summarizers, (V)Entity extraction. We conducted our research primarily on two widely used languages in India - Hindi and English. To conclude, our model performs significantly well and generates tags for videos appropriately.

Journal ArticleDOI
TL;DR: In this paper , text recognition from scanned documents via Optical Character Recognition (OCR) was used to perform extractive text summarization using TextRank algorithm, which is an unsupervised summarization technique.
Abstract: With the explosion of unstructured textual data circulating the digital space in present times, there has been an increase in the necessity of developing tools that can perform automatic text summarization to allow people to get insights from them easily and extract significant and essential data using Automatic Text Summarizers. The readability of documents can be improved and the time spent on researching for information can be improved by the implementation of text summarization tools. In this project, extractive summarization will be performed on text recognized from scanned documents via Optical Character Recognition (OCR), using the TextRank algorithm which is an unsupervised text summarization technique for performing extractive text summarization.

Journal ArticleDOI
TL;DR: This paper presented a centroid-based clustering algorithm for email summarization that combines the use of word embeddings with a clustering approach. But their results show that their approach stands close to existing methods in terms of summary quality, while also being computationally efficient.
Abstract: Abstract: Extractive text summarization is the process of identifying the most important information from a large text and presenting it in a condensed form. One popular approach to this problem is the use of centroid-based clustering algorithms, which group together similar sentences based on their content and then select representative sentences from each cluster to form a summary. In this research, we present a centroid-based clustering algorithm for email summarization that combines the use of word embeddings with a clustering algorithm. We compare our algorithm to existing summarization techniques. Our results show that our approach stands close to existing methods in terms of summary quality, while also being computationally efficient. Overall, our work demonstrates the potential of centroid-based clustering algorithms for extractive text summarization and suggests avenues for further research in this area.

Journal ArticleDOI
TL;DR: The authors compared and assessed the effectiveness of two optimizers on a variety of datasets and compared their performance on various datasets as they are widely employed in text summarization, and concluded that the two utilized optimizers are adam and rmsprop.

Book ChapterDOI
TL;DR: Zhang et al. as discussed by the authors proposed a topic-aware graph-based neural interest summarization method (UGraphNet), enhancing user semantic mining by unearthing potential user relations and jointly learning the latent topic representations of posts that facilitates user interest learning.
Abstract: User-generated content is daily produced in social media, as such user interest summarization is critical to distill salient information from massive information. While the interested messages (e.g., tags or posts) from a single user are usually sparse becoming a bottleneck for existing methods, we propose a topic-aware graph-based neural interest summarization method (UGraphNet), enhancing user semantic mining by unearthing potential user relations and jointly learning the latent topic representations of posts that facilitates user interest learning. Experiments on two datasets collected from well-known social media platforms demonstrate the superior performance of our model in the tasks of user interest summarization and item recommendation. Further discussions also show that exploiting the latent topic representations and user relations are conductive to the user automatic language understanding.

Journal ArticleDOI
TL;DR: The authors proposed a set of new requirements for opinion-Topic-sentence, which are essential for performing opinion summarization, and proposed four submodular functions and two optimization algorithms with proven performance bounds.
Abstract: This paper focuses on opinion summarization for constructing subjective and concise summaries representing essential opinions of online text reviews. As previous works rarely focus on the relationship between opinions, topics, and sentences, we propose a set of new requirements for Opinion-Topic-Sentence, which are essential for performing opinion summarization. We prove that Opinion-Topic-Sentence can be theoretically analyzed by submodular information measures. Thus, our proposed method can reduce redundant information, strengthen the relevance to given topics, and informatively represent the underlying emotional variations. While conventional methods require human-labeled topics for extractive summarization, we use unsupervised topic modeling methods to generate topic features. We propose four submodular functions and two optimization algorithms with proven performance bounds that can maximize opinion summarization's utility. An automatic evaluation metric, Topic-based Opinion Variance, is also derived to compensate for ROUGE-based metrics of opinion summarization evaluation. Four large, diversified, and representative corpora, OPOSUM, Opinosis, Yelp, and Amazon reviews, are used in our study. The results on these online review texts corroborate the efficacy of our proposed metric and framework.

Posted ContentDOI
19 May 2023
TL;DR: The authors proposed a unified topic encoder, which jointly discovers latent topics from the document and various kinds of side information through a graph encoder through a topic-aware interaction, and then proposed a triplet contrastive learning mechanism to align the single-and multi-modal information into a unified semantic space.
Abstract: Automatic summarization plays an important role in the exponential document growth on the Web. On content websites such as CNN.com and WikiHow.com, there often exist various kinds of side information along with the main document for attention attraction and easier understanding, such as videos, images, and queries. Such information can be used for better summarization, as they often explicitly or implicitly mention the essence of the article. However, most of the existing side-aware summarization methods are designed to incorporate either single-modal or multi-modal side information, and cannot effectively adapt to each other. In this paper, we propose a general summarization framework, which can flexibly incorporate various modalities of side information. The main challenges in designing a flexible summarization model with side information include: (1) the side information can be in textual or visual format, and the model needs to align and unify it with the document into the same semantic space, (2) the side inputs can contain information from various aspects, and the model should recognize the aspects useful for summarization. To address these two challenges, we first propose a unified topic encoder, which jointly discovers latent topics from the document and various kinds of side information. The learned topics flexibly bridge and guide the information flow between multiple inputs in a graph encoder through a topic-aware interaction. We secondly propose a triplet contrastive learning mechanism to align the single-modal or multi-modal information into a unified semantic space, where the summary quality is enhanced by better understanding the document and side information. Results show that our model significantly surpasses strong baselines on three public single-modal or multi-modal benchmark summarization datasets.

Book ChapterDOI
01 Jan 2023
TL;DR: In this paper , the main idea of this paper is to summarize text and know how transformers work in case of text summarization, which is the process of creating a condensed form of text document which maintains significant information and general meaning of source text.
Abstract: Text summarization is the process of creating a condensed form of text document which maintains significant information and general meaning of source text. Automatic text summarization becomes an important way of finding relevant information precisely in large text in a short time with little efforts. There are two main strategies involved in text summarization such as abstractive and extractive. In extractive method, the algorithm generates the summary by just picking up the words and line from the corpus. On the other hand in abstractive method, the algorithm generates the summary by rewriting the sentences. The main idea of this paper is to summarize text and know how transformers work in case of text summarization.

Journal ArticleDOI
TL;DR: In this article , the authors addressed the generic and update text summarization tasks of a set of documents as a combinatorial optimization problem through a genetic algorithm and unsupervised textual features.
Abstract: In this paper, we addressed the generic and update text summarization tasks of a set of documents as a combinatorial optimization problem through a genetic algorithm and unsupervised textual features. Particularly under the news domain, input documents are a set of articles of varying sizes covering the same event. The main advantage of the proposed method is that it is language-independent. The experimental results demonstrated that the method performs well for both kinds of summarization. Moreover, we calculated the heuristics for update text summarization like a benchmark to compare state-of-the-art methods.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed an audiovisual neural network that takes advantage of spatio-temporal visual and auditory information to better simulate human attention as well as exploit more plentiful information.