scispace - formally typeset
Search or ask a question

Showing papers on "Multi-document summarization published in 2018"


Proceedings ArticleDOI
01 Aug 2018
TL;DR: The authors exploited the maximal marginal relevance method to select representative sentences from multi-document input, and leveraged an abstractive encoder-decoder model to fuse disparate sentences to generate abstractive summary.
Abstract: Generating a text abstract from a set of documents remains a challenging task. The neural encoder-decoder framework has recently been exploited to summarize single documents, but its success can in part be attributed to the availability of large parallel data automatically acquired from the Web. In contrast, parallel data for multi-document summarization are scarce and costly to obtain. There is a pressing need to adapt an encoder-decoder model trained on single-document summarization data to work with multiple-document input. In this paper, we present an initial investigation into a novel adaptation method. It exploits the maximal marginal relevance method to select representative sentences from multi-document input, and leverages an abstractive encoder-decoder model to fuse disparate sentences to an abstractive summary. The adaptation method is robust and itself requires no training data. Our system compares favorably to state-of-the-art extractive and abstractive approaches judged by automatic metrics and human assessors.

119 citations


Posted Content
TL;DR: An initial investigation into a novel adaptation method that exploits the maximal marginal relevance method to select representative sentences from multi-document input, and leverages an abstractive encoder-decoder model to fuse disparate sentences to an Abstractive summary.
Abstract: Generating a text abstract from a set of documents remains a challenging task. The neural encoder-decoder framework has recently been exploited to summarize single documents, but its success can in part be attributed to the availability of large parallel data automatically acquired from the Web. In contrast, parallel data for multi-document summarization are scarce and costly to obtain. There is a pressing need to adapt an encoder-decoder model trained on single-document summarization data to work with multiple-document input. In this paper, we present an initial investigation into a novel adaptation method. It exploits the maximal marginal relevance method to select representative sentences from multi-document input, and leverages an abstractive encoder-decoder model to fuse disparate sentences to an abstractive summary. The adaptation method is robust and itself requires no training data. Our system compares favorably to state-of-the-art extractive and abstractive approaches judged by automatic metrics and human assessors.

88 citations


Proceedings Article
01 Aug 2018
TL;DR: In this article, the authors study the feasibility of using Abstract Meaning Representation (AMR), a semantic representation of natural language grounded in linguistic theory, as a form of content representation.
Abstract: Generating an abstract from a collection of documents is a desirable capability for many real-world applications. However, abstractive approaches to multi-document summarization have not been thoroughly investigated. This paper studies the feasibility of using Abstract Meaning Representation (AMR), a semantic representation of natural language grounded in linguistic theory, as a form of content representation. Our approach condenses source documents to a set of summary graphs following the AMR formalism. The summary graphs are then transformed to a set of summary sentences in a surface realization step. The framework is fully data-driven and flexible. Each component can be optimized independently using small-scale, in-domain training data. We perform experiments on benchmark summarization datasets and report promising results. We also describe opportunities and challenges for advancing this line of research.

72 citations


Journal ArticleDOI
01 Mar 2018
TL;DR: A clear up-to-date overview of the evolution and progress of summarization evaluation is provided, giving the reader useful insights into the past, present and latest trends in the automatic evaluation of summaries.
Abstract: Evaluation is crucial in the research and development of automatic summarization applications, in order to determine the appropriateness of a summary based on different criteria, such as the content it contains, and the way it is presented. To perform an adequate evaluation is of great relevance to ensure that automatic summaries can be useful for the context and/or application they are generated for. To this end, researchers must be aware of the evaluation metrics, approaches, and datasets that are available, in order to decide which of them would be the most suitable to use, or to be able to propose new ones, overcoming the possible limitations that existing methods may present. In this article, a critical and historical analysis of evaluation metrics, methods, and datasets for automatic summarization systems is presented, where the strengths and weaknesses of evaluation efforts are discussed and the major challenges to solve are identified. Therefore, a clear up-to-date overview of the evolution and progress of summarization evaluation is provided, giving the reader useful insights into the past, present and latest trends in the automatic evaluation of summaries.

72 citations


Journal ArticleDOI
TL;DR: A novel Cuckoo search based multi- document summarizer (MDSCSA) is proposed to address the problem of multi-document summarization and clearly reveals that the proposed approach outperforms the other summarizers included in this study.

61 citations


Journal ArticleDOI
TL;DR: This work evaluates the performance of a multilayer-based method to select the most relevant sentences in the context of an extractive multi document summarization (MDS) task and makes a distinction between edges linking sentences from different documents (inter-layer) and those connecting sentences from the same document (intra-layer).
Abstract: Huge volumes of textual information has been produced every single day. In order to organize and understand such large datasets, in recent years, summarization techniques have become popular. These techniques aims at finding relevant, concise and non-redundant content from such a big data. While network methods have been adopted to model texts in some scenarios, a systematic evaluation of multilayer network models in the multi-document summarization task has been limited to a few studies. Here, we evaluate the performance of a multilayer-based method to select the most relevant sentences in the context of an extractive multi document summarization (MDS) task. In the adopted model, nodes represent sentences and edges are created based on the number of shared words between sentences. Differently from previous studies in multi-document summarization, we make a distinction between edges linking sentences from different documents (inter-layer) and those connecting sentences from the same document (intra-layer). As a proof of principle, our results reveal that such a discrimination between intra- and inter-layer in a multilayered representation is able to improve the quality of the generated summaries. This piece of information could be used to improve current statistical methods and related textual models.

59 citations


Proceedings Article
01 Aug 2018
TL;DR: A paraphrastic sentence fusion model which jointly performs sentence fusion and paraphrasing using skip-gram word embedding model at the sentence level is designed which improves the information coverage and at the same time abstractiveness of the generated sentences.
Abstract: In this work, we aim at developing an unsupervised abstractive summarization system in the multi-document setting. We design a paraphrastic sentence fusion model which jointly performs sentence fusion and paraphrasing using skip-gram word embedding model at the sentence level. Our model improves the information coverage and at the same time abstractiveness of the generated sentences. We conduct our experiments on the human-generated multi-sentence compression datasets and evaluate our system on several newly proposed Machine Translation (MT) evaluation metrics. Furthermore, we apply our sentence level model to implement an abstractive multi-document summarization system where documents usually contain a related set of sentences. We also propose an optimal solution for the classical summary length limit problem which was not addressed in the past research. For the document level summary, we conduct experiments on the datasets of two different domains (e.g., news article and user reviews) which are well suited for multi-document abstractive summarization. Our experiments demonstrate that the methods bring significant improvements over the state-of-the-art methods.

58 citations


Journal ArticleDOI
TL;DR: The participation and the official results of the 2nd Computational Linguistics Scientific Summarization Shared Task (CL-SciSumm), held as a part of the BIRNDL workshop at the Joint Conference for Digital Libraries 2016 in Newark, New Jersey, are described.
Abstract: We describe the participation and the official results of the 2nd Computational Linguistics Scientific Summarization Shared Task (CL-SciSumm), held as a part of the BIRNDL workshop at the Joint Conference for Digital Libraries 2016 in Newark, New Jersey. CL-SciSumm is the first medium-scale Shared Task on scientific document summarization in the computational linguistics (CL) domain. Participants were provided a training corpus of 30 topics, each comprising of a reference paper (RP) and 10 or more citing papers, all of which cite the RP. For each citation, the text spans (i.e., citances) that pertain to the RP have been identified. Participants solved three sub-tasks in automatic research paper summarization using this text corpus. Fifteen teams from six countries registered for the Shared Task, of which ten teams ultimately submitted and presented their results. The annotated corpus comprised 30 target papers--currently the largest available corpora of its kind. The corpus is available for free download and use at https://github.com/WING-NUS/scisumm-corpus.

45 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: This paper proposes an approach to extend the neural abstractive model trained on large scale SDS data to the MDS task, which makes use of a small number of multi-document summaries for fine tuning.
Abstract: Till now, neural abstractive summarization methods have achieved great success for single document summarization (SDS). However, due to the lack of large scale multi-document summaries, such methods can be hardly applied to multi-document summarization (MDS). In this paper, we investigate neural abstractive methods for MDS by adapting a state-of-the-art neural abstractive summarization model for SDS. We propose an approach to extend the neural abstractive model trained on large scale SDS data to the MDS task. Our approach only makes use of a small number of multi-document summaries for fine tuning. Experimental results on two benchmark DUC datasets demonstrate that our approach can outperform a variety of baseline neural models.

39 citations


Journal ArticleDOI
TL;DR: The results show that the CIBS method can improve the performance of single- and multi-document biomedical text summarization and it is shown that the topic-based sentence clustering approach can be effectively used to increase the informative content of summaries, as well as to decrease the redundant information.

32 citations


Posted Content
TL;DR: This paper proposes an approach to extend the neural abstractive model trained on large scale SDS data to the MDS task, which makes use of a small number of multi-document summaries for fine tuning.
Abstract: Till now, neural abstractive summarization methods have achieved great success for single document summarization (SDS). However, due to the lack of large scale multi-document summaries, such methods can be hardly applied to multi-document summarization (MDS). In this paper, we investigate neural abstractive methods for MDS by adapting a state-of-the-art neural abstractive summarization model for SDS. We propose an approach to extend the neural abstractive model trained on large scale SDS data to the MDS task. Our approach only makes use of a small number of multi-document summaries for fine tuning. Experimental results on two benchmark DUC datasets demonstrate that our approach can outperform a variety of baseline neural models.

Journal ArticleDOI
TL;DR: To make the system applicable in real data, an online clustering approach is developed for participant detection and an online temporal-content mixture model is proposed to conduct sub-event detection.
Abstract: Given a textual data stream related to an event, social event summarization aims to generate an informative textual description that can capture all the important moments, and it plays a critical role in mining and analyzing social media streams. In this paper, we present a general social event summarization framework using Twitter streams. The proposed framework consists of three key components: participant detection, sub-event detection, and summary tweet extraction. To make the system applicable in real data, an online clustering approach is developed for participant detection and an online temporal-content mixture model is proposed to conduct sub-event detection. Experiments show that the proposed framework can achieve similar performance with its batch counterpart.

Proceedings Article
01 May 2018
TL;DR: A large heterogeneous multilingual multi-document summarization corpus with 7,316 topics in English and German is created, which has variing summary lengths and variing number of source documents.
Abstract: Automatic text summarization is a challenging natural language processing (NLP) task which has been researched for several decades. The available datasets for multi-document summarization (MDS) are, however, rather small and usually focused on the newswire genre. Nowadays, machine learning methods are applied to more and more NLP problems such as machine translation, question answering, and single-document summarization. Modern machine learning methods such as neural networks require large training datasets which are available for the three tasks but not yet for MDS. This lack of training data limits the development of machine learning methods for MDS. In this work, we automatically generate a large heterogeneous multilingual multi-document summarization corpus. The key idea is to use Wikipedia articles as summaries and to automatically search for appropriate source documents. We created a corpus with 7,316 topics in English and German, which has variing summary lengths and variing number of source documents. More information about the corpus can be found at the corpus GitHub page at https://github.com/AIPHES/auto-hMDS.

Journal ArticleDOI
TL;DR: A system for summarization of scientific and structured documents that has three components: section mixture models are used for estimation of the weights of terms; a hypothesis test to select a subset of these terms; and a sentence extractor based on techniques for combinatorial optimization is presented.
Abstract: In this paper, we present a system for summarization of scientific and structured documents that has three components: section mixture models are used for estimation of the weights of terms; a hypothesis test to select a subset of these terms; and a sentence extractor based on techniques for combinatorial optimization The section mixture models approach is an adaptation of a bigram mixture model based on the main sections of a scientific document and a collection of citing sentences (citances) from papers that reference the document The model was adapted from earlier work done on Biomedical documents used in the summarization task of the 2014 Text Analysis Conference (TAC 2014) The mixture model trained on the Biomedical data was used also on the data for the Computational Linguistics scientific summarization task of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (CL-SciSumm 2016) This model gives rise to machine-generated summaries with ROUGE scores that are nearly as strong as those seen on the Biomedical data and was also the highest scoring submission to the task of generating a human summary For sentence extraction, we use the OCCAMS algorithm (Davis et al, in: Vreeken, Ling, Zaki, Siebes, Yu, Goethals, Webb, Wu (eds) ICDM workshops, IEEE Computer Society, pp 454–463, 2012) which takes the sentences from the original document and the assignment of weights of the terms computed by the language models and outputs a set of minimally overlapping sentences whose combined term coverage is maximized Finally, we explore the importance of an appropriate background model for the hypothesis test to select terms to achieve the best quality summaries

Journal ArticleDOI
01 Jun 2018
TL;DR: An elaborate user evaluation study to determine human preferences in forum summarization and to create a reference data set is presented and shows that even for a summarization task with low inter-rater agreement, a model can be trained that generates sensible summaries.
Abstract: In this paper we address extractive summarization of long threads in online discussion fora. We present an elaborate user evaluation study to determine human preferences in forum summarization and to create a reference data set. We showed long threads to ten different raters and asked them to create a summary by selecting the posts that they considered to be the most important for the thread. We study the agreement between human raters on the summarization task, and we show how multiple reference summaries can be combined to develop a successful model for automatic summarization. We found that although the inter-rater agreement for the summarization task was slight to fair, the automatic summarizer obtained reasonable results in terms of precision, recall, and ROUGE. Moreover, when human raters were asked to choose between the summary created by another human and the summary created by our model in a blind side-by-side comparison, they judged the model’s summary equal to or better than the human summary in over half of the cases. This shows that even for a summarization task with low inter-rater agreement, a model can be trained that generates sensible summaries. In addition, we investigated the potential for personalized summarization. However, the results for the three raters involved in this experiment were inconclusive. We release the reference summaries as a publicly available dataset.

Journal ArticleDOI
01 Jun 2018
TL;DR: An improved extractive text summarization method for documents is proposed by enhancing the conventional lexical chain method to produce better relevant information of the text using three distinct features or characteristics of keyword in a text.
Abstract: Many researches have been converging on automatic text summarization as increasing of text documents due to the expansion of information diffusion constantly. The objective of this proposal is to achieve the most reliable and substantial context or most relevant brief summary of the text in extractive manner. The extractive text summarization produces the short summary of a certain text which contains the most important information of original text by extracting the set of sentences from the original document. This paper proposes an improved extractive text summarization method for documents by enhancing the conventional lexical chain method to produce better relevant information of the text using three distinct features or characteristics of keyword in a text. The keyword of the document is labeled using our previous work, transition probability distribution generator model which can learn the characteristics of the keyword in a document, and generates their probability distribution upon each feature.

Journal ArticleDOI
TL;DR: A novel extractive graph-based approach to solve the multi-document summarization (MDS) problem is proposed and it is shown that MDS-OP achieved the best F-measure scores on both tasks in terms of ROUGE-1 and RouGE-L (DUC 2004), ROU GE4, and three other evaluation methods (MultiLing 2015).
Abstract: With advances in information technology, people face the problem of dealing with tremendous amounts of information and need ways to save time and effort by summarizing the most important and relevant information. Thus, automatic text summarization has become necessary to reduce the information overload. This article proposes a novel extractive graph-based approach to solve the multi-document summarization (MDS) problem. To optimize the coverage of information in the output summary, the problem is formulated as an orienteering problem and heuristically solved by an ant colony system algorithm. The performance of the implemented system (MDS-OP) was evaluated on DUC 2004 (Task 2) and MultiLing 2015 (MMS task) benchmark corpora using several ROUGE metrics, as well as other methods. Its comparison with the performances of 26 systems shows that MDS-OP achieved the best F-measure scores on both tasks in terms of ROUGE-1 and ROUGE-L (DUC 2004), ROUGE-SU4, and three other evaluation methods (MultiLing 2015). Overall, MDS-OP ranked among the best 3 systems.

Proceedings ArticleDOI
01 Dec 2018
TL;DR: This paper presents an unsupervised centroid-based document-level reconstruction framework using distributed bag of words model that selects summary sentences in order to minimize the reconstruction error between the summary and the documents.
Abstract: As the number of documents on the web is growing exponentially, multi-document summarization is becoming more and more important since it can provide the main ideas in a document set in short time. In this paper, we present an unsupervised centroid-based document-level reconstruction framework using distributed bag of words model. Specifically, our approach selects summary sentences in order to minimize the reconstruction error between the summary and the documents. We apply sentence selection and beam search, to further improve the performance of our model. Experimental results on two different datasets show significant performance gains compared with the state-of-the-art baselines.

Journal ArticleDOI
TL;DR: A method based on Genetic Algorithms is proposed for calculating the best sentence combinations of DUC01 and DUC02 datasets in MDS through a meta-document representation and three heuristics mentioned in several works of state-of-the-art are calculated to rank the most recent MDS methods.
Abstract: Over the last years, several Multi-Document Summarization (MDS) methods have been presented in Document Understanding Conference (DUC) workshops. Since DUC01, several methods have been presented in approximately 268 publications of the state-of-the-art, that have allowed the continuous improvement of MDS, however in most works the upper bounds were unknowns. Recently, some works have been focused to calculate the best sentence combinations of a set of documents and in previous works we have been calculated the significance for single-document summarization task in DUC01 and DUC02 datasets. However, for MDS task has not performed an analysis of significance to rank the best multi-document summarization methods. In this paper, we propose a method based on Genetic Algorithms for calculating the best sentence combinations of DUC01 and DUC02 datasets in MDS through a meta-document representation. Moreover, we have calculated three heuristics mentioned in several works of state-of-the-art to rank the most recent MDS methods, through the calculus of upper bounds and lower bounds.

Proceedings ArticleDOI
Pin Wu1, Quan Zhou1, Zhidan Lei1, Wei Qiu1, Xiaoqiang Li1 
01 Jul 2018
TL;DR: This paper proposes a method based on knowledge graph technology to automatically extract abstract texts that can not only obtain higher-level extraction from the text, but also can select template and question and answer to obtain a personalized abstract.
Abstract: People are flooded with massive semi-structured and unstructured texts in their daily work life. The fast-paced lifestyle has forced us to get more focused information from these large amounts of text more quickly. So people urgently need a technology that can automatically extract abstracts from text. The traditional extractive automatic abstract method can only extract keywords or key sentences. Although the current popular sequence-to-sequence extraction methods have greatly improved compared with the traditional methods, they cannot be combined with the background information to obtain higher level abstraction. Therefore, we propose a method based on knowledge graph technology to automatically extract abstract texts. This method can not only obtain higher-level extraction from the text, but also can select template and question and answer to obtain a personalized abstract. We experimented on the CNN DAILYMAIL dataset. The results show that the abstract obtained by this method can reflect more textual information, and more in line with human reading habits, and can achieve personalized extraction, and can obtain close to the best ROUGE index results.

Posted Content
TL;DR: OdinioSumm is introduced, a new MDS which outperforms the best baseline by 4.6% w.r.t ROUGE-1 score and is compared to the popular MDS techniques and evaluated their performance on the CQA corpora.
Abstract: Community Question Answering forums such as Quora, Stackoverflow are rich knowledge resources, often catering to information on topics overlooked by major search engines. Answers submitted to these forums are often elaborated, contain spam, are marred by slurs and business promotions. It is difficult for a reader to go through numerous such answers to gauge community opinion. As a result summarization becomes a prioritized task for CQA forums. While a number of efforts have been made to summarize factoid CQA, little work exists in summarizing non-factoid CQA. We believe this is due to the lack of a considerably large, annotated dataset for CQA summarization. We create CQASUMM, the first huge annotated CQA summarization dataset by filtering the 4.4 million Yahoo! Answers L6 dataset. We sample threads where the best answer can double up as a reference summary and build hundred word summaries from them. We treat other answers as candidates documents for summarization. We provide a script to generate the dataset and introduce the new task of Community Question Answering Summarization. Multi document summarization has been widely studied with news article datasets, especially in the DUC and TAC challenges using news corpora. However documents in CQA have higher variance, contradicting opinion and lesser amount of overlap. We compare the popular multi document summarization techniques and evaluate their performance on our CQA corpora. We look into the state-of-the-art and understand the cases where existing multi document summarizers (MDS) fail. We find that most MDS workflows are built for the entirely factual news corpora, whereas our corpus has a fair share of opinion based instances too. We therefore introduce OpinioSumm, a new MDS which outperforms the best baseline by 4.6% w.r.t ROUGE-1 score.

Posted Content
TL;DR: In this paper, the authors study the feasibility of using Abstract Meaning Representation (AMR), a semantic representation of natural language grounded in linguistic theory, as a form of content representation.
Abstract: Generating an abstract from a collection of documents is a desirable capability for many real-world applications. However, abstractive approaches to multi-document summarization have not been thoroughly investigated. This paper studies the feasibility of using Abstract Meaning Representation (AMR), a semantic representation of natural language grounded in linguistic theory, as a form of content representation. Our approach condenses source documents to a set of summary graphs following the AMR formalism. The summary graphs are then transformed to a set of summary sentences in a surface realization step. The framework is fully data-driven and flexible. Each component can be optimized independently using small-scale, in-domain training data. We perform experiments on benchmark summarization datasets and report promising results. We also describe opportunities and challenges for advancing this line of research.

Journal ArticleDOI
TL;DR: The Feature Maximization based approach performs very well in the SciSumm 2016 context for the considered task, providing better results than the known results so far, and obtaining high recall.
Abstract: Feature Maximization is a feature selection method that deals efficiently with textual data: to design systems that are altogether language-agnostic, parameter-free and do not require additional corpora to function. We propose to evaluate its use in text summarization, in particular in cases where documents are structured. We first experiment this approach in a single-document summarization context. We evaluate it on the DUC AQUAINT corpus and show that despite the unstructured nature of the corpus, our system is above the baseline and produces encouraging results. We also observe that the produced summaries seem robust to redundancy. Next, we evaluate our method in the more appropriate context of SciSumm challenge, which is dedicated to research publications summarization. These publications are structured in sections and our class-based approach is thus relevant. We more specifically focus on the task that aims to summarize papers using those that refer to them. We consider and evaluate several systems using our approach dealing with specific bag of words. Furthermore, in these systems, we also evaluate cosine and graph-based distance for sentence weighting and comparison. We show that our Feature Maximization based approach performs very well in the SciSumm 2016 context for the considered task, providing better results than the known results so far, and obtaining high recall. We thus demonstrate the flexibility and the relevance of Feature Maximization in this context.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: According to experimental results, learning-to-rank methods achieve promising ROUGE-scores in many cases and one of them surpasses the state-of-the-art unsupervised learning method.
Abstract: Text summarization is challenging, but an interesting task of natural language processing. While this task has been widely studied in English, it is still an early stage in Vietnamese. This paper introduces an investigation of extractive summarization methods in Vietnamese. To do that, we implement and compare several well-known summarization methods in three directions: unsupervised, supervised, and deep learning. We validate the performance of the methods on two Vietnamese datasets. According to experimental results, we find two interesting points. Firstly, learning-to-rank methods achieve promising ROUGE-scores in many cases. Particularly, one of them surpasses the state-of-the-art unsupervised learning method. Secondly, formulating the scoring step in the form of learning-to-rank benefits the selection step.

Journal ArticleDOI
TL;DR: This paper proposes a new sentence weighting method by incorporating sentence distribution and POS tagging for multi-document summarization and achieves better results with an increasing rate of 5.41% on ROUGE-1 and 0.62% onRouGE-2.
Abstract: Automatic multi-document summarization needs to find representative sentences not only by sentence distribution to select the most important sentence but also by how informative a term is in a sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and well-spread words in the corpus but ignores the grammatical information that indicates instructive content. The presence or absence of informative content in a sentence can be indicated by grammatical information which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting method by incorporating sentence distribution and POS tagging for multi-document summarization. Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering is based on cluster importance to determine the important clusters. Sentence extraction based on sentence distribution and POS tagging is introduced to extract the representative sentences from the ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004 are compared with those of the Sentence Distribution Method. Our proposed method achieved better results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.

Journal ArticleDOI
TL;DR: This research proposes a novel summarization method which combines K-Means Clustering and LDA - Significance Sentences, so it can generate document summaries based on the topic and has good performance when the K-means method can cluster the document according to the topic correctly.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: A model for improving the quality of the scoring step, which benefits sentence selection to extract high-quality summaries and achieves sufficient improvements over traditional methods and competitive results with state-of-the-art deep learning models is presented.
Abstract: Sentence scoring is a vital step in an extractive summarization system. This paper presents a model for improving the quality of the scoring step, which benefits sentence selection to extract high-quality summaries. Different from previous methods, our model takes advantage of local information (inside a single document) and global information (on the whole corpus). The combination allows defining a rich set of features used for learning. Under a learning-to-rank formulation, the model learns to estimate the importance of sentences. After ranking, summaries are finally extracted by selecting top-ranked sentences with the consideration of diversity. Experiments on three benchmark datasets (DUC 2001, 2002, and 2004) indicate that our model achieves sufficient improvements over traditional methods and competitive results with state-of-the-art deep learning models.

Patent
02 Nov 2018
TL;DR: In this article, a method and a device for generating a multi-document summarization, which relates to the field of data processing and solves the problem of poor performance of a summarization generated by an existing automatic multidocument summarization technology, is presented.
Abstract: The embodiment of the invention discloses a method and a device for generating a multi-document summarization, relates to the field of data processing and solves the problem of poor performance of a summarization generated by an existing automatic multi-document summarization technology A specific scheme of the method comprises the steps of dividing multiple documents into n sentences; generatingan input word bag vector; performing unsupervised training on each sentence represented by the input word bag vector to obtain an encoding hidden layer vector of each sentence and a potential semantic vector of each sentence; collecting m potential semantic vectors; obtaining m decoding hidden layer vectors and m output word bag vectors according to the m potential semantic vectors; updating them decoding hidden layer vectors and the m output word bag vectors; estimating an importance degree of each sentence; acquiring the importance degree and a redundancy degree of a verb phrase of each sentence and the importance degree and the redundancy degree of a noun phrase of each sentence; and generating the summarization of multiple documents according to the importance degree and the redundancy degree of all noun phrases and the importance degree and the redundancy degree of all verb phrases The embodiment of the invention is used for a process for generating the multi-document summarization

Journal ArticleDOI
TL;DR: Experimental results show the degree of effectiveness in text summarization over different clustering algorithms and analysis of treating a query sentence as a common one, segmented from documents forText summarization.
Abstract: The availability of various digital sources has created a demand for text mining mechanisms. Effective summary generation mechanisms are needed in order to utilize relevant information from often overwhelming digital data sources. In this view, this paper conducts a survey of various single as well as multi-document text summarization techniques. It also provides analysis of treating a query sentence as a common one, segmented from documents for text summarization. Experimental results show the degree of effectiveness in text summarization over different clustering algorithms.