scispace - formally typeset
Search or ask a question
Topic

Multi-document summarization

About: Multi-document summarization is a research topic. Over the lifetime, 2270 publications have been published within this topic receiving 71850 citations.


Papers
More filters
Proceedings ArticleDOI
06 Nov 2006
TL;DR: This work proposes the frequency of domain concepts as a method to identify important sentences within a full-text and proposes a novel frequency distribution model and algorithm for identifying important sentences based on term or concept frequency distribution.
Abstract: Text summarization is a data reduction process. The use of text summarization enables users to reduce the amount of text that must be read while still assimilating the core information. The data reduction offered by text summarization is particularly useful in the biomedical domain, where physicians must continuously find clinical trial study information to incorporate into their patient treatment efforts. Such efforts are often hampered by the high-volume of publications. Our contribution is two-fold: 1) to propose the frequency of domain concepts as a method to identify important sentences within a full-text; and 2) propose a novel frequency distribution model and algorithm for identifying important sentences based on term or concept frequency distribution. An evaluation of several existing summarization systems using biomedical texts is presented in order to determine a performance baseline. For domain concept comparison, a recent high-performing frequency-based algorithm using terms is adapted to use concepts and evaluated using both terms and concepts. It is shown that the use of concepts performs closely with the use of terms for sentence selection. Our proposed frequency distribution model and algorithm outperforms a state-of-the-art approach.

30 citations

01 Jan 2009
TL;DR: A summarization method, which combines several domain specific features with some other known features such as term frequency, title and position to improve the summarization performance in the medical domain is discussed.
Abstract: —Medical Literatures on the web are the important sources to help clinicians in patient-care. Initially, the clinicians go through the author-written abstracts or summaries available with the medical articles to decide whether articles are suitable for in-depth study. Since all medical articles do not come with author written abstracts or summaries, automatic summarization of medical articles will help clinicians or medical students to find the relevant information on the web rapidly. In this paper we discuss a summarization method, which combines several domain specific features with some other known features such as term frequency, title and position to improve the summarization performance in the medical domain. Our experiments show that the incorporation of domain specific features improves the summarization performance. Index Terms — Text summarization, Domain-specific features, Novel medical term detection I. I NTRODUCTION These days, people are overwhelmed by the huge amount of information on the Web. The number of pages available on the Internet almost doubles every year. This is also the case for medical information [1], which is now available from a variety of sources. Medical Literature such as medical news, research articles, clinical trial reports on the web are the important source to help clinicians in patient treatment. Initially, the clinicians go through the author-written abstracts or summaries available with the medical articles to decide whether articles are relevant to them for in-depth study. Since all medical articles do not come with author written abstracts or summaries, automatic summarization of medical articles will help clinicians or medical students to find the relevant information on the web rapidly. Moreover, monitoring infectious disease outbreaks or other biological threats demand rapid information gatherings and summarization. Text summarization is the process to produce a condensed representation of the content of its input for human consumption [2]. Input to a summarization process can be one or more text documents. When only one document is the input, it is called single document text summarization and when the input is a cluster of related text documents, it is multi-document summarization. We can also categorize the text summarization based on the type of users the summary is intended for: User focused (query focused) summaries are tailored to the requirements of a particular user or group users and generic summaries are aimed at a broad readership community [2]. Depending on the nature of text representation in the summary, summary can be categorized as an abstract and an extract. An extract is a summary consisting of a number of salient text units selected from the input. An abstract is a summary, which represents the subject matter of the article with the text units, which are generated by reformulating the salient units selected from the input. An abstract may contain some text units, which are not present in to the input text. Based on the information content of the summary, it can be categorized as informative and indicative summary. The indicative summary presents an indication about an article’s purpose and approach to the user for selecting the article for in-depth reading; informative summary covers all salient information in the document at some level of detail, i.e., it will contain information about all the different aspects such as article’s purpose, scope, approach, results and conclusions etc. For example, an abstract of a medical research article is more informative than its headline. According to the above-mentioned types and sub-types of automatic text summarization, the summarization technique presented in this paper can be called sentence extraction-based single document informative summarization in medical domain. Some previous works on extractive summarization used few or all of the features such as term frequency, positional information and cue phrases to compute sentence scores [3] [4] [5]. Some machine learning approaches to extractive summarization had already been investigated. In [6] sentence extraction is viewed as a Bayesian classification task. Compared to creating an extract, generation of abstract is relatively harder since the latter requires: (1) semantic representation of text units (sentences or paragraphs) in the text, (2) reformulation of two or more text units and (3) rendering the new representation in natural language. Abstractive approaches have used template based information extraction, information fusion and compression. In information extraction based approach, predefined template slots are filled with the desired pieces of information extracted by the summarization engine [7]. Jing and McKeown [8] pointed out that human summaries are often constructed from the source document by a process of cutting and pasting document fragments that are then combined and regenerated as

30 citations

Journal ArticleDOI
01 Jan 2008
TL;DR: A combination of the algorithms, which achieves co-operation of the categorization and summarization mechanisms, is introduced in order to enhance text labeling through the personalized summaries that are constructed.
Abstract: In this manuscript we present the summarization and categorization subsystems of a complete mechanism that begins with web-page fetching and concludes with representation of the collected data to the end users through a personalized portal. The system intends to collect articles from major news portals and, following an algorithmic procedure, to create a more user friendly and personalized ''view'' of the articles. Before presenting the information back to the end user, the core of our mechanism automatically categorizes data and then extracts personalized summaries. We focalize to the core of the mechanism and more specifically, we present the algorithms used for the summarization and the categorization of texts. The algorithms are not utilized only for producing isolated data, targeted to a specific subsystem, but a combination of the algorithms, which achieves co-operation of the categorization and summarization mechanisms, is introduced in order to enhance text labeling through the personalized summaries that are constructed.

30 citations

Proceedings Article
23 Aug 2010
TL;DR: A systematic study of comparing different learning-to-rank algorithms and comparing different selection strategies on two key problems existing in extractive summarization: the ranking problem and the selection problem is presented.
Abstract: This paper presents a comparative study on two key problems existing in extractive summarization: the ranking problem and the selection problem. To this end, we presented a systematic study of comparing different learning-to-rank algorithms and comparing different selection strategies. This is the first work of providing systematic analysis on these problems. Experimental results on two benchmark datasets demonstrate three findings: (1) pairwise and listwise learning-to-rank algorithms outperform the baselines significantly; (2) there is no significant difference among the learning-to-rank algorithms; and (3) the integer linear programming selection strategy generally outperformed Maximum Marginal Relevance and Diversity Penalty strategies.

30 citations

Journal ArticleDOI
TL;DR: A novel Probabilistic-modeling Relevance, Coverage, and Novelty (PRCN) framework is proposed, which exploits a reference topic model incorporating user query for dependent relevance measurement and topic coverage is also modeled under this framework.
Abstract: Summarization plays an increasingly important role with the exponential document growth on the Web. Specifically, for query-focused summarization, there exist three challenges: (1) how to retrieve query relevant sentences; (2) how to concisely cover the main aspects (i.e., topics) in the document; and (3) how to balance these two requests. Specially for the issue relevance, many traditional summarization techniques assume that there is independent relevance between sentences, which may not hold in reality. In this paper, we go beyond this assumption and propose a novel Probabilistic-modeling Relevance, Coverage, and Novelty (PRCN) framework, which exploits a reference topic model incorporating user query for dependent relevance measurement. Along this line, topic coverage is also modeled under our framework. To further address the issues above, various sentence features regarding relevance and novelty are constructed as features, while moderate topic coverage are maintained through a greedy algorithm for topic balance. Finally, experiments on DUC2005 and DUC2006 datasets validate the effectiveness of the proposed method.

30 citations


Network Information
Related Topics (5)
Natural language
31.1K papers, 806.8K citations
85% related
Ontology (information science)
57K papers, 869.1K citations
84% related
Web page
50.3K papers, 975.1K citations
83% related
Recurrent neural network
29.2K papers, 890K citations
83% related
Graph (abstract data type)
69.9K papers, 1.2M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202374
2022160
202152
202061
201947
201852