scispace - formally typeset
Search or ask a question

Showing papers on "Multi-document summarization published in 2009"


Proceedings ArticleDOI
31 May 2009
TL;DR: The final model, HierSum, utilizes a hierarchical LDA-style model (Blei et al., 2004) to represent content specificity as a hierarchy of topic vocabulary distributions and yields state-of-the-art ROUGE performance and in pairwise user evaluation strongly outperforms Toutanova et al. (2007)'s state of theart discriminative system.
Abstract: We present an exploration of generative probabilistic models for multi-document summarization. Beginning with a simple word frequency based model (Nenkova and Vanderwende, 2005), we construct a sequence of models each injecting more structure into the representation of document set content and exhibiting ROUGE gains along the way. Our final model, HierSum, utilizes a hierarchical LDA-style model (Blei et al., 2004) to represent content specificity as a hierarchy of topic vocabulary distributions. At the task of producing generic DUC-style summaries, HierSum yields state-of-the-art ROUGE performance and in pairwise user evaluation strongly outperforms Toutanova et al. (2007)'s state-of-the-art discriminative system. We also explore HierSum's capacity to produce multiple 'topical summaries' in order to facilitate content discovery and navigation.

521 citations


Proceedings ArticleDOI
20 Apr 2009
TL;DR: The proposed methods are quite general and can be used to generate rated aspect summary automatically given any collection of short comments each associated with an overall rating.
Abstract: Web 2.0 technologies have enabled more and more people to freely comment on different kinds of entities (e.g. sellers, products, services). The large scale of information poses the need and challenge of automatic summarization. In many cases, each of the user-generated short comments comes with an overall rating. In this paper, we study the problem of generating a ``rated aspect summary'' of short comments, which is a decomposed view of the overall ratings for the major aspects so that a user could gain different perspectives towards the target entity. We formally define the problem and decompose the solution into three steps. We demonstrate the effectiveness of our methods by using eBay sellers' feedback comments. We also quantitatively evaluate each step of our methods and study how well human agree on such a summarization task. The proposed methods are quite general and can be used to generate rated aspect summary automatically given any collection of short comments each associated with an overall rating.

381 citations


Journal ArticleDOI
TL;DR: The purpose of present paper is to show, that summarization result not only depends on optimized function, and also depends on a similarity measure.
Abstract: The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. In our study we focus on sentence based extractive document summarization. We propose the generic document summarization method which is based on sentence clustering. The proposed approach is a continue sentence-clustering based extractive summarization methods, proposed in Alguliev [Alguliev, R. M., Aliguliyev, R. M., Bagirov, A. M. (2005). Global optimization in the summarization of text documents. Automatic Control and Computer Sciences 39, 42-47], Aliguliyev [Aliguliyev, R. M. (2006). A novel partitioning-based clustering method and generic document summarization. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI-IAT 2006 Workshops) (WI-IATW'06), 18-22 December (pp. 626-629) Hong Kong, China], Alguliev and Alyguliev [Alguliev, R. M., Alyguliev, R. M. (2007). Summarization of text-based documents with a determination of latent topical sections and information-rich sentences. Automatic Control and Computer Sciences 41, 132-140] Aliguliyev, [Aliguliyev, R. M. (2007). Automatic document summarization by sentence extraction. Journal of Computational Technologies 12, 5-15.]. The purpose of present paper to show, that summarization result not only depends on optimized function, and also depends on a similarity measure. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to sate-of-the-art summarization approaches.

271 citations


Journal ArticleDOI
TL;DR: This work proposes an automatic summarization approach based on the analysis of review articles' internal topic structure to assemble customer concerns and shows that the proposed approach outperforms the peer approaches, i.e. opinion mining and clustering-summarization, in terms of users' responsiveness and its ability to discover the most important topics.
Abstract: Product reviews possess critical information regarding customers' concerns and their experience with the product. Such information is considered essential to firms' business intelligence which can be utilized for the purpose of conceptual design, personalization, product recommendation, better customer understanding, and finally attract more loyal customers. Previous studies of deriving useful information from customer reviews focused mainly on numerical and categorical data. Textual data have been somewhat ignored although they are deemed valuable. Existing methods of opinion mining in processing customer reviews concentrates on counting positive and negative comments of review writers, which is not enough to cover all important topics and concerns across different review articles. Instead, we propose an automatic summarization approach based on the analysis of review articles' internal topic structure to assemble customer concerns. Different from the existing summarization approaches centered on sentence ranking and clustering, our approach discovers and extracts salient topics from a set of online reviews and further ranks these topics. The final summary is then generated based on the ranked topics. The experimental study and evaluation show that the proposed approach outperforms the peer approaches, i.e. opinion mining and clustering-summarization, in terms of users' responsiveness and its ability to discover the most important topics.

184 citations


Proceedings ArticleDOI
04 Aug 2009
TL;DR: A new Bayesian sentence-based topic model for summarization by making use of both the term-document and term-sentence associations is proposed and an efficient variational Bayesian algorithm is derived for model parameter estimation.
Abstract: Most of the existing multi-document summarization methods decompose the documents into sentences and work directly in the sentence space using a term-sentence matrix. However, the knowledge on the document side, i.e. the topics embedded in the documents, can help the context understanding and guide the sentence selection in the summarization procedure. In this paper, we propose a new Bayesian sentence-based topic model for summarization by making use of both the term-document and term-sentence associations. An efficient variational Bayesian algorithm is derived for model parameter estimation. Experimental results on benchmark data sets show the effectiveness of the proposed model for the multi-document summarization task.

175 citations


Journal ArticleDOI
TL;DR: A set of 14 summarizers are developed, generically referred to as CN-Summ, employing network concepts such as node degree, length of shortest paths, d-rings and k-cores to select sentences for an extractive summary of texts.

139 citations


Proceedings ArticleDOI
04 Jun 2009
TL;DR: This work proposes a one-step approach for document summarization that jointly performs sentence extraction and compression by solving an integer linear program.
Abstract: Text summarization is one of the oldest problems in natural language processing. Popular approaches rely on extracting relevant sentences from the original documents. As a side effect, sentences that are too long but partly relevant are doomed to either not appear in the final summary, or prevent inclusion of other relevant sentences. Sentence compression is a recent framework that aims to select the shortest subsequence of words that yields an informative and grammatical sentence. This work proposes a one-step approach for document summarization that jointly performs sentence extraction and compression by solving an integer linear program. We report favorable experimental results on newswire data.

132 citations


Proceedings ArticleDOI
20 Apr 2009
TL;DR: This paper considers extract-based summarization as learning a mapping from a set of sentences of a given document to a subset of the sentences that satisfies the following three requirements: diversity in summarization, sufficient coverage, and balance.
Abstract: Document summarization plays an increasingly important role with the exponential growth of documents on the Web. Many supervised and unsupervised approaches have been proposed to generate summaries from documents. However, these approaches seldom simultaneously consider summary diversity, coverage, and balance issues which to a large extent determine the quality of summaries. In this paper, we consider extract-based summarization emphasizing the following three requirements: 1) diversity in summarization, which seeks to reduce redundancy among sentences in the summary; 2) sufficient coverage, which focuses on avoiding the loss of the document's main information when generating the summary; and 3) balance, which demands that different aspects of the document need to have about the same relative importance in the summary. We formulate the extract-based summarization problem as learning a mapping from a set of sentences of a given document to a subset of the sentences that satisfies the above three requirements. The mapping is learned by incorporating several constraints in a structure learning framework, and we explore the graph structure of the output variables and employ structural SVM for solving the resulted optimization problem. Experiments on the DUC2001 data sets demonstrate significant performance improvements in terms of F1 and ROUGE metrics.

126 citations


Proceedings Article
01 Sep 2009
TL;DR: This work proposes a new method based on probabilistic latent semantic analysis, which allows for sentences and queries to be represented as probability distributions over latent topics, to estimate the summary relevance of sentences.
Abstract: We consider the problem of query-focused multidocument summarization, where a summary containing the information most relevant to a user’s information need is produced from a set of topic-related documents. We propose a new method based on probabilistic latent semantic analysis, which allows us to represent sentences and queries as probability distributions over latent topics. Our approach combines queryfocused and thematic features computed in the latent topic space to estimate the summaryrelevance of sentences. In addition, we evaluate several dierent similarity measures for computing sentence-level feature scores. Experimental results show that our approach outperforms the best reported results on DUC 2006 data, and also compares well on DUC 2007 data.

95 citations


Posted Content
TL;DR: This paper proposed text summarization based on fuzzy logic to improve the quality of the summary created by the general statistic method, and compared the results with the baseline summarizer and Microsoft Word 2007 summarizers.
Abstract: Text summarization can be classified into two approaches: extraction and abstraction This paper focuses on extraction approach The goal of text summarization based on extraction approach is sentence selection One of the methods to obtain the suitable sentences is to assign some numerical measure of a sentence for the summary called sentence weighting and then select the best ones The first step in summarization by extraction is the identification of important features In our experiment, we used 125 test documents in DUC2002 data set Each document is prepared by preprocessing process: sentence segmentation, tokenization, removing stop word, and word stemming Then, we used 8 important features and calculate their score for each sentence We proposed text summarization based on fuzzy logic to improve the quality of the summary created by the general statistic method We compared our results with the baseline summarizer and Microsoft Word 2007 summarizers The results show that the best average precision, recall, and fmeasure for the summaries were obtained by fuzzy method

84 citations


Proceedings ArticleDOI
31 May 2009
TL;DR: This paper presents an investigation into contrastive summarization through an implementation and evaluation of a contrastive opinion summarizer in the consumer reviews domain.
Abstract: Contrastive summarization is the problem of jointly generating summaries for two entities in order to highlight their differences. In this paper we present an investigation into contrastive summarization through an implementation and evaluation of a contrastive opinion summarizer in the consumer reviews domain.

Proceedings ArticleDOI
11 Sep 2009
TL;DR: The purpose of present paper is to show that summarization result is not only depends the sentence features, but also depends on the sentence similarity measure, and can improve the performance compared to other summarization methods.
Abstract: Technology of automatic text summarization plays an important role in information retrieval and text classification, and may provide a solution to the information overload problem. Text summarization is a process of reducing the size of a text while preserving its information content. This paper proposes a sentences clustering based summarization approach. The proposed approach consists of three steps: first clusters the sentences based on the semantic distance among sentences in the document, and then on each cluster calculates the accumulative sentence similarity based on the multi-features combination method, at last chooses the topic sentences by some extraction rules. The purpose of present paper is to show that summarization result is not only depends the sentence features, but also depends on the sentence similarity measure. The experimental result on the DUC 2003 dataset show that our proposed approach can improve the performance compared to other summarization methods.

Proceedings ArticleDOI
08 Feb 2009
TL;DR: A new document summarization algorithm which is personalized, based on the attention (reading) time of individual users spent on single words in a document as the essential clue, which summarizes the document according to user attention on every individual word in the document.
Abstract: We propose a new document summarization algorithm which is personalized. The key idea is to rely on the attention (reading) time of individual users spent on single words in a document as the essential clue. The prediction of user attention over every word in a document is based on the user's attention during his previous reads, which is acquired via a vision-based commodity eye-tracking mechanism. Once the user's attentions over a small collection of words are known, our algorithm can predict the user's attention over every word in the document through word semantics analysis. Our algorithm then summarizes the document according to user attention on every individual word in the document. With our algorithm, we have developed a document summarization prototype system. Experiment results produced by our algorithm are compared with the ones manually summarized by users as well as by commercial summarization software, which clearly demonstrates the advantages of our new algorithm for user-oriented document summarization.

Proceedings Article
11 Jul 2009
TL;DR: This paper proposes to use the multi-modality manifold-ranking algorithm for extracting topic-focused summary from multiple documents by considering the within- document sentence relationships and the cross-document sentence relationships as two separate modalities (graphs).
Abstract: Graph-based manifold-ranking methods have been successfully applied to topic-focused multi-document summarization. This paper further proposes to use the multi-modality manifold-ranking algorithm for extracting topic-focused summary from multiple documents by considering the within-document sentence relationships and the cross-document sentence relationships as two separate modalities (graphs). Three different fusion schemes, namely linear form, sequential form and score combination form, are exploited in the algorithm. Experimental results on the DUC benchmark datasets demonstrate the effectiveness of the proposed multi-modality learning algorithms with all the three fusion schemes.

Proceedings ArticleDOI
01 Dec 2009
TL;DR: This paper presents a taxonomy of summarization systems and defines the most important criteria for a summary which can be generated by a system and goes through main criteria for evaluating a text summarization.
Abstract: Text summarization systems are among the most attractive research areas nowadays. Summarization systems offers the possibility of finding the main points of texts and so the user will spend less time on reading the whole document. Different types of summary might be useful in various applications and summarization systems can be categorized based on these types. This paper presents a taxonomy of summarization systems and defines the most important criteria for a summary which can be generated by a system. Additionally, different methods of text summarization as well as main steps for summarization process is discussed. we also go through main criteria for evaluating a text summarization.

Proceedings ArticleDOI
28 Jun 2009
TL;DR: This paper proposes an approach for multi- document video summarization by exploring the redundancy between different videos and shows that multi-document video summarizations presents more elegant and informative summaries compared with single-document approach.
Abstract: Most previous works on video summarization target on a single video document. With the popularity of video corpus (e.g. news video archives) and web videos, video article that consists of a set of relevant videos are frequently confronted by users. By the traditional single-document summarization, these videos are treated independently and the results are usually redundant due to the lack of inter-video analysis. To efficiently manage video articles, in this paper, we propose an approach for multi-document video summarization by exploring the redundancy between different videos. The importance of keyframes is first measured by the content inclusion based on intra- and inter-video similarities. We then propose a Minimum Description Length (MDL) for automatically determining the appropriate length of the summary. Finally a video summary is generated for users to browse the content of the whole video article. We show that multi-document video summarization presents more elegant and informative summaries compared with single-document approach.

Proceedings ArticleDOI
01 Feb 2009
TL;DR: The so-called n-grams and maximal frequent word sequences are employed as features in a vector space model in order to determine the advantages and disadvantages for extractive text summarization.
Abstract: The main problem for generating an extractive automatic text summary is to detect the most relevant information in the source document. For such purpose, recently some approaches have successfully employed the word sequence information from the self-text for detecting the candidate text fragments for composing the summary. In this paper, we employ the so-called n-grams and maximal frequent word sequences as features in a vector space model in order to determine the advantages and disadvantages for extractive text summarization.

Journal ArticleDOI
TL;DR: This paper proposes unsupervised document summarization method that creates the summary by clustering and extracting sentences from the original document by developing a discrete differential evolution algorithm to optimize the criterion functions.
Abstract: Text summarization is the process of automatically creating a compressed version of a given document preserving its information content. There are two types of summarization: extractive and abstractive. Extractive summarization methods simplify the problem of summarization into the problem of selecting a representative subset of the sentences in the original documents. Abstractive summarization may compose novel sentences, unseen in the original sources. In our study we focus on sentence based extractive document summarization. The extractive summarization systems are typically based on techniques for sentence extraction and aim to cover the set of sentences that are most important for the overall understanding of a given document. In this paper, we propose unsupervised document summarization method that creates the summary by clustering and extracting sentences from the original document. For this purpose new criterion functions for sentence clustering have been proposed. Similarity measures play an increasingly important role in document clustering. Here we’ve also developed a discrete differential evolution algorithm to optimize the criterion functions. The experimental results show that our suggested approach can improve the performance compared to sate-of-the-art summarization approaches.

Journal ArticleDOI
TL;DR: This paper provides detailed descriptions of a proposed new algorithm for video summarization that does not require high-level semantics such as object detection and speech/audio analysis which provides a more flexible and general solution for this topic.
Abstract: In this paper, we provide detailed descriptions of a proposed new algorithm for video summarization, which are also included in our submission to TRECVID'08 on BBC rush summarization. Firstly, rush videos are hierarchically modeled using the formal language technique. Secondly, shot detections are applied to introduce a new concept of V-unit for structuring videos in line with the hierarchical model, and thus junk frames within the model are effectively removed. Thirdly, adaptive clustering is employed to group shots into clusters to determine retakes for redundancy removal. Finally, each most representative shot selected from every cluster is ranked according to its length and sum of activity level for summarization. Competitive results have been achieved to prove the effectiveness and efficiency of our techniques, which are fully implemented in the compressed domain. Our work does not require high-level semantics such as object detection and speech/audio analysis which provides a more flexible and general solution for this topic.

Journal ArticleDOI
23 Oct 2009
TL;DR: A generic text summarization method that generates summaries of Turkish texts by ranking sentences according to their scores calculated using their surface level features and extracting the highest ranked ones from the original documents.
Abstract: In this paper, we propose a generic text summarization method that generates summaries of Turkish texts by ranking sentences according to their scores calculated using their surface level features and extracting the highest ranked ones from the original documents. In order to extract sentences which form a summary with an extensive coverage of main content of the text and less redundancy, we use the features such as term frequency, key phrase, centrality, title similarity and position of the sentence in the original text. Sentence rank is computed using a score function that uses its feature values and the weights of the features. The best feature weights are learned using machine learning techniques with the help of human constructed summaries. Performance evaluation is conducted by comparing summarization outputs with manual summaries generated by 25 independent human evaluators. This paper presents one of the first Turkish summarization systems, and its results are promising.

Proceedings ArticleDOI
02 Nov 2009
TL;DR: A multi-document generic summarization model based on the budgeted median problem that covers the entire relevant part of the document cluster through sentence assignment and can incorporate asymmetric relations between sentences such as textual entailment.
Abstract: We propose a multi-document generic summarization model based on the budgeted median problem. Our model selects sentences to generate a summary so that every sentence in the document cluster can be assigned to and be represented by a sentence in the summary as much as possible. The advantage of this model is that it covers the entire relevant part of the document cluster through sentence assignment and can incorporate asymmetric relations between sentences such as textual entailment.

Journal Article
TL;DR: This paper aims to demonstrate the efforts towards in-situ applicability of EMMARM, as to provide real-time information about the response of the immune system to natural disasters.
Abstract: Delia RusuTechnical University of Cluj-Napoca, Faculty of Automation and Computer Science, G. Bariţiu 26-28, 400027 Cluj-Napoca, RomaniaE-mail: delia.rusu@gmail.comBlaž Fortuna, Marko Grobelnik and Dunja MladenicJozef Stefan Institute, Jamova 39, 1000 Ljubljana, SloveniaE-mail: {blaz.fortuna,marko.grobelnik, dunja.mladenic}@ijs.si

Proceedings ArticleDOI
Junyan Zhu1, Can Wang1, Xiaofei He1, Jiajun Bu1, Chun Chen1, Shujie Shang1, Mingcheng Qu1, Gang Lu1 
20 Apr 2009
TL;DR: Experimental results show the tag-oriented summarization has a significant improvement over those not using tags, and a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags.
Abstract: Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.

Proceedings ArticleDOI
04 Jun 2009
TL;DR: It is shown that the summarizer built is able to outperform most systems participating in task focused summarization evaluations at Text Analysis Conferences (TAC) 2008 and would perform better at producing short summaries than longer summaries.
Abstract: In this paper, we describe a sentence position based summarizer that is built based on a sentence position policy, created from the evaluation testbed of recent summarization tasks at Document Understanding Conferences (DUC). We show that the summarizer thus built is able to outperform most systems participating in task focused summarization evaluations at Text Analysis Conferences (TAC) 2008. Our experiments also show that such a method would perform better at producing short summaries (upto 100 words) than longer summaries. Further, we discuss the baselines traditionally used for summarization evaluation and suggest the revival of an old baseline to suit the current summarization task at TAC: the Update Summarization task.

Proceedings ArticleDOI
16 Sep 2009
TL;DR: The development of the summarizer is described which is based on Iterative Residual Rescaling (IRR) that creates the latent semantic space of a set of documents under consideration that enables to control the influence of major and minor topics in the latent space.
Abstract: This paper deals with our recent research in text summarization. The field has moved from multi-document summarization to update summarization. When producing an update summary of a set of topic-related documents the summarizer assumes prior knowledge of the reader determined by a set of older documents of the same topic. The update summarizer thus must solve a novelty vs. redundancy problem. We describe the development of our summarizer which is based on Iterative Residual Rescaling (IRR) that creates the latent semantic space of a set of documents under consideration. IRR generalizes Singular Value Decomposition (SVD) and enables to control the influence of major and minor topics in the latent space. Our sentence-extractive summarization method computes the redundancy, novelty and significance of each topic. These values are finally used in the sentence selection process. The sentence selection component prevents inner summary redundancy. The results of our participation in TAC evaluation seem to be promising.

Book ChapterDOI
19 May 2009
TL;DR: This paper presents NEO-CORTEX, a multi-document summarization system based on the existing CORTEX system that achieves good performance on topic-oriented multi- document summarization task.
Abstract: This paper discusses an approach to topic-oriented multi-document summarization. It investigates the effectiveness of using additional information about the document set as a whole, as well as individual documents. We present NEO-CORTEX, a multi-document summarization system based on the existing CORTEX system. Results are reported for experiments with a document base formed by the NIST DUC-2005 and DUC-2006 data. Our experiments have shown that NEO-CORTEX is an effective system and achieves good performance on topic-oriented multi-document summarization task.

Journal ArticleDOI
TL;DR: A novel training data selection approach that leverages the relevance information of spoken sentences to select reliable document-summary pairs derived by the probabilistic generative summarizers is explored for training the classification-based summarizers.
Abstract: Extractive document summarization automatically selects a number of indicative sentences, passages, or paragraphs from an original document according to a target summarization ratio, and sequences them to form a concise summary. In this article, we present a comparative study of various probabilistic ranking models for spoken document summarization, including supervised classification-based summarizers and unsupervised probabilistic generative summarizers. We also investigate the use of unsupervised summarizers to improve the performance of supervised summarizers when manual labels are not available for training the latter. A novel training data selection approach that leverages the relevance information of spoken sentences to select reliable document-summary pairs derived by the probabilistic generative summarizers is explored for training the classification-based summarizers. Encouraging initial results on Mandarin Chinese broadcast news data are demonstrated.

01 Jan 2009
TL;DR: A summarization method, which combines several domain specific features with some other known features such as term frequency, title and position to improve the summarization performance in the medical domain is discussed.
Abstract: —Medical Literatures on the web are the important sources to help clinicians in patient-care. Initially, the clinicians go through the author-written abstracts or summaries available with the medical articles to decide whether articles are suitable for in-depth study. Since all medical articles do not come with author written abstracts or summaries, automatic summarization of medical articles will help clinicians or medical students to find the relevant information on the web rapidly. In this paper we discuss a summarization method, which combines several domain specific features with some other known features such as term frequency, title and position to improve the summarization performance in the medical domain. Our experiments show that the incorporation of domain specific features improves the summarization performance. Index Terms — Text summarization, Domain-specific features, Novel medical term detection I. I NTRODUCTION These days, people are overwhelmed by the huge amount of information on the Web. The number of pages available on the Internet almost doubles every year. This is also the case for medical information [1], which is now available from a variety of sources. Medical Literature such as medical news, research articles, clinical trial reports on the web are the important source to help clinicians in patient treatment. Initially, the clinicians go through the author-written abstracts or summaries available with the medical articles to decide whether articles are relevant to them for in-depth study. Since all medical articles do not come with author written abstracts or summaries, automatic summarization of medical articles will help clinicians or medical students to find the relevant information on the web rapidly. Moreover, monitoring infectious disease outbreaks or other biological threats demand rapid information gatherings and summarization. Text summarization is the process to produce a condensed representation of the content of its input for human consumption [2]. Input to a summarization process can be one or more text documents. When only one document is the input, it is called single document text summarization and when the input is a cluster of related text documents, it is multi-document summarization. We can also categorize the text summarization based on the type of users the summary is intended for: User focused (query focused) summaries are tailored to the requirements of a particular user or group users and generic summaries are aimed at a broad readership community [2]. Depending on the nature of text representation in the summary, summary can be categorized as an abstract and an extract. An extract is a summary consisting of a number of salient text units selected from the input. An abstract is a summary, which represents the subject matter of the article with the text units, which are generated by reformulating the salient units selected from the input. An abstract may contain some text units, which are not present in to the input text. Based on the information content of the summary, it can be categorized as informative and indicative summary. The indicative summary presents an indication about an article’s purpose and approach to the user for selecting the article for in-depth reading; informative summary covers all salient information in the document at some level of detail, i.e., it will contain information about all the different aspects such as article’s purpose, scope, approach, results and conclusions etc. For example, an abstract of a medical research article is more informative than its headline. According to the above-mentioned types and sub-types of automatic text summarization, the summarization technique presented in this paper can be called sentence extraction-based single document informative summarization in medical domain. Some previous works on extractive summarization used few or all of the features such as term frequency, positional information and cue phrases to compute sentence scores [3] [4] [5]. Some machine learning approaches to extractive summarization had already been investigated. In [6] sentence extraction is viewed as a Bayesian classification task. Compared to creating an extract, generation of abstract is relatively harder since the latter requires: (1) semantic representation of text units (sentences or paragraphs) in the text, (2) reformulation of two or more text units and (3) rendering the new representation in natural language. Abstractive approaches have used template based information extraction, information fusion and compression. In information extraction based approach, predefined template slots are filled with the desired pieces of information extracted by the summarization engine [7]. Jing and McKeown [8] pointed out that human summaries are often constructed from the source document by a process of cutting and pasting document fragments that are then combined and regenerated as

Book ChapterDOI
15 May 2009
TL;DR: A Support Vector Machine (SVM) based ensemble approach to combat the extractive multi-document summarization problem using a committee of several SVMs to form an ensemble of classifiers where the strategy is to improve the performance by correcting errors of one classifier using the accurate output of others.
Abstract: In this paper, we present a Support Vector Machine (SVM) based ensemble approach to combat the extractive multi-document summarization problem Although SVM can have a good generalization ability, it may experience a performance degradation through wrong classifications We use a committee of several SVMs, ie Cross-Validation Committees (CVC), to form an ensemble of classifiers where the strategy is to improve the performance by correcting errors of one classifier using the accurate output of others The practicality and effectiveness of this technique is demonstrated using the experimental results

Proceedings ArticleDOI
Xiaojun Wan1
02 Nov 2009
TL;DR: The related subtopics are discovered from the topic's narrative text or document set through topic analysis techniques and the multi-modality manifold-ranking method is proposed to evaluate and rank sentences by fusing the multiple modalities.
Abstract: Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the relevance between each sentence and the single topic into the sentence evaluation process. However, the given topic is usually not well-defined and it consists of a few explicit or implicit subtopics. In this study, the related subtopics are discovered from the topic's narrative text or document set through topic analysis techniques. Then, the sentence relationships against each subtopic are considered as an individual modality and the multi-modality manifold-ranking method is proposed to evaluate and rank sentences by fusing the multiple modalities. Experimental results on the DUC benchmark datasets show the promising results of our proposed methods.