Showing papers on "Multi-document summarization published in 2009"

PDF

Open Access

Proceedings Article•DOI•

Exploring Content Models for Multi-Document Summarization

[...]

Aria Haghighi¹, Lucy Vanderwende²•Institutions (2)

University of California, Berkeley¹, Microsoft²

31 May 2009

TL;DR: The final model, HierSum, utilizes a hierarchical LDA-style model (Blei et al., 2004) to represent content specificity as a hierarchy of topic vocabulary distributions and yields state-of-the-art ROUGE performance and in pairwise user evaluation strongly outperforms Toutanova et al. (2007)'s state of theart discriminative system.

...read moreread less

Abstract: We present an exploration of generative probabilistic models for multi-document summarization. Beginning with a simple word frequency based model (Nenkova and Vanderwende, 2005), we construct a sequence of models each injecting more structure into the representation of document set content and exhibiting ROUGE gains along the way. Our final model, HierSum, utilizes a hierarchical LDA-style model (Blei et al., 2004) to represent content specificity as a hierarchy of topic vocabulary distributions. At the task of producing generic DUC-style summaries, HierSum yields state-of-the-art ROUGE performance and in pairwise user evaluation strongly outperforms Toutanova et al. (2007)'s state-of-the-art discriminative system. We also explore HierSum's capacity to produce multiple 'topical summaries' in order to facilitate content discovery and navigation.

...read moreread less

521 citations

Proceedings Article•DOI•

Rated aspect summarization of short comments

[...]

Yue Lu¹, ChengXiang Zhai¹, Neel Sundaresan²•Institutions (2)

University of Illinois at Urbana–Champaign¹, eBay²

20 Apr 2009

TL;DR: The proposed methods are quite general and can be used to generate rated aspect summary automatically given any collection of short comments each associated with an overall rating.

...read moreread less

Abstract: Web 2.0 technologies have enabled more and more people to freely comment on different kinds of entities (e.g. sellers, products, services). The large scale of information poses the need and challenge of automatic summarization. In many cases, each of the user-generated short comments comes with an overall rating. In this paper, we study the problem of generating a ``rated aspect summary'' of short comments, which is a decomposed view of the overall ratings for the major aspects so that a user could gain different perspectives towards the target entity. We formally define the problem and decompose the solution into three steps. We demonstrate the effectiveness of our methods by using eBay sellers' feedback comments. We also quantitatively evaluate each step of our methods and study how well human agree on such a summarization task. The proposed methods are quite general and can be used to generate rated aspect summary automatically given any collection of short comments each associated with an overall rating.

...read moreread less

381 citations

Journal Article•DOI•

A new sentence similarity measure and sentence based extractive technique for automatic text summarization

[...]

Ramiz M. Aliguliyev¹•Institutions (1)

Azerbaijan National Academy of Sciences¹

01 May 2009-Expert Systems With Applications

TL;DR: The purpose of present paper is to show, that summarization result not only depends on optimized function, and also depends on a similarity measure.

...read moreread less

Abstract: The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. In our study we focus on sentence based extractive document summarization. We propose the generic document summarization method which is based on sentence clustering. The proposed approach is a continue sentence-clustering based extractive summarization methods, proposed in Alguliev [Alguliev, R. M., Aliguliyev, R. M., Bagirov, A. M. (2005). Global optimization in the summarization of text documents. Automatic Control and Computer Sciences 39, 42-47], Aliguliyev [Aliguliyev, R. M. (2006). A novel partitioning-based clustering method and generic document summarization. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI-IAT 2006 Workshops) (WI-IATW'06), 18-22 December (pp. 626-629) Hong Kong, China], Alguliev and Alyguliev [Alguliev, R. M., Alyguliev, R. M. (2007). Summarization of text-based documents with a determination of latent topical sections and information-rich sentences. Automatic Control and Computer Sciences 41, 132-140] Aliguliyev, [Aliguliyev, R. M. (2007). Automatic document summarization by sentence extraction. Journal of Computational Technologies 12, 5-15.]. The purpose of present paper to show, that summarization result not only depends on optimized function, and also depends on a similarity measure. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to sate-of-the-art summarization approaches.

...read moreread less

271 citations

Journal Article•DOI•

Gather customer concerns from online product reviews - A text summarization approach

[...]

Jiaming Zhan¹, Han Tong Loh¹, Ying Liu²•Institutions (2)

National University of Singapore¹, Hong Kong Polytechnic University²

01 Mar 2009-Expert Systems With Applications

TL;DR: This work proposes an automatic summarization approach based on the analysis of review articles' internal topic structure to assemble customer concerns and shows that the proposed approach outperforms the peer approaches, i.e. opinion mining and clustering-summarization, in terms of users' responsiveness and its ability to discover the most important topics.

...read moreread less

Abstract: Product reviews possess critical information regarding customers' concerns and their experience with the product. Such information is considered essential to firms' business intelligence which can be utilized for the purpose of conceptual design, personalization, product recommendation, better customer understanding, and finally attract more loyal customers. Previous studies of deriving useful information from customer reviews focused mainly on numerical and categorical data. Textual data have been somewhat ignored although they are deemed valuable. Existing methods of opinion mining in processing customer reviews concentrates on counting positive and negative comments of review writers, which is not enough to cover all important topics and concerns across different review articles. Instead, we propose an automatic summarization approach based on the analysis of review articles' internal topic structure to assemble customer concerns. Different from the existing summarization approaches centered on sentence ranking and clustering, our approach discovers and extracts salient topics from a set of online reviews and further ranks these topics. The final summary is then generated based on the ranked topics. The experimental study and evaluation show that the proposed approach outperforms the peer approaches, i.e. opinion mining and clustering-summarization, in terms of users' responsiveness and its ability to discover the most important topics.

...read moreread less

184 citations

Proceedings Article•DOI•

Multi-Document Summarization using Sentence-based Topic Models

[...]

Dingding Wang¹, Shenghuo Zhu, Tao Li¹, Yihong Gong•Institutions (1)

Florida International University¹

04 Aug 2009

TL;DR: A new Bayesian sentence-based topic model for summarization by making use of both the term-document and term-sentence associations is proposed and an efficient variational Bayesian algorithm is derived for model parameter estimation.

...read moreread less

Abstract: Most of the existing multi-document summarization methods decompose the documents into sentences and work directly in the sentence space using a term-sentence matrix. However, the knowledge on the document side, i.e. the topics embedded in the documents, can help the context understanding and guide the sentence selection in the summarization procedure. In this paper, we propose a new Bayesian sentence-based topic model for summarization by making use of both the term-document and term-sentence associations. An efficient variational Bayesian algorithm is derived for model parameter estimation. Experimental results on benchmark data sets show the effectiveness of the proposed model for the multi-document summarization task.

...read moreread less

175 citations

Journal Article•DOI•

A complex network approach to text summarization

[...]

Lucas Antiqueira¹, Osvaldo N. Oliveira¹, Luciano da Fontoura Costa¹, Maria das Graças Volpe Nunes²•Institutions (2)

University of São Paulo¹, Spanish National Research Council²

01 Feb 2009-Information Sciences

TL;DR: A set of 14 summarizers are developed, generically referred to as CN-Summ, employing network concepts such as node degree, length of shortest paths, d-rings and k-cores to select sentences for an extractive summary of texts.

...read moreread less

139 citations

Proceedings Article•DOI•

Summarization with a Joint Model for Sentence Extraction and Compression

[...]

André F. T. Martins¹, Noah A. Smith¹•Institutions (1)

Carnegie Mellon University¹

04 Jun 2009

TL;DR: This work proposes a one-step approach for document summarization that jointly performs sentence extraction and compression by solving an integer linear program.

...read moreread less

Abstract: Text summarization is one of the oldest problems in natural language processing. Popular approaches rely on extracting relevant sentences from the original documents. As a side effect, sentences that are too long but partly relevant are doomed to either not appear in the final summary, or prevent inclusion of other relevant sentences. Sentence compression is a recent framework that aims to select the shortest subsequence of words that yields an informative and grammatical sentence. This work proposes a one-step approach for document summarization that jointly performs sentence extraction and compression by solving an integer linear program. We report favorable experimental results on newswire data.

...read moreread less

132 citations

Proceedings Article•DOI•

Enhancing diversity, coverage and balance for summarization through structure learning

[...]

Liangda Li¹, Ke Zhou¹, Gui-Rong Xue¹, Hongyuan Zha², Yong Yu¹ - Show less +1 more•Institutions (2)

Shanghai Jiao Tong University¹, Georgia Institute of Technology²

20 Apr 2009

TL;DR: This paper considers extract-based summarization as learning a mapping from a set of sentences of a given document to a subset of the sentences that satisfies the following three requirements: diversity in summarization, sufficient coverage, and balance.

...read moreread less

Abstract: Document summarization plays an increasingly important role with the exponential growth of documents on the Web. Many supervised and unsupervised approaches have been proposed to generate summaries from documents. However, these approaches seldom simultaneously consider summary diversity, coverage, and balance issues which to a large extent determine the quality of summaries. In this paper, we consider extract-based summarization emphasizing the following three requirements: 1) diversity in summarization, which seeks to reduce redundancy among sentences in the summary; 2) sufficient coverage, which focuses on avoiding the loss of the document's main information when generating the summary; and 3) balance, which demands that different aspects of the document need to have about the same relative importance in the summary. We formulate the extract-based summarization problem as learning a mapping from a set of sentences of a given document to a subset of the sentences that satisfies the above three requirements. The mapping is learned by incorporating several constraints in a structure learning framework, and we explore the graph structure of the output variables and employ structural SVM for solving the resulted optimization problem. Experiments on the DUC2001 data sets demonstrate significant performance improvements in terms of F1 and ROUGE metrics.

...read moreread less

126 citations

Proceedings Article•

Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis

[...]

Leonhard Hennig¹•Institutions (1)

Technical University of Berlin¹

01 Sep 2009

TL;DR: This work proposes a new method based on probabilistic latent semantic analysis, which allows for sentences and queries to be represented as probability distributions over latent topics, to estimate the summary relevance of sentences.

...read moreread less

Abstract: We consider the problem of query-focused multidocument summarization, where a summary containing the information most relevant to a user’s information need is produced from a set of topic-related documents. We propose a new method based on probabilistic latent semantic analysis, which allows us to represent sentences and queries as probability distributions over latent topics. Our approach combines queryfocused and thematic features computed in the latent topic space to estimate the summaryrelevance of sentences. In addition, we evaluate several dierent similarity measures for computing sentence-level feature scores. Experimental results show that our approach outperforms the best reported results on DUC 2006 data, and also compares well on DUC 2007 data.

...read moreread less

95 citations

Posted Content•

Fuzzy logic based method for improving text summarization

[...]

Ladda Suanmali, Naomie Salim, Mohammed Salem Binwahlan

25 Jun 2009-arXiv: Information Retrieval

TL;DR: This paper proposed text summarization based on fuzzy logic to improve the quality of the summary created by the general statistic method, and compared the results with the baseline summarizer and Microsoft Word 2007 summarizers.

...read moreread less

Abstract: Text summarization can be classified into two approaches: extraction and abstraction This paper focuses on extraction approach The goal of text summarization based on extraction approach is sentence selection One of the methods to obtain the suitable sentences is to assign some numerical measure of a sentence for the summary called sentence weighting and then select the best ones The first step in summarization by extraction is the identification of important features In our experiment, we used 125 test documents in DUC2002 data set Each document is prepared by preprocessing process: sentence segmentation, tokenization, removing stop word, and word stemming Then, we used 8 important features and calculate their score for each sentence We proposed text summarization based on fuzzy logic to improve the quality of the summary created by the general statistic method We compared our results with the baseline summarizer and Microsoft Word 2007 summarizers The results show that the best average precision, recall, and fmeasure for the summaries were obtained by fuzzy method

...read moreread less

84 citations

Proceedings Article•DOI•

Contrastive Summarization: An Experiment with Consumer Reviews

[...]

Kevin Lerman¹, Ryan McDonald²•Institutions (2)

Columbia University¹, Google²

31 May 2009

TL;DR: This paper presents an investigation into contrastive summarization through an implementation and evaluation of a contrastive opinion summarizer in the consumer reviews domain.

...read moreread less

Abstract: Contrastive summarization is the problem of jointly generating summaries for two entities in order to highlight their differences. In this paper we present an investigation into contrastive summarization through an implementation and evaluation of a contrastive opinion summarizer in the consumer reviews domain.

...read moreread less

Proceedings Article•DOI•

Automatic text summarization based on sentences clustering and extraction

[...]

Peiying Zhang¹, Cun-he Li¹•Institutions (1)

China University of Petroleum¹

11 Sep 2009

TL;DR: The purpose of present paper is to show that summarization result is not only depends the sentence features, but also depends on the sentence similarity measure, and can improve the performance compared to other summarization methods.

...read moreread less

Abstract: Technology of automatic text summarization plays an important role in information retrieval and text classification, and may provide a solution to the information overload problem. Text summarization is a process of reducing the size of a text while preserving its information content. This paper proposes a sentences clustering based summarization approach. The proposed approach consists of three steps: first clusters the sentences based on the semantic distance among sentences in the document, and then on each cluster calculates the accumulative sentence similarity based on the multi-features combination method, at last chooses the topic sentences by some extraction rules. The purpose of present paper is to show that summarization result is not only depends the sentence features, but also depends on the sentence similarity measure. The experimental result on the DUC 2003 dataset show that our proposed approach can improve the performance compared to other summarization methods.

...read moreread less

Proceedings Article•DOI•

User-oriented document summarization through vision-based eye-tracking

[...]

Songhua Xu¹, Hao Jiang¹, Francis C. M. Lau¹•Institutions (1)

University of Hong Kong¹

08 Feb 2009

TL;DR: A new document summarization algorithm which is personalized, based on the attention (reading) time of individual users spent on single words in a document as the essential clue, which summarizes the document according to user attention on every individual word in the document.

...read moreread less

Abstract: We propose a new document summarization algorithm which is personalized. The key idea is to rely on the attention (reading) time of individual users spent on single words in a document as the essential clue. The prediction of user attention over every word in a document is based on the user's attention during his previous reads, which is acquired via a vision-based commodity eye-tracking mechanism. Once the user's attentions over a small collection of words are known, our algorithm can predict the user's attention over every word in the document through word semantics analysis. Our algorithm then summarizes the document according to user attention on every individual word in the document. With our algorithm, we have developed a document summarization prototype system. Experiment results produced by our algorithm are compared with the ones manually summarized by users as well as by commercial summarization software, which clearly demonstrates the advantages of our new algorithm for user-oriented document summarization.

...read moreread less

Proceedings Article•

Graph-based multi-modality learning for topic-focused multi-document summarization

[...]

Xiaojun Wan¹, Jianguo Xiao¹•Institutions (1)

Peking University¹

11 Jul 2009

TL;DR: This paper proposes to use the multi-modality manifold-ranking algorithm for extracting topic-focused summary from multiple documents by considering the within- document sentence relationships and the cross-document sentence relationships as two separate modalities (graphs).

...read moreread less

Abstract: Graph-based manifold-ranking methods have been successfully applied to topic-focused multi-document summarization. This paper further proposes to use the multi-modality manifold-ranking algorithm for extracting topic-focused summary from multiple documents by considering the within-document sentence relationships and the cross-document sentence relationships as two separate modalities (graphs). Three different fusion schemes, namely linear form, sequential form and score combination form, are exploited in the algorithm. Experimental results on the DUC benchmark datasets demonstrate the effectiveness of the proposed multi-modality learning algorithms with all the three fusion schemes.

...read moreread less

Proceedings Article•DOI•

A Comprehensive Survey on Text Summarization Systems

[...]

Saeedeh Gholamrezazadeh¹, Mohsen Amini Salehi², Bahareh Gholamzadeh³•Institutions (3)

Islamic Azad University¹, Islamic Azad University of Mashhad², Sadjad University of Technology³

01 Dec 2009

TL;DR: This paper presents a taxonomy of summarization systems and defines the most important criteria for a summary which can be generated by a system and goes through main criteria for evaluating a text summarization.

...read moreread less

Abstract: Text summarization systems are among the most attractive research areas nowadays. Summarization systems offers the possibility of finding the main points of texts and so the user will spend less time on reading the whole document. Different types of summary might be useful in various applications and summarization systems can be categorized based on these types. This paper presents a taxonomy of summarization systems and defines the most important criteria for a summary which can be generated by a system. Additionally, different methods of text summarization as well as main steps for summarization process is discussed. we also go through main criteria for evaluating a text summarization.

...read moreread less

Proceedings Article•DOI•

Multi-document video summarization

[...]

Feng Wang¹, Bernard Merialdo¹•Institutions (1)

Institut Eurécom¹

28 Jun 2009

TL;DR: This paper proposes an approach for multi- document video summarization by exploring the redundancy between different videos and shows that multi-document video summarizations presents more elegant and informative summaries compared with single-document approach.

...read moreread less

Abstract: Most previous works on video summarization target on a single video document. With the popularity of video corpus (e.g. news video archives) and web videos, video article that consists of a set of relevant videos are frequently confronted by users. By the traditional single-document summarization, these videos are treated independently and the results are usually redundant due to the lack of inter-video analysis. To efficiently manage video articles, in this paper, we propose an approach for multi-document video summarization by exploring the redundancy between different videos. The importance of keyframes is first measured by the content inclusion based on intra- and inter-video similarities. We then propose a Minimum Description Length (MDL) for automatically determining the appropriate length of the summary. Finally a video summary is generated for users to browse the content of the whole video article. We show that multi-document video summarization presents more elegant and informative summaries compared with single-document approach.

...read moreread less

Proceedings Article•DOI•

Word Sequence Models for Single Text Summarization

[...]

René Arnulfo García-Hernández, Yulia Ledeneva

01 Feb 2009

TL;DR: The so-called n-grams and maximal frequent word sequences are employed as features in a vector space model in order to determine the advantages and disadvantages for extractive text summarization.

...read moreread less

Abstract: The main problem for generating an extractive automatic text summary is to detect the most relevant information in the source document. For such purpose, recently some approaches have successfully employed the word sequence information from the self-text for detecting the candidate text fragments for composing the summary. In this paper, we employ the so-called n-grams and maximal frequent word sequences as features in a vector space model in order to determine the advantages and disadvantages for extractive text summarization.

...read moreread less

Journal Article•DOI•

Evolutionary Algorithm for Extractive Text Summarization

[...]

Rasim M. Alguliev, Ramiz M. Aliguliyev

30 Nov 2009-Intelligent Information Management

TL;DR: This paper proposes unsupervised document summarization method that creates the summary by clustering and extracting sentences from the original document by developing a discrete differential evolution algorithm to optimize the criterion functions.

...read moreread less

Abstract: Text summarization is the process of automatically creating a compressed version of a given document preserving its information content. There are two types of summarization: extractive and abstractive. Extractive summarization methods simplify the problem of summarization into the problem of selecting a representative subset of the sentences in the original documents. Abstractive summarization may compose novel sentences, unseen in the original sources. In our study we focus on sentence based extractive document summarization. The extractive summarization systems are typically based on techniques for sentence extraction and aim to cover the set of sentences that are most important for the overall understanding of a given document. In this paper, we propose unsupervised document summarization method that creates the summary by clustering and extracting sentences from the original document. For this purpose new criterion functions for sentence clustering have been proposed. Similarity measures play an increasingly important role in document clustering. Here we’ve also developed a discrete differential evolution algorithm to optimize the criterion functions. The experimental results show that our suggested approach can improve the performance compared to sate-of-the-art summarization approaches.

...read moreread less

Journal Article•DOI•

Hierarchical Modeling and Adaptive Clustering for Real-Time Summarization of Rush Videos

[...]

Jinchang Ren¹, Jianmin Jiang¹•Institutions (1)

University of Bradford¹

01 Aug 2009-IEEE Transactions on Multimedia

TL;DR: This paper provides detailed descriptions of a proposed new algorithm for video summarization that does not require high-level semantics such as object detection and speech/audio analysis which provides a more flexible and general solution for this topic.

...read moreread less

Abstract: In this paper, we provide detailed descriptions of a proposed new algorithm for video summarization, which are also included in our submission to TRECVID'08 on BBC rush summarization. Firstly, rush videos are hierarchically modeled using the formal language technique. Secondly, shot detections are applied to introduce a new concept of V-unit for structuring videos in line with the hierarchical model, and thus junk frames within the model are effectively removed. Thirdly, adaptive clustering is employed to group shots into clusters to determine retakes for redundancy removal. Finally, each most representative shot selected from every cluster is ranked according to its length and sum of activity level for summarization. Competitive results have been achieved to prove the effectiveness and efficiency of our techniques, which are fully implemented in the compressed domain. Our work does not require high-level semantics such as object detection and speech/audio analysis which provides a more flexible and general solution for this topic.

...read moreread less

Journal Article•DOI•

Generic text summarization for Turkish

[...]

Celal Cigir¹, Mucahid Kutlu¹, Ilyas Cicekli¹•Institutions (1)

Bilkent University¹

23 Oct 2009

TL;DR: A generic text summarization method that generates summaries of Turkish texts by ranking sentences according to their scores calculated using their surface level features and extracting the highest ranked ones from the original documents.

...read moreread less

Abstract: In this paper, we propose a generic text summarization method that generates summaries of Turkish texts by ranking sentences according to their scores calculated using their surface level features and extracting the highest ranked ones from the original documents. In order to extract sentences which form a summary with an extensive coverage of main content of the text and less redundancy, we use the features such as term frequency, key phrase, centrality, title similarity and position of the sentence in the original text. Sentence rank is computed using a score function that uses its feature values and the weights of the features. The best feature weights are learned using machine learning techniques with the help of human constructed summaries. Performance evaluation is conducted by comparing summarization outputs with manual summaries generated by 25 independent human evaluators. This paper presents one of the first Turkish summarization systems, and its results are promising.

...read moreread less

Proceedings Article•DOI•

Text summarization model based on the budgeted median problem

[...]

Hiroya Takamura¹, Manabu Okumura¹•Institutions (1)

Tokyo Institute of Technology¹

02 Nov 2009

TL;DR: A multi-document generic summarization model based on the budgeted median problem that covers the entire relevant part of the document cluster through sentence assignment and can incorporate asymmetric relations between sentences such as textual entailment.

...read moreread less

Abstract: We propose a multi-document generic summarization model based on the budgeted median problem. Our model selects sentences to generate a summary so that every sentence in the document cluster can be assigned to and be represented by a sentence in the summary as much as possible. The advantage of this model is that it covers the entire relevant part of the document cluster through sentence assignment and can incorporate asymmetric relations between sentences such as textual entailment.

...read moreread less

Journal Article•

Semantic Graphs Derived from Triplets with Application in Document Summarization

[...]

Delia Rusu, Blaz Fortuna, Marko Grobelnik, Dunja Mladenic

01 Jan 2009-Informatica (slovenia)

TL;DR: This paper aims to demonstrate the efforts towards in-situ applicability of EMMARM, as to provide real-time information about the response of the immune system to natural disasters.

...read moreread less

Abstract: Delia RusuTechnical University of Cluj-Napoca, Faculty of Automation and Computer Science, G. Bariţiu 26-28, 400027 Cluj-Napoca, RomaniaE-mail: delia.rusu@gmail.comBlaž Fortuna, Marko Grobelnik and Dunja MladenicJozef Stefan Institute, Jamova 39, 1000 Ljubljana, SloveniaE-mail: {blaz.fortuna,marko.grobelnik, dunja.mladenic}@ijs.si

...read moreread less

Proceedings Article•DOI•

Tag-oriented document summarization

[...]

Junyan Zhu¹, Can Wang¹, Xiaofei He¹, Jiajun Bu¹, Chun Chen¹, Shujie Shang¹, Mingcheng Qu¹, Gang Lu¹ - Show less +4 more•Institutions (1)

Zhejiang University¹

20 Apr 2009

TL;DR: Experimental results show the tag-oriented summarization has a significant improvement over those not using tags, and a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags.

...read moreread less

Abstract: Social annotations on a Web document are highly generalized description of topics contained in that page. Their tagged frequency indicates the user attentions with various degrees. This makes annotations a good resource for summarizing multiple topics in a Web page. In this paper, we present a tag-oriented Web document summarization approach by using both document content and the tags annotated on that document. To improve summarization performance, a new tag ranking algorithm named EigenTag is proposed in this paper to reduce noise in tags. Meanwhile, association mining technique is employed to expand tag set to tackle the sparsity problem. Experimental results show our tag-oriented summarization has a significant improvement over those not using tags.

...read moreread less

Proceedings Article•DOI•

Sentence position revisited: a robust light-weight update summarization 'baseline' algorithm

[...]

Rahul Katragadda¹, Prasad Pingali¹, Vasudeva Varma¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

04 Jun 2009

TL;DR: It is shown that the summarizer built is able to outperform most systems participating in task focused summarization evaluations at Text Analysis Conferences (TAC) 2008 and would perform better at producing short summaries than longer summaries.

...read moreread less

Abstract: In this paper, we describe a sentence position based summarizer that is built based on a sentence position policy, created from the evaluation testbed of recent summarization tasks at Document Understanding Conferences (DUC). We show that the summarizer thus built is able to outperform most systems participating in task focused summarization evaluations at Text Analysis Conferences (TAC) 2008. Our experiments also show that such a method would perform better at producing short summaries (upto 100 words) than longer summaries. Further, we discuss the baselines traditionally used for summarization evaluation and suggest the revival of an old baseline to suit the current summarization task at TAC: the Update Summarization task.

...read moreread less

Proceedings Article•DOI•

Update summarization based on novel topic distribution

[...]

Josef Steinberger¹, Karel Ježek¹•Institutions (1)

University of West Bohemia¹

16 Sep 2009

TL;DR: The development of the summarizer is described which is based on Iterative Residual Rescaling (IRR) that creates the latent semantic space of a set of documents under consideration that enables to control the influence of major and minor topics in the latent space.

...read moreread less

Abstract: This paper deals with our recent research in text summarization. The field has moved from multi-document summarization to update summarization. When producing an update summary of a set of topic-related documents the summarizer assumes prior knowledge of the reader determined by a set of older documents of the same topic. The update summarizer thus must solve a novelty vs. redundancy problem. We describe the development of our summarizer which is based on Iterative Residual Rescaling (IRR) that creates the latent semantic space of a set of documents under consideration. IRR generalizes Singular Value Decomposition (SVD) and enables to control the influence of major and minor topics in the latent space. Our sentence-extractive summarization method computes the redundancy, novelty and significance of each topic. These values are finally used in the sentence selection process. The sentence selection component prevents inner summary redundancy. The results of our participation in TAC evaluation seem to be promising.

...read moreread less

Book Chapter•DOI•

NEO-CORTEX: A Performant User-Oriented Multi-Document Summarization System

[...]

Florian Boudin, Juan Manuel Torres Moreno

19 May 2009

TL;DR: This paper presents NEO-CORTEX, a multi-document summarization system based on the existing CORTEX system that achieves good performance on topic-oriented multi- document summarization task.

...read moreread less

Abstract: This paper discusses an approach to topic-oriented multi-document summarization. It investigates the effectiveness of using additional information about the document set as a whole, as well as individual documents. We present NEO-CORTEX, a multi-document summarization system based on the existing CORTEX system. Results are reported for experiments with a document base formed by the NIST DUC-2005 and DUC-2006 data. Our experiments have shown that NEO-CORTEX is an effective system and achieves good performance on topic-oriented multi-document summarization task.

...read moreread less

Journal Article•DOI•

A Comparative Study of Probabilistic Ranking Models for Chinese Spoken Document Summarization

[...]

Shih-Hsiang Lin¹, Berlin Chen¹, Hsin-Min Wang²•Institutions (2)

National Taiwan Normal University¹, Academia Sinica²

01 Mar 2009-ACM Transactions on Asian Language Information Processing

TL;DR: A novel training data selection approach that leverages the relevance information of spoken sentences to select reliable document-summary pairs derived by the probabilistic generative summarizers is explored for training the classification-based summarizers.

...read moreread less

Abstract: Extractive document summarization automatically selects a number of indicative sentences, passages, or paragraphs from an original document according to a target summarization ratio, and sequences them to form a concise summary. In this article, we present a comparative study of various probabilistic ranking models for spoken document summarization, including supervised classification-based summarizers and unsupervised probabilistic generative summarizers. We also investigate the use of unsupervised summarizers to improve the performance of supervised summarizers when manual labels are not available for training the latter. A novel training data selection approach that leverages the relevance information of spoken sentences to select reliable document-summary pairs derived by the probabilistic generative summarizers is explored for training the classification-based summarizers. Encouraging initial results on Mandarin Chinese broadcast news data are demonstrated.

...read moreread less

Using Domain Knowledge for Text Summarization in Medical Domain

[...]

Kamal Sarkar

01 Jan 2009

TL;DR: A summarization method, which combines several domain specific features with some other known features such as term frequency, title and position to improve the summarization performance in the medical domain is discussed.

...read moreread less

Abstract: —Medical Literatures on the web are the important sources to help clinicians in patient-care. Initially, the clinicians go through the author-written abstracts or summaries available with the medical articles to decide whether articles are suitable for in-depth study. Since all medical articles do not come with author written abstracts or summaries, automatic summarization of medical articles will help clinicians or medical students to find the relevant information on the web rapidly. In this paper we discuss a summarization method, which combines several domain specific features with some other known features such as term frequency, title and position to improve the summarization performance in the medical domain. Our experiments show that the incorporation of domain specific features improves the summarization performance. Index Terms — Text summarization, Domain-specific features, Novel medical term detection I. I NTRODUCTION These days, people are overwhelmed by the huge amount of information on the Web. The number of pages available on the Internet almost doubles every year. This is also the case for medical information [1], which is now available from a variety of sources. Medical Literature such as medical news, research articles, clinical trial reports on the web are the important source to help clinicians in patient treatment. Initially, the clinicians go through the author-written abstracts or summaries available with the medical articles to decide whether articles are relevant to them for in-depth study. Since all medical articles do not come with author written abstracts or summaries, automatic summarization of medical articles will help clinicians or medical students to find the relevant information on the web rapidly. Moreover, monitoring infectious disease outbreaks or other biological threats demand rapid information gatherings and summarization. Text summarization is the process to produce a condensed representation of the content of its input for human consumption [2]. Input to a summarization process can be one or more text documents. When only one document is the input, it is called single document text summarization and when the input is a cluster of related text documents, it is multi-document summarization. We can also categorize the text summarization based on the type of users the summary is intended for: User focused (query focused) summaries are tailored to the requirements of a particular user or group users and generic summaries are aimed at a broad readership community [2]. Depending on the nature of text representation in the summary, summary can be categorized as an abstract and an extract. An extract is a summary consisting of a number of salient text units selected from the input. An abstract is a summary, which represents the subject matter of the article with the text units, which are generated by reformulating the salient units selected from the input. An abstract may contain some text units, which are not present in to the input text. Based on the information content of the summary, it can be categorized as informative and indicative summary. The indicative summary presents an indication about an article’s purpose and approach to the user for selecting the article for in-depth reading; informative summary covers all salient information in the document at some level of detail, i.e., it will contain information about all the different aspects such as article’s purpose, scope, approach, results and conclusions etc. For example, an abstract of a medical research article is more informative than its headline. According to the above-mentioned types and sub-types of automatic text summarization, the summarization technique presented in this paper can be called sentence extraction-based single document informative summarization in medical domain. Some previous works on extractive summarization used few or all of the features such as term frequency, positional information and cue phrases to compute sentence scores [3] [4] [5]. Some machine learning approaches to extractive summarization had already been investigated. In [6] sentence extraction is viewed as a Bayesian classification task. Compared to creating an extract, generation of abstract is relatively harder since the latter requires: (1) semantic representation of text units (sentences or paragraphs) in the text, (2) reformulation of two or more text units and (3) rendering the new representation in natural language. Abstractive approaches have used template based information extraction, information fusion and compression. In information extraction based approach, predefined template slots are filled with the desired pieces of information extracted by the summarization engine [7]. Jing and McKeown [8] pointed out that human summaries are often constructed from the source document by a process of cutting and pasting document fragments that are then combined and regenerated as

...read moreread less

Book Chapter•DOI•

A SVM-Based Ensemble Approach to Multi-Document Summarization

[...]

Yllias Chali¹, Sadid A. Hasan¹, Shafiq Joty²•Institutions (2)

University of Lethbridge¹, University of British Columbia²

15 May 2009

TL;DR: A Support Vector Machine (SVM) based ensemble approach to combat the extractive multi-document summarization problem using a committee of several SVMs to form an ensemble of classifiers where the strategy is to improve the performance by correcting errors of one classifier using the accurate output of others.

...read moreread less

Abstract: In this paper, we present a Support Vector Machine (SVM) based ensemble approach to combat the extractive multi-document summarization problem Although SVM can have a good generalization ability, it may experience a performance degradation through wrong classifications We use a committee of several SVMs, ie Cross-Validation Committees (CVC), to form an ensemble of classifiers where the strategy is to improve the performance by correcting errors of one classifier using the accurate output of others The practicality and effectiveness of this technique is demonstrated using the experimental results

...read moreread less

Proceedings Article•DOI•

Topic analysis for topic-focused multi-document summarization

[...]

Xiaojun Wan¹•Institutions (1)

Peking University¹

02 Nov 2009

TL;DR: The related subtopics are discovered from the topic's narrative text or document set through topic analysis techniques and the multi-modality manifold-ranking method is proposed to evaluate and rank sentences by fusing the multiple modalities.

...read moreread less

Abstract: Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the relevance between each sentence and the single topic into the sentence evaluation process. However, the given topic is usually not well-defined and it consists of a few explicit or implicit subtopics. In this study, the related subtopics are discovered from the topic's narrative text or document set through topic analysis techniques. Then, the sentence relationships against each subtopic are considered as an individual modality and the multi-modality manifold-ranking method is proposed to evaluate and rank sentences by fusing the multiple modalities. Experimental results on the DUC benchmark datasets show the promising results of our proposed methods.

...read moreread less