scispace - formally typeset
Search or ask a question

Showing papers on "Multi-document summarization published in 2011"


Journal ArticleDOI
TL;DR: This paper presents a different kind of learning models, namely regression models, to query-focused multi-document summarization, and chooses to use Support Vector Regression to estimate the importance of a sentence in a document set to be summarized through a set of pre-defined features.
Abstract: Most existing research on applying machine learning techniques to document summarization explores either classification models or learning-to-rank models. This paper presents our recent study on how to apply a different kind of learning models, namely regression models, to query-focused multi-document summarization. We choose to use Support Vector Regression (SVR) to estimate the importance of a sentence in a document set to be summarized through a set of pre-defined features. In order to learn the regression models, we propose several methods to construct the ''pseudo'' training data by assigning each sentence with a ''nearly true'' importance score calculated with the human summaries that have been provided for the corresponding document set. A series of evaluations on the DUC data sets are conducted to examine the efficiency and the robustness of the proposed approaches. When compared with classification models and ranking models, regression models are consistently preferable.

172 citations


Journal ArticleDOI
TL;DR: Different LSA-based summarization algorithms are explained, two of which are proposed by the authors of this paper and their performances are compared using their ROUGE scores.
Abstract: Text summarization solves the problem of presenting the information needed by a user in a compact form. There are different approaches to creating well-formed summaries. One of the newest methods is the Latent Semantic Analysis (LSA). In this paper, different LSA-based summarization algorithms are explained, two of which are proposed by the authors of this paper. The algorithms are evaluated on Turkish and English documents, and their performances are compared using their ROUGE scores. One of our algorithms produces the best scores and both algorithms perform equally well on Turkish and English document sets.

133 citations


Proceedings Article
24 Jun 2011
TL;DR: A new, ambitious framework for abstractive summarization, which aims at selecting the content of a summary not from sentences, but from an abstract representation of the source documents, based on the concept of Information Items (InIt).
Abstract: We propose a new, ambitious framework for abstractive summarization, which aims at selecting the content of a summary not from sentences, but from an abstract representation of the source documents This abstract representation relies on the concept of Information Items (InIt), which we define as the smallest element of coherent information in a text or a sentence Our framework differs from previous abstractive summarization models in requiring a semantic analysis of the text We present a first attempt made at developing a system from this framework, along with evaluation results for it from TAC 2010 We also present related work, both from within and outside of the automatic summarization domain

122 citations


Proceedings ArticleDOI
Zi Yang1, Keke Cai2, Jie Tang1, Li Zhang2, Zhong Su2, Juanzi Li1 
24 Jul 2011
TL;DR: A dual wing factor graph (DWFG) model is proposed, which utilizes the mutual reinforcement between Web documents and their associated social contexts to generate summaries, and an efficient algorithm is designed to learn the proposed factor graph model.
Abstract: We study a novel problem of social context summarization for Web documents. Traditional summarization research has focused on extracting informative sentences from standard documents. With the rapid growth of online social networks, abundant user generated content (e.g., comments) associated with the standard documents is available. Which parts in a document are social users really caring about? How can we generate summaries for standard documents by considering both the informativeness of sentences and interests of social users? This paper explores such an approach by modeling Web documents and social contexts into a unified framework. We propose a dual wing factor graph (DWFG) model, which utilizes the mutual reinforcement between Web documents and their associated social contexts to generate summaries. An efficient algorithm is designed to learn the proposed factor graph model.Experimental results on a Twitter data set validate the effectiveness of the proposed model. By leveraging the social context information, our approach obtains significant improvement (averagely +5.0%-17.3%) over several alternative methods (CRF, SVM, LR, PR, and DocLead) on the performance of summarization.

110 citations


Journal ArticleDOI
TL;DR: A conceptual model of clinical summarization is developed that describes the creation of complex, task-specific clinical summaries and provides a framework for clinical workflow analysis and directed research on test results review, clinical documentation and medical decision-making.

100 citations


01 Jan 2011
TL;DR: CSTNews, a discourse-annotated corpus for fostering research on single and multi-document summarization, is introduced within the context of the SUCINTO Project, which aims at investigating summarization strategies and developing tools and resources for that purpose.
Abstract: Summary. This paper introduces CSTNews, a discourse-annotated corpus for fostering research on single and multi-document summarization. The corpus comprises 50 clusters of news texts in Brazilian Portuguese and some related material, which includes a set of single-document manual summaries and a set of multi-document manual and automatic summaries. The texts are annotated in different ways for discourse organization, following both the Rhetorical Structure Theory and Cross-document Structure Theory. The corpus is a result delivered within the context of the SUCINTO Project, which aims at investigating summarization strategies and developing tools and resources for that purpose. The design of the discourse annotation tasks and the decisions that have been taken during the annotation process are detailed in this paper.

90 citations


Journal ArticleDOI
TL;DR: This article proposes a new language model to simultaneously cluster and summarize documents by making use of both the document-term and sentence-term matrices and shows the effectiveness of the proposed method and the high interpretability of the generated summaries.
Abstract: Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of documents as a document-term matrix and then conduct the clustering process. Although many of these clustering methods can group the documents effectively, it is still hard for people to capture the meaning of the documents since there is no satisfactory interpretation for each document cluster. A straightforward solution is to first cluster the documents and then summarize each document cluster using summarization methods. However, most of the current summarization methods are solely based on the sentence-term matrix and ignore the context dependence of the sentences. As a result, the generated summaries lack guidance from the document clusters. In this article, we propose a new language model to simultaneously cluster and summarize documents by making use of both the document-term and sentence-term matrices. By utilizing the mutual influence of document clustering and summarization, our method makes; (1) a better document clustering method with more meaningful interpretation; and (2) an effective document summarization method with guidance from document clustering. Experimental results on various document datasets show the effectiveness of our proposed method and the high interpretability of the generated summaries.

81 citations


Journal ArticleDOI
TL;DR: A document summarization model which extracts key sentences from given documents while reducing redundant information in the summaries is presented, and an innovative aspect of the model lies in its ability to remove redundancy while selecting representative sentences.
Abstract: For effective multi-document summarization, it is important to reduce redundant information in the summaries and extract sentences, which are common to given documents. This paper presents a document summarization model which extracts key sentences from given documents while reducing redundant information in the summaries. An innovative aspect of our model lies in its ability to remove redundancy while selecting representative sentences. The model is represented as a discrete optimization problem. To solve the discrete optimization problem in this study an adaptive DE algorithm is created. We implemented our model on multi-document summarization task. Experiments have shown that the proposed model is to be preferred over summarization systems. We also showed that the resulting summarization system based on the proposed optimization approach is competitive on the DUC2002 and DUC2004 datasets.

76 citations


Proceedings Article
05 Jul 2011
TL;DR: This paper proposes a clustering-based approach for identifying correlated groups of comments; and a precedence-based ranking framework for automatically selecting informative user-contributed comments, and finds that in combination, these two salient features yield promising results.
Abstract: User-contributed comments are one of the hallmarks of the Social Web, widely adopted across social media sites and mainstream news providers alike. While comments encourage higher-levels of user engagement with online media, their wide success places new burdens on users to process and assimilate the perspectives of a huge number of user-contributed perspectives. Toward overcoming this problem we study in this paper the comment summarization problem: for a set of n usercontributed comments associated with an online resource, select the best top-k comments for summarization. In this paper we propose (i) a clustering-based approach for identifying correlated groups of comments; and (ii) a precedence-based ranking framework for automatically selecting informative user-contributed comments. We find that in combination, these two salient features yield promising results. Concretely, we evaluate the proposed comment summarization algorithm over a collection of YouTube videos and their associated comments, and we find good performance in comparison with traditional document summarization approaches (e.g., LexRank, MEAD).

58 citations


Journal Article
TL;DR: The Text Analysis Conference MultiLing Pilot of 2011 posed a multi-lingual summarization task to the summarization community, aiming to quantify and measure the performance of multi-lingsual, multi-document summarization systems.
Abstract: The Text Analysis Conference MultiLing Pilot of 2011 posed a multi-lingual summarization task to the summarization community, aiming to quantify and measure the performance of multi-lingual, multi-document summarization systems. The task was to create a 240‐250 word summary from 10 news texts, describing a given topic. The texts of each topic were provided in seven languages (Arabic, Czech, English, French, Greek, Hebrew, Hindi) and each participant generated summaries for at least 2 languages. The evaluation of the summaries was performed using automatic (AutoSummENG, Rouge) and manual processes (Overall Responsiveness score). The participating systems were 8, some of which providing summaries across all languages. This paper provides a brief description for the collection of the data, the evaluation methodology, the problems and challenges faced, and an overview of participation and corresponding results.

56 citations


Proceedings Article
Xiaojun Wan1
19 Jun 2011
TL;DR: Two summarization methods (SimFusion and CoRank) are proposed to leverage the bilingual information in the graph-based ranking framework for cross-language summary extraction to improve the effectiveness of these methods on the DUC2001 dataset.
Abstract: Cross-language document summarization is defined as the task of producing a summary in a target language (e.g. Chinese) for a set of documents in a source language (e.g. English). Existing methods for addressing this task make use of either the information from the original documents in the source language or the information from the translated documents in the target language. In this study, we propose to use the bilingual information from both the source and translated documents for this task. Two summarization methods (SimFusion and CoRank) are proposed to leverage the bilingual information in the graph-based ranking framework for cross-language summary extraction. Experimental results on the DUC2001 dataset with manually translated reference Chinese summaries show the effectiveness of the proposed methods.

Proceedings ArticleDOI
13 Feb 2011
TL;DR: The PASSEV system is introduced and the basic summarization approaches are described and evaluated, describing and evaluating two basic summarizing approaches.
Abstract: Social services like Twitter are increasingly used to provide a conversational backdrop to real-world events in real-time. Sporting events are a good example of this and this year, millions of users tweeted their comments as they watched the World Cup matches from around the world. In this paper, we look at using these time-stamped opinions as the basis for generating video highlights for these soccer matches. We introduce the PASSEV system and describe and evaluate two basic summarization approaches.

Proceedings ArticleDOI
13 Jun 2011
TL;DR: A system that automatically summarizes a large collection of product reviews to generate a concise summary is built, which not only extracts the review sentiments but also the underlying justification for their opinion.
Abstract: With product reviews growing in depth and becoming more numerous, it is growing challenge to acquire a comprehensive understanding of their contents, for both customers and product manufacturers. We built a system that automatically summarizes a large collection of product reviews to generate a concise summary. Importantly, our system not only extracts the review sentiments but also the underlying justification for their opinion. We solve this problem through a novel application of clustering and validate our approach through an empirical study, obtaining good performance as judged by F-measure (the harmonic mean of purity and inverse purity).

Proceedings Article
07 May 2011
TL;DR: This paper introduces a new type of summarization task, known as microblog summarization, which aims to synthesize content from multiple microblog posts on the same topic into a human-readable prose description of fixed length using a generative model and a user behavior model.
Abstract: This paper introduces a new type of summarization task, known as microblog summarization, which aims to synthesize content from multiple microblog posts on the same topic into a human-readable prose description of fixed length Our approach leverages (1) a generative model which induces event structures from text and (2) a user behavior model which captures how users convey relevant content

Proceedings Article
27 Jul 2011
TL;DR: Key features of this method include automatic grouping of semantically related sentences and sentence ranking based on extension of random walk model and a new sentence compression algorithm which use dependency tree instead of parser tree.
Abstract: In this paper, we propose a novel approach to automatic generation of aspect-oriented summaries from multiple documents. We first develop an event-aspect LDA model to cluster sentences into aspects. We then use extended LexRank algorithm to rank the sentences in each cluster. We use Integer Linear Programming for sentence selection. Key features of our method include automatic grouping of semantically related sentences and sentence ranking based on extension of random walk model. Also, we implement a new sentence compression algorithm which use dependency tree instead of parser tree. We compare our method with four baseline methods. Quantitative evaluation based on Rouge metric demonstrates the effectiveness and advantages of our method.

Proceedings ArticleDOI
11 Jul 2011
TL;DR: This paper studies automatic video summarization in the consumer domain where most previous methods cannot be easily applied due to the challenging issues for content analysis, i.e., consumer videos are captured with uncontrolled conditions such as uneven illumination, clutter, and large camera motion.
Abstract: Video summarization provides a condensed version of a video stream by analyzing the video content. Automatic summarization of consumer videos is an important tool that facilitates efficient browsing, searching, and album creation in large consumer video collections. This paper studies automatic video summarization in the consumer domain where most previous methods cannot be easily applied due to the challenging issues for content analysis, i.e., consumer videos are captured with uncontrolled conditions such as uneven illumination, clutter, and large camera motion, and with poor-quality soundtrack as a mix of multiple sound sources under severe noise. To pursue reliable summarization, a case study with actual consumer users is conducted, from which a set of consumer-oriented guidelines is obtained. The guidelines reflect the high-level semantic rules, in both visual and audio aspects, which are recognized by consumers as important to produce good video summaries. Following these guidelines, an automatic video summarization algorithm is developed where both visual and audio information are used to generate improved summaries. To the best of our knowledge, this is a first systematic study on automatic summarization of consumer-quality videos. Experimental evaluations from consumer subjects show the effectiveness of our approach.

Proceedings ArticleDOI
28 Mar 2011
TL;DR: A transfer learning framework for video summarization is proposed: in the training process both the video features and textual features are exploited to train a summarization algorithm while for summarizing a new video only its video features are utilized.
Abstract: It is well-known that textual information such as video transcripts and video reviews can significantly enhance the performance of video summarization algorithms. Unfortunately, many videos on the Web such as those from the popular video sharing site YouTube do not have useful textual information. The goal of this paper is to propose a transfer learning framework for video summarization: in the training process both the video features and textual features are exploited to train a summarization algorithm while for summarizing a new video only its video features are utilized. The basic idea is to explore the transferability between videos and their corresponding textual information. Based on the assumption that video features and textual features are highly correlated with each other, we can transfer textual information into knowledge on summarization using video information only. In particular, we formulate the video summarization problem as that of learning a mapping from a set of shots of a video to a subset of the shots using the general framework of SVM-based structured learning. Textual information is transferred by encoding them into a set of constraints used in the structured learning process which tend to provide a more detailed and accurate characterization of the different subsets of shots. Experimental results show significant performance improvement of our approach and demonstrate the utility of textual information for enhancing video summarization.

Book
31 Oct 2011
TL;DR: This chapter sets the focus on automatic summarization of text using as few direct human resources as possible, resulting in what can be perceived as an intermediary system, and presents the notion of taking a holistic view of the generation of summaries.
Abstract: Today, with digitally stored information available in abundance, even for many less commonly spoken languages, this information must by some means be filtered and extracted in order to avoid drowning in it. Automatic summarization is one such technique, where a computer summarizes a longer text into a shorter non-redundant form. The development of advanced summarization systems also for smaller languages may unfortunately prove too costly. Nevertheless, there will still be a need for summarization tools for these languages in order to curb the immense flow of digital information. This chapter sets the focus on automatic summarization of text using as few direct human resources as possible, resulting in what can be perceived as an intermediary system. Furthermore, it presents the notion of taking a holistic view of the generation of summaries.

Journal ArticleDOI
TL;DR: This paper proposes a novel approach developed based on the spectral analysis to simultaneously clustering and ranking of sentences and demonstrates the improvement of the proposed approach over the other existing clustering-based approaches.

Proceedings ArticleDOI
Chao Shen1, Tao Li1
11 Dec 2011
TL;DR: This paper makes use of sentence-to-sentence relationships to better estimate the probability of a sentence in the document set to be a summary sentence, and adopts a cost sensitive loss in the ranking SVM's objective function.
Abstract: In this paper, we explore how to use ranking SVM to train the feature weights for query-focused multi-document summarization. To apply a supervised learning method to sentence extraction in multi-document summarization, we need to derive the sentence labels for training corpus from the existing human labeling data in form of. However, this process is not trivial, because the human summaries are abstractive, and do not necessarily well match the sentences in the documents. In this paper, we try to address the above problem from the following two aspects. First, we make use of sentence-to-sentence relationships to better estimate the probability of a sentence in the document set to be a summary sentence. Second, to make the derived training data less sensitive, we adopt a cost sensitive loss in the ranking SVM's objective function. The experimental results demonstrate the effectiveness of our proposed method.

Journal ArticleDOI
TL;DR: A frequent term based text summarization algorithm which is implemented using open source technologies like java, DISCO, Porters stemmer etc and verified over the standard text mining corpus.
Abstract: Text summarization is an important activity in the analysis of a high volume text documents. Text summarization has number of applications; recently number of applications uses text summarization for the betterment of the text analysis and knowledge representation. In this paper a frequent term based text summarization algorithm is designed and implemented in java. The designed algorithm works in three steps. In the first step the document which is required to be summarized is processed by eliminating the stop word and by applying the stemmers. In the second step term-frequent data is calculated from the document and frequent terms are selected, for these selected words the semantic equivalent terms are also generated. Finally in the third step all the sentences in the document, which are containing the frequent and semantic equivalent terms, are filtered for summarization. The designed algorithm is implemented using open source technologies like java, DISCO, Porters stemmer etc. and verified over the standard text mining corpus. Keyword

Journal ArticleDOI
26 Aug 2011-PLOS ONE
TL;DR: The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and the results are better than the MEAD system, a well-known tool for text summarization.
Abstract: Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3) For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization.

Proceedings Article
07 Aug 2011
TL;DR: This paper proposes Bi-mixture PLSA, a new formulation of PLSA that allows the number of latent word classes to be different from thenumber of latent document classes, and extended to incorporate the sentence information.
Abstract: Probabilistic Latent Semantic Analysis (PLSA) has been popularly used in document analysis. However, as it is currently formulated, PLSA strictly requires the number of word latent classes to be equal to the number of document latent classes. In this paper, we propose Bi-mixture PLSA, a new formulation of PLSA that allows the number of latent word classes to be different from the number of latent document classes. We further extend Bi-mixture PLSA to incorporate the sentence information, and propose Bi-mixture PLSA with sentence bases (Bi-PLSAS) to simultaneously cluster and summarize the documents utilizing the mutual influence of the document clustering and summarization procedures. Experiments on real-world datasets demonstrate the effectiveness of our proposed methods.

01 Jan 2011
TL;DR: A toolkit for evaluation of single-document and multi-document summarization and evaluation of summarization in the framework of cross-lingual information retrieval is developed and the measurement of relevance correlation is introduced and systematically examined in this workshop.
Abstract: We report on research in multi-document summarization and on evaluation of summarization in the framework of cross-lingual information retrieval. This work was carried out during a summer workshop on Language Engineering held at Johns Hopkins University by a team of nine researchers from seven universities. The goals of the research were as follows: (1) to develop a toolkit for evaluation of single-document and multi-document summarizers, (2) to develop a modular multi-document summarizer, called MEAD, that works in both English and Chinese, and (3) to perform a meta-evaluation of four automatic summarizers, including MEAD, using several types of evaluation measures: some currently used by summarization researchers and a couple of novel techniques. Central to the experiments in this workshop was the cross-lingual experimental setup based on a large-scale Chinese and English parallel corpus. An extensive set of human judgments were specifically prepared by the Linguistic Data Consortium for our research. These human judgments include a) which documents are relevant to a certain query and b) which sentences in the relevant documents are most relevant to the query and which therefore constitute a good summary of the cluster. These judgments were used to construct variable-length multiand single document summaries as model summaries. Since one of the novel evaluation metrics that we used, Relevance Correlation, is based on the premise that good summaries preserve query relevance both within a language and across languages, we made use of a cross-lingual Information Retrieval (IR) engine. We evaluated the quality of the automatic summaries using co-selection and content-based evaluation, two established techniques. A relatively new metric, relative utility, was also extensively tested. Part of the new scientific contribution is the measurement of relevance correlation, which we introduced and systematically examined in this workshop. Relevance correlation measures the quality of summaries in comparison to the entire documents as a function of how much document relevance drops if summaries are indexed instead of documents. Our results show that this measure is sensible, in that it correlates with more established evaluation measures. Another contribution is the cross-lingual setup which allows us to automatically translate English queries into Chinese, perform Chinese IR with or without summarization. This allows us to calculate relevance correlation for English and for Chinese in parallel (i.e., for the same queries) and to take direct cross-lingual comparisons of evaluations. Additionally, an alternative way of constructing Chinese model summaries from English ones was implemented which relies on the sentence alignment of English and Chinese documents. The results of our large-scale meta-evaluation are numerous, but some of the highlights are the following: (1) All evaluation measures rank human summaries first, which is an appropriate and expected property of such measures, (2) Both relevance correlation and the content-based measures place leading sentence extracts ahead of the more sophisticated summarizers, (3) Relative utility ranks our system, MEAD, as the best summarizer for shorter summaries, although for longer summaries, lead-based summaries outperform MEAD, (4) Co-selection measurements show overall low agreement amongst humans (above chance), whereas relative utility reports higher numbers on the same data (but does not normalize for chance). The deliverable resources and software include: (1) a turn-key extractive multi-document summarizer, MEAD, which allows users to add their own features based on single sentences or pairs of sentences, (2) a large corpus of summaries produced by several automatic methods, including baseline and random summaries, (3) a collection of manual summaries produced by the Linguistic Data Consortium (LDC), (4) a battery of evaluation routines, (5) a collection of IR queries in English and Chinese and the corresponding relevance judgments from the Hong Kong news collection, (6) SMART relevance outputs for both full documents and summaries, (7) XML tools for processing of documents and summaries. JHU 2001 Summer workshop final report Evaluation of Text Summarization

23 Jun 2011
TL;DR: A novel unsupervised approach to the problem of multi-document summarization of scientific articles, in which the document collection is a list of papers cited together within the same source article, otherwise known as a co-citation, is presented.
Abstract: We present a novel unsupervised approach to the problem of multi-document summarization of scientific articles, in which the document collection is a list of papers cited together within the same source article, otherwise known as a co-citation. At the heart of the approach is a topic based clustering of fragments extracted from each co-cited article and relevance ranking using a query generated from the context surrounding the co-cited list of papers. This analysis enables the generation of an overview of common themes from the co-cited papers that relate to the context in which the co-citation was found. We present a system called SciSumm that embodies this approach and apply it to the 2008 ACL Anthology. We evaluate this summarization system for relevant content selection using gold standard summaries prepared on principle based guidelines. Evaluation with gold standard summaries demonstrates that our system performs better in content selection than an existing summarization system (MEAD). We present a detailed summary of our findings and discuss possible directions for future research.

Journal ArticleDOI
TL;DR: A novel search result summarization approach is introduced and an interactive browsing scheme, based on a tree structure for organizing the images obtained from the summarization approaches, is proposed to enable users to intuitively and conveniently browse the image search results.
Abstract: Presenting and browsing image search results play key roles in helping users to find desired images from search results. Most existing commercial image search engines present them, dependent on a ranked list. However, such a scheme suffers from at least two drawbacks: inconvenience for consumers to get an overview of the whole result, and high computation cost to find desired images from the list. In this paper, we introduce a novel search result summarization approach and exploit this approach to further propose an interactive browsing scheme. The main contribution of this paper includes: (1) a dynamic absorbing random walk to find diversified representatives for image search result summarization; (2) a local scaled visual similarity evaluation scheme between two images through inspecting the relation between each image and other images; and (3) an interactive browsing scheme, based on a tree structure for organizing the images obtained from the summarization approach, to enable users to intuitively and conveniently browse the image search results. Quantitative experimental results and user study demonstrate the effectiveness of the proposed summarization and browsing approaches.

Proceedings ArticleDOI
03 Jun 2011
TL;DR: This research focuses on developing a statistical automatic text summarization approach, Kmixture probabilistic model, to enhancing the quality of summaries by employing the K-mixture semantic relationship significance (KSRS) approach to enhancingThe quality of document summary results.
Abstract: Automatic Document Summarization is a highly interdisciplinary research area related with computer science as well as cognitive psychology. This Summarization is to compress an original document into a summarized version by extracting almost all of the essential concepts with text mining techniques. This research focuses on developing a statistical automatic text summarization approach, Kmixture probabilistic model, to enhancing the quality of summaries. KSRS employs the K-mixture probabilistic model to establish term weights in a statistical sense, and further identifies the term relationships to derive the semantic relationship significance (SRS) of nouns. Sentences are ranked and extracted based on their semantic relationship significance values. The objective of this research is thus to propose a statistical approach to text summarization. We propose a K-mixture semantic relationship significance (KSRS) approach to enhancing the quality of document summary results. The K-mixture probabilistic model is used to determine the term weights. Term relationships are then investigated to develop the semantic relationship of nouns that manifests sentence semantics. Sentences with significant semantic relationship, nouns are extracted to form the summary accordingly.

Proceedings ArticleDOI
01 Nov 2011
TL;DR: This work is built upon past work of extractive summarization methods to create abstractive summaries by creating new sentences in it, considering Telugu, a south Indian regional language, as the language of study.
Abstract: The Internet provides many sources of different opinions, expressed through user reviews of products, blogs, and forum discussions. Systems which could automatically summarize these opinions would be immensely useful for those who wish to use this information to make decisions. The previous work in automatic summarization has completely focused on extractive summarization, in which key sentences are identified from the source text and extracted to form the output. An alternative solution is abstractive summarization in which the information from the source text is first extracted into the form of abstract data which is then post processed to infer the most important message from the original text. This work is built upon past work of extractive summarization methods to create abstractive summaries by creating new sentences in it. This paper conveys the methodology for the abstractive summarization process and its evaluation considering Telugu, a south Indian regional language, as the language of study.

Journal ArticleDOI
30 Jun 2011
TL;DR: A graph-based approach to multi-document summarization that integrates machine translation quality scores in the sentence extraction process is proposed and results indicate that the approach improves the readability of the generated summaries without degrading their informativity.
Abstract: Cross–language summarization is the task of generating a summary in a language different from the language of the source documents. In this paper, we propose a graph–based approach to multi–document summarization that integrates machine translation quality scores in the sentence extraction process. We evaluate our method on a manually translated subset of the DUC 2004 evaluation campaign. Results indicate that our approach improves the readability of the generated summaries without degrading their informativity

Proceedings ArticleDOI
12 Dec 2011
TL;DR: Experimental results showed that the summaries produced by the proposed approaches are better than other approaches produced by Microsoft Word 2007, Copernic Summarizer, and MANYASPECTS summarizers.
Abstract: Automatic text summarization is a data reduction process to exclude unnecessary details and present important information in a shorter version. One way to summarize document is by extracting important sentences in the document. To select suitable sentences, a numerical rank is assigned to each sentence based on a sentence scoring approach. Highly ranked sentences are used for the summary. This paper proposed an automatic text summarization approach based on sentence extraction using fuzzy logic, genetic algorithm, semantic role labeling and their combinations to generate high quality summaries. This study explored the benefits of the genetic algorithm in the optimization problem in for feature selection during the training phase and adjusts feature weights during the testing phase. Fuzzy IF-THEN rules were used to balance the weights between important and unimportant features. Conventional extraction methods cannot capture semantic relations between concepts in a text. Therefore, this research investigates the use of the semantic role labeling to capture the semantic contents in sentences and incorporate it into the summarization method. This paper is evaluated in terms of performance using ROUGE toolkit. Experimental results showed that the summaries produced by the proposed approaches are better than other approaches produced by Microsoft Word 2007, Copernic Summarizer, and MANYASPECTS summarizers.