scispace - formally typeset
Search or ask a question

Showing papers on "Multi-document summarization published in 2006"


Proceedings ArticleDOI
06 Nov 2006
TL;DR: A multi-knowledge based approach is proposed, which integrates WordNet, statistical analysis and movie knowledge, and the experimental results show the effectiveness of the proposed approach in movie review mining and summarization.
Abstract: With the flourish of the Web, online review is becoming a more and more useful and important information resource for people. As a result, automatic review mining and summarization has become a hot research topic recently. Different from traditional text summarization, review mining and summarization aims at extracting the features on which the reviewers express their opinions and determining whether the opinions are positive or negative. In this paper, we focus on a specific domain - movie review. A multi-knowledge based approach is proposed, which integrates WordNet, statistical analysis and movie knowledge. The experimental results show the effectiveness of the proposed approach in movie review mining and summarization.

931 citations


Proceedings Article
01 Jan 2006
TL;DR: Both news and web blog articles are investigated and algorithms for opinion extraction at word, sentence and document level are proposed, and the issue of relevant sentence selection is discussed, and then topical and opinionated information are summarized.
Abstract: Humans like to express their opinions and are eager to know others’ opinions. Automatically mining and organizing opinions from heterogeneous information sources are very useful for individuals, organizations and even governments. Opinion extraction, opinion summarization and opinion tracking are three important techniques for understanding opinions. Opinion extraction mines opinions at word, sentence and document levels from articles. Opinion summarization summarizes opinions of articles by telling sentiment polarities, degree and the correlated events. In this paper, both news and web blog articles are investigated. TREC, NTCIR and articles collected from web blogs serve as the information sources for opinion extraction. Documents related to the issue of animal cloning are selected as the experimental materials. Algorithms for opinion extraction at word, sentence and document level are proposed. The issue of relevant sentence selection is discussed, and then topical and opinionated information are summarized. Opinion summarizations are visualized by representative sentences. Text-based summaries in different languages, and from different sources, are compared. Finally, an opinionated curve showing supportive and nonsupportive degree along the timeline is illustrated by an opinion tracking system.

502 citations


Journal ArticleDOI
TL;DR: A novel video summarization technique by using Delaunay clusters that generates good quality summaries with fewer frames and less redundancy when compared to other schemes is proposed.
Abstract: Recent advances in technology have made tremendous amounts of multimedia information available to the general population. An efficient way of dealing with this new development is to develop browsing tools that distill multimedia data as information oriented summaries. Such an approach will not only suit resource poor environments such as wireless and mobile, but also enhance browsing on the wired side for applications like digital libraries and repositories. Automatic summarization and indexing techniques will give users an opportunity to browse and select multimedia document of their choice for complete viewing later. In this paper, we present a technique by which we can automatically gather the frames of interest in a video for purposes of summarization. Our proposed technique is based on using Delaunay Triangulation for clustering the frames in videos. We represent the frame contents as multi-dimensional point data and use Delaunay Triangulation for clustering them. We propose a novel video summarization technique by using Delaunay clusters that generates good quality summaries with fewer frames and less redundancy when compared to other schemes. In contrast to many of the other clustering techniques, the Delaunay clustering algorithm is fully automatic with no user specified parameters and is well suited for batch processing. We demonstrate these and other desirable properties of the proposed algorithm by testing it on a collection of videos from Open Video Project. We provide a meaningful comparison between results of the proposed summarization technique with Open Video storyboard and K-means clustering. We evaluate the results in terms of metrics that measure the content representational value of the proposed technique.

330 citations


Proceedings ArticleDOI
17 Jul 2006
TL;DR: It is shown that approximate inference in BAYESUM is possible on large data sets and results in a state-of-the-art summarization system, and how B Bayesian summarization can be understood as a justified query expansion technique in the language modeling for IR framework.
Abstract: We present BAYESUM (for "Bayesian summarization"), a model for sentence extraction in query-focused summarization. BAYESUM leverages the common case in which multiple documents are relevant to a single query. Using these documents as reinforcement for query terms, BAYESUM is not afflicted by the paucity of information in short queries. We show that approximate inference in BAYESUM is possible on large data sets and results in a state-of-the-art summarization system. Furthermore, we show how BAYESUM can be understood as a justified query expansion technique in the language modeling for IR framework.

265 citations


Proceedings ArticleDOI
06 Aug 2006
TL;DR: The research shows that a frequency based summarizer can achieve performance comparable to that of state-of-the-art systems, but only with a good composition function; context sensitivity improves performance and significantly reduces repetition.
Abstract: The usual approach for automatic summarization is sentence extraction, where key sentences from the input documents are selected based on a suite of features. While word frequency often is used as a feature in summarization, its impact on system performance has not been isolated. In this paper, we study the contribution to summarization of three factors related to frequency: content word frequency, composition functions for estimating sentence importance from word frequency, and adjustment of frequency weights based on context. We carry out our analysis using datasets from the Document Understanding Conferences, studying not only the impact of these features on automatic summarizers, but also their role in human summarization. Our research shows that a frequency based summarizer can achieve performance comparable to that of state-of-the-art systems, but only with a good composition function; context sensitivity improves performance and significantly reduces repetition.

246 citations


Proceedings Article
01 Apr 2006
TL;DR: A framework for summarizing a corpus of evaluative documents about a single entity by a natural language summary is presented and it is indicated that forevaluative text abstraction tends to be more effective than extraction, particularly when the corpus is controversial.
Abstract: In many decision‐making scenarios, people can benefit from knowing what other people's opinions are. As more and more evaluative documents are posted on the Web, summarizing these useful resources becomes a critical task for many organizations and individuals. This paper presents a framework for summarizing a corpus of evaluative documents about a single entity by a natural language summary. We propose two summarizers: an extractive summarizer and an abstractive one. As an additional contribution, we show how our abstractive summarizer can be modified to generate summaries tailored to a model of the user preferences that is solidly grounded in decision theory and can be effectively elicited from users. We have tested our framework in three user studies. In the first one, we compared the two summarizers. They performed equally well relative to each other quantitatively, while significantly outperforming a baseline standard approach to multidocument summarization. Trends in the results as well as qualitative comments from participants suggest that the summarizers have different strengths and weaknesses. After this initial user study, we realized that the diversity of opinions expressed in the corpus (i.e., its controversiality) might play a critical role in comparing abstraction versus extraction. To clearly pinpoint the role of controversiality, we ran a second user study in which we controlled for the degree of controversiality of the corpora that were summarized for the participants. The outcome of this study indicates that for evaluative text abstraction tends to be more effective than extraction, particularly when the corpus is controversial. In the third user study we assessed the effectiveness of our user tailoring strategy. The results of this experiment confirm that user tailored summaries are more informative than untailored ones.

176 citations


Patent
21 Mar 2006
TL;DR: In this paper, a list of hot topics is provided to a user to indicate information that is currently popular, and a topic may be deemed popular when a large number of search queries related to the topic are entered by users.
Abstract: A list of “hot topics” may be provided to a user to indicate information that is currently popular. A topic may be deemed popular when a large number of search queries related to the topic are entered by users. A search system may receive and analyze an electronic source of published information to determine a reason for why a particular popular topic is popular. If content related to why a particular popular topic is popular exists in multiple electronic sources of published information, text summarization techniques may be used to determine a reason for why the popular topic is popular by from among the multiple electronic sources of published information.

144 citations


Proceedings ArticleDOI
17 Jul 2006
TL;DR: An "oracle" score, based on the probability distribution of unigrams in human summaries, is introduced and it is demonstrated that with the oracle score, extracts are generated which score, on average, better than the human summary, when evaluated with ROUGE.
Abstract: We consider the problem of producing a multi-document summary given a collection of documents. Since most successful methods of multi-document summarization are still largely extractive, in this paper, we explore just how well an extractive method can perform. We introduce an "oracle" score, based on the probability distribution of unigrams in human summaries. We then demonstrate that with the oracle score, we can generate extracts which score, on average, better than the human summaries, when evaluated with ROUGE. In addition, we introduce an approximation to the oracle score which produces a system with the best known performance for the 2005 Document Understanding Conference (DUC) evaluation.

123 citations


Proceedings ArticleDOI
04 Jun 2006
TL;DR: This paper introduces an information-theoretic approach to automatic evaluation of summaries based on the Jensen-Shannon divergence of distributions between an automatic summary and a set of reference summaries and results indicate that JS divergence-based evaluation method achieves comparable performance with the common automatic evaluation method ROUGE in single documents summarization task.
Abstract: Until recently there are no common, convenient, and repeatable evaluation methods that could be easily applied to support fast turn-around development of automatic text summarization systems. In this paper, we introduce an information-theoretic approach to automatic evaluation of summaries based on the Jensen-Shannon divergence of distributions between an automatic summary and a set of reference summaries. Several variants of the approach are also considered and compared. The results indicate that JS divergence-based evaluation method achieves comparable performance with the common automatic evaluation method ROUGE in single documents summarization task; while achieves better performance than ROUGE in multiple document summarization task.

110 citations


Proceedings ArticleDOI
04 Jun 2006
TL;DR: An affinity graph based approach to multi-document summarization that incorporates a diffusion process to acquire semantic relationships between sentences, and then compute information richness of sentences by a graph rank algorithm on differentiated intra- document links and inter-document links between sentences.
Abstract: This paper describes an affinity graph based approach to multi-document summarization. We incorporate a diffusion process to acquire semantic relationships between sentences, and then compute information richness of sentences by a graph rank algorithm on differentiated intra-document links and inter-document links between sentences. A greedy algorithm is employed to impose diversity penalty on sentences and the sentences with both high information richness and high information novelty are chosen into the summary. Experimental results on task 2 of DUC 2002 and task 2 of DUC 2004 demonstrate that the proposed approach outperforms existing state-of-the-art systems.

105 citations


Proceedings ArticleDOI
06 Nov 2006
TL;DR: A method to create query-specific summaries by identifying the most query-relevant fragments and combining them using the semantic associations within the document by calculating the top spanning trees on the document graphs is presented.
Abstract: There has been a great amount of work on query-independent summarization of documents. However, due to the success of Web search engines query-specific document summarization (query result snippets) has become an important problem, which has received little attention. We present a method to create query-specific summaries by identifying the most query-relevant fragments and combining them using the semantic associations within the document. In particular, we first add structure to the documents in the preprocessing stage and convert them to document graphs. Then, the best summaries are computed by calculating the top spanning trees on the document graphs. We present and experimentally evaluate efficient algorithms that support computing summaries in interactive time. Furthermore, the quality of our summarization method is compared to current approaches using a user survey.

Proceedings Article
01 Jan 2006
TL;DR: Computational approaches to summarizing dynamically introduced information: online discussions and blogs, and their evaluations are described; when branching into these newly emerged data types, there are number of difficulties that are discussed here.
Abstract: In this paper we describe computational approaches to summarizing dynamically introduced information: online discussions and blogs, and their evaluations. Research in the past has been mainly focused on text-based summarization where the input data is predominantly newswire data. When branching into these newly emerged data types, we face number of difficulties that are discussed here.

Proceedings Article
01 Jan 2006
TL;DR: This paper summarizes spontaneous conversations with features of a wide variety that have not been explored before, and examines the role of disfluencies in summarization, which in all previous work was either not explicitly handled or removed as noise.
Abstract: Most speech summarization research is conducted on broadcast news. In our viewpoint, spontaneous conversations are a more “typical” speech source that distinguishes speech summarization from text summarization, and hence a more appropriate domain for studying speech summarization. For example, spontaneous conversations contain more spoken-language characteristics, e.g. disfluencies and false starts. They are also more vulnerable to ASR errors. Previous research has studied some aspects of this type of data, but this paper addresses the problem further in several important respects. First, we summarize spontaneous conversations with features of a wide variety that have not been explored before. Second, we examine the role of disfluencies in summarization, which in all previous work was either not explicitly handled or removed as noise. Third, we breakdown and analyze the impact of WER on the individual features for summarization. Index Terms: speech summarization, utterance selection, spontaneous conversations

Proceedings ArticleDOI
23 Jul 2006
TL;DR: The evaluation shows that the best summarization systems have difficulty extracting relevant sentences in response to complex questions (as opposed to representative sentences that might be appropriate to a generic summary).
Abstract: The Document Understanding Conference (DUC) 2005 evaluation had a single user-oriented, question-focused summarization task, which was to synthesize from a set of 25--50 documents a well-organized, fluent answer to a complex question. The evaluation shows that the best summarization systems have difficulty extracting relevant sentences in response to complex questions (as opposed to representative sentences that might be appropriate to a generic summary). The relatively generous allowance of 250 words for each answer also reveals how difficult it is for current summarization systems to produce fluent text from multiple documents.

Proceedings Article
Ani Nenkova1
01 Jan 2006
TL;DR: Precision/recall schemes, as well as summary accuracy measures which incorporate weightings based on multiple human decisions, are suggested as particularly suitable in evaluating speech summaries.
Abstract: This paper surveys current text and speech summarization evaluation approaches. It discusses advantages and disadvantages of these, with the goal of identifying summarization techniques most suitable to speech summarization. Precision/recall schemes, as well as summary accuracy measures which incorporate weightings based on multiple human decisions, are suggested as particularly suitable in evaluating speech summaries.

Proceedings ArticleDOI
22 Jun 2006
TL;DR: A new user query based text summarization technique that makes use of unified medical language system, an ontology knowledge source from National Library of Medicine, is proposed and shows potential to be used in other information retrieval areas.
Abstract: As huge amounts of knowledge are created rapidly, effective information access becomes an important issue. Especially for critical domains, such as medical and financial areas, efficient retrieval of concise and relevant information is highly desired. In this paper we propose a new user query based text summarization technique that makes use of unified medical language system, an ontology knowledge source from National Library of Medicine. We compare our method with keyword-only approach, and our ontology-based method performs clearly better. Our method also shows potential to be used in other information retrieval areas

Proceedings ArticleDOI
06 Aug 2006
TL;DR: An evaluation of a novel hierarchical text summarization method that allows users to view summaries of Web documents from small, mobile devices and, in comparing the new method to three other summarization methods, subjects achieved significantly better accuracy on the tasks when using hierarchical summaries.
Abstract: We present an evaluation of a novel hierarchical text summarization method that allows users to view summaries of Web documents from small, mobile devices. Unlike previous approaches, ours does not require the documents to be in HTML since it infers a hierarchical structure automatically. Currently, the method is used to summarize news articles sent to a Web mail account in plain text format. Subjects used a Web-enabled mobile phone emulator to access the account's inbox and view the summarized news articles. They then used the summaries to complete several information-seeking tasks, which involved answering factual questions about the stories. In comparing the hierarchical text summary setting to that in which subjects were given the full text articles, there was no significant difference in task accuracy or the time taken to complete the task. However, in the hierarchical summarization setting, the number of bytes transferred per user request is less than half that of the full text case. Finally, in comparing the new method to three other summarization methods, subjects achieved significantly better accuracy on the tasks when using hierarchical summaries.

Proceedings ArticleDOI
18 Dec 2006
TL;DR: The generic summarization method is proposed that extracts the most relevance sentences from the source document to form a summary that can contain the main contents of different topics as many as possible and reduce its redundancy at the same time.
Abstract: In this paper is proposed the generic summarization method that extracts the most relevance sentences from the source document to form a summary. This method based on clustering of sentences. The specificity of this approach is that the generated summary can contain the main contents of different topics as many as possible and reduce its redundancy at the same time. The clustering method satisfies as much homogeneity within each cluster as well as much separability between the clusters as possible.

01 Jan 2006
TL;DR: A general modular automatic summarizer that achieves state of the art performance is described, techniques for automatic modification of the original author's wording of sentences that are included in a summary are explored, and an algorithm for a context sensitive frequency-based summarizer is proposed.
Abstract: Recent years have seen unprecedented interest in news aggregation and browsing, with dedicated corporate and research websites becoming increasingly popular. Generic multidocument summarization can enhance users' experiences with such sites, and thus the development and evaluation of automatic summarization systems has become not only research, but a very practical challenge. In this thesis, we describe a general modular automatic summarizer that achieves state of the art performance, present our experiments with rewrite of generic noun phrases and of references to people, and demonstrate how distinctions such as familiarity and salience of entities mentioned in the input can be automatically determined. We also propose an intrinsic evaluation method for summarization that incorporates the use of multiple models and allows a better study of human agreement in content selection. Our investigations and experiments have helped us to understand better the process of summarization and to formulate tasks that we believe will lead to future improvements in automatic summarization. It is well-known that humans do not fully agree on what content should be included in a summary. Traditionally, this phenomenon has been studied on the level of sentences, but sentences are a rather coarse level of granularity for content analysis. Here, we introduce an annotation method for semantically driven comparison of several texts for similarities and differences on the subsentential level. When applied to human summaries for the same input, the method allows for a better examination of human agreement, and also provides the basis for an evaluation method that incorporates the notion of importance of a content unit in a summary. Given the variability of human choices, we next address the questions of what features in the input are predictive for inclusion of content in the summary. We use a large collection of human written summaries and the respective inputs to study the predictive effect of one feature that has been widely used in summarization: frequency of occurrence. We show that content units that are repeated frequently in the input tend to be included in at least some human summaries and that human summarizers tend to agree more on the inclusion of frequent content units. In addition, human summaries tend to have higher likelihood under a multinomial model estimated from the input than automatic summaries do. This empirical investigation leads us to propose an algorithm for a context sensitive frequency-based summarizer. We show that context sensitivity and a good choice of composition function for estimating the weight of a sentence lead to a summarizer that performs as well as the best supervised automatic summarizer. We then turn to exploring methods for summary rewrite; that is, techniques for automatic modification of the original author's wording of sentences that are included in a summary. The added flexibility of subsentential changes has potential benefits for improving content selection as well as summary readability. We show that human readers prefer summaries in which references to people have been rewritten to restore the fluency of the text. We further develop our work on references to people, by presenting an approach to automatic classification of entity salience and familiarity, based on robustly derivable lexical, syntactic and frequency features. Such information is necessary for the generation of appropriate referring expressions.

Patent
30 May 2006
TL;DR: In this paper, a method for extraction and summarization of sentiment information related to a particular research subject is described, which includes accessing sources of information that contain sentiment information that is related to the research subject and extracting the sentiment information from the sources as opinions related to research subject.
Abstract: Methods and systems for extraction and summarization of sentiment information related to a particular research subject are disclosed. A method includes accessing sources of information that contain sentiment information that is related to the research subject and extracting the sentiment information from the sources of information as opinions related to the research subject. Opinion categories related to features of the research subject are identified. From this information a summarization of the sentiment information that is related to the particular research subject that includes the identified opinion categories is generated. Subsequently, access is provided to the summarization for graphical presentation.

01 Jan 2006
TL;DR: The Multi-Document Trimmer (MDT) uses Trimmer to generate multiple trimmed candidates for each sentence, and sentence selection is used to determine which trimmed candidates provide the best combination of topic coverage and brevity.
Abstract: We applied a single-document sentencetrimming approach (Trimmer) to the problem of multi-document summarization. Trimmer was designed with the intention of compressing a lead sentence into a space consisting of tens of characters. In our Multi-Document Trimmer (MDT), we use Trimmer to generate multiple trimmed candidates for each sentence. Sentence selection is used to determine which trimmed candidates provide the best combination of topic coverage and brevity. We demonstrate that we were able to port Trimmer easily to this new problem. We also show that MDT generally ranked higher for recall than for precision, suggesting that MDT is currently more successful at finding relevant content than it is at weeding out irrelevant content. Finally, we present an error analysis that shows that, while sentence compressions is making space for additional sentences, more work is needed in the area of generating and selecting the right candidates.

Proceedings ArticleDOI
17 Jul 2006
TL;DR: This work presents a bottom-up approach to arranging sentences extracted for multi-document summarization using four criteria, chronology, topical-closeness, precedence, and succession, which are integrated into a criterion by a supervised learning approach.
Abstract: Ordering information is a difficult but important task for applications generating natural-language text. We present a bottom-up approach to arranging sentences extracted for multi-document summarization. To capture the association and order of two textual segments (eg, sentences), we define four criteria, chronology, topical-closeness, precedence, and succession. These criteria are integrated into a criterion by a supervised learning approach. We repeatedly concatenate two textual segments into one segment based on the criterion until we obtain the overall segment with all sentences arranged. Our experimental results show a significant improvement over existing sentence ordering strategies.

01 Jan 2006
TL;DR: This approach is based on sentence extraction, where sentence type annotation is used for weighting, and frequencies of terms with sentiment polarities are taken into account if question types are appropriate for this.
Abstract: In this paper, we present our approach to opinion-focused summarization, its results with the DUC 2006 data, and additional analysis. We extend our approach previously proposed from DUC 2005 to achieve summarization responding multiple questions, assuming that given narrative consists of multiple questions, by segmenting the “narrative” into questions. Our new approach is based on sentence extraction, where sentence type annotation is used for weighting, and frequencies of terms with sentiment polarities are taken into account if question types are appropriate for this. In addition, we selected 15 topics related to opinionfocused summarization and analyzed sentences in original source documents which correspond to model summaries.

01 Jan 2006
TL;DR: It is shown that by considering entailment relationships between sentences extracted for a summary, Language Computer Corporation’s GIST EXTER can automatically create semantic “Pyramids” that can be used to identify answer passages that are both relevant and responsive.
Abstract: In this paper, we describe how Language Computer Corporation’s GIST EXTER question-directed summarization system combines multiple strategies for question decomposition and summary generation in order to produce summary-length answers to complex questions. In addition, we introduce a novel framework for question-directed summarization that uses a state-of-the-art textual entailment system (Hickl et al., 2006) in order to select a single responsive summary answer from amongst a number of candidate summaries. We show that by considering entailment relationships between sentences extracted for a summary, we can automatically create semantic “Pyramids” that can be used to identify answer passages that are both relevant and responsive.

Proceedings ArticleDOI
06 Nov 2006
TL;DR: This work proposes the frequency of domain concepts as a method to identify important sentences within a full-text and proposes a novel frequency distribution model and algorithm for identifying important sentences based on term or concept frequency distribution.
Abstract: Text summarization is a data reduction process. The use of text summarization enables users to reduce the amount of text that must be read while still assimilating the core information. The data reduction offered by text summarization is particularly useful in the biomedical domain, where physicians must continuously find clinical trial study information to incorporate into their patient treatment efforts. Such efforts are often hampered by the high-volume of publications. Our contribution is two-fold: 1) to propose the frequency of domain concepts as a method to identify important sentences within a full-text; and 2) propose a novel frequency distribution model and algorithm for identifying important sentences based on term or concept frequency distribution. An evaluation of several existing summarization systems using biomedical texts is presented in order to determine a performance baseline. For domain concept comparison, a recent high-performing frequency-based algorithm using terms is adapted to use concepts and evaluated using both terms and concepts. It is shown that the use of concepts performs closely with the use of terms for sentence selection. Our proposed frequency distribution model and algorithm outperforms a state-of-the-art approach.

Proceedings ArticleDOI
07 Jun 2006
TL;DR: An automated technique for single document summarization which combines content-based and graph-based approaches and the Hopfield network algorithm as a technique for ranking text segments is proposed.
Abstract: The continuing growth of World Wide Web and on-line text collections makes a large volume of information available to users. Automatic text summarization allows users to quickly understand documents. In this paper, we propose an automated technique for single document summarization which combines content-based and graph-based approaches and introduce the Hopfield Network algorithm as a technique for ranking text segments. A series of experiments are performed using the DUC collection and a Thai-document collection. The results show the superiority of the proposed technique over reference systems, in addition the Hopfield Network algorithm on undirected graph is shown to be the best text segment ranking algorithm in the study.

Patent
16 Nov 2006
TL;DR: In this article, different levels of summarization may be applied to different segments of text within a document, depending upon the user's interaction with an application in conjunction with a version of the document or with a document structure including the document.
Abstract: Summarization of text in a document may be requested in dependence upon the position of the text in relation to other text within the document or the position of the document containing the text within a plurality of documents in a document structure. Summarization of text in a document may also be requested in dependence upon a user's interaction with an application in conjunction with a version of the document or with a document structure including the document. Different levels of summarization may be applied to different segments of text within a document.

Proceedings ArticleDOI
22 Jul 2006
TL;DR: A sentence ordering algorithm using a semi-supervised sentence classification and historical ordering strategy that helps to ensure topic continuity and avoid topic bias is proposed.
Abstract: In this paper, we propose a sentence ordering algorithm using a semi-supervised sentence classification and historical ordering strategy. The classification is based on the manifold structure underlying sentences, addressing the problem of limited labeled data. The historical ordering helps to ensure topic continuity and avoid topic bias. Experiments demonstrate that the method is effective.

Proceedings ArticleDOI
14 May 2006
TL;DR: The use of probabilistic latent topical information for extractive summarization of spoken documents is proposed and the summarization capabilities were verified by comparison with the conventional vector space model and latent semantic indexing model, as well as the HMM model.
Abstract: The purpose of extractive summarization is to automatically select a number of indicative sentences, passages, or paragraphs from the original document according to a target summarization ratio and then sequence them to form a concise summary. In the paper, we proposed the use of probabilistic latent topical information for extractive summarization of spoken documents. Various kinds of modeling structures and learning approaches were extensively investigated. In addition, the summarization capabilities were verified by comparison with the conventional vector space model and latent semantic indexing model, as well as the HMM model. The experiments were performed on the Chinese broadcast news collected in Taiwan. Noticeable performance gains were obtained.

Proceedings ArticleDOI
18 Dec 2006
TL;DR: The relevance of the sentences to the specified topic is integrated into the graph-ranking based method for topic-focused multi-document summarization and the great importance of the cross-document relationships between sentences is demonstrated.
Abstract: Graph-ranking based methods have been developed for generic multi-document summarization in recent years and they make uniform use of the relationships between sentences to extract salient sentences. This paper proposes to integrate the relevance of the sentences to the specified topic into the graph-ranking based method for topic-focused multi-document summarization. The cross-document relationships and the within-document relationships between sentences are differentiated and we apply the graph-ranking based method using each individual kind of sentence relationships and explore their relative importance for topic-focused multi-document summarization. Experimental results on DUC2003 and DUC2005 demonstrate the great importance of the cross-document relationships between sentences for topic-focused multi-document summarization. Even the approach based only on the cross-document sentence relationships can perform better than or at least as well as the approaches based on both kinds of sentence relationships