scispace - formally typeset
Open Access

Document Clustering and Text Summarization

TLDR
A text mining tool that performs two tasks, namely document clustering and text summarization, based on computing the value of a TF-ISF measure for each word, which is an adaptation of the conventional TF-IDF measure of information retrieval.
Abstract
This paper describes a text mining tool that performs two tasks, namely document clustering and text summarization. These tasks have, of course, their corresponding counterpart in “conventional” data mining. However, the textual, unstructured nature of documents makes these two text mining tasks considerably more difficult than their data mining counterparts. In our system document clustering is performed by using the Autoclass data mining algorithm. Our text summarization algorithm is based on computing the value of a TF-ISF (term frequency – inverse sentence frequency) measure for each word, which is an adaptation of the conventional TF-IDF (term frequency – inverse document frequency) measure of information retrieval. Sentences with high values of TF-ISF are selected to produce a summary of the source text. The system has been evaluated on real-world documents, and the results are satisfactory.

read more

Citations
More filters
Journal ArticleDOI

Recent automatic text summarization techniques: a survey

TL;DR: A comprehensive survey of recent text summarization extractive approaches developed in the last decade is presented and the discussion of useful future directions that can help researchers to identify areas where further research is needed are discussed.
Journal ArticleDOI

Ensemble of keyword extraction methods and classifiers in text classification

TL;DR: The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability ofText classification schemes, which is of practical importance in the application fields of text classification.
Book ChapterDOI

Automatic Text Summarization Using a Machine Learning Approach

TL;DR: This paper presents a summarization procedure based on the application of trainable Machine Learning algorithms which employs a set of features extracted directly from the original text, based on statistical and linguistic features extracted from a simplified argumentative structure of the text.
Proceedings ArticleDOI

Document clustering with prior knowledge

Xiang Ji, +1 more
TL;DR: This work proposes to incorporate prior knowledge of cluster membership for document cluster analysis and develops a novel semi-supervised document clustering model that reveals remarkable performance improvements with very limited training samples, and is a very effective semi- supervised classification tool.
Journal ArticleDOI

A complex network approach to text summarization

TL;DR: A set of 14 summarizers are developed, generically referred to as CN-Summ, employing network concepts such as node degree, length of shortest paths, d-rings and k-cores to select sentences for an extractive summary of texts.
References
More filters
Journal ArticleDOI

Term Weighting Approaches in Automatic Text Retrieval

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Journal ArticleDOI

An algorithm for suffix stripping

TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.
Proceedings Article

A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

TL;DR: A Probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework and suggests that the probabilistic algorithms are preferable to the heuristic Rocchio classifier.
Proceedings Article

Bayesian classification (AutoClass): theory and results

TL;DR: It is emphasized that no current unsupervised classi cation system can produce maximally useful results when operated alone and that it is the interaction between domain experts and the machine searching over the model space that generates new knowledge.
Book

Managing gigabytes

Ian H. Witten