Document Clustering and Text Summarization

Open Access

Document Clustering and Text Summarization

TLDR

A text mining tool that performs two tasks, namely document clustering and text summarization, based on computing the value of a TF-ISF measure for each word, which is an adaptation of the conventional TF-IDF measure of information retrieval.

Abstract:

This paper describes a text mining tool that performs two tasks, namely document clustering and text summarization. These tasks have, of course, their corresponding counterpart in “conventional” data mining. However, the textual, unstructured nature of documents makes these two text mining tasks considerably more difficult than their data mining counterparts. In our system document clustering is performed by using the Autoclass data mining algorithm. Our text summarization algorithm is based on computing the value of a TF-ISF (term frequency – inverse sentence frequency) measure for each word, which is an adaptation of the conventional TF-IDF (term frequency – inverse document frequency) measure of information retrieval. Sentences with high values of TF-ISF are selected to produce a summary of the source text. The system has been evaluated on real-world documents, and the results are satisfactory.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Recent automatic text summarization techniques: a survey

Mahak Gambhir, +1 more

- 01 Jan 2017 -

Artificial Intelligence Review

TL;DR: A comprehensive survey of recent text summarization extractive approaches developed in the last decade is presented and the discussion of useful future directions that can help researchers to identify areas where further research is needed are discussed.

...read moreread less

Journal ArticleDOI

Ensemble of keyword extraction methods and classifiers in text classification

Aytuğ Onan, +2 more

- 15 Sep 2016 -

Expert Systems With Applications

TL;DR: The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability ofText classification schemes, which is of practical importance in the application fields of text classification.

...read moreread less

Book ChapterDOI

Automatic Text Summarization Using a Machine Learning Approach

Joel Larocca Neto, +2 more

TL;DR: This paper presents a summarization procedure based on the application of trainable Machine Learning algorithms which employs a set of features extracted directly from the original text, based on statistical and linguistic features extracted from a simplified argumentative structure of the text.

...read moreread less

Proceedings ArticleDOI

Document clustering with prior knowledge

Xiang Ji, +1 more

TL;DR: This work proposes to incorporate prior knowledge of cluster membership for document cluster analysis and develops a novel semi-supervised document clustering model that reveals remarkable performance improvements with very limited training samples, and is a very effective semi- supervised classification tool.

...read moreread less

Journal ArticleDOI

A complex network approach to text summarization

Lucas Antiqueira, +3 more

- 01 Feb 2009 -

Information Sciences

TL;DR: A set of 14 summarizers are developed, generically referred to as CN-Summ, employing network concepts such as node degree, length of shortest paths, d-rings and k-cores to select sentences for an extractive summary of texts.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Term Weighting Approaches in Automatic Text Retrieval

Gerard Salton, +1 more

- 01 Aug 1988 -

Information Processing and Management

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

...read moreread less

Journal ArticleDOI

An algorithm for suffix stripping

M. F. Porter

- 01 Dec 1997 -

Program: Electronic Library and Informat...

TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.

...read moreread less

Proceedings Article

A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

Thorsten Joachims

TL;DR: A Probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework and suggests that the probabilistic algorithms are preferable to the heuristic Rocchio classifier.

...read moreread less

Proceedings Article

Bayesian classification (AutoClass): theory and results

Peter Cheeseman, +1 more

TL;DR: It is emphasized that no current unsupervised classi cation system can produce maximally useful results when operated alone and that it is the interaction between domain experts and the machine searching over the model space that generates new knowledge.

...read moreread less

Book

Managing gigabytes

Ian H. Witten

Document Clustering and Text Summarization

Citations

Recent automatic text summarization techniques: a survey

Ensemble of keyword extraction methods and classifiers in text classification

Automatic Text Summarization Using a Machine Learning Approach

Document clustering with prior knowledge

A complex network approach to text summarization

References

Term Weighting Approaches in Automatic Text Retrieval

An algorithm for suffix stripping

A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

Bayesian classification (AutoClass): theory and results

Managing gigabytes

Related Papers (5)

The automatic creation of literature abstracts

Term Weighting Approaches in Automatic Text Retrieval

An algorithm for suffix stripping

ROUGE: A Package for Automatic Evaluation of Summaries

A Comparison of Document Clustering Techniques