An Evaluation of Statistical Approaches to Text Categorization

doi:10.1023/A:1009982220290

Journal ArticleDOI

An Evaluation of Statistical Approaches to Text Categorization

Yiming Yang

- 15 May 1999 -

Information Retrieval

- Vol. 1, Iss: 1, pp 69-90

TLDR

Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature.

Abstract:

This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performances except for a Naive Bayes approach, the other learning algorithms also performed relatively well.

An Evaluation of Statistical Approaches to Text Categorization

Citations

Machine learning

Foundations of Statistical Natural Language Processing

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

Machine learning in automated text categorization

Support vector machines

References

Induction of Decision Trees

Machine learning

Introduction to Modern Information Retrieval

A Comparative Study on Feature Selection in Text Categorization

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Related Papers (5)

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

Machine learning in automated text categorization

A Comparative Study on Feature Selection in Text Categorization

A comparison of event models for naive bayes text classification

Term Weighting Approaches in Automatic Text Retrieval