scispace - formally typeset
Open AccessPosted Content

A Sequential Algorithm for Training Text Classifiers

Reads0
Chats0
TLDR
An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task and reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
Abstract
The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.

read more

Citations
More filters

Active Learning Literature Survey

Burr Settles
TL;DR: This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date.
Proceedings Article

A comparison of event models for naive bayes text classification

TL;DR: It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Journal ArticleDOI

Support vector machine active learning with applications to text classification

TL;DR: Experimental results showing that employing the active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings are presented.
Journal ArticleDOI

Text Classification from Labeled and Unlabeled Documents using EM

TL;DR: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.
Book ChapterDOI

Naive (Bayes) at forty: the independence assumption in information retrieval

TL;DR: The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval, and some of the variations used for text retrieval and classification are reviewed.
References
More filters
Proceedings ArticleDOI

Query by committee

TL;DR: It is suggested that asymptotically finite information gain may be an important characteristic of good query algorithms, in which a committee of students is trained on the same data set.
Journal ArticleDOI

Generalization as search

TL;DR: The problem of concept learning, or forming a general description of a class of objects given a set of examples and non-examples, is viewed here as a search problem.
Related Papers (5)