scispace - formally typeset
Open AccessProceedings ArticleDOI

A sequential algorithm for training text classifiers

David D. Lewis, +1 more
- pp 3-12
Reads0
Chats0
TLDR
In this article, an algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task, which reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
Abstract
The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.

read more

Citations
More filters
Journal ArticleDOI

Machine learning in automated text categorization

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

Active Learning Literature Survey

Burr Settles
TL;DR: This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date.
Proceedings Article

A comparison of event models for naive bayes text classification

TL;DR: It is found that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi -variateBernoulli model at any vocabulary size.
Journal ArticleDOI

Support vector machine active learning with applications to text classification

TL;DR: Experimental results showing that employing the active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings are presented.
Journal ArticleDOI

Text Classification from Labeled and Unlabeled Documents using EM

TL;DR: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.
References
More filters
Book

Generalized Linear Models

TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).
Book

Pattern classification and scene analysis

TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.
Journal ArticleDOI

Queries and Concept Learning

TL;DR: This work considers the problem of using queries to learn an unknown concept, and several types of queries are described and studied: membership, equivalence, subset, superset, disjointness, and exhaustiveness queries.
Proceedings ArticleDOI

Query by committee

TL;DR: It is suggested that asymptotically finite information gain may be an important characteristic of good query algorithms, in which a committee of students is trained on the same data set.
Journal ArticleDOI

Improving Generalization with Active Learning

TL;DR: A formalism for active concept learning called selective sampling is described and it is shown how it may be approximately implemented by a neural network.
Related Papers (5)