scispace - formally typeset
Journal ArticleDOI

A survey of document image classification: problem statement, classifier architecture and performance evaluation

Reads0
Chats0
TLDR
This work focuses on techniques that classify single-page typeset document images without using OCR results, and brings to light important issues in designing a document classifier, including the definition of document classes, the choices of document features and feature representation, and the choice of classification algorithm and learning mechanism.
Abstract
Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.

read more

Citations
More filters
Proceedings ArticleDOI

Random graphs

TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.
Proceedings ArticleDOI

Evaluation of deep convolutional nets for document image classification and retrieval

TL;DR: In this paper, a new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs), is presented.
Proceedings ArticleDOI

Convolutional Neural Networks for Document Image Classification

TL;DR: Equipped with rectified linear units and trained with dropout, this CNN performs well even when document layouts present large inner-class variations, and experiments on public challenging datasets demonstrate the effectiveness of the proposed approach.
Journal ArticleDOI

Integrating image data into biomedical text categorization

TL;DR: A method for obtaining features from images and for using them-both alone and in combination with text-to perform the triage task introduced in the TREC Genomics track 2004 is presented.
Patent

Automatic suggestions and other content for messaging applications

TL;DR: A messaging application may automatically analyze content of one or more messages and/or user information to automatically provide suggestions to a user within a messaging application, such as a reply to a message as discussed by the authors.
References
More filters
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Book

Random Graphs

Journal ArticleDOI

Machine learning in automated text categorization

TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Proceedings ArticleDOI

Random graphs

TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.
Related Papers (5)