A survey of document image classification: problem statement, classifier architecture and performance evaluation

doi:10.1007/S10032-006-0020-2

Journal ArticleDOI

A survey of document image classification: problem statement, classifier architecture and performance evaluation

Nawei Chen, +1 more

- 22 May 2007 -

International Journal on Document Analys...

- Vol. 10, Iss: 1, pp 1-16

Chats0

TLDR

This work focuses on techniques that classify single-page typeset document images without using OCR results, and brings to light important issues in designing a document classifier, including the definition of document classes, the choices of document features and feature representation, and the choice of classification algorithm and learning mechanism.

Abstract:

Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.

A survey of document image classification: problem statement, classifier architecture and performance evaluation

Citations

Random graphs

Evaluation of deep convolutional nets for document image classification and retrieval

Convolutional Neural Networks for Document Image Classification

Integrating image data into biomedical text categorization

Automatic suggestions and other content for messaging applications

References

C4.5: Programs for Machine Learning

Pattern Classification

Random Graphs

Machine learning in automated text categorization

Random graphs

Related Papers (5)

Classification of document pages using structure-based features

Machine learning in automated text categorization

Distinctive Image Features from Scale-Invariant Keypoints

Convolutional Neural Networks for Document Image Classification

Structural similarity for document image classification and retrieval