Journal ArticleDOI
A survey of document image classification: problem statement, classifier architecture and performance evaluation
Nawei Chen,Dorothea Blostein +1 more
Reads0
Chats0
TLDR
This work focuses on techniques that classify single-page typeset document images without using OCR results, and brings to light important issues in designing a document classifier, including the definition of document classes, the choices of document features and feature representation, and the choice of classification algorithm and learning mechanism.Abstract:
Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.read more
Citations
More filters
Proceedings ArticleDOI
Random graphs
TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.
Proceedings ArticleDOI
Evaluation of deep convolutional nets for document image classification and retrieval
TL;DR: In this paper, a new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs), is presented.
Proceedings ArticleDOI
Convolutional Neural Networks for Document Image Classification
TL;DR: Equipped with rectified linear units and trained with dropout, this CNN performs well even when document layouts present large inner-class variations, and experiments on public challenging datasets demonstrate the effectiveness of the proposed approach.
Journal ArticleDOI
Integrating image data into biomedical text categorization
TL;DR: A method for obtaining features from images and for using them-both alone and in combination with text-to perform the triage task introduced in the TREC Genomics track 2004 is presented.
Patent
Automatic suggestions and other content for messaging applications
Gershony Ori,Sergey Nazarov,De Castro Rodrigo,Erika Palmer,Daniel Ramage,Adam Rodriguez,Andrei Pascovici +6 more
TL;DR: A messaging application may automatically analyze content of one or more messages and/or user information to automatically provide suggestions to a user within a messaging application, such as a reply to a message as discussed by the authors.
References
More filters
Book
C4.5: Programs for Machine Learning
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Journal ArticleDOI
Machine learning in automated text categorization
TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Proceedings ArticleDOI
Random graphs
TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.