Groundtruth generation and document image degradation

doi:10.21236/ADA447997

Open AccessReportDOI

Groundtruth generation and document image degradation

TLDR

A system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives, and used for training and evaluating Optical Character Recognition (OCR) systems.

Abstract:

: The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed a system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded by our degradation modules, and used for training and evaluating Optical Character Recognition (OCR) systems. Our document image degradation methodology incorporates several often-encountered types of noise at the page and pixel levels. Examples of OCR evaluation and synthetically degraded document images are given to demonstrate the effectiveness.

Groundtruth generation and document image degradation

Citations

Triggering applications based on a captured text in a mixed media environment

System And Methods For Creation And Use Of A Mixed Media Environment

Visibly-perceptible hot spots in documents

Dynamic presentation of targeted information in a mixed media reality recognition system

Method and system for image matching in a mixed media environment

References

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

Algorithms for approximate string matching

Twenty years of document image analysis in PAMI

Document image defect models

The Fourth Annual Test of OCR Accuracy

Related Papers (5)

Document image defect models and their uses

Document image defect models

Global and local document degradation models

Document Image Matching and Retrieval With Multiple Distortion-Invariant Descriptors

Method and apparatus for providing automated searching and linking of electronic documents