scispace - formally typeset
Open AccessReportDOI

Groundtruth generation and document image degradation

Gang Zi
TLDR
A system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives, and used for training and evaluating Optical Character Recognition (OCR) systems.
Abstract
: The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed a system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded by our degradation modules, and used for training and evaluating Optical Character Recognition (OCR) systems. Our document image degradation methodology incorporates several often-encountered types of noise at the page and pixel levels. Examples of OCR evaluation and synthetically degraded document images are given to demonstrate the effectiveness.

read more

Content maybe subject to copyright    Report

Citations
More filters
Patent

Triggering applications based on a captured text in a mixed media environment

TL;DR: In this paper, the MMR system provides mechanisms for forming a mixed media document that includes media of at least two types (e.g., printed paper as a first medium and digital content and/or web link as a second medium).
Patent

System And Methods For Creation And Use Of A Mixed Media Environment

TL;DR: In this article, a Mixed Media Reality (MMR) system and associated techniques are described, which provides mechanisms for forming a mixed media document that includes media of at least two types, such as printed paper as a first medium and a digital photograph, digital movie, digital audio file or web link as a second medium.
Patent

Visibly-perceptible hot spots in documents

TL;DR: The MMR system as discussed by the authors provides mechanisms for forming a mixed media document that includes media of at least two types (e.g., printed paper as a first medium and digital content and/or web link as a second medium).
Patent

Dynamic presentation of targeted information in a mixed media reality recognition system

TL;DR: In this article, a context-aware targeted information delivery system comprises a mobile device, an MMR matching unit, a plurality of databases for user profiles, user context and advertising information.
References
More filters
Journal ArticleDOI

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

TL;DR: The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Journal ArticleDOI

Algorithms for approximate string matching

TL;DR: An improved algorithm that works in time and in space O and algorithms that can be used in conjunction with extended edit operation sets, including, for example, transposition of adjacent characters.
Journal ArticleDOI

Twenty years of document image analysis in PAMI

TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
Book ChapterDOI

Document image defect models

TL;DR: Work-in-progress towards a parameterized model of local imaging defects is described, together with a variety of motivating theoretical arguments and empirical evidence, and a pseudo-random image generator implementing the model has been built.

The Fourth Annual Test of OCR Accuracy

TL;DR: The annual test of optical character recognition systems known as “page readers” accepts as input a bitmapped image of any document page, and attempts to identify the machine-printed characters on the page.