scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

The hOCR Microformat for OCR Workflow and Results

23 Sep 2007-Vol. 2, pp 1063-1067
TL;DR: A new format for representing both intermediate and final OCR results is described, developed in response to the needs of a newly developed OCR system and ground truth data release, which embeds OCR information invisibly inside the HTML and CSS standards.
Abstract: Large scale scanning and document conversion efforts have led to a renewed interest in OCR systems and workflows. This paper describes a new format for representing both intermediate and final OCR results, developed in response to the needs of a newly developed OCR system and ground truth data release. The format embeds OCR information invisibly inside the HTML and CSS standards and therefore can represent a wide range of linguistic and typographic phenomena with already well-defined, widely understood markup and can be processed using widely available and known tools. The format is based on a new, multi-level abstraction of OCR results based on logical markup, common typesetting models, and OCR engine-specific markup, making it suitable both for the support of existing workflows and the development of future model-based OCR engines.
Citations
More filters
Proceedings ArticleDOI
27 Jan 2008
TL;DR: The current status of the OCR system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition are described.
Abstract: OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition.

239 citations

Proceedings ArticleDOI
23 Aug 2010
TL;DR: PAGE is described, a new XML-based page image representation framework that records information on image characteristics (image borders, geometric distortions and corresponding corrections, binarisation etc.) in addition to layout structure and page content.
Abstract: There is a plethora of established and proposed document representation formats but none that can adequately support individual stages within an entire sequence of document image analysis methods (from document image enhancement to layout analysis to OCR) and their evaluation. This paper describes PAGE, a new XML-based page image representation framework that records information on image characteristics (image borders, geometric distortions and corresponding corrections, binarisation etc.) in addition to layout structure and page content. The suitability of the framework to the evaluation of entire workflows as well as individual stages has been extensively validated by using it in high-profile applications such as in public contemporary and historical ground-truthed datasets and in the ICDAR Page Segmentation competition series.

145 citations

Proceedings ArticleDOI
L. Vincent1
23 Sep 2007
TL;DR: This paper goes over some of the ways that Google is working with the document analysis research community to help push the state of the art in Google Book Search.
Abstract: Unveiled in late 2004, Google Book Search is an ambitious program to make all the world's books discoverable online. The sheer scale of the problem brings a number of unique document analysis and understanding challenges that are outlined in this paper. We also go over some of the ways that Google is working with the document analysis research community to help push the state of the art.

84 citations


Cites methods from "The hOCR Microformat for OCR Workfl..."

  • ...As part of his work on OCRopus, Breuel also developed the very interesting hOCR microformat, designed to describe OCR workflow and results in a flexible and open manner [4]....

    [...]

Proceedings ArticleDOI
24 Aug 2017
TL;DR: CloudScan as mentioned in this paper uses a recurrent neural network model that can capture long range context and compare it to a baseline logistic regression model corresponding to the current CloudScan production system, achieving 0.891 and 0.887 average F1 scores respectively on seen invoice layouts.
Abstract: We present CloudScan; an invoice analysis system that requires zero configuration or upfront annotation. In contrast to previous work, CloudScan does not rely on templates of invoice layout, instead it learns a single global model of invoices that naturally generalizes to unseen invoice layouts. The model is trained using data automatically extracted from end-user provided feedback. This automatic training data extraction removes the requirement for users to annotate the data precisely. We describe a recurrent neural network model that can capture long range context and compare it to a baseline logistic regression model corresponding to the current CloudScan production system. We train and evaluate the system on 8 important fields using a dataset of 326,471 invoices. The recurrent neural network and baseline model achieve 0.891 and 0.887 average F1 scores respectively on seen invoice layouts. For the harder task of unseen invoice layouts, the recurrent neural network model outperforms the baseline with 0.840 average F1 compared to 0.788.

75 citations

Journal ArticleDOI
TL;DR: OCR4all as mentioned in this paper is an open-source OCR software that combines state-of-the-art OCR components and continuous model training into a comprehensive workflow for historical printings.
Abstract: Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. In this paper we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. A comfortable GUI allows error corrections not only in the final output, but already in early stages to minimize error propagations. Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. Experiments showed that users with minimal or no experience were able to capture the text of even the earliest printed books with manageable effort and great quality, achieving excellent character error rates (CERs) below 0.5%. The fully automated application on 19th century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. The architecture of OCR4all allows the easy integration (or substitution) of newly developed tools for its main components by standardized interfaces like PageXML, thus aiming at continual higher automation for historical printings.

22 citations

References
More filters
Book
01 Jan 1993
TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Abstract: This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.

37,183 citations

Journal ArticleDOI
Gary E. Kopec1, Philip A. Chou1
TL;DR: The proposed approach is illustrated on the problem of decoding scanned telephone yellow pages to extract names and numbers from the listings by constructing a finite-state model for yellow page columns using a Viterbi-like dynamic programming algorithm.
Abstract: Document image decoding (DID) is a communication theory approach to document image recognition. In DID, a document recognition problem is viewed as consisting of three elements: an image generator, a noisy channel and an image decoder. A document image generator is a Markov source (stochastic finite-state automaton) that combines a message source with an imager. The message source produces a string of symbols, or text, that contains the information to be transmitted. The imager is modeled as a finite-state transducer that converts the 1D message string into an ideal 2D bitmap. The channel transforms the ideal image into a noisy observed image. The decoder estimates the message, given the observed image, by finding the a posteriori most probable path through the combined source and channel models using a Viterbi-like dynamic programming algorithm. The proposed approach is illustrated on the problem of decoding scanned telephone yellow pages to extract names and numbers from the listings. A finite-state model for yellow page columns was constructed and used to decode a database of scanned column images containing about 1100 individual listings. >

238 citations


"The hOCR Microformat for OCR Workfl..." refers methods in this paper

  • ...The combination of logical markup and typesetting markup permits us to use hOCR as an intermediate format for performing OCR as model-based reverse typesetting, an approach advocated, for example by Kopec and Chou [7]....

    [...]

01 Apr 2008
TL;DR: This specification defines Cascading Style Sheets, level 2 (CSS2), a style sheet language that allows authors and users to attach style to structured documents and simplifies Web authoring and site maintenance.
Abstract: This specification defines Cascading Style Sheets, level 2 (CSS2). CSS2 is a style sheet language that allows authors and users to attach style (e.g., fonts, spacing, and aural cues) to structured documents (e.g., HTML documents and XML applications). By separating the presentation style of documents from the content of documents, CSS2 simplifies Web authoring and site maintenance. CSS2 builds on CSS1 (see [CSS1]) and, with very few exceptions, all valid CSS1 style sheets are valid CSS2 style sheets. CSS2 supports media-specific style sheets so that authors may tailor the presentation of their documents to visual browsers, aural devices, printers, braille devices, handheld devices, etc. This specification also supports content positioning, downloadable fonts, table layout, features for internationalization, automatic counters and numbering, and some properties related to user interface. 1 Cascading Style Sheets, Level 2

154 citations


"The hOCR Microformat for OCR Workfl..." refers methods in this paper

  • ...the basis format, together with CSS (cascading style sheets) [4, 8] for representing typographic markup, and to enhance this format by embedding additional information using facilities of standard HTML....

    [...]

Journal ArticleDOI
Berrin Yanikoglu1, Luc Vincent2
TL;DR: A new approach for the automatic evaluation of document page segmentation algorithms that is region-based: segmentation quality is assessed by comparing the segmentation output, described as a set of regions, to the corresponding ground-truth.

105 citations


"The hOCR Microformat for OCR Workfl..." refers background in this paper

  • ...We can distinguish three major classes of OCR output formats: logical formats, suitable for direct use of OCR results by end users (RTF, HTML, LaTeX, and Microsoft Word), OCR engine-specific formats [5, 9], and benchmarking formats [11, 10] proposed for benchmarking various aspects of OCR systems....

    [...]