Transforming paper documents into XML format with WISDOM

doi:10.1007/PL00013569

Open AccessJournal ArticleDOI

Transforming paper documents into XML format with WISDOM

O. Altamura, +2 more

- 01 Aug 2001 -

International Journal on Document Analys...

- Vol. 4, Iss: 1, pp 2-17

Chats0

TLDR

The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.

Abstract:

The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

Transforming paper documents into XML format with WISDOM

Citations

Document understanding for a broad class of documents

System and method for transforming legacy documents into XML documents

Content Profiling to Dynamically Configure Content Processing

Machine learning methods for automatically processing historical documents: from paper acquisition to XML transformation

System and method of specifying image document layout definition

References

WISDOM++: an interactive and adaptive document analysis system

Content based internet access to paper documents

On benchmarking of document analysis systems

Handling Continuous Data in Top-Down Induction of First-Order Rules

Local skew angle estimation from background space in text regions

Related Papers (5)

Document understanding for a broad class of documents

Document Structure Analysis Based on Layout and Textual Features

Systems and methods for automatic form segmentation for raster-based passive electronic documents

Automatic layout of content in a design for a medium

Layout and Content Extraction for PDF Documents