scispace - formally typeset
Open AccessJournal ArticleDOI

Transforming paper documents into XML format with WISDOM

Reads0
Chats0
TLDR
The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.
Abstract
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Document understanding for a broad class of documents

TL;DR: In this article, the authors present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents, from geometric features and spatial relations to the textual features and content are employed in the analysis.
Patent

System and method for transforming legacy documents into XML documents

TL;DR: In this paper, a method for converting a legacy document (10) into an XML document (90), including decomposing the conversion process into a plurality of individual conversion tasks, is presented.
Patent

Content Profiling to Dynamically Configure Content Processing

TL;DR: In this article, the authors identify a default set of document reconstruction operations for reconstructing the unstructured document to define a structured document and then modify the set of reconstruction operations according to the identified profile.
Proceedings ArticleDOI

Machine learning methods for automatically processing historical documents: from paper acquisition to XML transformation

TL;DR: This work proposes the use of a document processing system, WISDOM++, which uses heavily machine learning techniques in order to perform such a task, and reports promising results obtained in preliminary experiments.
Patent

System and method of specifying image document layout definition

TL;DR: In this article, a system and method of processing an image comprises receiving a definition of at least one region in the image, where the region definition has a location specification and a type specification.
References
More filters
Proceedings ArticleDOI

WISDOM++: an interactive and adaptive document analysis system

TL;DR: This paper presents the two-phased skew estimation algorithm and the adaptive document block segmentation and classification techniques forISDOM++.
Journal ArticleDOI

Content based internet access to paper documents

TL;DR: In this paper, different hypertext structures one encounters in a document are studied and methods for analyzing paper documents to find these structures are presented, and the structures also form the basis for the presentation of the content of the document to the user.
Proceedings ArticleDOI

On benchmarking of document analysis systems

TL;DR: The proposed method describes how to build a database with specific ground truth for the output of each module of interest, especially of the image processing modules of a DAS.
Book ChapterDOI

Handling Continuous Data in Top-Down Induction of First-Order Rules

TL;DR: A specialization operator that discretizes continuous data during the learning process is proposed and the heuristic function used to choose among different discretizations satisfies a property that can be profitably exploited to improve the efficiency of the specialization operator.
Proceedings ArticleDOI

Local skew angle estimation from background space in text regions

TL;DR: A novel local skew estimation method is presented that takes advantage of the information available after flexible and efficient page segmentation and classification methods have been applied to the document image.
Related Papers (5)