scispace - formally typeset
Open AccessJournal ArticleDOI

Transforming paper documents into XML format with WISDOM

Reads0
Chats0
TLDR
The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.
Abstract
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

read more

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI

XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents

TL;DR: This paper proposes the application of knowledge technologies to many document processing steps, namely rule-based systems for semantic indexing of documents and the extraction of the necessary knowledge by means of machine learning techniques.
Patent

Representing spelling and grammatical error state in an XML document

TL;DR: In this article, the authors provide markers for spelling and grammar errors and the proofing state of a word-processing document stored as an XML file, which are used to show where a spelling or grammar error has occurred within the document.
Proceedings ArticleDOI

Relational learning techniques for document image understanding: comparing statistical and logical approaches

TL;DR: Evaluated and systematically compare two different (multi-)relational learning methods based on a statistical approach and a logical approach for the task of document image understanding.

Template-based Metadata Extraction for Heterogeneous Collection

Jianfeng Tang
TL;DR: TEMPLATE-BASED METADATA EXTRACTION for HETEROGENEOUS is studied for clues to the mechanism behind climate change and its role in human evolution.
References
More filters
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.

Programs for Machine Learning

TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.
Book

Developing user interfaces: ensuring usability through product & process

TL;DR: Ensuring Usability in Human-Computer Interaction, a Handbook of Iterative, Evaluation-Centered User Interaction Development, and Techniques for Representing user Interaction Designs.
Journal ArticleDOI

Document analysis system

TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.
Journal ArticleDOI

A query language for XML

TL;DR: This work presents a query language for XML, called XML-QL, which is argued to be suitable for performing the above tasks, and can extract data from existing XML documents and construct new XML documents.
Related Papers (5)