Transforming paper documents into XML format with WISDOM
Reads0
Chats0
TLDR
The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.Abstract:
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.read more
Citations
More filters
Journal ArticleDOI
Document understanding for a broad class of documents
TL;DR: In this article, the authors present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents, from geometric features and spatial relations to the textual features and content are employed in the analysis.
Patent
System and method for transforming legacy documents into XML documents
TL;DR: In this paper, a method for converting a legacy document (10) into an XML document (90), including decomposing the conversion process into a plurality of individual conversion tasks, is presented.
Patent
Content Profiling to Dynamically Configure Content Processing
TL;DR: In this article, the authors identify a default set of document reconstruction operations for reconstructing the unstructured document to define a structured document and then modify the set of reconstruction operations according to the identified profile.
Proceedings ArticleDOI
Machine learning methods for automatically processing historical documents: from paper acquisition to XML transformation
Floriana Esposito,Donato Malerba,Giovanni Semeraro,Stefano Ferilli,O. Altamura,Teresa Maria Altomare Basile,Margherita Berardi,Michelangelo Ceci,N. Di Mauro +8 more
TL;DR: This work proposes the use of a document processing system, WISDOM++, which uses heavily machine learning techniques in order to perform such a task, and reports promising results obtained in preliminary experiments.
Patent
System and method of specifying image document layout definition
TL;DR: In this article, a system and method of processing an image comprises receiving a definition of at least one region in the image, where the region definition has a location specification and a type specification.
References
More filters
Proceedings ArticleDOI
WISDOM++: an interactive and adaptive document analysis system
TL;DR: This paper presents the two-phased skew estimation algorithm and the adaptive document block segmentation and classification techniques forISDOM++.
Journal ArticleDOI
Content based internet access to paper documents
TL;DR: In this paper, different hypertext structures one encounters in a document are studied and methods for analyzing paper documents to find these structures are presented, and the structures also form the basis for the presentation of the content of the document to the user.
Proceedings ArticleDOI
On benchmarking of document analysis systems
TL;DR: The proposed method describes how to build a database with specific ground truth for the output of each module of interest, especially of the image processing modules of a DAS.
Book ChapterDOI
Handling Continuous Data in Top-Down Induction of First-Order Rules
TL;DR: A specialization operator that discretizes continuous data during the learning process is proposed and the heuristic function used to choose among different discretizations satisfies a property that can be profitably exploited to improve the efficiency of the specialization operator.
Proceedings ArticleDOI
Local skew angle estimation from background space in text regions
TL;DR: A novel local skew estimation method is presented that takes advantage of the information available after flexible and efficient page segmentation and classification methods have been applied to the document image.