scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Transforming paper documents into XML format with WISDOM

TL;DR: The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.
Abstract: The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this article, the authors present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents, from geometric features and spatial relations to the textual features and content are employed in the analysis.
Abstract: We present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents. All information sources, from geometric features and spatial relations to the textual features and content are employed in the analysis. To deal effectively with these information sources, we define a document representation general and flexible enough to represent complex documents. To handle such a broad document class, it uses generic document knowledge only, which is identified explicitly. The proposed system integrates components based on computer vision, artificial intelligence, and natural language processing techniques. The system is fully implemented and experimental results on heterogeneous collections of documents for each component and for the entire system are presented.

140 citations

Patent
Boris Chidlovskii1
08 Nov 2005
TL;DR: In this paper, a method for converting a legacy document (10) into an XML document (90), including decomposing the conversion process into a plurality of individual conversion tasks, is presented.
Abstract: A method for converting a legacy document (10) into an XML document (90), includes decomposing the conversion process into a plurality of individual conversion tasks. A legacy document (10) is decomposed (40) into a plurality of document portions. A target XML schema (20) including a plurality of schema components is provided. Local schema are generated from the target XML schema, wherein each local schema includes at least one of the schema components in the target XML schema. A plurality of conversion tasks (60) is generated by associating a local schema and an applicable document portion, wherein each conversion task associates data from the applicable document portion with the applicable schema component in the local schema. For each conversion task, a conversion method is selected and the conversion method is performed on the applicable document portion and local schema. Finally, the results of all the individual conversion tasks are assembled into a target XML document.

95 citations

Patent
07 Jun 2009
TL;DR: In this article, the authors identify a default set of document reconstruction operations for reconstructing the unstructured document to define a structured document and then modify the set of reconstruction operations according to the identified profile.
Abstract: Some embodiments provide a method that receives an unstructured document including a number of primitive elements. The method identifies a default set of document reconstruction operations for reconstructing the unstructured document to define a structured document. The method performs at least one of the document reconstruction operations from the default set. Based on results of the performed document reconstruction operations, the method identifies a profile for the unstructured document. The method modifies the set of document reconstruction operations for reconstructing the unstructured document according to the identified profile.

87 citations

Proceedings ArticleDOI
23 Jan 2004
TL;DR: This work proposes the use of a document processing system, WISDOM++, which uses heavily machine learning techniques in order to perform such a task, and reports promising results obtained in preliminary experiments.
Abstract: One of the aims of the EU project COLLATE is to design and implement a Web-based collaboratory for archives, scientists and end-users working with digitized cultural material. Since the originals of such a material are often unique and scattered in various archives, severe problems arise for their wide fruition. A solution would be to develop intelligent document processing tools that automatically transform printed documents into a Web-accessible form such as XML. Here, we propose the use of a document processing system, WISDOM++, which uses heavily machine learning techniques in order to perform such a task, and report promising results obtained in preliminary experiments.

84 citations


Cites background from "Transforming paper documents into X..."

  • ...A straightforward application of OCR technology produces poor results because of the variability of the layout structure of printed documents....

    [...]

Patent
03 Oct 2003
TL;DR: In this article, a system and method of processing an image comprises receiving a definition of at least one region in the image, where the region definition has a location specification and a type specification.
Abstract: A system and method of processing an image comprises receiving a definition of at least one region in the image, where the region definition has a location specification and a type specification. The method further comprises displaying the boundaries of the at least one defined region according to its type specification, receiving a definition of a visible area in the image, the visible area definition having a specification of margins around the image, generating an image layout definition comprising the region definition and the visible area definition, and saving the image layout definition. The image layout definition may also be used as a template to conform image documents to a specified layout.

75 citations

References
More filters
Book
15 Oct 1992
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

21,674 citations


Additional excerpts

  • ...5 by Quinlan (1993). It is a...

    [...]

01 Jan 1994
TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.
Abstract: Algorithms for constructing decision trees are among the most well known and widely used of all machine learning methods. Among decision tree algorithms, J. Ross Quinlan's ID3 and its successor, C4.5, are probably the most popular in the machine learning community. These algorithms and variations on them have been the subject of numerous research papers since Quinlan introduced ID3. Until recently, most researchers looking for an introduction to decision trees turned to Quinlan's seminal 1986 Machine Learning journal article [Quinlan, 1986]. In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments. As such, this book will be a welcome addition to the library of many researchers and students.

8,046 citations

Book
21 May 1993
TL;DR: Ensuring Usability in Human-Computer Interaction, a Handbook of Iterative, Evaluation-Centered User Interaction Development, and Techniques for Representing user Interaction Designs.
Abstract: Ensuring Usability in Human-Computer Interaction. THE PRODUCT. User Interaction Design Guidance: Standards, Guidelines, and Style Guides. Interaction Styles. THE PROCESS. Iterative, Evaluation-Centered User Interaction Development. An Overview of Systems Analysis and Design. Techniques for Representing User Interaction Designs. More on Using the User Action Notation. Usability Specification Techniques. Rapid Prototyping of Interaction Design. Formative Evaluation. User Interface Development Tools. Making It Work: Ensuring Usability in Your Development Environment. Index.

861 citations


Additional excerpts

  • ...The behavioral design of the interface partially follows the sequential interaction style, in which the user action is controlled by the system itself (Hix and Hartson 1993), and the glass-box model, in which some system mechanisms are revealed to the user (Wenger 1988)....

    [...]

Journal ArticleDOI
TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.
Abstract: This paper outlines the requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing. Several critical functions have been investigated and the technical approaches are discussed. The first is the segmentation and classification of digitized printed documents into regions of text and images. A nonlinear, run-length smoothing algorithm has been used for this purpose. By using the regular features of text lines, a linear adaptive classification scheme discriminates text regions from others. The second technique studied is an adaptive approach to the recognition of the hundreds of font styles and sizes that can occur on printed documents. A preclassifier is constructed during the input process and used to speed up a well-known pattern-matching method for clustering characters from an arbitrary print source into a small sample of prototypes. Experimental results are included.

718 citations

Journal ArticleDOI
17 May 1999
TL;DR: This work presents a query language for XML, called XML-QL, which is argued to be suitable for performing the above tasks, and can extract data from existing XML documents and construct new XML documents.
Abstract: An important application of XML is the interchange of electronic data (EDI) between multiple data sources on the Web. As XML data proliferates on the Web, applications will need to integrate and aggregate data from multiple source and clean and transform data to facilitate exchange. Data extraction, conversion, transformation, and integration are all well-understood database problems, and their solutions rely on a query language. We present a query language for XML, called XML-QL, which we argue is suitable for performing the above tasks. XML-QL is a declarative, `relational complete' query language and is simple enough that it can be optimized. XML-QL can extract data from existing XML documents and construct new XML documents.

649 citations


"Transforming paper documents into X..." refers background in this paper

  • ..., XML-QL is a language designed to express database-style queries in XML documents (Deutsch et al. 1999))....

    [...]

  • ...…of XML, optional but powerful, is the concept of DTD (Document Type Definition), which specifies the logical hierarchy of documents and can make information retrieval on the Web easier (e.g., XML-QL is a language designed to express database-style queries in XML documents (Deutsch et al. 1999))....

    [...]