Transforming paper documents into XML format with WISDOM

doi:10.1007/PL00013569

Home
/
Papers
/
Transforming paper documents into XML format with WISDOM

Journal Article•DOI•

Transforming paper documents into XML format with WISDOM

O. Altamura, Floriana Esposito, Donato Malerba

01 Aug 2001-International Journal on Document Analysis and Recognition (Springer-Verlag)-Vol. 4, Iss: 1, pp 2-17

TL;DR: The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.

read less

Abstract: The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Document understanding for a broad class of documents

[...]

Marco Aiello¹, Christof Monz¹, Leon Todoran¹, Marcel Worring¹•Institutions (1)

University of Amsterdam¹

01 Nov 2002-International Journal on Document Analysis and Recognition

TL;DR: In this article, the authors present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents, from geometric features and spatial relations to the textual features and content are employed in the analysis.

...read moreread less

Abstract: We present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents. All information sources, from geometric features and spatial relations to the textual features and content are employed in the analysis. To deal effectively with these information sources, we define a document representation general and flexible enough to represent complex documents. To handle such a broad document class, it uses generic document knowledge only, which is identified explicitly. The proposed system integrates components based on computer vision, artificial intelligence, and natural language processing techniques. The system is fully implemented and experimental results on heterogeneous collections of documents for each component and for the entire system are presented.

...read moreread less

140 citations

Patent•

System and method for transforming legacy documents into XML documents

[...]

Boris Chidlovskii¹•Institutions (1)

Xerox¹

08 Nov 2005

TL;DR: In this paper, a method for converting a legacy document (10) into an XML document (90), including decomposing the conversion process into a plurality of individual conversion tasks, is presented.

...read moreread less

Abstract: A method for converting a legacy document (10) into an XML document (90), includes decomposing the conversion process into a plurality of individual conversion tasks. A legacy document (10) is decomposed (40) into a plurality of document portions. A target XML schema (20) including a plurality of schema components is provided. Local schema are generated from the target XML schema, wherein each local schema includes at least one of the schema components in the target XML schema. A plurality of conversion tasks (60) is generated by associating a local schema and an applicable document portion, wherein each conversion task associates data from the applicable document portion with the applicable schema component in the local schema. For each conversion task, a conversion method is selected and the conversion method is performed on the applicable document portion and local schema. Finally, the results of all the individual conversion tasks are assembled into a target XML document.

...read moreread less

95 citations

Patent•

Content Profiling to Dynamically Configure Content Processing

[...]

Michael Robert Levy¹, Philip Andrew Mansfield¹•Institutions (1)

Apple Inc.¹

07 Jun 2009

TL;DR: In this article, the authors identify a default set of document reconstruction operations for reconstructing the unstructured document to define a structured document and then modify the set of reconstruction operations according to the identified profile.

...read moreread less

Abstract: Some embodiments provide a method that receives an unstructured document including a number of primitive elements. The method identifies a default set of document reconstruction operations for reconstructing the unstructured document to define a structured document. The method performs at least one of the document reconstruction operations from the default set. Based on results of the performed document reconstruction operations, the method identifies a profile for the unstructured document. The method modifies the set of document reconstruction operations for reconstructing the unstructured document according to the identified profile.

...read moreread less

87 citations

Proceedings Article•DOI•

Machine learning methods for automatically processing historical documents: from paper acquisition to XML transformation

[...]

Floriana Esposito, Donato Malerba, Giovanni Semeraro, Stefano Ferilli, O. Altamura, Teresa Maria Altomare Basile, Margherita Berardi, Michelangelo Ceci, N. Di Mauro - Show less +5 more

23 Jan 2004

TL;DR: This work proposes the use of a document processing system, WISDOM++, which uses heavily machine learning techniques in order to perform such a task, and reports promising results obtained in preliminary experiments.

...read moreread less

Abstract: One of the aims of the EU project COLLATE is to design and implement a Web-based collaboratory for archives, scientists and end-users working with digitized cultural material. Since the originals of such a material are often unique and scattered in various archives, severe problems arise for their wide fruition. A solution would be to develop intelligent document processing tools that automatically transform printed documents into a Web-accessible form such as XML. Here, we propose the use of a document processing system, WISDOM++, which uses heavily machine learning techniques in order to perform such a task, and report promising results obtained in preliminary experiments.

...read moreread less

84 citations

Cites background from "Transforming paper documents into X..."

...A straightforward application of OCR technology produces poor results because of the variability of the layout structure of printed documents....
[...]

Patent•

System and method of specifying image document layout definition

[...]

Steven J. Simske¹, Malgorzata M. Sturgill¹•Institutions (1)

Hewlett-Packard¹

03 Oct 2003

TL;DR: In this article, a system and method of processing an image comprises receiving a definition of at least one region in the image, where the region definition has a location specification and a type specification.

...read moreread less

Abstract: A system and method of processing an image comprises receiving a definition of at least one region in the image, where the region definition has a location specification and a type specification. The method further comprises displaying the boundaries of the at least one defined region according to its type specification, receiving a definition of a visible area in the image, the visible area definition having a specification of margins around the image, generating an image layout definition comprising the region definition and the visible area definition, and saving the image layout definition. The image layout definition may also be used as a template to conform image documents to a specified layout.

...read moreread less

75 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Collapse

References

PDF

Open Access

More filters

Book•

C4.5: Programs for Machine Learning

[...]

J. Ross Quinlan¹•Institutions (1)

University of Sydney¹

15 Oct 1992

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.

...read moreread less

Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

...read moreread less

21,674 citations

Additional excerpts

...5 by Quinlan (1993). It is a...
[...]

Programs for Machine Learning

[...]

Steven L. Salzberg¹, Alberto Segre•Institutions (1)

Johns Hopkins University¹

01 Jan 1994

TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.

...read moreread less

Abstract: Algorithms for constructing decision trees are among the most well known and widely used of all machine learning methods. Among decision tree algorithms, J. Ross Quinlan's ID3 and its successor, C4.5, are probably the most popular in the machine learning community. These algorithms and variations on them have been the subject of numerous research papers since Quinlan introduced ID3. Until recently, most researchers looking for an introduction to decision trees turned to Quinlan's seminal 1986 Machine Learning journal article [Quinlan, 1986]. In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments. As such, this book will be a welcome addition to the library of many researchers and students.

...read moreread less

8,046 citations

Book•

Developing user interfaces: ensuring usability through product & process

[...]

Deborah Hix¹, H. Rex Hartson¹•Institutions (1)

Virginia Tech¹

21 May 1993

TL;DR: Ensuring Usability in Human-Computer Interaction, a Handbook of Iterative, Evaluation-Centered User Interaction Development, and Techniques for Representing user Interaction Designs.

...read moreread less

Abstract: Ensuring Usability in Human-Computer Interaction. THE PRODUCT. User Interaction Design Guidance: Standards, Guidelines, and Style Guides. Interaction Styles. THE PROCESS. Iterative, Evaluation-Centered User Interaction Development. An Overview of Systems Analysis and Design. Techniques for Representing User Interaction Designs. More on Using the User Action Notation. Usability Specification Techniques. Rapid Prototyping of Interaction Design. Formative Evaluation. User Interface Development Tools. Making It Work: Ensuring Usability in Your Development Environment. Index.

...read moreread less

861 citations

Additional excerpts

...The behavioral design of the interface partially follows the sequential interaction style, in which the user action is controlled by the system itself (Hix and Hartson 1993), and the glass-box model, in which some system mechanisms are revealed to the user (Wenger 1988)....
[...]

Journal Article•DOI•

Document analysis system

[...]

Kwan Y. Wong¹, Richard G. Casey¹, Friedrich M. Wahl•Institutions (1)

IBM¹

01 Nov 1982-Ibm Journal of Research and Development

TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.

...read moreread less

Abstract: This paper outlines the requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing. Several critical functions have been investigated and the technical approaches are discussed. The first is the segmentation and classification of digitized printed documents into regions of text and images. A nonlinear, run-length smoothing algorithm has been used for this purpose. By using the regular features of text lines, a linear adaptive classification scheme discriminates text regions from others. The second technique studied is an adaptive approach to the recognition of the hundreds of font styles and sizes that can occur on printed documents. A preclassifier is constructed during the input process and used to speed up a well-known pattern-matching method for clustering characters from an arbitrary print source into a small sample of prototypes. Experimental results are included.

...read moreread less

718 citations

Journal Article•DOI•

A query language for XML

[...]

Alin Deutsch¹, Mary Fernández², Daniela Florescu³, Alon Y. Levy⁴, Dan Suciu² - Show less +1 more•Institutions (4)

University of Pennsylvania¹, AT&T Labs², French Institute for Research in Computer Science and Automation³, University of Washington⁴

17 May 1999

TL;DR: This work presents a query language for XML, called XML-QL, which is argued to be suitable for performing the above tasks, and can extract data from existing XML documents and construct new XML documents.

...read moreread less

Abstract: An important application of XML is the interchange of electronic data (EDI) between multiple data sources on the Web. As XML data proliferates on the Web, applications will need to integrate and aggregate data from multiple source and clean and transform data to facilitate exchange. Data extraction, conversion, transformation, and integration are all well-understood database problems, and their solutions rely on a query language. We present a query language for XML, called XML-QL, which we argue is suitable for performing the above tasks. XML-QL is a declarative, `relational complete' query language and is simple enough that it can be optimized. XML-QL can extract data from existing XML documents and construct new XML documents.

...read moreread less

649 citations

"Transforming paper documents into X..." refers background in this paper

..., XML-QL is a language designed to express database-style queries in XML documents (Deutsch et al. 1999))....
[...]
...…of XML, optional but powerful, is the concept of DTD (Document Type Definition), which specifies the logical hierarchy of documents and can make information retrieval on the Web easier (e.g., XML-QL is a language designed to express database-style queries in XML documents (Deutsch et al. 1999))....
[...]