scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A document classification and extraction system with learning ability

20 Sep 1999-pp 197-200
TL;DR: Two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm for domain-independent automatic document image understanding system with learning ability.
Abstract: Document image processing begins at the OCR phase with the difficulty of automatic document analysis and understanding. Most existing systems only do well in their specific application domains. In this paper, we describe a domain-independent automatic document image understanding system with learning ability. A segmentation method based on "logical closeness" is proposed. A novel and natural representation of document layout structure-a directed weight graph (DWG)-is described. To classify a given document, a string representation matching algorithm is applied first, instead of comparing all the sample graphs. A frame template and a document type hierarchy (DTH) are used to represent the document's logical structure and the hierarchical relationships among these frame templates, respectively. In this paper, two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article, the authors present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents, from geometric features and spatial relations to the textual features and content are employed in the analysis.
Abstract: We present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents. All information sources, from geometric features and spatial relations to the textual features and content are employed in the analysis. To deal effectively with these information sources, we define a document representation general and flexible enough to represent complex documents. To handle such a broad document class, it uses generic document knowledge only, which is identified explicitly. The proposed system integrates components based on computer vision, artificial intelligence, and natural language processing techniques. The system is fully implemented and experimental results on heterogeneous collections of documents for each component and for the entire system are presented.

140 citations

01 Jan 2000
TL;DR: This paper presents a hybrid and comprehensive approach to document structure analysis that makes use of layout as well as textual features of a given document to express fuzzy matched rules of an underlying rule base.
Abstract: Document image processing is a crucial process in the office automation and begins from the ’OCR’ phase with difficulty of the document ’analysis’ and ’understanding’. This paper presents a hybrid and comprehensive approach to document structure analysis. Hybrid in the sense, that it makes use of layout (geometrical) as well as textual features of a given document. These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base.

99 citations


Cites methods from "A document classification and extra..."

  • ...This file including its text, font and layout attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5]-[11]....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a hybrid and comprehensive approach to document structure analysis that makes use of layout as well as textual features of a given document, which allows an easy adaptation to specific domains with their specific logical objects.
Abstract: Document image processing is a crucial process in office automation and begins at the ‘OCR’ phase with difficulties in document ‘analysis’ and ‘understanding’ This paper presents a hybrid and comprehensive approach to document structure analysis Hybrid in the sense that it makes use of layout (geometrical) as well as textual features of a given document These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base Rules can be formulated based on features which might be observed within one specific layout object However, rules can also express dependencies between different layout objects In addition to its rule driven analysis, which allows an easy adaptation to specific domains with their specific logical objects, the system contains domain-independent markup algorithms for common objects (eg, lists)

41 citations

01 Jan 2002
TL;DR: In the next section, this chapter defines an abstract propositional formal language to express qualitative spatial relations among document objects to formally express document encoding rules.
Abstract: formal languages can also serve as document encoding languages, for instance, first-order logic. The syntax and semantics are the usual ones for firstorder logic, taking special care in giving adequate semantics to spatial relations and predicates. A final example of a general document encoding rule stated informally in natural language is the following: “in the Western culture, documents are usually read top-bottom and left-right.” (7.1) A problem of stating rules in natural language is ambiguity. In fact, we do not know if one should interpret the “and” as commutative or not. Should one first go top-bottom and then left-right? Or, should one apply any of the two interchangeably? It is not possible to say from the rule merely stated in natural language. In the next section, we define an abstract propositional formal language to express qualitative spatial relations among document objects to formally express document encoding rules. 7.3.2 Relations adequate for documents Considering relations adequate for documents and their components, requires a preliminary formalization step. This consists of regarding a document as a formal model. At this level of abstraction a document is a tuple 〈D,R, l〉 of document objectsD, a binary relationR, and a labeling functionl. Each document object d ∈ D consists of the coordinates of its bounding box (defined as the smallest rectangle containing all elements of that object) D = {d | d = 〈id, x1, y1, x2, y2〉} where id is an identifier of the document object and (x1, y1) (x2, y2) represent the upper-left corner and the lower-right corner of the bounding box of the document object. In addition, we consider the logical labeling information. Given a set of labels L, logical labeling is a functionl, typically injective, from document objects to labels: l : D → L In the following, we consider an instance of such a model where the set of relations R is the set of bidimensional Allen relations and where the set of labels L is {title, body of text, figure, caption, footer, header, page number, graphics }. We shall refer to this model as a spatial [bidimensional Allen] model.Bidimensional Allen relations consist of 13 ×13 relations: the product of Allen’s 13 interval relations [Allen, 1983, van Benthem, 1983b] on two orthogonal axes. (Consider an inverted coordinate system for each document with origin (0,0) in the left-upper corner. The x axis spans horizontally increasing to the right, while the y axis spans vertically towards the bottom.) Each relation r ∈ A is a tuple of Allen interval relations of the 134 • Chapter 7. THICK 2D RELATIONS FOR DOCUMENT UNDERSTANDING form: precedes,meets, overlaps, starts, during, finishes, equals, andprecedes i, meets i, overlaps i, starts i, during i, finishes i. We shall refer to the set of Allen bidimensional relations simply as A and to the propositional language over bidimensional Allen relations asL the remainder of the chapter. Since Allen relations are jointly exhaustive and pairwise disjoint, so is A. This implies that given any two document objects there is one and only one A relation holding among them.

31 citations

Proceedings ArticleDOI
03 Aug 2003
TL;DR: The characteristics of data, knowledge, and information are described in order to describe their synergetic inter-weaving and the inherentcomplexity of sub-problems of document understanding is structure.
Abstract: In this paper I will try to explain the nature of documentunderstanding in all of its dimensions. Therefore I willfirst describe the characteristics of data, knowledge, andinformation in order to describe their synergetic inter-weaving.After that I will try to structure the inherentcomplexity of sub-problems of document understandingwhich may not be solved serially, but rather are attributesof individual documents. Thus, this paper focuses onsystem engineering challenges. However, I will showsome recent work done on the different topics and givesome insights in the individual techniques we chose atDFKI.

25 citations


Cites methods from "A document classification and extra..."

  • ...In general, they may be grouped into statistical methods [14, 18, 26], Neural Nets [16, 31, 51], decisions trees [25, 32], and rule learning techniques [10, 20, 33]....

    [...]

  • ...[21] [13] [40] [3] [5] [26] [37] [43] [44] [47] [9]...

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.
Abstract: Gobbledoc, a system providing remote access to stored documents, which is based on syntactic document analysis and optical character recognition (OCR), is discussed. In Gobbledoc, image processing, document analysis, and OCR operations take place in batch mode when the documents are acquired. The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described. The process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools is also described. Syntactic analysis is used in Gobbledoc to divide each page into labeled rectangular blocks. Blocks labeled text are converted by OCR to obtain a secondary (ASCII) document representation. Since such symbolic files are better suited for computerized search than for human access to the document content and because too many visual layout clues are lost in the OCR process (including some special characters), Gobbledoc preserves the original block images for human browsing. Storage, networking, and display issues specific to document images are also discussed. >

466 citations


"A document classification and extra..." refers methods in this paper

  • ...The rich text information including the text with its font and attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5-9]....

    [...]

Proceedings ArticleDOI
21 Jun 1994
TL;DR: Document image understanding encompasses the technology required to make paper documents equivalent to other computer exchange media like floppies, tapes, and CDROMs and restricts ourselves to documents such as business letters, forms, and scientific and technical articles such as those found in archival journals and technical conferences.
Abstract: Document image understanding encompasses the technology required to make paper documents equivalent to other computer exchange media like floppies, tapes, and CDROMs. The physical reader of the paper document is the scanner just like the physical reader of the floppy is the floppy drive and the physical reader of the tape cartridge is the tape cartridge drive, and the physical reader of the CDROM is the CDROM drive. In the survey presented, we restrict ourselves to documents such as business letters, forms, and scientific and technical articles such as those found in archival journals and technical conferences. Understanding such documents involves estimating the rotation skew of each document page, determining the geometric page layout, labeling blocks as text or non-text, determining the read order for text blocks, recognizing the text of text blocks through an OCR system, determining the logical page layout, and formatting the data and information of the document in a suitable way for use by a word processing system or by an information retrieval system. >

152 citations


"A document classification and extra..." refers methods in this paper

  • ...The rich text information including the text with its font and attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5-9]....

    [...]

Journal ArticleDOI
TL;DR: A survey on the techniques and problems involved in automatic knowledge acquisition through document processing is presented, and the basic concept of document structure and its measurement based on entropy analysis is introduced.
Abstract: The knowledge acquisition bottleneck has become the major impediment to the development and application of effective information systems. To remove this bottleneck, new document processing techniques must be introduced to automatically acquire knowledge from various types of documents. By presenting a survey on the techniques and problems involved, this paper aims at serving as a catalyst to stimulate research in automatic knowledge acquisition through document processing. In this study, a document is considered to have two structures: geometric structure and logical structure. These play a key role in the process of the knowledge acquisition, which can be viewed as a process of acquiring the above structures. Extracting the geometric structure from a document refers to document analysis; mapping the geometric structure into logical structure is regarded as document understanding. Both areas are described in this paper, and the basic concept of document structure and its measurement based on entropy analysis is introduced. Logical structure and geometric models are proposed. Both top-down and bottom-up approaches and their entropy analyses are presented. Different techniques are discussed with practical examples. Mapping methods, such as tree transformation, document formatting knowledge and document format description language, are described. >

106 citations


"A document classification and extra..." refers methods in this paper

  • ...The rich text information including the text with its font and attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5-9]....

    [...]

Proceedings ArticleDOI
16 Jun 1990
TL;DR: A novel approach to automatic classification of digitized office documents based on the inductive generalization of their layout style, is presented, supported by the observation that for a number of printed documents it is possible to find a set of relevant and invariant layout features.
Abstract: A novel approach to automatic classification of digitized office documents based on the inductive generalization of their layout style, is presented. It is supported by the observation that for a number of printed documents it is possible to find a set of relevant and invariant layout features. These are geometrical characteristics automatically detected through a segmentation and layout analysis process. The learning step, in which significant examples of document classes are used to train the classification system, involves the novel idea of integrating parametric (numerical) and conceptual (symbolic) learning methods. >

75 citations


"A document classification and extra..." refers methods in this paper

  • ...The rich text information including the text with its font and attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5-9]....

    [...]

Proceedings ArticleDOI
20 Oct 1993
TL;DR: A new approach to the layout analysis, called nested segmentation, is introduced and an ordered labeled tree structure (L-S-Tree) is introduced to represent the segmented document for document classification.
Abstract: Office information systems (OISs) are employed to support office workers in their management of information and to assist them in their daily work. In the OISs, document classification is one of the major functional capabilities. Classifying a document can be facilitated through the layout analysis of the document. A new approach to the layout analysis, called nested segmentation, is introduced. The layout relationships of components of a document are defined in terms of the adjacency of blocks. Given the adjacency of blocks, an adjacent block graph is introduced where the problem of the nested segmentation is transformed to a classic minimal cut problem for the graph. Also, an ordered labeled tree structure (L-S-Tree) is introduced to represent the segmented document for document classification. >

30 citations


"A document classification and extra..." refers background or methods in this paper

  • ...The usefulness and efficiency of a text processing system can be improved greatly by converting normal text representations into a new form adapted better to computer manipulation [1,2,4]....

    [...]

  • ...Definition 4.2: Document Type Hierarchy (DTH) A Frame Template is used to keep a record of logical meanings for a document....

    [...]

  • ...We organize frame templates of document types as a hierarchical structure, which is called a document type hierarchy (DTH) [1,2,4], based upon the generalization and specialization relations among the frame templates and their inheritance properties among them....

    [...]

  • ...The task of segmentation [1] is to separate the original document image file into several rectangular areas, also called blocks....

    [...]