A document classification and extraction system with learning ability

doi:10.1109/ICDAR.1999.791758

Home
/
Papers
/
A document classification and extraction system with learning ability

Proceedings Article•DOI•

A document classification and extraction system with learning ability

Xuhong Li¹, P.A. Ng•Institutions (1)

New Jersey Institute of Technology¹

20 Sep 1999-pp 197-200

TL;DR: Two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm for domain-independent automatic document image understanding system with learning ability.

read less

Abstract: Document image processing begins at the OCR phase with the difficulty of automatic document analysis and understanding. Most existing systems only do well in their specific application domains. In this paper, we describe a domain-independent automatic document image understanding system with learning ability. A segmentation method based on "logical closeness" is proposed. A novel and natural representation of document layout structure-a directed weight graph (DWG)-is described. To classify a given document, a string representation matching algorithm is applied first, instead of comparing all the sample graphs. A frame template and a document type hierarchy (DTH) are used to represent the document's logical structure and the hierarchical relationships among these frame templates, respectively. In this paper, two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Document understanding for a broad class of documents

[...]

Marco Aiello¹, Christof Monz¹, Leon Todoran¹, Marcel Worring¹•Institutions (1)

University of Amsterdam¹

01 Nov 2002-International Journal on Document Analysis and Recognition

TL;DR: In this article, the authors present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents, from geometric features and spatial relations to the textual features and content are employed in the analysis.

...read moreread less

Abstract: We present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents. All information sources, from geometric features and spatial relations to the textual features and content are employed in the analysis. To deal effectively with these information sources, we define a document representation general and flexible enough to represent complex documents. To handle such a broad document class, it uses generic document knowledge only, which is identified explicitly. The proposed system integrates components based on computer vision, artificial intelligence, and natural language processing techniques. The system is fully implemented and experimental results on heterogeneous collections of documents for each component and for the entire system are presented.

...read moreread less

140 citations

Document Structure Analysis Based on Layout and Textual Features

[...]

Stefan Klink, Andreas Dengel, Thomas Kieninger¹•Institutions (1)

German Research Centre for Artificial Intelligence¹

01 Jan 2000

TL;DR: This paper presents a hybrid and comprehensive approach to document structure analysis that makes use of layout as well as textual features of a given document to express fuzzy matched rules of an underlying rule base.

...read moreread less

Abstract: Document image processing is a crucial process in the office automation and begins from the ’OCR’ phase with difficulty of the document ’analysis’ and ’understanding’. This paper presents a hybrid and comprehensive approach to document structure analysis. Hybrid in the sense, that it makes use of layout (geometrical) as well as textual features of a given document. These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base.

...read moreread less

99 citations

Cites methods from "A document classification and extra..."

...This file including its text, font and layout attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5]-[11]....
[...]

Journal Article•DOI•

Rule-based document structure understanding with a fuzzy combination of layout and textual features

[...]

Stefan Klink, Thomas Kieninger¹•Institutions (1)

German Research Centre for Artificial Intelligence¹

01 Aug 2001-International Journal on Document Analysis and Recognition

TL;DR: This paper presents a hybrid and comprehensive approach to document structure analysis that makes use of layout as well as textual features of a given document, which allows an easy adaptation to specific domains with their specific logical objects.

...read moreread less

Abstract: Document image processing is a crucial process in office automation and begins at the ‘OCR’ phase with difficulties in document ‘analysis’ and ‘understanding’ This paper presents a hybrid and comprehensive approach to document structure analysis Hybrid in the sense that it makes use of layout (geometrical) as well as textual features of a given document These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base Rules can be formulated based on features which might be observed within one specific layout object However, rules can also express dependencies between different layout objects In addition to its rule driven analysis, which allows an easy adaptation to specific domains with their specific logical objects, the system contains domain-independent markup algorithms for common objects (eg, lists)

...read moreread less

41 citations

Spatial reasoning : theory and practice

[...]

Marco Aiello

01 Jan 2002

TL;DR: In the next section, this chapter defines an abstract propositional formal language to express qualitative spatial relations among document objects to formally express document encoding rules.

...read moreread less

Abstract: formal languages can also serve as document encoding languages, for instance, first-order logic. The syntax and semantics are the usual ones for firstorder logic, taking special care in giving adequate semantics to spatial relations and predicates. A final example of a general document encoding rule stated informally in natural language is the following: “in the Western culture, documents are usually read top-bottom and left-right.” (7.1) A problem of stating rules in natural language is ambiguity. In fact, we do not know if one should interpret the “and” as commutative or not. Should one first go top-bottom and then left-right? Or, should one apply any of the two interchangeably? It is not possible to say from the rule merely stated in natural language. In the next section, we define an abstract propositional formal language to express qualitative spatial relations among document objects to formally express document encoding rules. 7.3.2 Relations adequate for documents Considering relations adequate for documents and their components, requires a preliminary formalization step. This consists of regarding a document as a formal model. At this level of abstraction a document is a tuple 〈D,R, l〉 of document objectsD, a binary relationR, and a labeling functionl. Each document object d ∈ D consists of the coordinates of its bounding box (defined as the smallest rectangle containing all elements of that object) D = {d | d = 〈id, x1, y1, x2, y2〉} where id is an identifier of the document object and (x1, y1) (x2, y2) represent the upper-left corner and the lower-right corner of the bounding box of the document object. In addition, we consider the logical labeling information. Given a set of labels L, logical labeling is a functionl, typically injective, from document objects to labels: l : D → L In the following, we consider an instance of such a model where the set of relations R is the set of bidimensional Allen relations and where the set of labels L is {title, body of text, figure, caption, footer, header, page number, graphics }. We shall refer to this model as a spatial [bidimensional Allen] model.Bidimensional Allen relations consist of 13 ×13 relations: the product of Allen’s 13 interval relations [Allen, 1983, van Benthem, 1983b] on two orthogonal axes. (Consider an inverted coordinate system for each document with origin (0,0) in the left-upper corner. The x axis spans horizontally increasing to the right, while the y axis spans vertically towards the bottom.) Each relation r ∈ A is a tuple of Allen interval relations of the 134 • Chapter 7. THICK 2D RELATIONS FOR DOCUMENT UNDERSTANDING form: precedes,meets, overlaps, starts, during, finishes, equals, andprecedes i, meets i, overlaps i, starts i, during i, finishes i. We shall refer to the set of Allen bidimensional relations simply as A and to the propositional language over bidimensional Allen relations asL the remainder of the chapter. Since Allen relations are jointly exhaustive and pairwise disjoint, so is A. This implies that given any two document objects there is one and only one A relation holding among them.

...read moreread less

31 citations

Proceedings Article•DOI•

Making documents work: challenges for document understanding

[...]

Andreas Dengel¹•Institutions (1)

German Research Centre for Artificial Intelligence¹

03 Aug 2003

TL;DR: The characteristics of data, knowledge, and information are described in order to describe their synergetic inter-weaving and the inherentcomplexity of sub-problems of document understanding is structure.

...read moreread less

Abstract: In this paper I will try to explain the nature of documentunderstanding in all of its dimensions. Therefore I willfirst describe the characteristics of data, knowledge, andinformation in order to describe their synergetic inter-weaving.After that I will try to structure the inherentcomplexity of sub-problems of document understandingwhich may not be solved serially, but rather are attributesof individual documents. Thus, this paper focuses onsystem engineering challenges. However, I will showsome recent work done on the different topics and givesome insights in the individual techniques we chose atDFKI.

...read moreread less

25 citations

Cites methods from "A document classification and extra..."

...In general, they may be grouped into statistical methods [14, 18, 26], Neural Nets [16, 31, 51], decisions trees [25, 32], and rule learning techniques [10, 20, 33]....
[...]
...[21] [13] [40] [3] [5] [26] [37] [43] [44] [47] [9]...
[...]

1
2
3
4
…
5

References

PDF

Open Access

More filters

Journal Article•DOI•

A prototype document image analysis system for technical journals

[...]

George Nagy¹, Sharad C. Seth², Mahesh Viswanathan³•Institutions (3)

Rensselaer Polytechnic Institute¹, University of Nebraska–Lincoln², IBM³

01 Jul 1992-IEEE Computer

TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.

...read moreread less

Abstract: Gobbledoc, a system providing remote access to stored documents, which is based on syntactic document analysis and optical character recognition (OCR), is discussed. In Gobbledoc, image processing, document analysis, and OCR operations take place in batch mode when the documents are acquired. The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described. The process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools is also described. Syntactic analysis is used in Gobbledoc to divide each page into labeled rectangular blocks. Blocks labeled text are converted by OCR to obtain a secondary (ASCII) document representation. Since such symbolic files are better suited for computerized search than for human access to the document content and because too many visual layout clues are lost in the OCR process (including some special characters), Gobbledoc preserves the original block images for human browsing. Storage, networking, and display issues specific to document images are also discussed. >

...read moreread less

466 citations

"A document classification and extra..." refers methods in this paper

...The rich text information including the text with its font and attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5-9]....
[...]

Proceedings Article•DOI•

Document image understanding: geometric and logical layout

[...]

Haralick¹•Institutions (1)

University of Washington¹

21 Jun 1994

TL;DR: Document image understanding encompasses the technology required to make paper documents equivalent to other computer exchange media like floppies, tapes, and CDROMs and restricts ourselves to documents such as business letters, forms, and scientific and technical articles such as those found in archival journals and technical conferences.

...read moreread less

Abstract: Document image understanding encompasses the technology required to make paper documents equivalent to other computer exchange media like floppies, tapes, and CDROMs. The physical reader of the paper document is the scanner just like the physical reader of the floppy is the floppy drive and the physical reader of the tape cartridge is the tape cartridge drive, and the physical reader of the CDROM is the CDROM drive. In the survey presented, we restrict ourselves to documents such as business letters, forms, and scientific and technical articles such as those found in archival journals and technical conferences. Understanding such documents involves estimating the rotation skew of each document page, determining the geometric page layout, labeling blocks as text or non-text, determining the read order for text blocks, recognizing the text of text blocks through an OCR system, determining the logical page layout, and formatting the data and information of the document in a suitable way for use by a word processing system or by an information retrieval system. >

...read moreread less

152 citations

"A document classification and extra..." refers methods in this paper

...The rich text information including the text with its font and attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5-9]....
[...]

Journal Article•DOI•

Document processing for automatic knowledge acquisition

[...]

Yuan Yan Tang¹, Chang De Yan¹, Ching Y. Suen¹•Institutions (1)

Concordia University¹

01 Feb 1994-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A survey on the techniques and problems involved in automatic knowledge acquisition through document processing is presented, and the basic concept of document structure and its measurement based on entropy analysis is introduced.

...read moreread less

Abstract: The knowledge acquisition bottleneck has become the major impediment to the development and application of effective information systems. To remove this bottleneck, new document processing techniques must be introduced to automatically acquire knowledge from various types of documents. By presenting a survey on the techniques and problems involved, this paper aims at serving as a catalyst to stimulate research in automatic knowledge acquisition through document processing. In this study, a document is considered to have two structures: geometric structure and logical structure. These play a key role in the process of the knowledge acquisition, which can be viewed as a process of acquiring the above structures. Extracting the geometric structure from a document refers to document analysis; mapping the geometric structure into logical structure is regarded as document understanding. Both areas are described in this paper, and the basic concept of document structure and its measurement based on entropy analysis is introduced. Logical structure and geometric models are proposed. Both top-down and bottom-up approaches and their entropy analyses are presented. Different techniques are discussed with practical examples. Mapping methods, such as tree transformation, document formatting knowledge and document format description language, are described. >

...read moreread less

106 citations

"A document classification and extra..." refers methods in this paper

...The rich text information including the text with its font and attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5-9]....
[...]

Proceedings Article•DOI•

An experimental page layout recognition system for office document automatic classification: an integrated approach for inductive generalization

[...]

Floriana Esposito, Donato Malerba, Giovanni Semeraro, E. Annese, G. Scafuro - Show less +1 more

16 Jun 1990

TL;DR: A novel approach to automatic classification of digitized office documents based on the inductive generalization of their layout style, is presented, supported by the observation that for a number of printed documents it is possible to find a set of relevant and invariant layout features.

...read moreread less

Abstract: A novel approach to automatic classification of digitized office documents based on the inductive generalization of their layout style, is presented. It is supported by the observation that for a number of printed documents it is possible to find a set of relevant and invariant layout features. These are geometrical characteristics automatically detected through a segmentation and layout analysis process. The learning step, in which significant examples of document classes are used to train the classification system, involves the novel idea of integrating parametric (numerical) and conceptual (symbolic) learning methods. >

...read moreread less

75 citations

"A document classification and extra..." refers methods in this paper

...The rich text information including the text with its font and attributes can be used to conduct further analysis on the text and to find the "structured" parts of the document [5-9]....
[...]

Proceedings Article•DOI•

Nested segmentation: an approach for layout analysis in document classification

[...]

Xiaolong Hao¹, J.T.L. Wang, P.A. Ng•Institutions (1)

New Jersey Institute of Technology¹

20 Oct 1993

TL;DR: A new approach to the layout analysis, called nested segmentation, is introduced and an ordered labeled tree structure (L-S-Tree) is introduced to represent the segmented document for document classification.

...read moreread less

Abstract: Office information systems (OISs) are employed to support office workers in their management of information and to assist them in their daily work. In the OISs, document classification is one of the major functional capabilities. Classifying a document can be facilitated through the layout analysis of the document. A new approach to the layout analysis, called nested segmentation, is introduced. The layout relationships of components of a document are defined in terms of the adjacency of blocks. Given the adjacency of blocks, an adjacent block graph is introduced where the problem of the nested segmentation is transformed to a classic minimal cut problem for the graph. Also, an ordered labeled tree structure (L-S-Tree) is introduced to represent the segmented document for document classification. >

...read moreread less

30 citations

"A document classification and extra..." refers background or methods in this paper

...The usefulness and efficiency of a text processing system can be improved greatly by converting normal text representations into a new form adapted better to computer manipulation [1,2,4]....
[...]
...Definition 4.2: Document Type Hierarchy (DTH) A Frame Template is used to keep a record of logical meanings for a document....
[...]
...We organize frame templates of document types as a hierarchical structure, which is called a document type hierarchy (DTH) [1,2,4], based upon the generalization and specialization relations among the frame templates and their inheritance properties among them....
[...]
...The task of segmentation [1] is to separate the original document image file into several rectangular areas, also called blocks....
[...]