scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 1995"


Book
01 Jan 1995
TL;DR: The document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Abstract: Page layout analysis is a document processing technique used to determine the format of a page. This paper describes the document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components. The method yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks. It is advantageous over many other methods in three main ways: independence from skew angle, independence from different text spacings, and the ability to process local regions of different text orientations within the same image. Results of the method shown for several different page formats and for randomly oriented subpages on the same image illustrate the versatility of the method. We also discuss the differences, advantages, and disadvantages of the docstrum with respect to other lay-out methods. >

628 citations


Journal ArticleDOI
TL;DR: This paper discusses some layout adjustment methods and the preservation of the 'mental map' of the diagram, and two kinds of layout adjustments are described, an algorithm for rearranging a diagram to avoid overlapping nodes and a method aimed at changing the focus of interest of the user without destroying the mental map.
Abstract: Many models in software and information engineering use graph representations; examples are data flow diagrams, state transition diagrams, flow charts, PERT charts, organization charts, Petri nets and entity-relationship diagrams. The usefulness of these graph representations depends on the quality of the layout of the graphs. Automatic graph layout, which can release humans from graph drawing, is now available in several visualization systems. Most automatic layout facilities take a purely combinatorial description of a graph and produce a layout of the graph; these methods are called 'layout creation' methods. For interactive systems, another kind of layout is needed: a facility which can adjust a layout after a change is made by the user or by the application. Although layout adjustment is essential in interactive systems, most existing layout algorithms are designed for layout creation. The use of a layout creation method for layout adjustment may totally rearrange the layout and thus destroy the user's 'mental map' of the diagram; thus a set of layout adjustment methods, separate from layout creation methods, is needed. This paper discusses some layout adjustment methods and the preservation of the 'mental map' of the diagram. First, several models are proposed to make the concept of 'mental map' more precise. Then two kinds of layout adjustments are described. One is an algorithm for rearranging a diagram to avoid overlapping nodes, and the other is a method aimed at changing the focus of interest of the user without destroying the mental map. Next, some experience with visualization systems in which the techniques have been employed is also described.

613 citations


Patent
30 Jun 1995
TL;DR: In this paper, the geometric and logical structure of a document page from its image is determined by partitioning the document image into text and non-text regions, which are then organized into related groups in the correct reading order.
Abstract: Apparatus and method are provided which determine the geometric and logical structure of a document page from its image. The document image is partitioned into regions (both text and non-text) which are then organized into related "article" groups in the correct reading order. The invention uses image-based features, text-based features, and assumptions based on knowledge of expected layout, to find the correct reading order of the text blocks on a document page. It can handle complex layouts which have multiple configurations of columns on a page and inset material (such as figures and inset text blocks). The apparatus comprises two main components, a geometric page segmentor and a logical page organizer. The geometric page segmentor partitions a binary image of a document page into fundamental units of text or non-text, and produces a list of rectangular blocks, their locations on the page in points (1/72 inch), and the locations of horizontal rule lines on the page. The logical page organizer receives a list of text region locations, rule line locations, associated ASCII text (as found from an OCR) for the text blocks, and a list of text attributes (such as face style and point size). The logical page organizer groups appropriately the components (both text and non-text) which comprise a document page, sequences them in a correct reading order and establishes the dominance pattern (e.g., find the lead article on a newspaper page).

158 citations


Journal ArticleDOI
TL;DR: The authors introduce a classification tree to manage the relationships among different classes of layout structures and propose a method to recognize the layout structures of multi-kinds of table-form document images.
Abstract: Many approaches have reported that knowledge-based layout recognition methods are very successful in classifying the meaningful data from document images automatically. However, these approaches are applicable to only the same kind of documents because they are based on the paradigm that specifies the structure definition information in advance so as to be able to analyze a particular class of documents intelligently. In this paper, the authors propose a method to recognize the layout structures of multi-kinds of table-form document images. For this purpose, the authors introduce a classification tree to manage the relationships among different classes of layout structures. The authors' recognition system has two modes: layout knowledge acquisition and layout structure recognition. In the layout knowledge acquisition mode, table-form document images are distinguished according to this. Classification tree and then the structure description trees which specify the logical structures of table-form documents are generated automatically. While, in the layout structure recognition mode, individual item fields in the table-form document images are extracted and classified successfully by searching the classification tree and interpreting the structure description tree. >

151 citations


Book
01 Jan 1995
TL;DR: In this article, the authors describe a text reading system consisting of three major components: document analysis, document understanding, and character segmentation/recognition, which is used for multicolumned and multi-article documents.
Abstract: The document image processes used in a recently developed text reading system are described. The system consists of three major components: document analysis, document understanding, and character segmentation/recognition. The document analysis component extracts lines of text from a page for recognition. The document understanding component extracts logical relationships between the document constituents. The character segmentation/recognition component extracts characters from a text line and recognizes them. Experiments on more than a hundred documents have proved that the proposed approaches to document analysis and document understanding are robust even for multicolumned and multiarticle documents containing graphics and photographs, and that the proposed character segmentation/recognition method is robust enough to cope with omnifont characters which frequently touch each other. >

128 citations


Patent
Akio Yamashita1, Kazuharu Toyokawa1
26 Jul 1995
TL;DR: In this article, a tree structure and layout model are automatically generated by automatically extracting the tree structure in accordance with document image analysis before a user executes graphical correction, and then the area segmentation is displayed on a display unit together with a document image and interactively corrected by the user to define a desired tree structure.
Abstract: The present invention provide a method for extracting a tree structure by using image analysis results of an actual document and generating a flexible layout model. A tree structure and layout model are newly generated by automatically extracting the tree structure in accordance with document image analysis before a user executes graphical correction. That is, an inputted document image is physically analyzed to extract a separator with a high possibility to separate the objects of the document and segment the above document image into a plurality of areas in accordance with the information for the separator. Then, the area segmentation is displayed on a display unit together with a document image and interactively corrected by the user to define a desired tree structure and complete a flexible layout model by setting a parameter to each node of the tree structure.

121 citations


Patent
15 Dec 1995
TL;DR: In this article, a method of displaying information in a computer system is described, which consists of a plurality of document images, text files, and positions files, where the first text file of the plurality of text files represents optical character recognized text of a corresponding first document image.
Abstract: A method of displaying information in a computer system is described. The computer system includes a plurality of document images, a plurality of text files, and a plurality of positions files. A first text file of the plurality of text files represents optical character recognized text of a corresponding first document image of the plurality of document images. A first positions file of the plurality of positions files relates character information in the first text file to a position in the first document image. The computer system searches the plurality of text files using a search term to generate a set of found text files. Each found text file of the set of found text files includes at least a first matching text string to the search term; the set of found text files includes the first text file. The system accesses the first positions file to determine a first region in the first document image corresponding to the first matching text string. The system displays the first document image including displaying a first enhanced view of the first region, the first enhanced view being enhanced relative to a display of the first document image, the first enhanced view being determined from a previously stored visual enhancement definition.

111 citations


Book
01 Jan 1995
TL;DR: In this article, a conceptual framework for solving the task of document analysis, which consists in the conversion of the document's pixel representation into an equivalent knowledge network representation holding the document content and layout, is presented.
Abstract: The authors present a conceptual framework for solving the task of document analysis, which, in essence, consists in the conversion of the document's pixel representation into an equivalent knowledge network representation holding the document's content and layout. Starting on the pixel level, the formation of elementary geometric objects on which layout analysis as well as the definition of character objects is based is described. Character recognition accomplishes the mapping from geometric object to character meaning in ASCII representation. On the next level of abstraction words are formed and verified by contextual processing. Modeled knowledge about complete documents and about how their constituents are related to the application forms the highest level of abstraction. The various problems arising at each stage are discussed. The dependencies between the different levels are exemplified and technical solutions put forward. >

89 citations


Patent
19 Jun 1995
TL;DR: In this article, an image-based dual path, document processing system including an imaging unit, a character recognition unit, dual path module, and an encoder is presented, where a document received by the system is sequentially processed through the imaging unit.
Abstract: An image based, dual path, document processing system including an imaging unit, a character recognition unit, a dual path module and an encoder. A document received by the system is sequentially processed through the imaging unit, the character recognition unit, the dual path module and the encoder. The imaging unit images the front face of the document and attempts to identify character data appearing on the face of said document, such as a hand-written courtesy amount appearing on a bank check, while the character recognition unit is utilized to reads machine-readable data, such as MICR data, printed on the face of the document. The dual path module includes a first document path for directly delivering the document through the dual path module to the encoder upon successful processing of the document by the imaging and character recognition units, and a second document path including an action window wherein an operator may perform corrective action on the document upon unsuccessful processing of the document by either the imaging unit or the character recognition unit.

81 citations


Patent
Makoto Murata1
03 Apr 1995
TL;DR: In this article, a document structure layout unit is proposed to embed in the content portion the document structure laid out by the document layout layout unit, whereby layout processing of embedding a medium into another already embedded medium can be realized (e.g., a mathematical formula into a text in a graphic frame).
Abstract: A document processing system in which, when it is desired to lay out a document structure embedded in a content portion to be laid out, a content layout unit calls a document structure layout unit to embed in the content portion the document structure laid out by the called document structure layout unit, whereby layout processing of embedding a medium into another already embedded medium can be realized (for example, embedding a mathematical formula into a text in a graphic frame). Further, such data (layout of a boxed item) that is difficult to express in terms of content portion can be expressed in terms of document structures appearing in the middle of the processing.

80 citations


Proceedings ArticleDOI
14 Aug 1995
TL;DR: A computational model for document logical structure derivation is developed, in which a rule-based control strategy utilizes the data obtained from analyzing a digitized document image, and makes inferences using a multi-level knowledge base of document layout rules.
Abstract: The analysis of a document image to derive a symbolic description of its structure and contents involves using spatial domain knowledge to classify the different printed blocks (e.g., text paragraphs), group them into logical units (e.g., newspaper stories), and determine the reading order of the text blocks within each unit. These steps describe the conversion of the physical structure of a document into its logical structure. We have developed a computational model for document logical structure derivation, in which a rule-based control strategy utilizes the data obtained from analyzing a digitized document image, and makes inferences using a multi-level knowledge base of document layout rules. The knowledge-based document logical structure derivation system (DeLoS) based on this model consists of a hierarchical rule-based control system to guide the block classification, grouping and read-ordering operations; a global data structure to store the document image data and incremental inferences; and a domain knowledge base to encode the rules governing document layout.

Patent
Takashi Saito1
18 Aug 1995
TL;DR: In this article, an extracting step extracts text regions from an input document image and a classifying step classifies the text regions into in-order reading regions to be successively read in a predetermined order and different-attribute regions.
Abstract: An extracting step extracts text regions from an input document image. A classifying step classifies the text regions into in-order reading regions to be successively read in the predetermined order and different-attribute regions. A detecting step detects the construction of the in-order reading regions. A determining step determines the reading order, in which the in-order reading regions are to be read, using the construction. The detecting step detects the construction in a manner that is the same whether the input document image comprises a vertically typeset document or a horizontally typeset document. The detecting step further includes a tree graph formation step c-1) forming a tree graph representing the construction including nodes respectively representing the in-order reading regions.

Journal Article
TL;DR: In this paper, a strategy for document analysis is presented which uses Portable Document Format (PDF) as its starting point, examining the appearance and geometric position of text and image blocks distributed over an entire document.
Abstract: SUMMARY A strategy for document analysis is presented which uses Portable Document Format (PDF — the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the ‘semantic gap’ between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding.

Patent
07 Jun 1995
TL;DR: In this paper, a data processing system and method for generating a representation of an electronic document, for indexing the electronic documents, for navigating the electronic document using its representation and for displaying the e-document on an output device.
Abstract: A data processing system and method for generating a representation of an electronic document, for indexing the electronic document, for navigating the electronic document using its representation and for displaying the electronic document on an output device. The system and method are used with electronic documents having descriptive markup which describes the content or meaning of the document rather than its appearance. Such documents may be represented by a tree. Each markup element defines a node or element in a tree. The tree is represented by providing a unique identifier for each element and for accessing a descriptor of the element. An element descriptor preferably includes indications of the parent, first child, last child, left sibling, right sibling, type name and text location for the element. The document representation is used to facilitate navigation of the text for constructing navigational aids such as table of contents and full text indexing. A document is also provided with a style sheet for specifying desired formatting characteristics for each type of element in the document. To display the document, a suitable starting point is found on the basis of a selected starting point. The document is displayed beginning with the suitable starting point and the format characteristics for each element displayed are retrieved from the style sheet and applied to the text of the displayed element.


Patent
14 Dec 1995
TL;DR: In this paper, a method of automatically generating a thematic summary from a document image without performing character recognition to generate an ASCII representation of the document text is presented, which is based on word image equivalence classes and sentence boundaries.
Abstract: A method of automatically generating a thematic summary from a document image without performing character recognition to generate an ASCII representation of the document text. The method begins with decomposition of the document image into text blocks, and text lines. Using the median x-height of text blocks the main body of text is identified. Afterward, word image equivalence classes and sentence boundaries within the blocks of the main body of text are determined. The word image equivalence classes are used to identify thematic words. These, in turn are used to score the sentences within the main body of text, and the highest scoring sentences are selected for extraction.

Proceedings ArticleDOI
14 Aug 1995
TL;DR: A hybrid approach to the problem of the document analysis in which the document image is segmented by means of a top-down technique and then basic blocks are grouped bottom-up in order to form complex layout components.
Abstract: In this paper, we present a hybrid approach to the problem of the document analysis in which the document image is segmented by means of a top-down technique and then basic blocks are grouped bottom-up in order to form complex layout components. In this latter process, called layout analysis, only generic knowledge on typesetting conventions is exploited. Such knowledge is independent of the particular class of processed documents and turns out to be valuable for a wide range of documents. Preliminary results of the layout analysis system LEX (Layout EXpert) show the methodological validity of this approach.

Patent
Nelson Hon Ng1
13 Dec 1995
TL;DR: In this article, a test system for testing layout services for supporting complex text languages in a graphical user interface (GUI) system is presented, which allows users to modify input and output text layout attributes.
Abstract: A test system for testing layout services for supporting complex text languages in a graphical user interface (GUI) system. A GUI application is used to test the implementation of locale-specific layout services modules via the Layout Services interface, with little or no modification of the GUI toolkit. The application allows users to modify input and output text layout attributes. The input processing layout is processed and input characters are transformed into presentation layout, and the output data may be analyzed to determine if the correct transformations are being made. The application tool allows users to test a specific locale interface, and returns the result data so that the user may interpret the data. Errors may be detected and isolated to each locale-specific language module.

Patent
Masaharu Ozaki1, Mudita Jain1
07 Jun 1995
TL;DR: In this paper, a document white region extraction system was proposed to identify major white regions in the input document image segmenting and defining the document elements of the document image, and assembling a one-dimensional data string corresponding to the major white region, generating at least one optimum path that matches the data string, and identifying the column layout of the output document image based on the optimum path.
Abstract: A system for logically segmenting document elements from a document includes an input port for inputting a signal representing the document image, a computer having a document structural model, a document white region extraction system that extracts major white regions separating document elements in the input document image, and a string translation device that generates matching one-dimensional data string that corresponds to the extracted major white regions in a document image, a comparison device that selects the optimum path through a finite state machine representing acceptable column layouts for the source document, and a columnar layout identification device that identifies the column layout defined by the optimum path. Then, the identified column of document elements may be processed to logically tag or extract document elements. The method for logically segmenting document element columns includes providing at least one structural model of a corresponding source document, each structural model including at least one finite state machine defining relationships between document elements of the source document. Identifying major white regions in the input document image segmenting and defining the document elements of the document image, and assembling a one-dimensional data string corresponding to the major white regions, generating at least one optimum path that matches the data string, and identifying the column layout of the input document image based on the optimum path.

Journal ArticleDOI
TL;DR: These methods are very efficient in extracting character/graphic pixels from the images with a variety of background types and are compared with Lloyd's 1 method, the logical level technique, Otsu's 3 method, dynamic thresholding, the nonlinear adaptive techniques and the integrated function technique.

Proceedings ArticleDOI
Yuan Yan Tang1, Hong Ma, Xiaogang Mao, Dan Liu, Ching Y. Suen 
14 Aug 1995
TL;DR: This paper presents a new approach to document analysis based on modified fractal signature that can divide a document into blocks in only one step and be used to process documents with high geometrical complexity.
Abstract: This paper presents a new approach to document analysis. The proposed approach is based on modified fractal signature. Instead of the time-consuming traditional approaches (top-down and bottom-up approaches) where iterative operations are necessary to break a document into blocks to extract its geometric (layout) structure, this new approach can divide a document into blocks in only one step. This approach can be used to process documents with high geometrical complexity. Experiments have been conducted to prove the proposed new approach for document processing.

Patent
11 Dec 1995
TL;DR: In this article, a document image processor reads data on the image of an original document and automatically sets a format for the document in a document processor such as a word processor or personal computer.
Abstract: A document image processor reads data on the image of an original document and automatically sets a format for the document in a document processor such as a word processor or personal computer. First, an image scanner reads a document as image data. A character recognizer recognizes and encodes a character from the read image data, and a document processor stores the encoded document data into a document memory. When the character recognizer encodes the image data, it analyzes the structure of the document. The document processor produces format information on the basis of the result of the analysis, and sets the produced format information in the document format memory in correspondence to document data in the document memory. Thus, the document image processor reads a document image from the document, encodes the image, and sets a format for a document conforming exactly to the document image in correspondence to the encoded document data.

Patent
Masaharu Ozaki1
07 Jun 1995
TL;DR: In this article, a system for logically identifying document elements from a document includes an input port for inputting a signal representing the document image, a computer having a document structural model, a document white region extraction system, a major white region selecting device and a column string selection device that generate matching column string of document elements that match the extracted major white regions in a column.
Abstract: A system for logically identifying document elements from a document includes an input port for inputting a signal representing the document image, a computer having a document structural model, a document white region extraction system that extracts major white regions separating and within document elements in the input document image, a major white region selecting device and a column string selection device that generate matching column string of document elements that match the extracted major white regions in a column, a column expression comparison device that selects the best matching column string and a logical tagging device that logically tags and then extracts the document elements in the document image using the best matching column string. The method for logically identifying document elements includes providing at least one structural model of a corresponding source document, each structural model including at least one column expression defining relationships between document elements of the source document. Identifying major white regions in the input document image segmenting and defining the document elements of the document image, and assembling a major white region pattern and generating at least one column string that matches the major white region pattern for each column of the input document. Then, determining the column string that most closely matches the column expression, and logically identifying each document element of the document image based on the closest matching column string.

Proceedings ArticleDOI
14 Aug 1995
TL;DR: This paper proposes an approach to the classification of logical document structures, according to their distance from predefined prototypes, thus relying minimally on the accuracy of OCR and decreasing language-dependence.
Abstract: Automatic derivation of logical document structure from generic layout would enable the development of many highly flexible electronic document manipulation tools. This problem can be divided into the segmentation of text into pieces and the classification of these pieces as particular logical structures. This paper proposes an approach to the classification of logical document structures, according to their distance from predefined prototypes. The prototypes consider linguistic information minimally, thus relying minimally on the accuracy of OCR and decreasing language-dependence. Different classes of logical structures and the differences in the requisite information for classifying them are discussed. A prototype format is proposed, existing prototypes and a distance measurement are described, and performance results are provided.

Proceedings ArticleDOI
Philip A. Chou1, Gary E. Kopec1
30 Mar 1995
TL;DR: This paper generalizes the source and encoder models using context-free attribute grammars and employs these models in a document image decoder that uses a dynamic programming algorithm to minimize the probability of error between original and reconstructed structures.
Abstract: Document image decoding (DID) refers to the process of document recognition within a communication theory framework. In this framework, a logical document structure is a message communicated by encoding the structure as an ideal image, transmitting the ideal image through a noisy channel, and decoding the degraded image into a logical structure as close to the original message as possible, on average. Thus document image decoding is document image recognition where the recognizer performs optimal reconstruction by explicitly modeling the source of logical structures, the encoding procedure, and the channel noise. In previous work, we modeled the source and encoder using probabilistic finite-state automata and transducers. In this paper, we generalize the source and encoder models using context-free attribute grammars. We employ these models in a document image decoder that uses a dynamic programming algorithm to minimize the probability of error between original and reconstructed structures. The dynamic programming algorithm is a generalization of the Cocke-Younger-Kasami parsing algorithm.© (1995) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Patent
16 Oct 1995
TL;DR: In this article, a document editing apparatus edits document generic logical information in accordance with a set of restrictions, and a judging device is used to judge whether the document generic logic information meets the restrictions.
Abstract: A document editing apparatus edits document generic logical information in accordance with a set of restrictions. The document generic logical information defines a document generic logical structure. The restrictions are applied to the document generic logical information to ensure that any logical unit of a document having a logical structure consistent with document generic logical structure, can be extracted as a tree structure. The apparatus includes an editing device for editing the document generic logical information, and a judging device for judging whether the document generic logical information meets the restrictions.

Patent
Joji Ikeo1, Tsuyoshi Tanaka1
24 Aug 1995
TL;DR: In this paper, the authors present a document style design support system which enables a system user to easily prepare a format design for a document meeting his purpose and application while eliminating the need to have any editorial design knowledge.
Abstract: A document style design support system which enables a system user to easily prepare a format design for a document meeting his purpose and application while eliminating the need to have any editorial design knowledge and which also enables the system user to easily design a format for a document satisfying his intention and application even when the system user has no prior document style design experience. In the document style design support system, an inferring operation is carried out with use of such document style design knowledge data including document evaluation words and document attributes and document style design elements to decide an optimum style format. Further, an inferring operation is carried out with use of document style design knowledge data including document evaluation words and document style formatting methods to extract the optimum document style formatting method.

Proceedings ArticleDOI
23 Oct 1995
TL;DR: This paper introduces scoring methods developed to automatically assess the performance of document recognition systems, specifically, to evaluate the spatial correspondence of zones produced by a document segmentor.
Abstract: This paper introduces scoring methods developed to automatically assess the performance of document recognition systems, specifically, to evaluate the spatial correspondence of zones produced by a document segmentor. Two different approaches are discussed. The first approach (based on zone overlap and nearest-neighbors) is better applied to merged zones, whereas the second approach (based on zone alignments) is better applied to nested zones (such as those found in tables and graphs). Definitions of coverage and efficiency error are presented, and scoring results on real system output is provided that validates the usefulness of these methods to compare different document recognition algorithms. Currently, no standard testing procedures exist for measuring and comparing algorithms within a complex document recognition system. Scoring methods, like the ones introduced in this paper, serve as design and validations tools, expediting the development and deployment of document analysis technology for system developers and end users.

01 Jan 1995
TL;DR: This thesis investigates document classification and information extraction in TEXPROS, a document processing system to support and assist office workers in their daily work in dealing with information and document management.
Abstract: TEXPROS (TEXt PROcessing System) is a document processing system (DPS) to support and assist office workers in their daily work in dealing with information and document management. In this thesis, document classification and information extraction, which are two of the major functional capabilities in TEXPROS, are investigated. Based on the nature of its content, a document is divided into structured and unstructured (i.e., of free text) parts. The conceptual content structures are introduced to capture the semantics of structured and unstructured part of the document respectively. The document is classified and information is extracted based on the analyses of conceptual and content structures. In our approach, the layout structure of a document is used to assist the analyses of the conceptual and content structures of the document. By nested segmentation of a document, the layout structure of the document is represented by an ordered labeled tree structure, called Layout Structure Tree (L-S-Tree). Sample-based classification mechanism is adopted in our approach for classifying the documents. A set of pre-classified documents are stored in a document sample base in the form of sample trees. In the layout analysis, an approximate tree matching is used to match the L-S-Tree of a document to be classified against the sample trees. The layout similarities between the document and the sample documents are evaluated based on the "edit distance" between the L-S-Tree of the document and the sample trees. The document samples which have the similar layout structure to the document are chosen to be used for the conceptual analysis of the document. In the conceptual analysis of the document, based on the mapping between the document and document samples, which was found during the layout analysis, the conceptual similarities between the document and the sample documents are evaluated based on the degree of "conceptual closeness degree". The document sample which has the similar conceptual structure to the document is chosen to be used for extracting information. Extracting the information of the structured part of the document is based on the layout locations of key terms appearing in the document and string pattern matching. Based on the information extracted from the structured part of the document the type of the document is identified. In the content analysis of the document, the bottom-up and top-down analyses on the free text are combined to extract information from the unstructured part of the document. In the bottom-up analysis, the sentences of the free text are classified into those which are relevant or irrelevant to the extraction. The sentence classification is based on the semantical relationship between the phrases in the sentences and the attribute names in the corresponding content structure by consulting the thesaurus. Then the thematic roles of the phrases in each relevant sentence are identified based on the syntactic analysis and heuristic thematic analysis. In the top-down analysis, the appropriate content structure is identified based on the document type identified in the conceptual analysis. Then the information is extracted from the unstructured part of the document by evaluating the restrictions specified in the corresponding content structure based on the result of bottom-up analysis. The information extracted from the structured and unstructured parts of the document are stored in the form of a frame like structure (frame instance) in the data base for information retrieval in TEXPROS.

20 Nov 1995
TL;DR: A knowledge-based document logical structure derivation system (DeLoS) has been developed based on this model, which consists of a hierarchical rule-based control system that guides the block classification, block grouping and read-ordering operations, and a domain knowledge base that encodes the rules governing document layout.
Abstract: An important application of artificial intelligence is document image understanding, specifically the analysis of document images to derive a symbolic description of the document structure and contents. This requires the segmentation of the different blocks of printed matter using standard image processing techniques, and the use of spatial domain knowledge to first classify these blocks (e.g., text paragraphs, photographs, etc.), then group these blocks into logical units (e.g., newspaper stories, magazine articles, etc.), and finally determine the reading order of the text blocks within each logical unit. The above steps describe the problem of converting the physical structure of the document into its logical structure with the use of domain knowledge about document layout. The objective of this work is to develop a computational model for the derivation of the logical structure of documents using certain formalisms designed for this task. In this model, a simple yet powerful rule-based control strategy utilizes the data obtained from the invocation of different types of operations on a digitized document image, and makes inferences about the document using a knowledge base of document layout rules. The domain knowledge about document structure is represented in a unified multi-level hierarchical form, and is used by reasoning processes to make inferences. The main issues investigated in this research are: the kinds of domain and control knowledge that are required, how to represent this knowledge in a globally integrated form, and how to devise a control strategy that efficiently utilizes this knowledge. A knowledge-based document logical structure derivation system (DeLoS) has been developed based on this model. The system consists of a hierarchical rule-based control system that guides the block classification, block grouping and read-ordering operations; a global data structure that stores the document image data and incremental inferences; and a domain knowledge base that encodes the rules governing document layout. Applications of this approach include use in digital libraries for the retrieval of relevant logical document information, as well as in comprehensive document understanding systems that can read document text and allow interactive querying of the syntactic and semantic information in a document.