scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 1997"


Patent
14 Nov 1997
TL;DR: In this article, a document search system provides a user with a programming interface for dynamically specifying features of documents recorded in a corpus of documents, which is suitable for interactive user specification of layout components and structures of documents.
Abstract: A document search system provides a user with a programming interface for dynamically specifying features of documents recorded in a corpus of documents. The programming interface operates at a high-level that is suitable for interactive user specification of layout components and structures of documents. In operation, a bitmap image of a document is analyzed by the document search system to identify layout objects such as text blocks or graphics. Subsequently, the document search system computes a set of attributes for each of the identified layout objects. The set of attributes which are identified are used to describe the layout structure of a page image of a document in terms of the spatial relations that layout objects have to frames of reference that are defined by other layout objects. After computing attributes for each layout object, a user can operate the programming interface to define unique document features. Each document feature is a routine defined by a sequence of selections operations which consume a first set of layout objects and produce a second set of layout objects. The second set of layout objects constitutes the feature in a page image of a document. Using the programming interface, a user flexibly defines a genre of document using the user-specified document features.

298 citations


Proceedings ArticleDOI
18 Aug 1997
TL;DR: A new method is presented for adaptive document image binarization, where the page is considered as a collection of subcomponents such as text, background and picture, using document characteristics to determine (surface) attributes, often used in document segmentation.
Abstract: A new method is presented for adaptive document image binarization, where the page is considered as a collection of subcomponents such as text, background and picture. The problems caused by noise, illumination and many source type related degradations are addressed. The algorithm uses document characteristics to determine (surface) attributes, often used in document segmentation. Using characteristic analysis, two new algorithms are applied to determine a local threshold for each pixel. An algorithm based on soft decision control is used for thresholding the background and picture regions. An approach utilizing local mean and variance of gray values is applied to textual regions. Tests were performed with images including different types of document components and degradations. The results show that the method adapts and performs well in each case.

257 citations


Patent
23 Jun 1997
TL;DR: In this article, a document processing apparatus, a similar character classifying element classifies characters in a document image into similar character categories in advance and stores the classified categories together with their representative image features.
Abstract: The present invention provides a document processing apparatus, document processing method and a storage medium for storing thereof on purpose to offer document filing in which document can be registered with a little computation cost and with high speed, and retrieval can be performed with little oversight. In the document processing apparatus, a similar character classifying element classifies characters in a document image into similar character categories in advance and stores the classified categories together with their representative image features. When the document image is registered, a pseudo character recognizing element executes, without identifying each character in the text region, classification into character categories based on the image features less than those used in the ordinary character recognition and stores the category strings generated by identifying each character with the inputted image. In retrieval, a retrieval executing element converts each character in the retrieval keyword into nearest category, and retrieves a document including the converted category string as a part as a result of retrieval.

175 citations


Journal ArticleDOI
TL;DR: A new bottom-up method for document layout analysis based on Kruskal's algorithm and uses a special distance-metric between the components to construct the physical page structure.
Abstract: This paper describes a new bottom-up method for document layout analysis. The algorithm was implemented in the CLIDE (Chemical Literature Data Extraction) system, but the method described here is suitable for a broader range of documents. It is based on Kruskal's algorithm and uses a special distance-metric between the components to construct the physical page structure. The method has all the major advantages of bottom-up systems: independence from different text spacing and independence from different block alignments. The algorithms computational complexity is reduced to linear by using heuristics and path-compression.

158 citations


Patent
25 Feb 1997
TL;DR: In this article, a document condensation method and apparatus produce a document synopsis are provided in which automatic indexing techniques are used to analyze an input document to determine a list of words and phrases characteristic of the subject matter of the document.
Abstract: A document condensation method and apparatus produce a document synopsis are provided in which automatic indexing techniques are used to analyze an input document to determine a list of words and phrases characteristic of the subject matter of the document. Sections of the document are compared to the list of characteristic words and phrases to determine which sections of the document are most like the overall document in view of subject matter. A predetermined number of sections determined to be most similar to the overall document in content are provided as a condensed version of the whole document.

134 citations


Patent
14 Nov 1997
TL;DR: In this article, a document search system automatically segments document images into one or more layout objects and then computes a set of attributes for each of the identified layout objects, which are used to describe the layout structure of a page image of a document in terms of the spatial relations that layout objects have to frames of reference.
Abstract: A programming interface of document search system enables a user to dynamically specifying features of documents recorded in a corpus of documents. The programming interface provides category and format flexibility for defining different genre of documents. The document search system initially segments document images into one or more layout objects. Each layout object identifies a structural element in a document such as text blocks, graphics, or halftones. Subsequently, the document search system computes a set of attributes for each of the identified layout objects. The set of attributes are used to describe the layout structure of a page image of a document in terms of the spatial relations that layout objects have to frames of reference that are defined by other layout objects. Using the set of attributes a user defines features of a document with the programming interface. After receiving a feature or attribute and a set of document images selected by a user, the system forms a set of image segments by identifying those layout objects in the set of document images that make up the selected feature or attribute. The system then sorts the set of image segments into meaningful groupings of objects which have similarities and/or recurring patterns. In operation, the system sorts images in the image domain based on segments (or portions) of a document image which have been automatically extracted by the system. As a result, searching becomes more efficient because it is performed on limited portions of a document. Subsequently, document images in the set of document images are order and displayed to a user in accordance with the meaningful groupings.

113 citations


Patent
14 Nov 1997
TL;DR: In this article, a programming interface of document search system enables a user to dynamically specify features of documents recorded in a corpus of documents and provides category and format flexibility for defining different genre of documents.
Abstract: A programming interface of document search system enables a user to dynamically specifying features of documents recorded in a corpus of documents. The programming interface provides category and format flexibility for defining different genre of documents. The document search system initially segments document images into one or more layout objects. Each layout object identifies a structural element in a document such as text blocks, graphics, or halftones. Subsequently, the document search system computes a set of attributes for each of the identified layout objects. The set of attributes are used to describe the layout structure of a page image of a document in terms of the spatial relations that layout objects have to frames of reference that are defined by other layout objects. Using the set of attributes a user defines features of a document with the programming interface. After receiving a feature or attribute and a set of document images selected by a user, the system forms a set of image segments by identifying those layout objects in the set of document images that make up the selected feature or attribute. The system then sorts the set of image segments into meaningful groupings of objects which have similarities and/or recurring patterns. Subsequently, document images in the set of document images are ordered and displayed to a user in accordance with the meaningful groupings.

89 citations


Patent
22 Dec 1997
TL;DR: In this article, a management information extraction apparatus learns the structure of the ruled lines of a document and the position of user-specified management information such as a title, etc., during a form learning process, and stores them in a layout dictionary.
Abstract: A management information extraction apparatus learns the structure of the ruled lines of a document and the position of user-specified management information such as a title, etc. during a form learning process, and stores them in a layout dictionary. During the operation, the structure of the ruled lines extracted from an image of an input document is matched with that of the document in the layout dictionary. Then, position information in the layout dictionary is referred to, and the management information is extracted from the input document.

84 citations


Patent
02 May 1997
TL;DR: In this article, the author of a document can use a menu-driven utility to define a first layout for the document and a program module known as a Wizard then renders a document in the first layout, typically by adding content and making author-defined changes to the document rendered by Wizard.
Abstract: A publisher program for automatically changing the layout of content-filled desktop publishing documents. The publisher program allows the author of a document to use a menu-driven utility to define a first layout for the document. A program module known as a Wizard then renders a document in the first layout. The author makes changes to the document while the document is in the first layout, typically by adding content and making author-defined changes to the document rendered by the Wizard. The author may then return to the menu-driven utility to select a second layout for the document. The Wizard renders a document in the second layout and a program module known as a Page Manager automatically applies the author's changes to the document and renders a content-filled document in the second layout.

83 citations


Journal ArticleDOI
TL;DR: A system for segmenting and understanding text and mathematical expressions in a document can be divided into six stages: page segmentation and labeling, character segmentation, feature extraction, character recognition, expression formation, and error correction and expression extraction.

66 citations


Patent
14 Nov 1997
TL;DR: In this article, a document image is segmented into layout objects, and the system computes attributes and features for each segmented layout object before any document images are transmitted between a client and a server.
Abstract: In a document search and retrieval system, document images are segmented into layout objects. Each layout object identifies different structural elements in a document image. In addition, the system computes attributes and features for each segmented layout object. Before any document images are transmitted between a client and a server, users specify which document image attributes and features are most relevant to their browsing or searching tasks. Transmission (and/or display) of document images is then divided into two stages. During the first stage, those layout objects which are identified as having the specified features or attributes are transmitted at a first or high resolution; the remaining layout objects in an image are transmitted at a second or lower resolution (or in the form of bounding polygons). If the second stage is invoked, those remaining layout objects are re-transmitted at the first or high resolution. The second stage of transmission may be invoked when either a user request is received or when there is a system timeout.

Proceedings ArticleDOI
18 Aug 1997
TL;DR: A new approach for document logical structure analysis to convert document images and contents information into an electronic document by analyzing consecutive pages of a portion of the book.
Abstract: Numerous studies have so far been carried out extensively for the analysis of document image structure, with particular emphasis placed on media conversion and layout analysis. For the conversion of a collection of books in a library into the form of hypertext documents, a logical structure extraction technology is indispensable, in addition to document layout analysis. The table of contents of a book generally involves very concise and faithful information to represent the logical structure of the entire book. That is to say, we can efficiently analyze the logical structure of a book by making full use of its contents pages. This paper proposes a new approach for document logical structure analysis to convert document images and contents information into an electronic document. First, the contents pages of a book are analyzed to acquire the overall document logical structure. Thereafter, we are able to use this information to acquire the logical structure of all the pages of the book by analyzing consecutive pages of a portion of the book. Test results demonstrate very high discrimination rates: up to 97.6% for the headline structure, 99.4% for the text structure, 97.8% for the page-number structure and almost 100% for the head-foot structure.

Journal ArticleDOI
TL;DR: A modular software system, which classifies a large variety of office documents according to layout form and textual content, consists of the following components: layout analysis, pre-classification, OCR interface, fuzzy string matching, text categorization, lexical, syntactical and semantic analysis.

Proceedings ArticleDOI
TL;DR: In this article, a performance evaluation protocol for the layout analysis of document images is discussed, which is intended to serve as a model for using the UW-III database to evaluate the document analysis algorithms.
Abstract: A performance evaluation protocol for the layout analysis is discussed in this paper. In the University of Washington English Document Image Database-III, there are 1600 English document images that come with manually edited ground truth of entity bounding boxes. These bounding boxes enclose text and non-text zones, text-lines, and words. We describe a performance metric for the comparison of the detected entities and the ground truth in terms of their bounding boxes. The Document Attribute Format Specification is used as the standard data representation. The protocol is intended to serve as a model for using the UW-III database to evaluate the document analysis algorithms. A set of layout analysis algorithms which detect different entities have been tested based on the data set and the performance metric. The evaluation results are presented in this paper.

Journal ArticleDOI
TL;DR: The proposed segmentation method belongs to the bottom-up categories, and is more robust than other techniques, and can identify text regions in difficult cases such as skewed documents, non-rectangular text regions, or text included in drawings or halftone regions.

Proceedings ArticleDOI
Francine Chen1, D.S. Bloomberg
18 Aug 1997
TL;DR: A system for selecting sentences from an imaged document for presentation as part of a document summary is presented, and evaluation against a set of abstracts created by a professional abstracting company is given.
Abstract: A system for selecting sentences from an imaged document for presentation as part of a document summary is presented. The extracts are identified without the use of optical character recognition. The sentences are selected based on a set of discrete features characterizing the words within a sentence and the location of the sentence within the imaged document. Each sentence is scored based on the values of the discrete features using a statistically based classifier. The imaged document is processed to identify the word locations, the reading order of words, and the location of sentence and paragraph boundaries in the text. The words are grouped into equivalence classes to mimic the terms in a text document. A sample extract for a technical document is shown, and evaluation against a set of abstracts created by a professional abstracting company is given. These results are compared with text-based abstracts.

Patent
29 Jul 1997
TL;DR: In this paper, a preference document vector including the user's preference is obtained based on the frequency of appearance of the processing important phases included in a user's processing document, such as the interest, the degree of attention, the purpose, etc., of the user.
Abstract: PROBLEM TO BE SOLVED: To perform the document processing in response to the user's preference by shifting the document vector acquired via a document vector acquisition means to a preference vector. SOLUTION: A document (object document) which is the object of acquiring a preference document vector is acquired and stored in a RAM 113. A CPU 111 extracts the processing important phrases of a matrix from the object document and then decides the importance based on the frequency of appearance, the evaluation function, etc., included in the object document that is contained in a processing important document. Then a reference document vector is shifted and a preference document vector including the user's preference is obtained based on the frequency of appearance of the processing important phases included in the user's processing document. The similarity of another document is calculated to the preference document vector as an index to the preference such as the interest, the degree of attention, the purpose, etc., of the user. Thereby, it is possible to sort and retrieve the documents in response to the user's preference by sorting and retrieving the documents based on the similarity.

Proceedings ArticleDOI
18 Aug 1997
TL;DR: A taxonomy of functional document components is defined and it is shown how functional descriptions can be used to reverse-engineer the intentions of the author, to navigate in document space, and to provide important contextual information to aid in interpretation.
Abstract: The purpose of a document is to facilitate the transfer of information from its author to its readers. It is the author's job to design the document so that the information it contains can be interpreted accurately and efficiently. To do this, the author can make use of a set of stylistic tools. In this paper, we introduce the concept of document functionality, which attempts to describe the roles of documents and their components in the process of transferring information. A functional description of a document provides insight into the type of the document, into its intended uses, and into strategies for automatic document interpretation and retrieval. To demonstrate these ideas, we define a taxonomy of functional document components and show how functional descriptions can be used to reverse-engineer the intentions of the author, to navigate in document space, and to provide important contextual information to aid in interpretation.

01 Jan 1997
TL;DR: This paper presents the layout strategy developed for GraphVisualizer3D, which combines manual layout techniques and automatic algorithms in a synergistic manner, and a grid system is provided that can be nested to any arbitrary depth.
Abstract: There is increasing evidence that 3D visualization of complex structures has advantages over 2D visualization While nested directed graphs are an important method of representing information in 2D or 3D, they must be effectively organized in order to be understood Most work on graph layout has assumed that fully automatic layout is desirable Through our work with graphs representing large software structures, we have found that, due to the importance of the semantic content, it is necessary to combine automatic layout with manual layout This paper describes a system called GraphVisualizer3D, which was designed to help people understand large nested graph structures by displaying them in 3D This system is currently being applied to the problem of understanding large bodies of software In this paper we present the layout strategy developed for GraphVisualizer3D, which combines manual layout techniques and automatic algorithms in a synergistic manner In order to facilitate manual layout, a grid system is provided that can be nested to any arbitrary depth The automatic layout is accomplished by layering followed by a node migration algorithm, whereby nodes migrate to their final position under the influence of a variety of different forces Options are provided to allow users to switch back and forth between manual layout and automatic layout GV3D has been tested with large examples containing more than 35,000 nodes and 40,000 relationships

Journal ArticleDOI
TL;DR: This paper describes a generic document segmentation and geometric relation labeling method with applications to Chinese document analysis that begins with a hierarchy of partitioned image layers where inhomogeneous higher-level regions are recursively partitioned into lower-level rectangular subregions.

Proceedings ArticleDOI
TL;DR: A 'document browser' application is being developed that allows a user to interactively specify queries on the documents in the digital library using a graphical user interface, provides feedback about the candidate documents at each stage of the retrieval process, and allows refinements of the query based on the intermediate results of the search.
Abstract: This paper describes an approach to retrieving information from document images stored in a digital library by means of knowledge-based layout analysis and logical structure derivation techniques. Queries on document image content are categorized in terms of the type of information that is desired, and are parsed to determine the type of document from which information is desired, the syntactic level of the information desired, and the level of analysis required to extract the information. Using these clauses in the query, a set of salient documents are retrieved, layout analysis and logical structure derivation are performed on the retrieved documents, and the documents are then analyzed in detail to extract the relevant logical components. A 'document browser' application, being developed based on this approach, allows a user to interactively specify queries on the documents in the digital library using a graphical user interface, provides feedback about the candidate documents at each stage of the retrieval process, and allows refinements of the query based on the intermediate results of the search. Results of a query are displayed either as an image or as formatted text.

Proceedings ArticleDOI
18 Aug 1997
TL;DR: The idea is to generate the layout knowledge of business cards from a predefined logical structure, which is used as a kind of meta-knowledge to interpretatively generate the layouts knowledge of given business cards.
Abstract: Document knowledge plays a very important role in many document image understanding methods. In these methods, the document knowledge is utilized to classify/extract individual item data interpretatively from paper-based sheets as a kind of document model: these days, this knowledge is specified into the document image understanding system in advance. In this paper, we propose an experimental method to acquire the layout knowledge automatically from sample document images. In particular, we focus on the acquisition of the layout of business cards. Our idea is to generate the layout knowledge of business cards from a predefined logical structure, which is used as a kind of meta-knowledge to interpretatively generate the layout knowledge of given business cards.

Patent
06 Nov 1997
TL;DR: In this article, a digital compound machine which can specify document image data for a specific purpose by using a document ID or the document ID mark corresponding to the document id is equipped with a property display means which displays a table on a property storage means 6 where property information of the document images corresponding to a document is registered and the contents registered in the property table when the image data is specified.
Abstract: PROBLEM TO BE SOLVED: To obtain a digital compound machine which can store and take document image data out by accurately specifying a document image file. SOLUTION: This digital compound machine which can specify document image data for a specific purpose by using a document ID or the document ID mark corresponding to the document ID is equipped with a property display means which displays a table on a property storage means 6 where property information of the document image data corresponding to the document ID is registered and the contents registered in the property table when the document image data is specified. Not is only a cover document with a document ID mark specified for a document image file, but property information can be confirmed with the property table at need, so the document image file can accurately be specified. COPYRIGHT: (C)1999,JPO

Proceedings ArticleDOI
18 Aug 1997
TL;DR: A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer and logical analysis.
Abstract: Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval and interpretation continues to be a challenging problem. An efficient document model is necessary to solve this problem. Document modeling involves techniques of thresholding, skew detection, geometric layout analysis and logical layout analysis. The derived model can then be used in document storage and retrieval. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer and logical analysis.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that many office documents can be classified correctly using the proposed approach, and are demonstrated to be Turing-complete.

Proceedings ArticleDOI
18 Aug 1997
TL;DR: In this article, a pattern matcher is used to detect phrases in the document and then a Levenshtein distance is defined for error tolerance against OCR failures in order to detect such information and transform it into a semantic representation.
Abstract: Document analysis and understanding (DAU) systems aim not only at the recognition of text and document structures but also at the extraction of relevant information out of a scanned document. Depending on the class of a document, information to be extracted may be defined in advance in syntactic structures as well as in semantic structures. In this paper we present a system for detecting such information and transforming it into a semantic representation. The basic component is a pattern matcher which incorporates geometric positions to detect phrases in the document. By defining a Levenshtein distance, the component reacts more generously in order to be error tolerant against OCR failures.

Proceedings ArticleDOI
Y. Ishitani1
18 Aug 1997
TL;DR: A new method of document layout analysis is proposed for a document reader, to be used for reading a wide variety of documents that is adaptable to various layout structures in documents.
Abstract: A new method of document layout analysis is proposed for a document reader, to be used for reading a wide variety of documents. Emergent computation, which is a key concept of artificial life, is adopted to analyze various complex document structures. The proposed method uses a multilayer architecture consisting of four subsystems: region extraction, region analysis, region recognition, and region modification. Emergent computation is used for the interactions between subsystems to produce effective and flexible behavior of the entire system. The global layout structure of a document is extracted from these interactions. Experimental results obtained for 150 documents show the method is adaptable to various layout structures in documents.


Proceedings ArticleDOI
18 Aug 1997
TL;DR: A novel local skew estimation method is presented that takes advantage of the information available after flexible and efficient page segmentation and classification methods have been applied to the document image.
Abstract: Almost all document analysis approaches need to perform a global analysis of the page orientation as a separate process at an early stage. It would be preferable to estimate the orientation locally after page segmentation and classification, when more knowledge about the different regions is available. A novel local skew estimation method is presented that takes advantage of the information available after flexible and efficient page segmentation and classification methods have been applied to the document image. The proposed method accurately estimates the orientation of individual text regions by efficiently analysing the arrangement of background space contained in them. No assumption is made about the existence of a uniform or dominant orientation in the document. The whole process is very efficient, as only the regions of text are considered and the points used for the angle estimation are already available as by products of previous document analysis stages.

Proceedings ArticleDOI
06 Oct 1997
TL;DR: A new approach to automate document image layout extraction for an object- oriented database feature population using rapid low level feature analysis, preclassification and predictive coding is proposed.
Abstract: We propose a new approach to automate document image layout extraction for an object-oriented database feature populationusing rapid low level feature analysis, preclassification and predictive coding. The layout information comprised of region lo-cation and classification data is transformed into 'feature object(s)' . The information is then fed into an intelligent documentimage retrieval system (IDIR) to be utilized in document retrieval schemes. The IDIR system consists of user interface, object-oriented database and a variety of document image analysis algorithms. In this paper the object-oriented storage model and thedatabase system are presented in formal and functional domains. Moreover, the graphical user interface and a visual documentimage browser are described. The document analysis techniques used at document characterization are also presented. In thiscontext the documents consist of text, picture and other media (possibly embedded) data. Documents are stored in the databaseas document, page and region objects. Our test system has been implemented and tested using a document database of 10 000documents.Keywords: Document layout analysis, predictive coding, document database, retrieval, document content characterization, ob-ject-oriented database.