scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2002"


Patent
30 May 2002
TL;DR: In this paper, a system and method for improving document layout on arbitrary devices of different resolutions and size using manifold representations of content is presented, where multiple versions of anything that might appear in a document, from text, to images, to even such things as stylistic conventions are selected and formatted dynamically, on the fly, by a layout engine.
Abstract: A system and method for improving document layout on arbitrary devices of different resolutions and size using manifold representations of content. Manifold representations of content are: multiple versions of anything that might appear in a document, from text, to images, to even such things as stylistic conventions. The specific content is selected and formatted dynamically, on the fly, by a layout engine in order to best adapt to a given viewing situation. A user interface for authoring and editing such manifold content is disclosed.

176 citations


Journal ArticleDOI
TL;DR: This paper briefly describes various components of a document analysis system and provides the background necessary to understand the detailed descriptions of specific techniques presented in other papers in this issue.
Abstract: Document image analysis refers to algorithms and techniques that are applied to images of documents to obtain a computer-readable description from pixel data. A well-known document image analysis product is the Optical Character Recognition (OCR) software that recognizes characters in a scanned document. OCR makes it possible for the user to edit or search the document’s contents. In this paper we briefly describe various components of a document analysis system. Many of these basic building blocks are found in most document analysis systems, irrespective of the particular domain or language to which they are applied. We hope that this paper will help the reader by providing the background necessary to understand the detailed descriptions of specific techniques presented in other papers in this issue.

143 citations


Patent
31 Jul 2002
TL;DR: In this article, a system and method for summarizing the contents of a natural language document provided in electronic or digital form includes preformatting the document, performing linguistic analysis, weighting each sentence in the document as a function of quantitative importance, and generating one or more document summaries, from a plurality of selectable document summary types, as a result of the sentence weights.
Abstract: A system and method for summarizing the contents of a natural language document provided in electronic or digital form includes preformatting the document, performing linguistic analysis, weighting each sentence in the document as a function of quantitative importance, and generating one or more document summaries, from a plurality of selectable document summary types, as a function of the sentence weights.

120 citations


Patent
19 Dec 2002
TL;DR: In this paper, a set of features of the document is generated using information contained in the assist channel appended to the document, which is then compared to the set of feature features appended on the document.
Abstract: Xerox Docket No. D/A0037Q Document authentication is accomplished by acquiring document image data, generating a set of features of the document, and generating an assist channel that includes information on how to generate the set of features. The set of features and the assist channel are digitally signed and then append to the document. Document verification is accomplished by acquiring document image data and verifying the signature. If the signature is valid, a set of features of the document is generated using information contained in the assist channel appended to the document. The generated set of features is then compared to the set of features appended on the document. If the sets do not match, the document is determined to have been altered sometimes after the assist channel was appended to the document, i.e., the document is not genuine. Otherwise, the document can be considered to be genuine.

102 citations


Journal ArticleDOI
TL;DR: Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method for text retrieval from document images without the use of OCR.
Abstract: We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely the vertical traverse density (VTD) and horizontal traverse density (HTD), are extracted. An n-gram-based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from the UW1 (University of Washington 1) database confirms the validity of the proposed method.

98 citations


Patent
Brian John Cragun1, Paul Reuben Day1
29 Oct 2002
TL;DR: In this paper, an apparatus and method helps a user to determine parts of an electronic document that are of interest by allowing the user to define preferences for processing electronic documents, and by automatically highlighting one or more portions of the document according to the user preferences.
Abstract: An apparatus and method helps a user to determine parts of an electronic document that are of interest by allowing the user to define preferences for processing an electronic document, and by automatically highlighting one or more portions of the document according to the user preferences. Highlighting includes any way to enhance or alter the appearance of text, including bold, italics, underlining, change in font style, change in font size, change in color, change in background color, etc. The automatic highlighting of portions of the document attract the user's eyes to that portion of the document, which helps the user to discern whether or not the highlighted portion is relevant or interesting. The preferred embodiments also include a document generator that takes an input document and generates therefrom an output document that has one or more highlighted portions that are hard-coded into the document according to the user preferences.

79 citations


Patent
13 Nov 2002
TL;DR: In this paper, a document processing system for use in identifying a segmented document includes a data store of layout graph models that are classified and/or labeled, and a matching module makes a determination of a match between a layout graph sample for the segmented documents and a particular layout graph model.
Abstract: A document processing system for use in identifying a segmented document includes a data store of layout graph models that are classified and/or labeled A matching module makes a determination of a match between a layout graph sample for the segmented document and a particular layout graph model The matching module uses a correlator to generate an identified, segmented document that is classified and/or labeled based on the segmented document, the layout graph model, and the determination of a match

76 citations


Book ChapterDOI
TL;DR: The proposed approach uses the texture of a document image so as to infer the document structure distortion and a two-pass image warping algorithm is then used to correct the images.
Abstract: Document analysis and graphics recognition algorithms are normally applied to the processing of images of 2D documents scanned when flattened against a planar surface Technological advancements in recent years have led to a situation in which digital cameras with high resolution are widely available Consequently, traditional graphics recognition tasks may be updated to accommodate document images captured through a hand-held camera in an uncontrolled environment In this paper the problem of perspective and geometric deformations correction in document images is discussed The proposed approach uses the texture of a document image so as to infer the document structure distortion A two-pass image warping algorithm is then used to correct the images In addition to being language independent, the proposed approach may handle document images that include multiple fonts, math notations, and graphics The de-warped images contain less distortions and so are better suited for existing text/graphics recognition techniques

55 citations


Journal ArticleDOI
TL;DR: A structured link vector model (SLVM) is presented in this paper, where a vector represents a document, and vectors' elements are determined by terms, document structure and neighboring documents.
Abstract: A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents a document, and vectors' elements are determined by terms, document structure and neighboring documents Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 065-073 to 082-086

49 citations


Patent
Lisa S. Purvis1
23 Jul 2002
TL;DR: In this paper, a system and method specify a custom document as a constraint satisfaction problem to create the specified document using existing constraint solving algorithms wherein the document, its content components, and its layout requirements as elements of a constrained satisfaction problem which when solved, results in an automated document layout for the set of content components.
Abstract: A system and method specify a custom document as a constraint satisfaction problem to create the specified document using existing constraint solving algorithms wherein the document, its content components, and its layout requirements as elements of a constraint satisfaction problem which when solved, results in an automated document layout for the set of content components. The system and method enables an automated custom document creation process, providing a wider array of output documents.

42 citations


Patent
27 Nov 2002
TL;DR: In this paper, a plurality of document definition information for identifying documents, and format control information for recognizing a character recorded on a document corresponding to each of the plurality of definition information are held beforehand.
Abstract: A plurality of document definition information for identifying documents, and format control information for recognizing a character recorded on a document corresponding to each of the plurality of document definition information are held beforehand, documents targeted for character recognition are identified as specific documents based on document images of the entered documents targeted for character recognition and the document definition information and, based on a result of the identification, character recognition is executed by using corresponding format control information. A document definition device adds a plane area of each of documents to be identified to the document definition. An OCR device checks the plane area on the document by using the document definition before check of a preprint accompanied by character recognition.

Book ChapterDOI
19 Aug 2002
TL;DR: This system is able to learn a model for a document class, use this model to label document images through graph matching, and adaptively improve the model with error feed back.
Abstract: Logical structure analysis of document images is an important problem in document image understanding. In this paper, we propose a graph matching approach to label logical components on a document page. Our system is able to learn a model for a document class, use this model to label document images through graph matching, and adaptively improve the model with error feed back. We tested our method on journal/proceeding article title pages. The experimental results show promising accuracy, and confirm the ability of adaptive learning.

Patent
15 Feb 2002
TL;DR: In this paper, a digital document browsing system includes a layout engine for determining the layout of digital documents based on previously obtained historical data for a display form of the digital document, a summarization engine for preparing a summary for the sentences of the documents, and a view generator for arranging the summary obtained by the summary in accordance with the layout.
Abstract: A digital document browsing system includes: a layout engine for determining the layout of a digital document based on previously obtained historical data for a display form of the digital document, a summarization engine for preparing a summary for the sentences of the digital document based on the historical data for the digital document. Further included is a view generator for arranging the summary obtained by the summarization engine in accordance with the layout, and for generating data relating to the display form of the digital document. A user interface for displaying the digital document on a display device based on the data related to the display form is still further included.

Journal ArticleDOI
Jie Xi1, Jianming Hu1, Lide Wu1
TL;DR: This paper describes a new bottom-up method for page segmentation of Chinese document images based on run-length smoothing algorithm and minimal spanning tree clustering that can resolve the problems of segmenting Chinese documents that differ from English documents.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: The use of the Bayesian network model is discussed for a classification problem related to the document image understanding field, focused on logical labeling in documents, which consists in assigning logical labels to text blocks.
Abstract: This paper discusses the use of the Bayesian network model for a classification problem related to the document image understanding field. Our application is focused on logical labeling in documents, which consists in assigning logical labels to text blocks. The objective is to map a set of logical tags, composing the document logical structure, to the physical text components. We build a Bayesian network model that allows this mapping using supervised learning, and without imposing a priori constraints on the document structure. The learning strategy is based partly on genetic programming tools. A prototype has been implemented, and tested on tables of contents found in periodicals and magazines.

Journal ArticleDOI
TL;DR: A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract the required keywords and their logical relationship from various printed documents and is robust and effective for various document structures.
Abstract: A new method for information extraction from document images is proposed in this paper as the basis for a document reader which can extract the required keywords and their logical relationship from various printed documents. The proposed method consists of robust keyword matching, global document matching, and postprocessing for matching errors. First, robust keyword matching between two-dimensional OCR results consisting of a set of possible character candidate lists and a set of keywords defined in the keyword dictionary is carried out. This keyword dictionary includes incorrect words with typical OCR errors and segments of words in order to deal with OCR errors. Next, document matching is invoked between keyword matching results in an input document and document models. Each document model consists of a set of word models with their logical relationship described in terms of a tree structure. This model matching extracts the required keywords and their logical relationship from the input document and determines the most suitable model for the input document. Finally, postprocessing using heuristic rules defined in the model is applied to document matching results to recover keyword matching errors and to modify keyword matching results. This comprehensive approach solves word segmentation problems accurately even if a document has unknown words, compound words, or incorrect words due to OCR errors. Experimental results obtained for 100 documents show that the method is robust and effective for various document structures.

Patent
14 Mar 2002
TL;DR: In this article, an object layout device is provided with an image feature information extraction part 120 for extracting image features representing the features of each of a plurality of candidate images, an evaluation value calculation part 140 for calculating evaluation values of images on the basis of image features extracted by the image feature extractor part 120, and an image layout part 170 for determining the layout of the selected image selected by image selection part 150 on basis of evaluation value calculated by the evaluation value part 140.
Abstract: PROBLEM TO BE SOLVED: To provide an object layout device which reduces the time and labor required for processing and is suitable to realize a well-balanced layout in accordance with the contents of images. SOLUTION: The object layout device is provided with an image feature information extraction part 120 for extracting image feature information representating the features of each of a plurality of candidate images, an evaluation value calculation part 140 for calculating evaluation values of images on the basis of image feature information extracted by the image feature information extraction part 120, an image selection part 150 for selecting an image from a plurality of candidate images, and an image layout part 170 for determining the layout of the image selected by the image selection part 150 on the basis of the evaluation value calculated by the evaluation value calculation part 140. COPYRIGHT: (C)2003,JPO

Patent
12 Apr 2002
TL;DR: In this article, a document reading system scans a profile preparation sheet under each scanning environment, and extracts scanning characteristics by analyzing a read image, then records a profile of the scanning environment defining the characteristic or link information to the characteristic in each document definition, compares the profiles for definition preparation and for the book reading apparatus while reading the document, and recognizes and verifies a character string based on the comparison result.
Abstract: PROBLEM TO BE SOLVED: To provide a method for automatically extracting characteristics of each scanner provided on a document processing system, and a method for preparing and using a single document definition database even when scanning environments at defining a document and at reading a document are different from each other or a plurality of scanners are used for defining or reading the document, thereby facilitating a document definition preparation operation and preventing degradation in accuracy of a document reading. SOLUTION: The document reading system scans a profile preparation sheet under each scanning environment, and extracts scanning characteristics by analyzing a read image. The document reading system then records a profile of the scanning environment defining the characteristic or link information to the characteristic in each document definition, compares the profiles for definition preparation and for the book reading apparatus while reading the document, and recognizes and verifies a character string based on the comparison result. COPYRIGHT: (C)2004,JPO

Patent
14 Aug 2002
TL;DR: A document mapping system as discussed by the authors is a set of element classes, each element class having an associated set of document elements and associated sets of format and mapping rules, identifying means for identifying one or more document elements within an original document, and mapping means for creating and displaying a map of document sections linked by labels representing the respective documents elements associated with those document sections.
Abstract: A document mapping system including a set of element classes, each element class having an associated set of document elements and an associated set of format and mapping rules, identifying means for identifying one or more document elements within an original document, and mapping means for creating and displaying a map of document sections linked by labels representing the respective documents elements associated with those document sections.

Patent
09 Sep 2002
TL;DR: In this paper, a Document Constraint Analyzer (DCA) takes as input a set of document files together with a document constraint specification file, extracts and examines the contents, attributes, and relationships associated with the document objects, and evaluates the logical expressions specified in the document constraints.
Abstract: A Product Document Constraint Specification Language (PDCSL) is provided for a document author to represent various types of documentation guidelines that must be enforced within documents or across different documents. A Document Constraint Analyzer (DCA) takes as input a set of document files together with a document constraint specification file, extracts and examines the contents, attributes, and relationships associated with the document objects, and evaluates the logical expressions specified in the document constraints. If a document constraint is not satisfied, an action can be taken to correct the documents or provide an explanation to the document author.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: An enhanced background-thinning based page segmentation algorithm to process document images rapidly and eliminate some small regions embedded in other regions and a hierarchical approach, which combines the cross correlation measure, Kolmogorov complexity measure, and a neural network, to classify sub-images into halftones and texts.
Abstract: Page segmentation and image content classification plays an important role in automatic document image processing with applications to mixed-type document image compression, form and check reading, and automatic mail sorting. We propose an enhanced background-thinning based page segmentation algorithm to process document images rapidly and eliminate some small regions embedded in other regions. We then present a hierarchical approach, which combines the cross correlation measure, Kolmogorov complexity measure, and a neural network, to classify sub-images into halftones and texts. The approach also achieves high accuracy in text determination using a three-layer feed-forward network where the text region can be classified into Chinese or alphabetic characters. Experimental results on a number of mixed-type document images show the efficiency and effectiveness of our approach.

Patent
Robert S. Murata1
31 May 2002
TL;DR: In this article, a user selection performs a selection defining an area of interest based on the layout of an electronic document, which can be applied across multiple electronic documents and is performed visually on a rendering of the electronic document thereby providing the user with visual feedback as to what content has been selected.
Abstract: Techniques for layout-based page capture. A user selection performs a selection defining an area of interest based on the layout of an electronic document. The program retrieves electronic documents based on the defined area of interest. The selection can be performed visually on a rendering of the electronic document thereby providing the user with visual feedback as to what content has been selected. The selection can be applied across multiple electronic documents.

Journal ArticleDOI
TL;DR: A new document model in terms of a bounding box representation of its constituent parts is described and an empirical measure of performance of a segmentation algorithm based on this new graph-like model of the document is suggested.
Abstract: Document image segmentation is the first step in document image analysis and understanding. One major problem centres on the performance analysis of the evolving segmentation algorithms. The use of a standard document database maintained at the Universities/Research Laboratories helps to solve the problem of getting authentic data sources and other information, but some methodologies have to be used for performance analysis of the segmentation. We describe a new document model in terms of a bounding box representation of its constituent parts and suggest an empirical measure of performance of a segmentation algorithm based on this new graph-like model of the document. Besides the global error measures, the proposed method also produces segment-wise details of common segmentation problems such as horizontal and vertical split and merge as well as invalid and mismatched regions.

Patent
23 Dec 2002
TL;DR: In this paper, a content and layout coordination system for market-based document content selection is presented, which takes simple criteria from a user and automatically constructs virtual documents from a much larger underlying database of content and layouts.
Abstract: A method and corresponding apparatus for market-based document content and layout selection use an automated auction or bartering system, i.e., an automated content and layout coordination system, to automatically coordinate content and layout selection for document presentation. The system takes simple criteria from a user and automatically constructs virtual documents from a much larger underlying database of content and layout. By trading among the virtual documents, the automated content and layout coordination system affords a flexible and scalable method for coordinating the selection of high-value content and layout and generating a high value document.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: A new method for text detection using a binary image alone is proposed, which includes detection of both normal and inverted text and robustness to various font types, styles and sizes and small skew angles, combined with a moderate number of free parameters.
Abstract: Many document images are rich in color and have complex background. To detect text from them, a standard approach utilizes both color and binary information. This often leads to time-consuming processing and requires a lot of parameters to be tuned. In contrast, we propose a new method for text detection using a binary image alone. The main virtues of our method include detection of both normal and inverted text and robustness to various font types, styles and sizes and small skew angles, combined with a moderate number of free parameters.

Patent
16 Apr 2002
TL;DR: In this paper, the authors present a method for separating and processing layout information and data of a document by using style sheet language transformations, which can be rendered by a conventional browser.
Abstract: Computer-implemented method, computer system and computer program product for separating and processing layout information and data of a document. The computer system providing a predefined document description (120, 130). The document description (120, 130) is decomposed (420) into a layout template (140-1) and a data description (140-2). In a preferred embodiment of the invention decomposition (420) is achieved by using style sheet language transformations. Optionally, the computer system instantiates (460) a data instance (150) from the data description (140-2) and merges (470) the data instance (150) with the layout template (140-1) into an individual document description (160). The individual document description (160) can be rendered by a conventional browser.

Patent
21 Nov 2002
TL;DR: In this paper, a structured document mapping apparatus that correlates items included in text document data inputted to items of hierarchical document having attributes and a structure is presented, and automatically maps the items of the input text document according to the rules stored in a search database and creates an XBRL document that corresponds to attributes and structure of the standard taxonomy or the corporation unique taxonomy.
Abstract: A structured document mapping apparatus that correlates items included in text document data inputted to items of hierarchical document having attributes and a structure. The structured document mapping apparatus refers to at least one of a standard taxonomy and a corporation-unique taxonomy, and automatically maps the items of the input text document according to the rules stored in a search database and creates an XBRL document that corresponds to attributes and structure of the standard taxonomy or the corporation-unique taxonomy.

Patent
Hui Chao1, Henry Sang1
28 Feb 2002
TL;DR: In this paper, a computer-implemented document composition device includes a processor and a memory communicating with the processor, which includes a document storage area storing one or more electronic documents and a distance modifier routine.
Abstract: A computer-implemented document composition device includes a processor and a memory communicating with the processor. The memory includes a document storage area storing one or more electronic documents and a distance modifier routine. The processor uses the distance modifier routine to modify a separation distance between two particular text clusters in the electronic document. A document thus differentiated may be easily discriminated from other, similar documents. The differentiation may result in a human eye insignificant difference to the document layout that may be computer recognizable.

Book ChapterDOI
27 Jun 2002
TL;DR: This paper investigates the possibility of supporting the user during the correction of the results of the global analysis of the document processing system WISDOM++ by allowing the user to correct the results and then by learning rules for layout correction from the sequence of user actions.
Abstract: Layout analysis is the process of extracting a hierarchical structure describing the layout of a page. In the document processing system WISDOM++ the layout analysis is performed in two steps: firstly, the global analysis determines possible areas containing paragraphs, sections, columns, figures and tables, and secondly, the local analysis groups together blocks that possibly fall within the same area. The result of the local analysis process strongly depends on the quality of the results of the first step. In this paper we investigate the possibility of supporting the user during the correction of the results of the global analysis. This is done by allowing the user to correct the results of the global analysis and then by learning rules for layout correction from the sequence of user actions. Experimental results on a set of multipage documents are reported.

Patent
Ecran E. Kuruoglu1, Alex S. Taylor1
11 Oct 2002
TL;DR: In this article, a target document in a document processing system is annotated on the basis of annotations made previously to a source document, and the target document is searched to locate any of the keywords of interest to the user.
Abstract: A target document in a document processing system is annotated on the basis of annotations made previously to a source document. A source document (either a scanned image of a paper document or an electronic document) is annotated by a user to identify words or phrases of interest. The annotated words are extracted for use as keywords or phrases to search in future document. When a target document is processed, the target document is searched to locate any of the keywords of interest to the user. If any of the keywords are located, electronic annotations are applied to these in the target document for display or printing out and/or registered as keywords to the project. The automatically annotated words or phrases enable the user to locate regions of interest more quickly. A summary of a captured document image is produced on the basis of detected annotations made to a document prior to image capture. The scanned (or otherwise captured) image is processed to detect annotations made to the document prior to scanning. The detected annotations can be used to identify features, or text, for use to summarize that document. Additionally, or alternatively, the detected annotations in one document can be used to identify features, or text, for use to summarize a different document. The summary may be displayed in expandable detail levels.