scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2000"


Patent
17 May 2000
TL;DR: In this paper, a text mining program is provided that allows a user to perform text mining operations such as information retrieval, term and document visualization, term clustering, term classification, summarization of individual documents and groups of documents, and cross-referencing.
Abstract: A text mining program is provided that allows a user to perform text mining operations, such as: information retrieval, term and document visualization, term and document clustering, term and document classification, summarization of individual documents and groups of documents, and document cross-referencing. This is accomplished by representing the text of a document collection using subspace transformations. This subspace transformation representation is performed by: constructing a term frequency matrix of the term frequencies for each of the documents, transforming the term frequencies for statistical purposes, and projecting the documents or the terms into a lower dimensional subspace. As the document collection is updated, the subspace is dynamically updated to reflect the new document collection.

280 citations


Patent
02 Aug 2000
TL;DR: In this article, a shape can be defined by a set of associated edges in a specified configuration, and a catalog of shapes is defined and layout processing actions are associated with the various shapes.
Abstract: Layout processing can be applied to an integrated circuit (IC) layout using a shape-based system. A shape can be defined by a set of associated edges in a specified configuration. A catalog of shapes is defined and layout processing actions are associated with the various shapes. Each layout processing action applies a specified layout modification to its associated shape. A shape-based rule system advantageously enables efficient formulation and precise application of layout modifications. Shapes/actions can be provided as defaults, can be retrieved from a remote source, or can be defined by the user. The layout processing actions can be compiled in a bias table. The bias table can include both rule-based and model-based actions, and can also include single-edge shapes for completeness. The scanning of the IC layout can be performed in order of increasing or decreasing complexity, or can be specified by the user. The appropriate layout processing actions are applied to matching portions of the IC layout to form the corrected photomask layout. This process can be sequential or batch mode. Shape and action conflicts can be resolved by marking identified/modified elements or by designing rules for orderly resolution of any inconsistencies or overlaps.

224 citations


Patent
28 Jul 2000
TL;DR: In this article, a system and method for text-based document retrieval is proposed, which is based on utilizing information contained in the document collection about the statistics of word relationships (context) to facilitate the specification of search queries and document comparison.
Abstract: A system and method for document retrieval is disclosed The invention addresses a major problem in text-based document retrieval: rapidly finding a small subset of documents in a large document collection (eg Web pages on the Internet) that are relevant to a limited set of query terms supplied by the user The invention is based on utilizing information contained in the document collection about the statistics of word relationships (“context”) to facilitate the specification of search queries and document comparison The method consists of first compiling word relationships into a context database that captures the statistics of word proximity and occurrence throughout the document collection At retrieval time, a search matrix is computed from a set of user-supplied keywords and the context database For each document in the collection, a similar matrix is computed using the contents of the document and the context database Document relevance is determined by comparing the similarity of the search and document matrices The disclosed system therefore retrieves documents with contextual similarity rather than word frequency similarity, simplifying search specification while allowing greater search precision

221 citations


Patent
24 Jul 2000
TL;DR: In this article, the bindings are used to describe a document and a different document by associating content elements with layout elements, the layout elements defining layout features or placement information to be applied to the associated content elements in the document, the bindings being stored separately from both the content and layout elements.
Abstract: Bindings are used to describe a document (and a different document) by associating content elements with layout elements, the layout elements defining layout features or placement information to be applied to the associated content elements in the document, the bindings being stored separately from both the content and layout elements.

175 citations


Journal ArticleDOI
TL;DR: A new algorithm for skew detection is described and the performance and results of this skew detection algorithm are compared to other publidhed methods form O'Gorman, Hinds, Le, Baird, Posel and Akuyama.
Abstract: Document image processing has become an increasingly important technology in the automation of office documentation tasks. Automatic document scanners such as text readers and OCR (Optical Character Recognition) systems are an essential component of systems capable of those tasks. One of the problems in this field is that the document to be read is not always placed correctly on a flatbed scanner. This means that the document may be skewed on the scanner bed, resulting in a skewed image. This skew has a detrimental effect on document on document analysis, document understanding, and character segmentation and recognition. Consequently, detecting the skew of a document image and correcting it are important issues in realising a practical document reader. In this paper we describe a new algorithm for skew detection. We then compare the performance and results of this skew detection algorithm to other publidhed methods form O'Gorman, Hinds, Le, Baird, Posel and Akuyama. Finally, we discuss the theory of skew detection and the different apporaches taken to solve the problem of skew in documents. The skew correction algorithm we propose has been shown to be extremenly fast, with run times averaging under 0.25 CPU seconds to calculate the angle on the DEC 5000/20 workstation.

127 citations


01 Jan 2000
TL;DR: This paper presents a hybrid and comprehensive approach to document structure analysis that makes use of layout as well as textual features of a given document to express fuzzy matched rules of an underlying rule base.
Abstract: Document image processing is a crucial process in the office automation and begins from the ’OCR’ phase with difficulty of the document ’analysis’ and ’understanding’. This paper presents a hybrid and comprehensive approach to document structure analysis. Hybrid in the sense, that it makes use of layout (geometrical) as well as textual features of a given document. These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base.

99 citations


Patent
Thomas Lange1
30 Nov 2000
TL;DR: In this article, a text portion containing instruction symbols is selected, and the instruction symbols contained in the selected text portion are converted into a data object represented by the instruction symbol, which is then input into the text document.
Abstract: For inserting a data object as for example a mathematical formula or special characters like Greek characters into a text document, instruction symbols representing the data object are inputted in the form of text characters into the text document. A text portion containing instruction symbols is selected, and the instruction symbols contained in the selected text portion are converted into a data object represented by the instruction symbols. The invention allows rapid input of data objects into the text document, in particular simple mathematical formulae or single special characters without entering a formula editor or the like.

68 citations


Proceedings ArticleDOI
21 Dec 2000
TL;DR: A new paradigm, 'random graph probing,' is described for comparing the results returned by the recognition system and the representation created during ground-truthing, which could be applied to other document recognition tasks and perhaps even other computer vision problems as well.
Abstract: Tables are an important means for communicating information in written media, and understanding such tables is a challenging problem in document layout analysis. In this paper we describe a general solution to the problem of recognizing the structure of a detected table region. First hierarchial clustering is used to identify columns and then spatial and lexical criteria to classify headers. We also address the problem of evaluating table structure recognition. Our model is based on a directed acyclic attribute graph, or table DAG. We describe a new paradigm, 'random graph probing,' for comparing the results returned by the recognition system and the representation created during ground-truthing. Probing is in fact a general concept that could be applied to other document recognition tasks and perhaps even other computer vision problems as well.© (2000) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

65 citations


Journal ArticleDOI
TL;DR: The usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible are demonstrated.
Abstract: This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction.

59 citations


Patent
31 Aug 2000
TL;DR: In this article, the authors proposed a method of quickly and automatically comparing a new document to a large number of previously seen documents and identifying the document type by calculating the distances between the new document signature and each of the plurality of document type distributions using a Bayesian framework for a Gaussian distribution.
Abstract: A method of quickly and automatically comparing a new document to a large number of previously seen documents and identifying the document type. First, provide a plurality of document type distributions, each document type distribution describes layout characteristics of an independent document type and may include a plurality of data points. Each document type distribution includes data derived from at least one basis document signature which may include data defining pixels of a low-resolution image of the independent basis document resolved to between 1 and 75 dots per inch or may include document segmentation data derived from the independent basis document. Next provide a new electronic document. Then create new document signature from the new electronic document. Next, distances between the new document signature and each of the plurality of document type distributions are calculated using an algorithm based on a Bayesian framework for a Gaussian distribution. The distances calculated may be Euclidean distances or may be Mahalanobis distances. Additionally, calculating the distances may include weighting the value given each of a plurality of data points in the document signatures based on the usefulness of each of the plurality of data points in distinguishing between the document signatures. Next, select at least one candidate document type for the new electronic document from among the independent document types described by the plurality of document type distributions. The selection of the at least one candidate document type may include selecting a preselected fixed number of the independent document types or may include selecting the independent document types described by those of the plurality of document type distributions having calculated distances that are within a preselected threshold distance of the smallest of the distances calculated. In addition, the invention provides for a program storage medium readable by computer, tangibly embodying a program of instructions executable by the computer to perform the method steps described above.

44 citations


Patent
Chinmoy Panda1
23 Mar 2000
TL;DR: In this article, a computer-implemented method and system for identifying key images in a document is presented, which includes extracting one or more document keywords from the document considered important in describing the document, collecting one or several images associated with the document including information describing each image, generating a proximity factor for each image collected from the documents and each document keyword that reflects the degree of correlation between the image and the document keyword, and determining the importance of each image according to an image metric that combines the proximity factors for each document keywords and image pair.
Abstract: A computer-implemented method and system for identifying key images in a document is provided. The operations used include extracting one or more document keywords from the document considered important in describing the document, collecting one or more images associated with the document including information describing each image, generating a proximity factor for each image collected from the document and each document keyword that reflects the degree of correlation between the image and the document keyword, and determining the importance of each image according to an image metric that combines the proximity factors for each document keyword and image pair. In addition, the operations may also include ordering the document keywords according to an ordering criterion and weighting the proximity factor associated with each document keyword and image pair based on the order of the document keyword.

Patent
24 Jan 2000
TL;DR: In this paper, the style of an example document is determined by examining the example file for syntax patterns that are required in a document of this type, each pattern is used to create a section template (a sub-template for a larger template).
Abstract: A system and method of using an example document to create another document with the same style. The style is determined by examining the example file for syntax patterns that are required in a document of this type. Each pattern is used to create a section template (a sub-template for a larger template). After all the required sub-templates have been defined, by examining the example, we have a document template that may be used to format new documents. Along with user-specific content, a document generator uses the captured document template to generate sections of a new document. When a section of a document is generated, the sub-template that corresponds to that section of a document is inserted with user-specific content. The generated file ends up with the same kind of text spacing and positioning, ordering of sections, presence of annotations and other nonfunctional attributes as the example.

Book ChapterDOI
TL;DR: Recognition techniques that are discussed include blackboard systems, stochastic Grammars, Hidden Markov Models, and graph grammars for diagram recognition.
Abstract: Document image analysis is the study of converting documents from paper form to an electronic form that captures the information content of the document. Necessary processing includes recognition of document layout (to determine reading order, and to distinguish text from diagrams), recognition of text (called Optical Character Recognition, OCR), and processing of diagrams and photographs. The processing of diagrams has been an active research area for several decades. A selection of existing diagram recognition techniques are presented in this paper. Challenging problems in diagram recognition include (1) the great diversity of diagram types, (2) the difficulty of adequately describing the syntax and semantics of diagram notations, and (3) the need to handle imaging noise. Recognition techniques that are discussed include blackboard systems, stochastic grammars, Hidden Markov Models, and graph grammars.

Journal ArticleDOI
TL;DR: This paper proposes a new method, called the document multithresholding technique, based on a page layout analysis (PLA) technique and on a neural-network multilevel threshold-selection approach, which is applicable to any mixed-type document and achieves document multathresholding by taking advantage of the types of the document blocks.

Journal ArticleDOI
TL;DR: The system is divided into three main components: detection of MEs in a document; recognition of the symbols present in each ME; and arrangement of the recognised symbols.
Abstract: In this paper, we propose an approach for understanding Mathematical Expressions (MEs) in a printed document. The system is divided into three main components: (i) detection of MEs in a document; (ii) recognition of the symbols present in each ME; and (iii) arrangement of the recognised symbols. The MEs printed in separate lines are detected without any character recognition whereas the embedded expressions (mixed with normal text) are detected by recognising the mathematical symbols in text. Some structural features of the MEs are used for both cases. The mathematical symbols are grouped into two classes for convenience. At first, the frequently occurring symbols are recognised by a stroke-feature analysis technique. Recognition of less frequent symbols involves a hybrid of feature-based and template-based technique. The bounding-box coordinates and the size information of the symbols help to determine the spatial relationships among the symbols. A set of predefined rules is used to form the meaningful symbol groups so that a logical arrangement of the mathematical expression can be obtained. Experiments conducted using this approach on a large number of documents show high accuracy.

Proceedings ArticleDOI
27 Mar 2000
TL;DR: A new method that forms large connected components by a smoothing algorithm and calculates the document skew by finding the orientation of the minimum-area bounding rectangle of one of several connected components is presented.
Abstract: Detection of document skew is an important step in document image analysis. The paper presents a new method for calculation of document skew. The method forms large connected components by a smoothing algorithm and calculates the document skew by finding the orientation of the minimum-area bounding rectangle of one of several connected components. Connection of text to non-text in the smoothing step does not degrade the performance of the method. The smoothing parameters are determined automatically and no manual adjustment is necessary. The method is not limited in the range of detectable skew angles and the achievable accuracy. Experimental results show the high performance of the algorithm in detecting document skew for a variety of documents with different levels of complexity.

Proceedings ArticleDOI
21 Dec 2000
TL;DR: An algorithm and its implementation that segregates text block by block from the provided document, e.g. newspaper image, based on pyramid structure, which is amenable for parallel processing on output, is described in this paper.
Abstract: Text block segmentation is necessary in document layout analysis. An algorithm and its implementation that segregates text block by block (a block is either a title or a paragraph) from the provided document, e.g. newspaper image, based on pyramid structure is described in this paper. The pyramid structure, which is amenable for parallel processing on output, is a multi-resolution image representation. The pyramid structure also simulates what the human eyes see the document from afar visualizing the block structure of the document, the block segmentation can identify the titles, and distinguish different paragraphs based on the indentation between them. Our implementation will be used in a news articles retrieval project.


Proceedings ArticleDOI
01 Jan 2000
TL;DR: A bottom-up algorithm of layout analysis based on nearest neighbor connect-strength and line confidence is proposed and a rule-based growing algorithm used for layout understanding is proposed that can be used for automatic electronic publishing.
Abstract: Layout analysis, understanding and representation are important problems when transforming paper document to its electronic version. For a Chinese newspaper with a complex layout, a bottom-up algorithm of layout analysis based on nearest neighbor connect-strength and line confidence is proposed. We also propose a rule-based growing algorithm used for layout understanding. The implementation of layout representation is discussed at the same time. Using these algorithms with a Chinese OCR engine, we developed a complete system that can be used for automatic electronic publishing. The algorithms were proved to be efficient and practical by experimental results and by a practically running system.

Proceedings ArticleDOI
01 Jan 2000
TL;DR: A new matching algorithm based on the document component block list and component block tree is proposed that can effectively make use of the local information of each page block and the global information of page layout, while it is also robust to image distortion, filled-in text, and noises.
Abstract: Document image matching is the key technique for document registration and retrieval. A new matching algorithm based on the document component block list and component block tree is proposed. Our method can effectively make use of the local information of each page block and the global information of page layout, while it is also robust to image distortion, filled-in text, and noises. This algorithm is then refined and applied to automatic data extraction of column forms. A demonstrating software package has been developed.

Patent
Yoshiaki Kurosawa1, Katsumi Kato1
20 Mar 2000
TL;DR: In this paper, a document image is inputted as image data from an image inputting section, and the layout of the document image was analyzed on the basis of the image data to obtain layout constituents.
Abstract: In a document image processing apparatus, a document image is inputted as image data from an image inputting section. In a layout analyzing section, the layout of the document image is analyzed on the basis of the image data to obtain layout constituents. In an image processing section, the document image is processed. The image processing section includes an editor for specifying a position to be edited in the document image and editing the document image, on the basis of position/size data on the layout constituents, in accordance with the operator's instructions from the operation data inputting section. In an image displaying section, the document image is displayed in cooperation with the image processing section.

Patent
25 Apr 2000
TL;DR: In this paper, a document image recognizing method was proposed to identify the areas of color document images and black-and-white/gray images and accurate OCR is enabled to a color document having a problem peculiar for the color document as well.
Abstract: PROBLEM TO BE SOLVED: To provide a document image recognizing method, with which the areas of color document images and black-and-white/gray images are accurately and efficiently identified and accurate OCR is enabled to a color document having a problem peculiar for the color document as well. SOLUTION: Concerning the document image recognizing method for recognizing a document image, the document image is inputted as a digital image, the background color of this document image is specified, the image is reduced as needed, pixels except for a background area are extracted from the document image while using this background color, a link component is generated by merging these pixels, the link component is classified into prescribed areas while using form features at least, and the area identified result of the document image is provided. Besides, the area identification of a binary image is performed, the result is collated with the color area identified result, feedback processing is performed as needed and the binary image and the area identified result suitable for OCR can be provided. COPYRIGHT: (C)2001,JPO

Patent
06 Apr 2000
Abstract: A parsing system and method are provided in which the break characters in the document are used to rapidly parse the document and extract one or more key phrases from the document which characterize the document (44). The break characters in the document may include explicit break characters (46), such as punctuation, soft stop words and hard stop words. The determination of which phrases in the document are extracted depends upon the type of break character appearing after the phrase in the document (52).

Proceedings ArticleDOI
21 Dec 2000
TL;DR: A fully implemented system based on generic document knowledge for detecting the logical structure of documents for which only general layout information is assumed, which focuses on detecting the reading order.
Abstract: We present a fully implemented system based on generic document knowledge for detecting the logical structure of documents for which only general layout information is assumed. In particular, we focus on detecting the reading order. Our system integrates components based on computer vision, artificial intelligence, and natural language processing techniques. The prominent feature of our framework is its ability to handle documents from heterogeneous collections. The system has been evaluated on a standard collection of documents to measure the quality of the reading order detection. Experimental results for each component and the system as a whole are presented and discussed in detail. The performance of the system is promising, especially when considering the diversity of the document collection.

Proceedings Article
12 Apr 2000
TL;DR: A framework to analyze color documents of complex layout combines in a content-driven bottom-up approach two different sources of information: textual and spatial and is successful as it extracts the intended reading order from the document.
Abstract: We present a framework to analyze color documents of complex layout. In addition, no assumption is made on the layout. Our framework combines in a content-driven bottom-up approach two different sources of information: textual and spatial. To analyze the text, shallow natural language processing tools, such as taggers and partial parsers, are used. To infer relations of the logical layout we resort to a qualitative spatial calculus closely related to Allen's calculus. We evaluate the system against documents from a color journal and present the results of extracting the reading order from the journal's pages. In this case, our analysis is successful as it extracts the intended reading order from the document.

Proceedings ArticleDOI
01 Sep 2000
TL;DR: A pyramidal quadtree structure is constructed for multiscale analysis and top-down approach, and a periodicity measure is suggested to find a periodical attribute of text regions.
Abstract: We propose a new method independent of parameters for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables and lines. A pyramidal quadtree structure is constructed for multiscale analysis and top-down approach, and a periodicity measure is suggested to find a periodical attribute of text regions. To obtain robust page segmentation results, a confirmation procedure using texture analysis is applied to only ambiguous regions. Experimental results with the document database from the University of Washington show that the proposed method works better than the previous ones.

Patent
25 Jan 2000
TL;DR: In this article, the authors propose to perform document synthesizing processing by extracting document components from a structure document and inserting/replacing the respective document components in a model document without using a script with described procedure.
Abstract: PROBLEM TO BE SOLVED: To perform a document synthesizing processing by extracting document components from a structure document and inserting/replacing the respective document components in a model document without using a script with described procedure. SOLUTION: In a structured document, an extraction instruction for taking out the document components and repetitive copying and insertion/replacement instructions are imparted. Thus, as the result of specifying the take-out of the document components, the repetitive copying and the document components (parts) for inserting or replacing the document components, dynamically synthesizing the instructions taken out from the inputted plural structured documents and preparing a document processing description, the need of a document processing description script is eliminated. Thus, the time and labor of managing the script separately from original documents are omitted.

Proceedings ArticleDOI
03 Sep 2000
TL;DR: This work proposes a methodology of automatic generation of ground-truths for skewed images by using the ground- Truths available for upright images, which is simple and quite fast because processing is done at the level of small square blocks, but not at pixel level.
Abstract: Generation of ground-truths is of great importance for unbiased performance evaluation of document layout analysis methods. This is especially necessary because many methods are claimed to be skew-tolerant. However, experimental evaluation of this fact is often based only on human subjective judgement and restricted to a few experiments. The main obstacle for obtaining human-independent and more automated performance evaluation is that usually there are only ground-truths for upright images, i.e., images with no skew of text lines, because currently available ground-truthing techniques are too time-consuming. We propose a methodology of automatic generation of ground-truths for skewed images by using the ground-truths available for upright images. This methodology is simple and quite fast because processing is done at the level of small square blocks, but not at pixel level.

Proceedings ArticleDOI
28 May 2000
TL;DR: This paper proposes a new method that is called the document multithresholding technique, based on a Page Layout Analysis technique and on a neural network multilevel threshold selection approach that can reduce the number of the gray-levels in accordance to the type of each document region.
Abstract: Mixed type documents include text, drawings and graphics regions. It is obvious that a technique that can reduce the number of the gray-levels in accordance to the type of each document region could be important for many document applications, such as storage, transmission and recognition. To solve this problem this paper proposes a new method that is called the document multithresholding technique. The method is based on a Page Layout Analysis (PLA) technique and on a neural network multilevel threshold selection approach. In the final document the different block types are stored with the appropriate and limited number of gray-level values. In text and line-drawing blocks, only one threshold is determined whereas in the graphics blocks the optimal number of thresholds is first determined. The performance of the method was extensively tested on a variety of documents.

Patent
26 Apr 2000
TL;DR: In this paper, a document issue side calculates a feature quantity as to document image data acquired by reading a paper document and pastes the feature quantity image from the document image onto document image to generate transmission purpose document image and print out a transmission purpose paper document.
Abstract: PROBLEM TO BE SOLVED: To provide a document authentication method or the like that can properly authenticate a document or the like that is exchanged. SOLUTION: A document issue side calculates a feature quantity D3 as to document image data D2 acquired by reading a paper document D1, pastes a feature quantity image D4 resulting from the feature quantity D3 onto the document image data D2 to generate transmission purpose document image data D5 and prints out a transmission purpose paper document D6. A document receiver side separates a feature quantity image D8 from document image data D7 obtained by reading the received paper document D6 to reproduce a feature quantity (original feature quantity) D9. Furthermore, the document receiver side calculates a feature quantity D11 from an original document image D10 from which the feature quantity D8 is separated and compares the feature quantity D11 with the original feature quantity D9 to provide an output of the result of falsification investigation. COPYRIGHT: (C)2001,JPO