scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2001"


Patent
Fernando Incertis Carro1
06 Aug 2001
TL;DR: In this article, a method and system for creating hyperlinks from items (e.g., words, pictures, foot notes, symbols, icons) on a first physical document to particular points on a second physical document (manuscript or printed document), for activating these hyperlinks simply by touching the first document, and for highlighting by means of a light emitting source, the position of the items on the second document.
Abstract: The present invention generally relates to interactive hypermedia systems and more particularly to a method and system for locating on a physical document items referenced in another physical document. The present invention discloses a method and system for creating hyperlinks from items (e.g. words, pictures, foot notes, symbols, icons) on a first physical document to particular points on a second physical document (manuscript or printed document), for activating these hyperlinks simply by touching the first document, and for highlighting by means of a light emitting source, the position of the items on the second document. In a preferred embodiment, the present invention discloses a method and system for highlighting on a hard-copy map the geographic positions of places referenced in a hard-copy document.

150 citations


Patent
24 Feb 2001
TL;DR: In this article, a document drafting system generates a document having a required sequence, including one or more screens to receive text input; an electronic agent to review the text input, to apply a set of diagnostic rules to the text inputs, and to generate zero or more diagnostic messages; a visualization system coupled to one screen to graphically display one or multiple relationships between one of the text portions in the document; and a document generator to assemble text input into the document according to the sequence.
Abstract: A document drafting system generates a document having a required sequence, including one or more screens to receive text input; an electronic agent to review the text input, to apply one or more diagnostic rules to the text input and to generate zero or more diagnostic messages; a visualization system coupled to one screen to graphically display one or more relationships between one or more text portions in the document; and a document generator to assemble the text input into the document according to the sequence.

147 citations


Journal ArticleDOI
TL;DR: The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.
Abstract: The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

129 citations


Journal ArticleDOI
TL;DR: It is demonstrated that layout offers a rich resource for achieving presentational coherence, alongside more traditional resources such as text-formatting and the text-internal marking of discourse connections, and an integrated approach to layout, text, and diagram generation is introduced.
Abstract: Combining elements appropriately within a coherent page layout is a well-recognized and crucial aspect of sophisticated information presentation. The precise function and nature of layout has not, however, been sufficiently addressed within computational approaches; attention is often restricted to relatively local issues of typography and text-formatting, leaving broader issues of layout unaddressed. In this paper we focus on the selection and function of layout in pages that appropriately combine textual and graphical representation styles to yield coherent presentation designs. We demonstrate that layout offers a rich resource for achieving presentational coherence, alongside more traditional resources such as text-formatting and the text-internal marking of discourse connections. We also introduce an integrated approach to layout, text, and diagram generation. Our approach is developed on the basis of a preliminary empirical investigation of professionally produced layouts, followed by implementation within a prototype information system in the area of art history.

119 citations


Journal ArticleDOI
TL;DR: The approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class, using decision tree classifiers and self-organizing maps.
Abstract: Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify it by type in the absence of domain-specific models. Our approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class. We use image features such as percentages of text and non-text (graphics, images, tables, and rulings) content regions, column structures, relative point sizes of fonts, density of content area, and statistics of features of connected components which can be derived without class knowledge. In order to obtain class labels for training samples, we conducted a study where subjects ranked document pages with respect to their resemblance to representative page images. Class labels can also be assigned based on known document types, or can be defined by the user. We implemented our classification scheme using decision tree classifiers and self-organizing maps.

97 citations


Proceedings ArticleDOI
10 Sep 2001
TL;DR: A hierarchical approach for extracting homogeneous regions in on-line documents is presented and the problem of identifying and processing ruled and unruled tables, text and drawings is addressed.
Abstract: We present a hierarchical approach for extracting homogeneous regions in on-line documents. The problem of identifying and processing ruled and unruled tables, text and drawings is addressed. The on-line document is first segmented into regions with only text strokes and regions with both text and non-text strokes. The text region is further classified as unruled table or plain text. Stroke clustering is used to segment the non-text regions. Each nontext segment is then classified as drawing, ruled table or underlined keyword using stroke properties. The individual regions are processed and the results are assembled to identify the structure of the on-line document.

86 citations


Patent
Takuya Okamoto1, Toru Takahashi1, Yuki Aoyama1, Noriyuki Yamasaki1, Eiko Murata1 
28 Sep 2001
TL;DR: In this article, a method and an apparatus for searching and displaying a structured document are disclosed, where an analyzed structured document and information for document search are generated and stored in data bases, respectively.
Abstract: A method and an apparatus for searching and displaying a structured document are disclosed. The process for document registration is executed with a structured document of a file as an input. An analyzed structured document and information for document search are generated, and are stored in data bases, respectively. A query input from an input/output unit is analyzed, a document search index is read and a search process is executed. Matching document identifier information and matching strings position information are output as the result of search. In the display process, a corresponding analyzed structured document is read from the data base based on the document identifier information matched in a document read process. In processing a document display, the matching information are embedded in the structured document based on the matching strings position information, and a structured document for display with highlight information added thereto is generated and displayed. A document is searched from which the element information constituting a stumbling block to the search is removed, and the result of search is displayed with highlight information added to the original structured document.

83 citations


Patent
Yue Ma1, Jinhong Katherine Guo1
31 Jan 2001
TL;DR: In this article, a handwriting detection method was used to detect handwritten annotations and printed text lines from a scanned document image, including add-on information such as handwritten annotations in addition to text lines.
Abstract: A scanned document image, including add-on information such as handwritten annotations in addition to printed text lines, is processed by a handwriting detection method First, at least one projection histogram is generated from the scanned document image A regular pattern that correlates to the printed text lines is determined from the projection histogram Second, connected component analysis is applied to the scanned document image to generate at least one merged text line Each merged text line relates to at least one of the handwritten annotation and the printed text line By comparing the merged text lines to the regular pattern of the projection histograms, the printed text lines are discriminated from the handwritten annotations

79 citations


Journal ArticleDOI
TL;DR: A parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines is proposed and shown that the proposed method provides more accurate results than previous ones.
Abstract: Automatic transformation of paper documents into electronic documents requires geometric document layout analysis at the first stage. However, variations in character font sizes, text line spacing, and document layout structures have made it difficult to design a general-purpose document layout analysis algorithm for many years. The use of some parameters has therefore been unavoidable in previous methods. The authors propose a parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines. A pyramidal quadtree structure is constructed for multiscale analysis and a periodicity measure is suggested to find a periodical attribute of text regions for page segmentation. To obtain robust page segmentation results, a confirmation procedure using texture analysis is applied to only ambiguous regions. Based on the proposed periodicity measure, multiscale analysis, and confirmation procedure, we could develop a robust method for geometric document layout analysis independent of character font sizes, text line spacing, and document layout structures. The proposed method was experimented with the document database from the University of Washington and the MediaTeam Document Database. The results of these tests have shown that the proposed method provides more accurate results than previous ones.

77 citations


Proceedings ArticleDOI
01 Jan 2001
TL;DR: In this paper, three efficient techniques that can be used to discriminate between text written in Arabic script and textwritten in English script are presented and evaluated.
Abstract: Because of the different characteristics of Arabic language and Romance and Anglo Saxon languages, recognition of documents written in hybrids of these languages requires that the language of the text is to be identified prior to the recognition phase. In this paper, three efficient techniques that can be used to discriminate between text written in Arabic script and text written in English script are presented and evaluated. These techniques address the language identification problem on the word level and on text level. The characteristics of horizontal projection profiles as well as runlength histograms for text written in both languages are the basic features underlying these techniques. Solving this problem is very important in building bilingual document image analysis systems which are capable of processing documents containing hybrid Arabic/Romance and Anglo Saxon languages.

72 citations


Proceedings ArticleDOI
01 Sep 2001
TL;DR: A well designed method that makes use of edge information to extract textual blocks from gray scale document images by detecting textual regions on heavy noise infected newspaper images and separate them from graphical regions is presented.
Abstract: In this paper we present a well designed method that makes use of edge information to extract textual blocks from gray scale document images. It aims at detecting textual regions on heavy noise infected newspaper images and separate them from graphical regions. The algorithm traces the feature points in different entities and then groups those edge points of textual regions. From using the technology of line approximation and layout categorization, it can successfully retrieve directional placed text blocks. Finally feature based connected component merging was introduced to gather homogeneous textual regions together within the scope of its bounding rectangles. We can obtain correct page decomposition with efficient computation and reduced memory size by handling line segments instead of small pixels. The proposed method has been tested on a large group of newspaper images with multiple page layouts, promising results approved the effectiveness of our method.

Patent
28 Dec 2001
TL;DR: In this paper, a system and method for automatically generating a hierarchical table of contents or outline for indexing a document and identifying clusters of related information in the document is presented, which employs a unique and novel combination of latent semantic indexing techniques to identify related blocks and major topic changes within the document with scale space segmentation techniques.
Abstract: A system and method for automatically generating a hierarchical table of contents or outline for indexing a document and identifying clusters of related information in the document. The document may comprise text, audio, video, or a multimedia presentation. The invention employs a unique and novel combination of latent semantic indexing techniques to identify related blocks and major topic changes within the document with scale space segmentation techniques to respectively identify self-similar blocks within the document and to thus find topic changes of various sizes at block edges. The invention then produces a visual presentation of the semantic structure of the document.

Proceedings ArticleDOI
09 Nov 2001
TL;DR: A document analysis system which is expected to extract regions of interest in greyscale document images using geometric and texture features and some entropic heuristic is presented.
Abstract: In this paper, we present a document analysis system which is expected to extract regions of interest in greyscale document images. Collected areas are then clustered in text zones and non-text areas using geometric and texture features. The system works in two steps. Regions of interest are retrieved via cumulative gradient considerations. In classification module, we introduced some entropic heuristic. Experiments are done on the MediaTeam Document Database to show the relevance of this criteria.

Proceedings ArticleDOI
10 Sep 2001
TL;DR: This paper develops a FORG-based genre classification method for machine-printed documents and presents a comparative evaluation between the technique and a variety of statistical pattern classifiers.
Abstract: We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our method uses the attributed relational graphs (ARGs) to represent the layout structure of document instances, and the first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to significantly outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.

Patent
05 Feb 2001
TL;DR: A dynamic document and a method for representing a dynamic document are provided in this paper, where the dynamic document includes a logic section and a layout section, each of which has at least one layout object.
Abstract: A dynamic document and a method for representing a dynamic document are provided The dynamic document includes a dynamic document template and an instances set bound to the dynamic document template (32) The instances set (34) includes a plurality of pointers to a plurality of data sources The dynamic document template includes a logic section (36) and a layout section (38) The layout section has at least one layout object

Journal ArticleDOI
TL;DR: This paper presents a hybrid and comprehensive approach to document structure analysis that makes use of layout as well as textual features of a given document, which allows an easy adaptation to specific domains with their specific logical objects.
Abstract: Document image processing is a crucial process in office automation and begins at the ‘OCR’ phase with difficulties in document ‘analysis’ and ‘understanding’ This paper presents a hybrid and comprehensive approach to document structure analysis Hybrid in the sense that it makes use of layout (geometrical) as well as textual features of a given document These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base Rules can be formulated based on features which might be observed within one specific layout object However, rules can also express dependencies between different layout objects In addition to its rule driven analysis, which allows an easy adaptation to specific domains with their specific logical objects, the system contains domain-independent markup algorithms for common objects (eg, lists)

Journal ArticleDOI
TL;DR: A new matching method based on document component block list (CBL) is proposed that can effectively make use of the local information of each page component block and the global information of document page layout.

Patent
24 Apr 2001
TL;DR: In this article, the content of a text document is determined by comparing hash values of documents (or alternatively their sifted text) and then the hash values are used to classify a document into one or more categories.
Abstract: Described herein is a technology for recognizing the content of text documents. The technology determines one or more hash values for the content of a text document. Alternatively, the technology may generate a “sifted text” version of a document. In one implementation described herein, document recognition is used to determine whether the content of one document is copied (i.e., plagiarized) from another document. This is done by comparing hash values of documents (or alternatively their sifted text). In another implementation described herein, document recognition is used to categorize the content of a document so that it may be grouped with other documents in the same category. This abstract itself is not intended to limit the scope of this patent. The scope of the present invention is pointed out in the appending claims.

Patent
01 Jun 2001
TL;DR: In this paper, the compatibility of layout items and processing order with a predefined data interface of a business application is verified as well as the compatibility with the business application's data interface.
Abstract: Systems and methods for defining a form with a plurality of layout items for data presentation by a business application provide a tree view with tree nodes to represent the layout items, wherein the view visualizes structure information, a processing order, a selected tree node to represent a selected layout item; provide a property view to display properties of the selected layout item; provide a layout view to display items, wherein the selected layout item is highlighted; modify the selected layout item and the processing order through interaction with a user; and create a form definition document. The compatibility of layout items and processing order with a predefined data interface of the business application is verified as well.

Patent
18 Apr 2001
TL;DR: In this article, the authors proposed a document automated dividing device capable of omitting a partition sheet and automatically determining a break of a document even when various documents having unclear formats are inputted.
Abstract: PROBLEM TO BE SOLVED: To provide a document automated dividing device capable of omitting a partition sheet and automatically determining a break of a document even when various documents having unclear formats are inputted. SOLUTION: This document automated dividing device is provide with an image reading means 101 reading a plurality of documents and forming a document image, a document image storing buffer 102 storing the read document image, a letter identifying means 103 identifying a letter in the document image, a document dividing information extracting means 104 extracting document dividing information for determining a break of the document from an analysis result and a letter identification result of the document image, a document break determination means 105 determining the break of the document on the basis of the document dividing information, a document break possible selection means 106 displaying the document break determination result by the document break determination means to an operation for correction and confirmation, and a document management system registering means 107 dividing the document image into document units for registration in a document management system.

Patent
04 Jan 2001
TL;DR: In this article, a text document can be authenticated through the insertion of interword blank characters for the purpose of becoming authenticateable, which can be used to merge the information necessary to authenticate text documents into the body of the document itself.
Abstract: The invention discloses how a text document can be marked through the insertion of inter-word blank characters for the purpose of becoming authenticateable. First, text to be marked is edited so as to obtain a canonical form of it conforming to a model. Then, from this canonical form of the text and a secret-key used as inputs, a unique combination of inter-word blank characters positions is computed in which extra blanks are inserted thus, obtaining a marked text document. Authentication of a received marked text document is performed by a recipient, sharing the secret-key, further comparing the received text document to the marked text document so that if they are matching exactly the received text document is accepted as authentic or rejected as fake if not. The invention allows to merge the information necessary to authenticate a text document into the body of the document itself which works as well on soft-copy and hard-copy text documents.

Journal ArticleDOI
TL;DR: The use of analyzing the connected components extracted from the binary image of a document page provides a lot of useful information, and will be used to perform skew correction, segmentation and classification of the document.
Abstract: Document image processing has become an increasingly important technology in the automation of office documentation tasks. Automatic document scanners such as text readers and OCR (Optical Character Recognition) systems are an essential component of systems capable of those tasks. One of the problems in this field is that the document to be read is not always placed correctly on a flat-bed scanner. This means that the document may be skewed on the scanner bed, resulting in a skewed image. This skew has a detrimental effect on document analysis, document understanding, and character segmentation and recognition. Consequently, detecting the skew of a document image and correcting it are important issues in realizing a practical document reader. This paper presents the use of analyzing the connected components extracted from the binary image of a document page. Such an analysis provides a lot of useful information, and will be used to perform skew correction, segmentation and classification of the document. Moreover, we describe two new algorithms — one for skew detection and one for skew correction. The new skew correction algorithm we propose has been shown to be fast and accurate, with run times averaging under 1.5 CPU seconds and 30 seconds real time to calculate the angle on a 5000/20 DEC workstation. Experiments on over 100 pages show that the method works well on a wide variety of layouts, including sparse textual regions, mixed fonts, multiple columns, and even for documents with a high graphical content.

Patent
16 Aug 2001
TL;DR: In this article, a transformation document generation mechanism (TDGM) is proposed to automatically generate transformation documents given a source document and a target document, where the transformation document is generated by generating templates from the source document to the target document.
Abstract: A transformation document generation mechanism (TDGM) for automatically generating a transformation document given a source document and a target document is disclosed. The TDGM analyzes each document and builds a pattern dictionary for each that records the patterns found in that document. Thereafter, the TDGM processes the pattern dictionaries to automatically generate the transformation document. In doing so, the TDGM automatically generates pattern creation templates in the transformation document. These templates (when invoked by a transformation processor at a later time while processing a source document with the transformation document) will cause particular patterns to be created in a result document. In addition, the TDGM generates zero or more copy templates in the transformation document to copy identical elements, if any, from the source document to the result document. Once that is done, the transformation document is created and may be refined by a user. By performing much of the underlying document analysis for the user, and by generating an initial transformation document, the TDGM simplifies the transformation document creation process.

Journal ArticleDOI
TL;DR: A method for determining an algorithm's optimal tuning parameters and the correspondences between detected entities and ground truth is described, and a group of document layout analysis algorithms are evaluated.

Proceedings ArticleDOI
10 Sep 2001
TL;DR: A statistical model for a document understanding system, which uses both text attributes and document layouts that provides the consistency of the model according to the features used, and the samples chosen among the training set, is described.
Abstract: This paper describes a statistical model for a document understanding system, which uses both text attributes and document layouts. Probabilistic relaxation is used as a recognition scheme to find the hierarchical structure of the logical layout. This approach, commonly used for pixels classification in image analysis, can be applied to classify text blocks into logical classes according to local compatibility with other neighboring blocks at different hierarchical levels. It provides a logical layout that is globally compatible with the training model. We have tested this approach on reading tables of contents of periodicals for documents indexing. Probabilistic relaxation has interesting properties like high-speed training and the 'a priori' recognition rate, which provides the consistency of the model according to the features used, and the samples chosen among the training set.

Proceedings ArticleDOI
10 Sep 2001
TL;DR: A new component based bottom-up algorithm is proposed with a novel homogeneity related definition of distance that maintains a dynamic minimal distance mechanism to decide the components merging sequence.
Abstract: The aim of the layout analysis is to extract the geometric structure from a document image. It is a progress of labeling homogenous regions of a document image. In order to present a complex newspaper layout analysis, this paper proposes a new component based bottom-up algorithm. With a novel homogeneity related definition of distance, it maintains a dynamic minimal distance mechanism to decide the components merging sequence. Under the restricting rules generated from the newspaper layout heuristically, we derive the preferred analysis result. Experimental results reveal the proposed approach is effective.

Patent
Matthew Hurst1, Tetsuya Nasukawa1
25 Jun 2001
TL;DR: In this paper, a document is input which is laid out using blanks or the like, then a symbol is acquired which is associated with a spatial coordinate of the document, and successive characters of the same type are extracted from the symbol to generate a token and a space.
Abstract: A technique for extracting a meaningful text block from a document where a table, an itemized list, a multiple column, etc., are arbitrarily laid out. A document is input which is laid out using blanks or the like, then a symbol is acquired which is associated with a spatial coordinate of the document. Consecutive characters of the same type are extracted from the symbol to generate a token and a space. A stream is generated from consecutive spaces in the column direction, while a text block is generated from streams and tokens. A link is generated between the text blocks to form a document graph. Validity of a connection (link) between the text blocks in the document graph is evaluated using a language model, then the text blocks are merged if the connection is valid.

Proceedings ArticleDOI
10 Sep 2001
TL;DR: The paper describes problems concerning the networking of digital document images and provides solutions for image compression and presents a file format, which enables editing and annotation, which has been applied to 16/sup th/ century books for the project DEBORA.
Abstract: Digital libraries create new services and open rare collections to a larger and wider audience. The development of online digital libraries in image mode is today limited by the narrow bandwidth of the network and the heavy storage requirements. Moreover, efficient networking of text content images requires specific compression schemes and particular file formats that describe the contents of the images and provide additional information such as document layout descriptions. The paper describes problems concerning the networking of digital document images and provides solutions for image compression. A file format, which enables editing and annotation, is presented. Our proposal mainly uses automatic document layout analysis algorithms, which constitutes a new field of application for this research area. This work, granted by the European community, has been applied to 16/sup th/ century books for the project DEBORA.

Patent
Alex S. Taylor1, Ercan E. Kuruoglu1
19 Oct 2001
TL;DR: In this paper, a target document in a document processing system is annotated on the basis of annotations made previously to a source document, and the target document is searched to locate any of the keywords of interest to the user.
Abstract: A target document in a document processing system is annotated on the basis of annotations made previously to a source document. A source document (either a scanned image of a paper document or an electronic document) is annotated by a user to identify words or phrases of interest. The annotated words are extracted for use as keywords or phrases to search in future document. When a target document is processed, the target document is searched to locate any of the keywords of interest to the user. If any of the keywords are located, electronic annotations are applied to these in the target document for display or printing out and/or registered as keywords to the project. The automatically annotated words or phrases enable the user to locate regions of interest more quickly.

Patent
02 Nov 2001
TL;DR: In this article, a computer system based method of analyzing an electronic document which includes text and graphics and common reference symbols designate text components and respective graphics components is presented, where the method comprising processing the document texts and graphics into an index that identifies the text locations of reference symbols and graphic locations of references, and displaying (70) the text that includes at least some of the text reference symbols, and linking the common text and common graphic reference symbols such that user selection of a particular text reference symbol or graphic reference symbol causes display of a respective graphic segment or text segment that includes the
Abstract: A computer system based method of analyzing an electronic document which document includes text and graphics and common reference symbols designate text components and respective graphics components the method comprising processing the document text and graphics into an index that identifies the text locations of reference symbols and graphic locations of reference symbols, and displaying (70) the text that includes at least some of the text reference symbols and/or displaying (68) at least some of the graphic reference symbols, and linking the common text and common graphic reference symbols such that user selection of a particular text reference symbol or graphic reference symbol causes display of a respective graphic segment or text segment that includes the selected common reference symbol. Other features include displaying a component list, selecting component identities to display graphic segments, using voice recognition for user control, and synthesized speech for audio text response.