Showing papers on "Document layout analysis published in 2001"

PDF

Open Access

Patent•

System and method for locating on a physical document items referenced in another physical document

[...]

Fernando Incertis Carro¹•Institutions (1)

06 Aug 2001

TL;DR: In this article, a method and system for creating hyperlinks from items (e.g., words, pictures, foot notes, symbols, icons) on a first physical document to particular points on a second physical document (manuscript or printed document), for activating these hyperlinks simply by touching the first document, and for highlighting by means of a light emitting source, the position of the items on the second document.

...read moreread less

Abstract: The present invention generally relates to interactive hypermedia systems and more particularly to a method and system for locating on a physical document items referenced in another physical document. The present invention discloses a method and system for creating hyperlinks from items (e.g. words, pictures, foot notes, symbols, icons) on a first physical document to particular points on a second physical document (manuscript or printed document), for activating these hyperlinks simply by touching the first document, and for highlighting by means of a light emitting source, the position of the items on the second document. In a preferred embodiment, the present invention discloses a method and system for highlighting on a hard-copy map the geographic positions of places referenced in a hard-copy document.

...read moreread less

150 citations

Patent•

Systems and methods for generating intellectual property

[...]

Bao Tran

24 Feb 2001

TL;DR: In this article, a document drafting system generates a document having a required sequence, including one or more screens to receive text input; an electronic agent to review the text input, to apply a set of diagnostic rules to the text inputs, and to generate zero or more diagnostic messages; a visualization system coupled to one screen to graphically display one or multiple relationships between one of the text portions in the document; and a document generator to assemble text input into the document according to the sequence.

...read moreread less

Abstract: A document drafting system generates a document having a required sequence, including one or more screens to receive text input; an electronic agent to review the text input, to apply one or more diagnostic rules to the text input and to generate zero or more diagnostic messages; a visualization system coupled to one screen to graphically display one or more relationships between one or more text portions in the document; and a document generator to assemble the text input into the document according to the sequence.

...read moreread less

147 citations

Journal Article•DOI•

Transforming paper documents into XML format with WISDOM

[...]

O. Altamura, Floriana Esposito, Donato Malerba

01 Aug 2001-International Journal on Document Analysis and Recognition

TL;DR: The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.

...read moreread less

Abstract: The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

...read moreread less

129 citations

Journal Article•DOI•

Towards constructive text, diagram, and layout generation for information presentation

[...]

John A. Bateman¹, Jörg Kleinz, Thomas Kamps, Klaus Reichenberger•Institutions (1)

University of Bremen¹

01 Sep 2001-Computational Linguistics

TL;DR: It is demonstrated that layout offers a rich resource for achieving presentational coherence, alongside more traditional resources such as text-formatting and the text-internal marking of discourse connections, and an integrated approach to layout, text, and diagram generation is introduced.

...read moreread less

Abstract: Combining elements appropriately within a coherent page layout is a well-recognized and crucial aspect of sophisticated information presentation. The precise function and nature of layout has not, however, been sufficiently addressed within computational approaches; attention is often restricted to relatively local issues of typography and text-formatting, leaving broader issues of layout unaddressed. In this paper we focus on the selection and function of layout in pages that appropriately combine textual and graphical representation styles to yield coherent presentation designs. We demonstrate that layout offers a rich resource for achieving presentational coherence, alongside more traditional resources such as text-formatting and the text-internal marking of discourse connections. We also introduce an integrated approach to layout, text, and diagram generation. Our approach is developed on the basis of a preliminary empirical investigation of professionally produced layouts, followed by implementation within a prototype information system in the area of art history.

...read moreread less

119 citations

Journal Article•DOI•

Classification of document pages using structure-based features

[...]

Christian K. Shin¹, David Doermann¹, Azriel Rosenfeld¹•Institutions (1)

University of Maryland, College Park¹

01 May 2001-International Journal on Document Analysis and Recognition

TL;DR: The approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class, using decision tree classifiers and self-organizing maps.

...read moreread less

Abstract: Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify it by type in the absence of domain-specific models. Our approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class. We use image features such as percentages of text and non-text (graphics, images, tables, and rulings) content regions, column structures, relative point sizes of fonts, density of content area, and statistics of features of connected components which can be derived without class knowledge. In order to obtain class labels for training samples, we conducted a study where subjects ranked document pages with respect to their resemblance to representative page images. Class labels can also be assigned based on known document types, or can be defined by the user. We implemented our classification scheme using decision tree classifiers and self-organizing maps.

...read moreread less

97 citations

Proceedings Article•DOI•

Structure in on-line documents

[...]

K. Jain¹, Anoop M. Namboodiri, J. Subrahmonia•Institutions (1)

Michigan State University¹

10 Sep 2001

TL;DR: A hierarchical approach for extracting homogeneous regions in on-line documents is presented and the problem of identifying and processing ruled and unruled tables, text and drawings is addressed.

...read moreread less

Abstract: We present a hierarchical approach for extracting homogeneous regions in on-line documents. The problem of identifying and processing ruled and unruled tables, text and drawings is addressed. The on-line document is first segmented into regions with only text strokes and regions with both text and non-text strokes. The text region is further classified as unruled table or plain text. Stroke clustering is used to segment the non-text regions. Each nontext segment is then classified as drawing, ruled table or underlined keyword using stroke properties. The individual regions are processed and the results are assembled to identify the structure of the on-line document.

...read moreread less

86 citations

Patent•

Method and apparatus for searching and displaying structured document

[...]

Takuya Okamoto¹, Toru Takahashi¹, Yuki Aoyama¹, Noriyuki Yamasaki¹, Eiko Murata¹ - Show less +1 more•Institutions (1)

Hitachi¹

28 Sep 2001

TL;DR: In this article, a method and an apparatus for searching and displaying a structured document are disclosed, where an analyzed structured document and information for document search are generated and stored in data bases, respectively.

...read moreread less

Abstract: A method and an apparatus for searching and displaying a structured document are disclosed. The process for document registration is executed with a structured document of a file as an input. An analyzed structured document and information for document search are generated, and are stored in data bases, respectively. A query input from an input/output unit is analyzed, a document search index is read and a search process is executed. Matching document identifier information and matching strings position information are output as the result of search. In the display process, a corresponding analyzed structured document is read from the data base based on the document identifier information matched in a document read process. In processing a document display, the matching information are embedded in the structured document based on the matching strings position information, and a structured document for display with highlight information added thereto is generated and displayed. A document is searched from which the element information constituting a stumbling block to the search is removed, and the result of search is displayed with highlight information added to the original structured document.

...read moreread less

83 citations

Patent•

Detecting and utilizing add-on information from a scanned document image

[...]

Yue Ma¹, Jinhong Katherine Guo¹•Institutions (1)

Panasonic¹

31 Jan 2001

TL;DR: In this article, a handwriting detection method was used to detect handwritten annotations and printed text lines from a scanned document image, including add-on information such as handwritten annotations in addition to text lines.

...read moreread less

Abstract: A scanned document image, including add-on information such as handwritten annotations in addition to printed text lines, is processed by a handwriting detection method First, at least one projection histogram is generated from the scanned document image A regular pattern that correlates to the printed text lines is determined from the projection histogram Second, connected component analysis is applied to the scanned document image to generate at least one merged text line Each merged text line relates to at least one of the handwritten annotation and the printed text line By comparing the merged text lines to the regular pattern of the projection histograms, the printed text lines are discriminated from the handwritten annotations

...read moreread less

79 citations

Journal Article•DOI•

Parameter-free geometric document layout analysis

[...]

Seong-Whan Lee¹, Dae-Seok Ryu•Institutions (1)

Korea University¹

01 Nov 2001-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines is proposed and shown that the proposed method provides more accurate results than previous ones.

...read moreread less

Abstract: Automatic transformation of paper documents into electronic documents requires geometric document layout analysis at the first stage. However, variations in character font sizes, text line spacing, and document layout structures have made it difficult to design a general-purpose document layout analysis algorithm for many years. The use of some parameters has therefore been unavoidable in previous methods. The authors propose a parameter-free method for segmenting the document images into maximal homogeneous regions and identifying them as texts, images, tables, and ruling lines. A pyramidal quadtree structure is constructed for multiscale analysis and a periodicity measure is suggested to find a periodical attribute of text regions for page segmentation. To obtain robust page segmentation results, a confirmation procedure using texture analysis is applied to only ambiguous regions. Based on the proposed periodicity measure, multiscale analysis, and confirmation procedure, we could develop a robust method for geometric document layout analysis independent of character font sizes, text line spacing, and document layout structures. The proposed method was experimented with the document database from the University of Washington and the MediaTeam Document Database. The results of these tests have shown that the proposed method provides more accurate results than previous ones.

...read moreread less

77 citations

Proceedings Article•DOI•

Techniques for language identification for hybrid Arabic-English document images

[...]

Ahmed Elgammal¹, M.A. Ismail²•Institutions (2)

University of Maryland, College Park¹, Alexandria University²

01 Jan 2001

TL;DR: In this paper, three efficient techniques that can be used to discriminate between text written in Arabic script and textwritten in English script are presented and evaluated.

...read moreread less

Abstract: Because of the different characteristics of Arabic language and Romance and Anglo Saxon languages, recognition of documents written in hybrids of these languages requires that the language of the text is to be identified prior to the recognition phase. In this paper, three efficient techniques that can be used to discriminate between text written in Arabic script and text written in English script are presented and evaluated. These techniques address the language identification problem on the word level and on text level. The characteristics of horizontal projection profiles as well as runlength histograms for text written in both languages are the basic features underlying these techniques. Solving this problem is very important in building bilingual document image analysis systems which are capable of processing documents containing hybrid Arabic/Romance and Anglo Saxon languages.

...read moreread less

72 citations

Proceedings Article•DOI•

Text extraction from gray scale document images using edge information

[...]

Q. Yuan¹, Chew Lim Tan¹•Institutions (1)

National University of Singapore¹

01 Sep 2001

TL;DR: A well designed method that makes use of edge information to extract textual blocks from gray scale document images by detecting textual regions on heavy noise infected newspaper images and separate them from graphical regions is presented.

...read moreread less

Abstract: In this paper we present a well designed method that makes use of edge information to extract textual blocks from gray scale document images. It aims at detecting textual regions on heavy noise infected newspaper images and separate them from graphical regions. The algorithm traces the feature points in different entities and then groups those edge points of textual regions. From using the technology of line approximation and layout categorization, it can successfully retrieve directional placed text blocks. Finally feature based connected component merging was introduced to gather homogeneous textual regions together within the scope of its bounding rectangles. We can obtain correct page decomposition with efficient computation and reduced memory size by handling line segments instead of small pixels. The proposed method has been tested on a large group of newspaper images with multiple page layouts, promising results approved the effectiveness of our method.

...read moreread less

Patent•

System and method for hierarchical segmentation with latent semantic indexing in scale space

[...]

James H. Kaufman¹, Dulce B. Ponceleon¹, Malcolm Slaney¹•Institutions (1)

IBM¹

28 Dec 2001

TL;DR: In this paper, a system and method for automatically generating a hierarchical table of contents or outline for indexing a document and identifying clusters of related information in the document is presented, which employs a unique and novel combination of latent semantic indexing techniques to identify related blocks and major topic changes within the document with scale space segmentation techniques.

...read moreread less

Abstract: A system and method for automatically generating a hierarchical table of contents or outline for indexing a document and identifying clusters of related information in the document. The document may comprise text, audio, video, or a multimedia presentation. The invention employs a unique and novel combination of latent semantic indexing techniques to identify related blocks and major topic changes within the document with scale space segmentation techniques to respectively identify self-similar blocks within the document and to thus find topic changes of various sizes at block edges. The invention then produces a visual presentation of the semantic structure of the document.

...read moreread less

Proceedings Article•DOI•

Extraction of text areas in printed document images

[...]

Jean Duong¹, Myriam Côte¹, Hubert Emptoz², Ching Y. Suen³•Institutions (3)

École de technologie supérieure¹, Institut national des sciences appliquées², Concordia University³

09 Nov 2001

TL;DR: A document analysis system which is expected to extract regions of interest in greyscale document images using geometric and texture features and some entropic heuristic is presented.

...read moreread less

Abstract: In this paper, we present a document analysis system which is expected to extract regions of interest in greyscale document images. Collected areas are then clustered in text zones and non-text areas using geometric and texture features. The system works in two steps. Regions of interest are retrieved via cumulative gradient considerations. In classification module, we introduced some entropic heuristic. Experiments are done on the MediaTeam Document Database to show the relevance of this criteria.

...read moreread less

Proceedings Article•DOI•

Fine-grained document genre classification using first order random graphs

[...]

Andrew D. Bagdanov¹, Marcel Worring¹•Institutions (1)

University of Amsterdam¹

10 Sep 2001

TL;DR: This paper develops a FORG-based genre classification method for machine-printed documents and presents a comparative evaluation between the technique and a variety of statistical pattern classifiers.

...read moreread less

Abstract: We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our method uses the attributed relational graphs (ARGs) to represent the layout structure of document instances, and the first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to significantly outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.

...read moreread less

Patent•

A system and method for creating customized documents for cross media publishing

[...]

Jacob Aizikowitz, Israel Roth, Reuven Sherwin

05 Feb 2001

TL;DR: A dynamic document and a method for representing a dynamic document are provided in this paper, where the dynamic document includes a logic section and a layout section, each of which has at least one layout object.

...read moreread less

Abstract: A dynamic document and a method for representing a dynamic document are provided The dynamic document includes a dynamic document template and an instances set bound to the dynamic document template (32) The instances set (34) includes a plurality of pointers to a plurality of data sources The dynamic document template includes a logic section (36) and a layout section (38) The layout section has at least one layout object

...read moreread less

Journal Article•DOI•

Rule-based document structure understanding with a fuzzy combination of layout and textual features

[...]

Stefan Klink, Thomas Kieninger¹•Institutions (1)

German Research Centre for Artificial Intelligence¹

01 Aug 2001-International Journal on Document Analysis and Recognition

TL;DR: This paper presents a hybrid and comprehensive approach to document structure analysis that makes use of layout as well as textual features of a given document, which allows an easy adaptation to specific domains with their specific logical objects.

...read moreread less

Abstract: Document image processing is a crucial process in office automation and begins at the ‘OCR’ phase with difficulties in document ‘analysis’ and ‘understanding’ This paper presents a hybrid and comprehensive approach to document structure analysis Hybrid in the sense that it makes use of layout (geometrical) as well as textual features of a given document These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base Rules can be formulated based on features which might be observed within one specific layout object However, rules can also express dependencies between different layout objects In addition to its rule driven analysis, which allows an easy adaptation to specific domains with their specific logical objects, the system contains domain-independent markup algorithms for common objects (eg, lists)

...read moreread less

Journal Article•DOI•

Document image template matching based on component block list

[...]

Hanchuan Peng¹, Hanchuan Peng², Hanchuan Peng³, Fuhui Long¹, Zheru Chi¹, Wan-Chi Siu¹ - Show less +2 more•Institutions (3)

Hong Kong Polytechnic University¹, Southeast University², Johns Hopkins University³

01 Jul 2001-Pattern Recognition Letters

TL;DR: A new matching method based on document component block list (CBL) is proposed that can effectively make use of the local information of each page component block and the global information of document page layout.

...read moreread less

Patent•

Recognizer of text-based work

[...]

Ramarathnam Venkatesan¹, Michael T. Malkin¹•Institutions (1)

Microsoft¹

24 Apr 2001

TL;DR: In this article, the content of a text document is determined by comparing hash values of documents (or alternatively their sifted text) and then the hash values are used to classify a document into one or more categories.

...read moreread less

Abstract: Described herein is a technology for recognizing the content of text documents. The technology determines one or more hash values for the content of a text document. Alternatively, the technology may generate a “sifted text” version of a document. In one implementation described herein, document recognition is used to determine whether the content of one document is copied (i.e., plagiarized) from another document. This is done by comparing hash values of documents (or alternatively their sifted text). In another implementation described herein, document recognition is used to categorize the content of a document so that it may be grouped with other documents in the same category. This abstract itself is not intended to limit the scope of this patent. The scope of the present invention is pointed out in the appending claims.

...read moreread less

Patent•

Defining form formats with layout items that present data of business application

[...]

Wolfgang Otter, Wolfgang Weiss, Adrian Alexander, Vladislav Bezrukov, Claudia Binder, Andreas Deutesfeld, Thomas Goring, Rainer Hoch, Christoph Wachter - Show less +5 more

01 Jun 2001

TL;DR: In this paper, the compatibility of layout items and processing order with a predefined data interface of a business application is verified as well as the compatibility with the business application's data interface.

...read moreread less

Abstract: Systems and methods for defining a form with a plurality of layout items for data presentation by a business application provide a tree view with tree nodes to represent the layout items, wherein the view visualizes structure information, a processing order, a selected tree node to represent a selected layout item; provide a property view to display properties of the selected layout item; provide a layout view to display items, wherein the selected layout item is highlighted; modify the selected layout item and the processing order through interaction with a user; and create a form definition document. The compatibility of layout items and processing order with a predefined data interface of the business application is verified as well.

...read moreread less

Patent•

Document automated dividing device

[...]

Hirano Takashi, Yasuhiro Okada, 康裕岡田, 敬平野

18 Apr 2001

TL;DR: In this article, the authors proposed a document automated dividing device capable of omitting a partition sheet and automatically determining a break of a document even when various documents having unclear formats are inputted.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide a document automated dividing device capable of omitting a partition sheet and automatically determining a break of a document even when various documents having unclear formats are inputted. SOLUTION: This document automated dividing device is provide with an image reading means 101 reading a plurality of documents and forming a document image, a document image storing buffer 102 storing the read document image, a letter identifying means 103 identifying a letter in the document image, a document dividing information extracting means 104 extracting document dividing information for determining a break of the document from an analysis result and a letter identification result of the document image, a document break determination means 105 determining the break of the document on the basis of the document dividing information, a document break possible selection means 106 displaying the document break determination result by the document break determination means to an operation for correction and confirmation, and a document management system registering means 107 dividing the document image into document units for registration in a document management system.

...read moreread less

Patent•

Method and system of marking a text document with a pattern of extra blanks for authentication

[...]

Fernando Incertis Carro¹, Stephen M. Matyas¹•Institutions (1)

IBM¹

04 Jan 2001

TL;DR: In this article, a text document can be authenticated through the insertion of interword blank characters for the purpose of becoming authenticateable, which can be used to merge the information necessary to authenticate text documents into the body of the document itself.

...read moreread less

Abstract: The invention discloses how a text document can be marked through the insertion of inter-word blank characters for the purpose of becoming authenticateable. First, text to be marked is edited so as to obtain a canonical form of it conforming to a model. Then, from this canonical form of the text and a secret-key used as inputs, a unique combination of inter-word blank characters positions is computed in which extra blanks are inserted thus, obtaining a marked text document. Authentication of a received marked text document is performed by a recipient, sharing the secret-key, further comparing the received text document to the marked text document so that if they are matching exactly the received text document is accepted as authentic or rejected as fake if not. The invention allows to merge the information necessary to authenticate a text document into the body of the document itself which works as well on soft-copy and hard-copy text documents.

...read moreread less

Journal Article•DOI•

Page segmentation and classification utilizing bottom-up approach

[...]

Adnan Amin¹, Ricky Shiu¹•Institutions (1)

University of New South Wales¹

01 Apr 2001-International Journal of Image and Graphics

TL;DR: The use of analyzing the connected components extracted from the binary image of a document page provides a lot of useful information, and will be used to perform skew correction, segmentation and classification of the document.

...read moreread less

Abstract: Document image processing has become an increasingly important technology in the automation of office documentation tasks. Automatic document scanners such as text readers and OCR (Optical Character Recognition) systems are an essential component of systems capable of those tasks. One of the problems in this field is that the document to be read is not always placed correctly on a flat-bed scanner. This means that the document may be skewed on the scanner bed, resulting in a skewed image. This skew has a detrimental effect on document analysis, document understanding, and character segmentation and recognition. Consequently, detecting the skew of a document image and correcting it are important issues in realizing a practical document reader. This paper presents the use of analyzing the connected components extracted from the binary image of a document page. Such an analysis provides a lot of useful information, and will be used to perform skew correction, segmentation and classification of the document. Moreover, we describe two new algorithms — one for skew detection and one for skew correction. The new skew correction algorithm we propose has been shown to be fast and accurate, with run times averaging under 1.5 CPU seconds and 30 seconds real time to calculate the angle on a 5000/20 DEC workstation. Experiments on over 100 pages show that the method works well on a wide variety of layouts, including sparse textual regions, mixed fonts, multiple columns, and even for documents with a high graphical content.

...read moreread less

Patent•

Enhanced mechanism for automatically generating a transformation document

[...]

Matthew D. Birder¹•Institutions (1)

Sun Microsystems¹

16 Aug 2001

TL;DR: In this article, a transformation document generation mechanism (TDGM) is proposed to automatically generate transformation documents given a source document and a target document, where the transformation document is generated by generating templates from the source document to the target document.

...read moreread less

Abstract: A transformation document generation mechanism (TDGM) for automatically generating a transformation document given a source document and a target document is disclosed. The TDGM analyzes each document and builds a pattern dictionary for each that records the patterns found in that document. Thereafter, the TDGM processes the pattern dictionaries to automatically generate the transformation document. In doing so, the TDGM automatically generates pattern creation templates in the transformation document. These templates (when invoked by a transformation processor at a later time while processing a source document with the transformation document) will cause particular patterns to be created in a result document. In addition, the TDGM generates zero or more copy templates in the transformation document to copy identical elements, if any, from the source document to the result document. Once that is done, the transformation document is created and may be refined by a user. By performing much of the underlying document analysis for the user, and by generating an initial transformation document, the TDGM simplifies the transformation document creation process.

...read moreread less

Journal Article•DOI•

Performance Evaluation of Document Structure Extraction Algorithms

[...]

Jisheng Liang, I.T. Phillips¹, Robert M. Haralick²•Institutions (2)

Queens College¹, City University of New York²

01 Oct 2001-Computer Vision and Image Understanding

TL;DR: A method for determining an algorithm's optimal tuning parameters and the correspondences between detected entities and ground truth is described, and a group of document layout analysis algorithms are evaluated.

...read moreread less

Proceedings Article•DOI•

Document understanding using probabilistic relaxation: application on tables of contents of periodicals

[...]

F. Le Bourgeois¹, Hubert Emptoz, S.S. Bensafi•Institutions (1)

Institut national des sciences Appliquées de Lyon¹

10 Sep 2001

TL;DR: A statistical model for a document understanding system, which uses both text attributes and document layouts that provides the consistency of the model according to the features used, and the samples chosen among the training set, is described.

...read moreread less

Abstract: This paper describes a statistical model for a document understanding system, which uses both text attributes and document layouts. Probabilistic relaxation is used as a recognition scheme to find the hierarchical structure of the logical layout. This approach, commonly used for pixels classification in image analysis, can be applied to classify text blocks into logical classes according to local compatibility with other neighboring blocks at different hierarchical levels. It provides a logical layout that is globally compatible with the training model. We have tested this approach on reading tables of contents of periodicals for documents indexing. Probabilistic relaxation has interesting properties like high-speed training and the 'a priori' recognition rate, which provides the consistency of the model according to the features used, and the samples chosen among the training set.

...read moreread less

Proceedings Article•DOI•

A new component based algorithm for newspaper layout analysis

[...]

Fei Liu¹, Yupin Luo¹, M. Yoshikawa², Dongcheng Hu¹•Institutions (2)

Tsinghua University¹, Brother Industries²

10 Sep 2001

TL;DR: A new component based bottom-up algorithm is proposed with a novel homogeneity related definition of distance that maintains a dynamic minimal distance mechanism to decide the components merging sequence.

...read moreread less

Abstract: The aim of the layout analysis is to extract the geometric structure from a document image. It is a progress of labeling homogenous regions of a document image. In order to present a complex newspaper layout analysis, this paper proposes a new component based bottom-up algorithm. With a novel homogeneity related definition of distance, it maintains a dynamic minimal distance mechanism to decide the components merging sequence. Under the restricting rules generated from the newspaper layout heuristically, we derive the preferred analysis result. Experimental results reveal the proposed approach is effective.

...read moreread less

Patent•

Document processing method, system and medium

[...]

Matthew Hurst¹, Tetsuya Nasukawa¹•Institutions (1)

IBM¹

25 Jun 2001

TL;DR: In this paper, a document is input which is laid out using blanks or the like, then a symbol is acquired which is associated with a spatial coordinate of the document, and successive characters of the same type are extracted from the symbol to generate a token and a space.

...read moreread less

Abstract: A technique for extracting a meaningful text block from a document where a table, an itemized list, a multiple column, etc., are arbitrarily laid out. A document is input which is laid out using blanks or the like, then a symbol is acquired which is associated with a spatial coordinate of the document. Consecutive characters of the same type are extracted from the symbol to generate a token and a space. A stream is generated from consecutive spaces in the column direction, while a text block is generated from streams and tokens. A link is generated between the text blocks to form a document graph. Validity of a connection (link) between the text blocks in the document graph is evaluated using a language model, then the text blocks are merged if the connection is valid.

...read moreread less

Proceedings Article•DOI•

Networking digital document images

[...]

F. Le Bourgeois¹, Hubert Emptoz, Eric Trinh, Jean Duong•Institutions (1)

Vision Institute¹

10 Sep 2001

TL;DR: The paper describes problems concerning the networking of digital document images and provides solutions for image compression and presents a file format, which enables editing and annotation, which has been applied to 16/sup th/ century books for the project DEBORA.

...read moreread less

Abstract: Digital libraries create new services and open rare collections to a larger and wider audience. The development of online digital libraries in image mode is today limited by the narrow bandwidth of the network and the heavy storage requirements. Moreover, efficient networking of text content images requires specific compression schemes and particular file formats that describe the contents of the images and provide additional information such as document layout descriptions. The paper describes problems concerning the networking of digital document images and provides solutions for image compression. A file format, which enables editing and annotation, is presented. Our proposal mainly uses automatic document layout analysis algorithms, which constitutes a new field of application for this research area. This work, granted by the European community, has been applied to 16/sup th/ century books for the project DEBORA.

...read moreread less

Patent•

Method and apparatus for foward annotating documents

[...]

Alex S. Taylor¹, Ercan E. Kuruoglu¹•Institutions (1)

Xerox¹

19 Oct 2001

TL;DR: In this paper, a target document in a document processing system is annotated on the basis of annotations made previously to a source document, and the target document is searched to locate any of the keywords of interest to the user.

...read moreread less

Abstract: A target document in a document processing system is annotated on the basis of annotations made previously to a source document. A source document (either a scanned image of a paper document or an electronic document) is annotated by a user to identify words or phrases of interest. The annotated words are extracted for use as keywords or phrases to search in future document. When a target document is processed, the target document is searched to locate any of the keywords of interest to the user. If any of the keywords are located, electronic annotations are applied to these in the target document for display or printing out and/or registered as keywords to the project. The automatically annotated words or phrases enable the user to locate regions of interest more quickly.

...read moreread less

Patent•

Computer based integrated text and graphic document analysis

[...]

Leonid Batchilo, Valery Tsourikov¹, Edward Dreyfus•Institutions (1)

IHS Inc.¹

02 Nov 2001

TL;DR: In this article, a computer system based method of analyzing an electronic document which includes text and graphics and common reference symbols designate text components and respective graphics components is presented, where the method comprising processing the document texts and graphics into an index that identifies the text locations of reference symbols and graphic locations of references, and displaying (70) the text that includes at least some of the text reference symbols, and linking the common text and common graphic reference symbols such that user selection of a particular text reference symbol or graphic reference symbol causes display of a respective graphic segment or text segment that includes the

...read moreread less

Abstract: A computer system based method of analyzing an electronic document which document includes text and graphics and common reference symbols designate text components and respective graphics components the method comprising processing the document text and graphics into an index that identifies the text locations of reference symbols and graphic locations of reference symbols, and displaying (70) the text that includes at least some of the text reference symbols and/or displaying (68) at least some of the graphic reference symbols, and linking the common text and common graphic reference symbols such that user selection of a particular text reference symbol or graphic reference symbol causes display of a respective graphic segment or text segment that includes the selected common reference symbol. Other features include displaying a component list, selecting component identities to display graphic segments, using voice recognition for user control, and synthesized speech for audio text response.

...read moreread less