scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 1988"


Journal ArticleDOI
TL;DR: This article proposes an approach to identify the layout of a document page by dividing it recursively into nested rectangular areas and uses it as a basis for a document layout model, which is able to control an automatic interpretation mechanism for deriving a high level representation of the contents of a documents.
Abstract: The realization of the paper-free office seems to be difficult that expected. Therefore, good paper-computer interfaces are necessary to transform paper documents into an electronic form, which allows the use of a filing and retrieval system. An electronic document page is an optically scanned and digitized representation of a printed page. Document analysis is the problem of interpreting and labeling the constitutents of the document. Although there are very reliable optical character recognition (OCR) methods, the process could be very inefficient. To prune the search space and to become more efficient, some search supporting methods have to be developed. This article proposes an approach to identify the layout of a document page by dividing it recursively into nested rectangular areas. The procedure is used as a basis for a document layout model, which is able to control an automatic interpretation mechanism for deriving a high level representation of the contents of a document. We have implemented our method in Common Lisp on a Symbolies 3640 Workstation and have run it for a large population of office documents. The results obtained have been very encouraging and have convincingly confirmed the soundness of our approach.

43 citations


Journal ArticleDOI
TL;DR: An experimental office system currently being developed at Olivetti research integrates two major requirements of office work: content based document retrieval and mail distribution that closes the gap between electronic document entry systems and processing of (semi-) structured document content.
Abstract: An experimental office system currently being developed at Olivetti research integrates two major requirements of office work: content based document retrieval and mail distribution In this system documents are described and classified by their semantic structure that provides access to abstract concepts contained in the document The derivation of the semantic structure of a document supports both an efficient retrieval by content and an intelligent mail filtering through document semantics A knowledge based classification system automatically generates the conceptual description of a document to be inserted into the system by means of content analysis, and associates the document to an appropriate predefined type The classification system closes the gap between electronic document entry systems and processing of (semi-) structured document content

34 citations


Proceedings ArticleDOI
Y. Tsuji1
14 Nov 1988
TL;DR: Experimental results showed that this proposed method can be appropriately used to automatically describe an input image as a layout structure, and both the elements and their relations in the generated tree were finally determined by the bottom-up strategy, based on the general document layout property.
Abstract: A document image analysis is described which automatically converts an input image into a syntactic document tree structure, while simultaneously representing the elements and their relative relations. Top-down image segmentation, using projection profiles, was greatly improved by systematically using a feedback process. As a result, the tree structure, including the blocks and their relative relations, was generated. Both the elements and their relations in the generated tree were finally determined by the bottom-up strategy, based on the general document layout property. Experimental results showed that this proposed method can be appropriately used to automatically describe an input image as a layout structure. >

19 citations


Proceedings ArticleDOI
11 Apr 1988
TL;DR: The concept of model driven segmentation allows quick focussing of the analysis on important regions of a document without necessarily requiring CPU-intensive preprocessing steps for the whole document.
Abstract: The task of document recognition requires the scanning of a paper document and the analysis of its content and structure. The resulting electronic representation has to capture the content as well as the logic and layout structure of the document. The first step in the recognition process is scanning, filtering and binarization of the paper document. Based on the preprocessing results we delineate key areas like address or signature for a letter, or the abstract for a report. This segmentation procedure uses a specific document layout model. The validity of this segmentation can be verified in a second step by using the results of more time-consuming procedures like text/graphic classification, optical character recognition (OCR) and the comparison with more elaborate models for specific document parts. Thus our concept of model driven segmentation allows quick focussing of the analysis on important regions. The segmentation is able to operate directly on the raster image of a document without necessarily requiring CPU-intensive preprocessing steps for the whole document. A test version for the analysis of simple business letters has been implemented.

9 citations