Topic
Document layout analysis
About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.
Papers published on a yearly basis
Papers
More filters
•
09 Sep 2002TL;DR: In this paper, a Document Constraint Analyzer (DCA) takes as input a set of document files together with a document constraint specification file, extracts and examines the contents, attributes, and relationships associated with the document objects, and evaluates the logical expressions specified in the document constraints.
Abstract: A Product Document Constraint Specification Language (PDCSL) is provided for a document author to represent various types of documentation guidelines that must be enforced within documents or across different documents. A Document Constraint Analyzer (DCA) takes as input a set of document files together with a document constraint specification file, extracts and examines the contents, attributes, and relationships associated with the document objects, and evaluates the logical expressions specified in the document constraints. If a document constraint is not satisfied, an action can be taken to correct the documents or provide an explanation to the document author.
16 citations
••
25 Aug 2013TL;DR: A novel evaluation approach that responds to the evaluation of reading order results generated by layout analysis methods by incorporating region correspondence analysis is proposed and a sophisticated reading order representation scheme is presented and used by the system.
Abstract: Reading order detection and representation is an important task in many digitisation scenarios involving the preservation of the logical structure of a document. The corresponding need for the evaluation of reading order results generated by layout analysis methods poses a particular challenge due to potential deviations between ground truth and actually detected segmentation of the page. To this end a novel evaluation approach that responds to this problem by incorporating region correspondence analysis is proposed. Furthermore, a sophisticated reading order representation scheme is presented and used by the system allowing the grouping of objects with ordered and/or unordered relations. This is a typical requirement for documents with complex layouts such as magazines and newspapers. The evaluation method has been validated using the results of two state-of-the-art OCR / layout analysis systems and a basic top-to-bottom reading order detection algorithm applied on representative samples from the PRImA contemporary and the IMPACT historical document datasets.
16 citations
••
11 Aug 2002
TL;DR: An enhanced background-thinning based page segmentation algorithm to process document images rapidly and eliminate some small regions embedded in other regions and a hierarchical approach, which combines the cross correlation measure, Kolmogorov complexity measure, and a neural network, to classify sub-images into halftones and texts.
Abstract: Page segmentation and image content classification plays an important role in automatic document image processing with applications to mixed-type document image compression, form and check reading, and automatic mail sorting. We propose an enhanced background-thinning based page segmentation algorithm to process document images rapidly and eliminate some small regions embedded in other regions. We then present a hierarchical approach, which combines the cross correlation measure, Kolmogorov complexity measure, and a neural network, to classify sub-images into halftones and texts. The approach also achieves high accuracy in text determination using a three-layer feed-forward network where the text region can be classified into Chinese or alphabetic characters. Experimental results on a number of mixed-type document images show the efficiency and effectiveness of our approach.
16 citations
••
IBM1
TL;DR: A method of stochastic syntactic analysis is applied to extracting the logical structure of a printed document from its physical layout and keywords indicating logical components, and 87% of the logical components of manuals and 82% of those of technical papers are correctly marked up.
Abstract: A method of stochastic syntactic analysis is applied to extracting the logical structure of a printed document from its physical layout and keywords indicating logical components. The document is parsed as a sentence consisting of text lines and graphic objects according to a stochastic regular grammar with attributes. By using stochastic analysis, the parser can retain possible results in order of their probability, and thus, if ambiguity occurs, it selects an optimal result more appropriately than deterministic systems. A mark up system applying the method was constructed, and 87% of the logical components of manuals and 82% of those of technical papers are correctly marked up. The rate improved to 89% when the second candidates were considered, showing the advantage of the authors' approach over the deterministic approach.
16 citations
••
TL;DR: The proposed method detects the layout of the scanned document even when the image and the text regions have irregular shape, and shows that the proposed method works when the document contain multiple images.
16 citations