scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Patent
27 Jul 2005
TL;DR: A method for finding text reading order in a document such as a scanned newspaper or magazine includes the steps of pruning unnecessary text zones using semantic analysis (40), using text correlation measures to cluster zones (41), and then finding a reading order within each of the clusters as discussed by the authors.
Abstract: A method for finding text reading order in a document such as a scanned newspaper or magazine includes the steps of pruning unnecessary text zones using semantic analysis (40), using text correlation measures to cluster zones (41), and then finding a reading order within each of the clusters (42).

13 citations

Journal ArticleDOI
TL;DR: A system that performs decomposition and structural analysis (including logical grouping and read‐order determination) on complex multiarticled documents is presented and experimental results showing the efficiency of this approach are presented.
Abstract: A document image is a visual representation of a paper document, such as a journal article page, a cover page of facsimile transmission, office correspondence, an application form, etc. Document image understanding as a research endeavor consists of developing processes for taking a document through various representations, from scanned image to semantic representation. This article describes document decomposition and structural analysis, which constitutes one of the major processes involved in document image understanding. The current state of the art and future directions in the areas of document segmentation, layout analysis, and logical block grouping are indicated. A system that performs decomposition and structural analysis (including logical grouping and read‐order determination) on complex multiarticled documents is presented. This system uses bottom‐up segmentation techniques to identify the block structure of a document, and layout rules to classify and group these blocks into logical units that represent meaningful subdivisions of the document. Experimental results showing the efficiency of this approach are presented and discussed. © 1996 John Wiley & Sons, Inc.

13 citations

Patent
06 Nov 2008
TL;DR: In this paper, a method of enhancing electronic documents received from a plurality of users by a document analysis system for improving automatic recognition and classification of the received electronic documents, is provided.
Abstract: A method of enhancing electronic documents received from a plurality of users by a document analysis system for improving automatic recognition and classification of the received electronic documents, is provided. For each page of a received electronic document, the method filters the page to infer binarized-background artifacts resulting from the binarization of the original grayscale or color image source document and which reside in the vicinity of binarized text and binarized image features in the page, so that the binarized text and binarized images may be distinguished from the binarized-background artifacts and extracted from the document. The method then uses the extracted features from the filtered document to automatically recognized and classify a document into a document category.

13 citations

Patent
Ray Smith1
21 Jan 2009
TL;DR: In this article, a physical page layout analysis for optical character recognition is performed, where tab-stops are detected from groups of edge-aligned connected components, which are used to deduce the column layout of the page by finding column partitions.
Abstract: Physical page layout analysis for optical character recognition is performed. A physical page layout analysis method finds constituent parts of an image and gives an initial data-type label, such as text or non-text. Within the text data, connected components are identified and analyzed. Tab-stops are detected from groups of edge-aligned connected components. The detected tab-stops are used to deduce the column layout of the page by finding column partitions. The column layout is then applied to find the polygonal boundaries of and a reading order of regions containing flowing text, headings, and pull-outs.

13 citations

Proceedings ArticleDOI
08 Jan 2015
TL;DR: This paper presents a hybrid method of page segmentation based on the combination of connected component analysis and classification on multilevel homogeneous regions, and achieves the higher accuracy compared to other methods.
Abstract: This paper presents a hybrid method of page segmentation based on the combination of connected component analysis and classification on multilevel homogeneous regions. This suggests an iterative method. In which, connected component analysis is used to classify the non-text elements at each level of homogeneous region, and multilevel homogeneity structure is used to ensure this classification can identify all non-text elements. The result of this iterative method is the two documents, text document and non-text document. On text document, adaptive mathematical morphology in each text homogeneous region will give us the corresponding text region. On the non-text document, more detailed classification of the non-text components are made to get separators, tables, images, etc. For evaluation, we experiment our method with datasets from ICDAR2009 page segmentation competition. According to the results, our proposed method achieves the higher accuracy compared to other methods. This proves the effectiveness and superiority of our proposed method.

13 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189