Topic
Document layout analysis
About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.
Papers published on a yearly basis
Papers
More filters
•
30 Oct 1969
TL;DR: In this article, a font of editing symbols is provided which are handwritable yet recognizable by a character recognition system, each of the symbols is representative of an editing instruction, and an appropriate symbol is inserted adjacent each portion of the textual material which is in error.
Abstract: A method and apparatus for editing a document having textual material thereon. A unique font of editing symbols is provided which are handwritable yet recognizable by a character recognition system. Each of the symbols is representative of an editing instruction. An appropriate symbol is inserted adjacent each portion of the textual material which is in error. The document is then inserted into a character recognition system without requiring reproduction of the document with the alterations incorporated.
17 citations
••
TL;DR: Experiments show the effectiveness of the proposed algorithm in reducing both the under and over-segmentation errors and boost the performance significantly when comparing with popular page segmentation algorithms.
17 citations
••
18 Sep 2012TL;DR: The results from this research suggested that the proposed approach for practical data on palm leaf manuscripts has better performance in solving the line segmentation problem.
Abstract: Text line extraction is one of the critical steps in document analysis and optical character recognition (OCR) systems. The purpose of this study is to address the problem of text line extraction of ancient Thai manuscripts written on palm leaves, using an Adaptive Partial Projection (APP) technique by integrating a modified partial projection and smooth histogram with recursion. The proposed approach was compared with a Modified Partial Projection (MPP) looking at vowel analysis and touching components of two consecutive lines. The results from this research suggested that the proposed approach for practical data on palm leaf manuscripts has better performance in solving the line segmentation problem.
16 citations
•
05 Jul 2005TL;DR: In this article, a plurality of different extraction conditions are stored in an extraction condition memory for use in extracting text blocks from a given document image, in accordance with those extraction conditions, a text block extractor extracts a plurality set of sets of text blocks.
Abstract: A document layout analysis program capable of extracting an appropriate set of text blocks from a given document image even in the case where the document layout is so complicated that conventional extraction methods with a single extraction condition would not work well. A plurality of different extraction conditions are stored in an extraction condition memory for use in extracting text blocks from a given document image. In accordance with those extraction conditions, a text block extractor extracts a plurality of sets of text blocks from the document image. A text block consolidator produces a consolidated set of text blocks by performing character recognition on each extracted text block, evaluating validity of each text block based on a result of the character recognition, and selecting most valid text blocks from among the plurality of sets of text blocks.
16 citations
••
TL;DR: The proposed system was evaluated against two other systems that represent the best available tools for the Arabic documents analysis, and evaluation results show that the proposed system works well on multi-font and multi-size documents with a variety of layouts even on some historical documents.
Abstract: Document layout analysis is a key step in the process of converting document images into text. Arabic language script is cursive and written in different styles which cause some challenges in the analysis of Arabic text documents. In this paper, we introduce an approach for Arabic documents layout analysis. In that approach, the document is segmented into set of zones using morphological operations. The segmented zones are classified as text or non-text ones using a support vector machine classifier. Features used in zone classification are combination between texture-based features and connected component-based features. The textural-based feature vector size is reduced using genetic algorithm. Classified text zones are clustered, using adaptive sample set clustering algorithm, into lines. Each segmented line is segmented into words by clustering inter- and intra-spaces. The proposed system was evaluated against two other systems that represent the best available tools for the Arabic documents analysis, and evaluation results show that the proposed system works well on multi-font and multi-size documents with a variety of layouts even on some historical documents.
16 citations