Topic
Document layout analysis
About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.
Papers published on a yearly basis
Papers
More filters
•
11 Jun 2002
TL;DR: In this paper, a computer system for displaying a content page (250) on first and second computers (901, 902), layouts are defined in a first layout file and a second layout file (252).
Abstract: In a computer system for displaying a content page (250) on first and second computers (901, 902), layouts are defined in a first layout file (251) and a second layout file (252). The first and second layout files (251, 252) each are provided by a parser that reads a first markup language document and a second markup language document, respectively. Each markup language document has a first entity and a second entity in that instructions define portions of the respective layout file (251/252) by predefined information added to the respective layout file (251/252) and by parameterized values also added to the respective layout file (251/252). The parameterized values are valid for both the first entity and the second entity.
7 citations
••
18 Sep 2011TL;DR: This work proposes a new local homogeneity measure based on line space, and incorporates this new feature into a region growing algorithm that achieved robust performance on PDF magazines with wide-ranging layouts and styles.
Abstract: Text segmentation is usually the first step taken towards the reuse and repurposing of PDF documents. Through experimental evaluation, we found that the leading text segmentation algorithms have limitations for contemporary consumer magazines. We propose a new local homogeneity measure based on line space, and incorporate this new feature into a region growing algorithm. Using a fixed set of parameters, our algorithm achieved robust performance on PDF magazines with wide-ranging layouts and styles.
7 citations
••
TL;DR: The release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books, and a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA), making it an outstanding baseline model to challenge.
Abstract: Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.
7 citations
•
13 Feb 2014
TL;DR: A document processing apparatus as discussed by the authors performs image processing on document image data to extract character information and assign a document name using the character information, and then shapes the determined document name character string based on the acquired character code.
Abstract: A document processing apparatus (1) performs image processing on document image data to thereby extract character information and assign a document name using the character information. The document processing apparatus includes: an acquiring unit (551) that acquires a character code of characters displayable on a display unit (15); a determination unit (53) that determines a document name character string that serves as a basis for the document name, from the character information; and a shaping unit (55) that shapes the determined document name character string based on the acquired character code.
7 citations
••
TL;DR: A novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images and an enhance labeling method of semi-supervised cluster-and-label approach that can significantly improve the accuracy of labeling examples and the performance of classification.
Abstract: We proposed a novel noise reduction method for document images.Semi-supervised learning is applied to classify noise from character components.The proposed method is suitable for Non-Latin based scripts i.e. Thai document image.We proposed an enhance labeling method of semi-supervised cluster-and-label approach.The performance of proposed methods are significantly better than comparison methods. Noise components are a major cause of poor performance in document analysis. To reduce undesired components, most recent research works have applied an image processing technique. However, the effectiveness of these techniques is suitable only for a Latin script document but not a non-Latin script document. The characteristics of the non-Latin script document, such as Thai, are considerably more complicated than the Latin script document and include many levels of character alignment, no word or sentence separator, and variability in a character's size. When applying an image processing technique to a Thai document, we usually remove the characters that are relatively close to noise. Hence, in this paper, we propose a novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images. The proposed method uses a semi-supervised cluster-and-label approach with an improved labeling method, namely, feature selected sub-cluster labeling. Feature selected sub-cluster labeling focuses on the clusters that are incorrectly labeled by conventional labeling methods. These clusters are re-clustered into small groups with a new feature set that is selected according to class labels. The experimental results show that this method can significantly improve the accuracy of labeling examples and the performance of classification. We compared the performance of noise reduction and character preservation between the proposed method and two related noise reduction approaches, i.e., a two-phased stroke-like pattern noise (SPN) removal and a commercial noise reduction software called ScanFix Xpress 6.0. The results show that semi-supervised noise reduction is significantly better than the compared methods of which an F-measure of character and noise is 86.01 and 97.82, respectively.
7 citations