scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Patent
11 Jun 2002
TL;DR: In this paper, a computer system for displaying a content page (250) on first and second computers (901, 902), layouts are defined in a first layout file and a second layout file (252).
Abstract: In a computer system for displaying a content page (250) on first and second computers (901, 902), layouts are defined in a first layout file (251) and a second layout file (252). The first and second layout files (251, 252) each are provided by a parser that reads a first markup language document and a second markup language document, respectively. Each markup language document has a first entity and a second entity in that instructions define portions of the respective layout file (251/252) by predefined information added to the respective layout file (251/252) and by parameterized values also added to the respective layout file (251/252). The parameterized values are valid for both the first entity and the second entity.

7 citations

Proceedings ArticleDOI
Jian Fan1
18 Sep 2011
TL;DR: This work proposes a new local homogeneity measure based on line space, and incorporates this new feature into a region growing algorithm that achieved robust performance on PDF magazines with wide-ranging layouts and styles.
Abstract: Text segmentation is usually the first step taken towards the reuse and repurposing of PDF documents. Through experimental evaluation, we found that the leading text segmentation algorithms have limitations for contemporary consumer magazines. We propose a new local homogeneity measure based on line space, and incorporate this new feature into a region growing algorithm. Using a fixed set of parameters, our algorithm achieved robust performance on PDF magazines with wide-ranging layouts and styles.

7 citations

Journal ArticleDOI
TL;DR: The release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books, and a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA), making it an outstanding baseline model to challenge.
Abstract: Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.

7 citations

Patent
13 Feb 2014
TL;DR: A document processing apparatus as discussed by the authors performs image processing on document image data to extract character information and assign a document name using the character information, and then shapes the determined document name character string based on the acquired character code.
Abstract: A document processing apparatus (1) performs image processing on document image data to thereby extract character information and assign a document name using the character information. The document processing apparatus includes: an acquiring unit (551) that acquires a character code of characters displayable on a display unit (15); a determination unit (53) that determines a document name character string that serves as a basis for the document name, from the character information; and a shaping unit (55) that shapes the determined document name character string based on the acquired character code.

7 citations

Journal ArticleDOI
TL;DR: A novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images and an enhance labeling method of semi-supervised cluster-and-label approach that can significantly improve the accuracy of labeling examples and the performance of classification.
Abstract: We proposed a novel noise reduction method for document images.Semi-supervised learning is applied to classify noise from character components.The proposed method is suitable for Non-Latin based scripts i.e. Thai document image.We proposed an enhance labeling method of semi-supervised cluster-and-label approach.The performance of proposed methods are significantly better than comparison methods. Noise components are a major cause of poor performance in document analysis. To reduce undesired components, most recent research works have applied an image processing technique. However, the effectiveness of these techniques is suitable only for a Latin script document but not a non-Latin script document. The characteristics of the non-Latin script document, such as Thai, are considerably more complicated than the Latin script document and include many levels of character alignment, no word or sentence separator, and variability in a character's size. When applying an image processing technique to a Thai document, we usually remove the characters that are relatively close to noise. Hence, in this paper, we propose a novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images. The proposed method uses a semi-supervised cluster-and-label approach with an improved labeling method, namely, feature selected sub-cluster labeling. Feature selected sub-cluster labeling focuses on the clusters that are incorrectly labeled by conventional labeling methods. These clusters are re-clustered into small groups with a new feature set that is selected according to class labels. The experimental results show that this method can significantly improve the accuracy of labeling examples and the performance of classification. We compared the performance of noise reduction and character preservation between the proposed method and two related noise reduction approaches, i.e., a two-phased stroke-like pattern noise (SPN) removal and a commercial noise reduction software called ScanFix Xpress 6.0. The results show that semi-supervised noise reduction is significantly better than the compared methods of which an F-measure of character and noise is 86.01 and 97.82, respectively.

7 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189