Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Patent•

Defining layout files by markup language documents

[...]

Heinz Willumeit, Joerg Bischof

11 Jun 2002

TL;DR: In this paper, a computer system for displaying a content page (250) on first and second computers (901, 902), layouts are defined in a first layout file and a second layout file (252).

...read moreread less

Abstract: In a computer system for displaying a content page (250) on first and second computers (901, 902), layouts are defined in a first layout file (251) and a second layout file (252). The first and second layout files (251, 252) each are provided by a parser that reads a first markup language document and a second markup language document, respectively. Each markup language document has a first entity and a second entity in that instructions define portions of the respective layout file (251/252) by predefined information added to the respective layout file (251/252) and by parameterized values also added to the respective layout file (251/252). The parameterized values are valid for both the first entity and the second entity.

...read moreread less

7 citations

Proceedings Article•DOI•

Text Segmentation of Consumer Magazines in PDF Format

[...]

Jian Fan¹•Institutions (1)

Hewlett-Packard¹

18 Sep 2011

TL;DR: This work proposes a new local homogeneity measure based on line space, and incorporates this new feature into a region growing algorithm that achieved robust performance on PDF magazines with wide-ranging layouts and styles.

...read moreread less

Abstract: Text segmentation is usually the first step taken towards the reuse and repurposing of PDF documents. Through experimental evaluation, we found that the leading text segmentation algorithms have limitations for contemporary consumer magazines. We propose a new local homogeneity measure based on line space, and incorporate this new feature into a region growing algorithm. Using a fixed set of parameters, our algorithm achieved robust performance on PDF magazines with wide-ranging layouts and styles.

...read moreread less

7 citations

Journal Article•DOI•

Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model

[...]

Randa I. Elanwar, Wenda Qin¹, Margrit Betke¹, Derry Tanti Wijaya¹•Institutions (1)

Boston University¹

30 Jun 2021-International Journal on Document Analysis and Recognition

TL;DR: The release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books, and a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA), making it an outstanding baseline model to challenge.

...read moreread less

Abstract: Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.

...read moreread less

7 citations

Patent•

Document processing apparatus, document processing method, and document processing computer program product

[...]

Yoshihisa Ohguro

13 Feb 2014

TL;DR: A document processing apparatus as discussed by the authors performs image processing on document image data to extract character information and assign a document name using the character information, and then shapes the determined document name character string based on the acquired character code.

...read moreread less

Abstract: A document processing apparatus (1) performs image processing on document image data to thereby extract character information and assign a document name using the character information. The document processing apparatus includes: an acquiring unit (551) that acquires a character code of characters displayable on a display unit (15); a determination unit (53) that determines a document name character string that serves as a basis for the document name, from the character information; and a shaping unit (55) that shapes the determined document name character string based on the acquired character code.

...read moreread less

7 citations

Journal Article•DOI•

Semi-supervised cluster-and-label with feature based re-clustering to reduce noise in Thai document images

[...]

N. Piroonsup¹, Sukree Sinthupinyo¹•Institutions (1)

Chulalongkorn University¹

01 Dec 2015-Knowledge Based Systems

TL;DR: A novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images and an enhance labeling method of semi-supervised cluster-and-label approach that can significantly improve the accuracy of labeling examples and the performance of classification.

...read moreread less

Abstract: We proposed a novel noise reduction method for document images.Semi-supervised learning is applied to classify noise from character components.The proposed method is suitable for Non-Latin based scripts i.e. Thai document image.We proposed an enhance labeling method of semi-supervised cluster-and-label approach.The performance of proposed methods are significantly better than comparison methods. Noise components are a major cause of poor performance in document analysis. To reduce undesired components, most recent research works have applied an image processing technique. However, the effectiveness of these techniques is suitable only for a Latin script document but not a non-Latin script document. The characteristics of the non-Latin script document, such as Thai, are considerably more complicated than the Latin script document and include many levels of character alignment, no word or sentence separator, and variability in a character's size. When applying an image processing technique to a Thai document, we usually remove the characters that are relatively close to noise. Hence, in this paper, we propose a novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images. The proposed method uses a semi-supervised cluster-and-label approach with an improved labeling method, namely, feature selected sub-cluster labeling. Feature selected sub-cluster labeling focuses on the clusters that are incorrectly labeled by conventional labeling methods. These clusters are re-clustered into small groups with a new feature set that is selected according to class labels. The experimental results show that this method can significantly improve the accuracy of labeling examples and the performance of classification. We compared the performance of noise reduction and character preservation between the proposed method and two related noise reduction approaches, i.e., a two-phased stroke-like pattern noise (SPN) removal and a commercial noise reduction software called ScanFix Xpress 6.0. The results show that semi-supervised noise reduction is significantly better than the compared methods of which an F-measure of character and noise is 86.01 and 97.82, respectively.

...read moreread less

7 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

35,779

Citations

No. of papers in the topic in previous years
Year	Papers
2023	5
2022	19
2021	34
2020	19
2019	14
2018	9

Document layout analysis

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics