Showing papers on "Document layout analysis published in 2018"

PDF

Open Access

Proceedings Article•DOI•

dhSegment: A Generic Deep-Learning Approach for Document Segmentation

[...]

Sofia Ares Oliveira¹, Benoit Seguin¹, Frédéric Kaplan¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

01 Jan 2018

TL;DR: This paper proposes an open-source implementation of a CNN-based pixel-wise predictor coupled with task dependent post-processing blocks and shows that a single CNN-architecture can be used across tasks with competitive results.

...read moreread less

Abstract: In recent years there have been multiple successful attempts tackling document processing problems separately by designing task specific hand-tuned strategies. We argue that the diversity of historical document processing tasks prohibits to solve them one at a time and shows a need for designing generic approaches in order to handle the variability of historical series. In this paper, we address multiple tasks simultaneously such as page extraction, baseline extraction, layout analysis or multiple typologies of illustrations and photograph extraction. We propose an open-source implementation of a CNN-based pixel-wise predictor coupled with task dependent post-processing blocks. We show that a single CNN-architecture can be used across tasks with competitive results. Moreover most of the task-specific post-precessing steps can be decomposed in a small number of simple and standard reusable operations, adding to the flexibility of our approach.

...read moreread less

92 citations

Posted Content•

Multi-Task Handwritten Document Layout Analysis.

[...]

Lorenzo Quirós

22 Jun 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: A system based on artificial neural networks which is able to determine not only the baselines of text lines present in the document, but also performs geometric and logic layout analysis of the document.

...read moreread less

Abstract: Document Layout Analysis is a fundamental step in Handwritten Text Processing systems, from the extraction of the text lines to the type of zone it belongs to. We present a system based on artificial neural networks which is able to determine not only the baselines of text lines present in the document, but also performs geometric and logic layout analysis of the document. Experiments in three different datasets demonstrate the potential of the method and show competitive results with respect to state-of-the-art methods.

...read moreread less

38 citations

Proceedings Article•DOI•

From HMMs to RNNs: Computer-Assisted Transcription of a Handwritten Notarial Records Collection

[...]

Lorenzo Quirós¹, Vicente Bosch¹, Lluis Serrano, Alejandro Héctor Toselli¹, Enrique Vidal¹ - Show less +1 more•Institutions (1)

Polytechnic University of Valencia¹

01 Aug 2018

TL;DR: The results show the impact caused by an optical modelling technological transition: from classical HMM-based methods to new technology based on recurrent neural networks.

...read moreread less

Abstract: We present the process which is being followed for the transcription of a large XVIII Century Manuscript collection with the help of Handwritten Text Recognition (HTR) Technology. The documents are being processed in batches of 50 pages each. For each batch we perform two semi-supervised processes: one in order to analyze the layout and detect the text lines and another to provide the full transcripts of the text. As per users request, both diplomatic and modernized transcripts, as well as semantically tagged versions are being produced. Layout analysis supervision is performed by means of a conventional layout editing tool. On the other hand, transcripts, including automatic modernization and tagging, are being produced by means of a web based computer-assisted interactive-predictive tool (CATTI). We present results of the performance of this process through 12 image batches processed so far. These results show the impact caused by an optical modelling technological transition: from classical HMM-based methods to new technology based on recurrent neural networks.

...read moreread less

11 citations

Proceedings Article•DOI•

Text Line Extraction Based on Distance Map Features and Dynamic Programming

[...]

Vicente Bosch Campos¹, Verónica Romero Gómez¹, Alejandro Hector Toselli Rossi, Enrique Vidal Ruiz•Institutions (1)

Polytechnic University of Valencia¹

01 Aug 2018

TL;DR: A binarization free dynamic programming approach that generates an equidistant text line extraction polygon is presented and compared with other solutions ranging from the actual human reviewed ground-truth polygons to simpler automatic generated rectangle areas.

...read moreread less

Abstract: Text Line Segmentation is a basic document layout task that consists in detecting and extracting the text lines present in a document page image. Although considered a basic task, generally, it is a necessary step for Handwritten Text Recognition (HTR) higher level tasks. Most state of the art automatic text recognition, text-to-line image alignment and key word spotting systems require it due to their need for isolated text line images as input. Traditionally most Text Line Segmentation approaches cover both detection and extraction sub steps. However, the community has recently shifted its focus to tackle independently the baseline detection in document images. This shift generates the need for extraction methods that use these detected baselines as input. In this paper, a binarization free dynamic programming approach that generates an equidistant text line extraction polygon is presented. The approach performs this calculation, based on the information provided by priorly detected text baselines and automatically generated foreground pixels distance maps. We evaluate our approach both in a synthetic competition corpus and in a challenging real handwritten text recognition task corpus. We evaluate it not only at the graphical error level but also the impact it produces on an HTR task trained with the line images it yields. We compare our solution with other solutions ranging from the actual human reviewed ground-truth polygons to simpler automatic generated rectangle areas.

...read moreread less

5 citations

Proceedings Article•DOI•

Segmentation of Heterogeneous Documents into Homogeneous Components using Morphological Operations

[...]

Nasid Habib Barna¹, Tisa Islam Erana¹, Shabbir Ahmed¹, Hasnain Heickal¹•Institutions (1)

University of Dhaka¹

06 Jun 2018

TL;DR: A language-independent document segmentation system that segments a heterogeneous printed document into homogeneous components like halftones and graphics, texts and tables including its individual cells and Text Extraction from Table Cells is presented.

...read moreread less

Abstract: The research on document layout analysis has been widespread over a large arena recently and is craving for more efficiency day by day. Document segmentation is an important preprocessing step before analyzing the layouts. This paper presents a language-independent document segmentation system that segments a heterogeneous printed document into homogeneous components like halftones and graphics, texts and tables including its individual cells. From an input document page homogeneous components are segmented in three steps with three separate modules, which are- extraction of halftone images, extraction of tables and segmentation of text blocks. These modules altogether build the whole page segmentation system which takes an input image of heterogeneous document page and produces an output with explicitly indicated homogeneous segments with colored bounding boxes. The modules use morphological operations to detect the components. To improve the performance of image segmentation Residual Image Fragments Retrieval (RIFR) is proposed. The paper also proposes Text Extraction from Table Cells (TETC). Combining RIFR and TETC together we get an overall accuracy of 93%. Table and cell detection have a higher accuracy of 96% whereas image and texts have around 90% accuracy.

...read moreread less

5 citations

Proceedings Article•DOI•

Learning to detect tables in document images using line and text information

[...]

Thong Huynh-Van¹, Khuong Nguyen-An¹, Trinh Le Ba Khanh¹, Hyung-Jeong Yang², Tuan Anh Tran¹, Soo-Hyung Kim² - Show less +2 more•Institutions (2)

Ho Chi Minh City University of Technology¹, Chonnam National University²

02 Feb 2018

TL;DR: A hybrid method consisting of three fundamental steps to detect table zones: classification of the regions, detection of the tables that constitute intersecting horizontal and vertical lines, and identification of the table zones made up by only parallel lines is presented.

...read moreread less

Abstract: Table detection is a crucial step in many document analysis applications as tables are used for presenting essential information to readers in a structured manner. It is still a challenging problem due to the variety of table structures and the complexity of document layout. This paper presents a hybrid method consisting of three fundamental steps to detect table zones: classification of the regions, detection of the tables that constitute intersecting horizontal and vertical lines, and identification of the tables made up by only parallel lines. Experiments on the UW-III dataset show that the obtained results are very promising.

...read moreread less

4 citations

Proceedings Article•DOI•

Document Layout Analysis: A Maximum Homogeneous Region Approach

[...]

Tuan Anh Tran¹, Khuong Nguyen-An¹, Nhat Quang Vo•Institutions (1)

Ho Chi Minh City University of Technology¹

01 Apr 2018

TL;DR: This paper presents a method for document layout analysis that applies the analyzing of whitespace in maximum homogeneous regions to focus on the balance between processing time and performance.

...read moreread less

Abstract: This paper presents a method for document layout analysis. This method applies the analyzing of whitespace in maximum homogeneous regions. This method focuses on the balance between processing time and performance. It consists of two main stages: classification and segmentation. Firstly, by using the analysis of whitespace analysis on Maximum multi-layer horizontal homogeneous regions, the text and non-text elements are classified. Then, text regions are extracted by using mathematical morphology. Besides, non-text elements are classified into separators, tables, images via a machine learning approach. The proposed method's effectiveness is proved by the tests on UW-III (A1) datasets.

...read moreread less

4 citations

Proceedings Article•DOI•

A Document Layout Analysis Method Based on Morphological Operators and Connected Components

[...]

Sebastian Wilde Alarcon Arenas, Yessenia Yari, Graciela Meza-Lovon

01 Oct 2018

TL;DR: This paper presents a new hybrid approach to analyze the structure of documents that is founded on morphological operators and connected components, and conducted the experiments on a dataset containing ancient historical newspapers.

...read moreread less

Abstract: During the last decades, the interest in preserving digitally historical documents have gained considerable attention. To exploit all the advantages and opportunities offered by the digitized documents, it's necessary to understand their contents. The first step toward that understanding is to determine the locations of the entities of the document, such as figures, titles, and captions, text, etc. This paper presents a new hybrid approach to analyze the structure of documents that is founded on morphological operators and connected components. The proposed method is divided into two stages, preprocessing, in which the quality of the document images is enhanced; and layout analysis, in which, we identify three types of layout. We also include a fragmentation process, in which we divide the page image into sections. Finally, We conducted the experiments on a dataset containing ancient historical newspapers.

...read moreread less

2 citations

Proceedings Article•DOI•

Automatic Extract Handwriting Marked Regions in Business Document Images

[...]

Nam Quan Nguyen, Quoc Thang Nguyen, Ha Vu Duy Nguyen, Thanh Duc Chau, Tuan Anh Tran¹ - Show less +1 more•Institutions (1)

Ho Chi Minh City University of Technology¹

01 Nov 2018

TL;DR: The proposed method is a combination of printed form analysis and colored handwritten symbols detection that is primarily designed to support commercial activities and assist users in document layout analysis.

...read moreread less

Abstract: One of the difficulties in the automation of market trading, data crawling in commercial activities, such as banking operation is the identification of handwriting marked regions. This is the first and important step in mixed-document image understanding. In this paper, we propose a method to extract the handwriting marked regions in business forms. Our method is a combination of printed form analysis and colored handwritten symbols detection. This program is primarily designed to support commercial activities and assist users in document layout analysis. Evaluated on the real commercial dataset, the method achieved high performance.

...read moreread less

2 citations