scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2018"


Proceedings ArticleDOI
01 Jan 2018
TL;DR: This paper proposes an open-source implementation of a CNN-based pixel-wise predictor coupled with task dependent post-processing blocks and shows that a single CNN-architecture can be used across tasks with competitive results.
Abstract: In recent years there have been multiple successful attempts tackling document processing problems separately by designing task specific hand-tuned strategies. We argue that the diversity of historical document processing tasks prohibits to solve them one at a time and shows a need for designing generic approaches in order to handle the variability of historical series. In this paper, we address multiple tasks simultaneously such as page extraction, baseline extraction, layout analysis or multiple typologies of illustrations and photograph extraction. We propose an open-source implementation of a CNN-based pixel-wise predictor coupled with task dependent post-processing blocks. We show that a single CNN-architecture can be used across tasks with competitive results. Moreover most of the task-specific post-precessing steps can be decomposed in a small number of simple and standard reusable operations, adding to the flexibility of our approach.

92 citations


Posted Content
TL;DR: A system based on artificial neural networks which is able to determine not only the baselines of text lines present in the document, but also performs geometric and logic layout analysis of the document.
Abstract: Document Layout Analysis is a fundamental step in Handwritten Text Processing systems, from the extraction of the text lines to the type of zone it belongs to. We present a system based on artificial neural networks which is able to determine not only the baselines of text lines present in the document, but also performs geometric and logic layout analysis of the document. Experiments in three different datasets demonstrate the potential of the method and show competitive results with respect to state-of-the-art methods.

38 citations


Proceedings ArticleDOI
01 Aug 2018
TL;DR: The results show the impact caused by an optical modelling technological transition: from classical HMM-based methods to new technology based on recurrent neural networks.
Abstract: We present the process which is being followed for the transcription of a large XVIII Century Manuscript collection with the help of Handwritten Text Recognition (HTR) Technology. The documents are being processed in batches of 50 pages each. For each batch we perform two semi-supervised processes: one in order to analyze the layout and detect the text lines and another to provide the full transcripts of the text. As per users request, both diplomatic and modernized transcripts, as well as semantically tagged versions are being produced. Layout analysis supervision is performed by means of a conventional layout editing tool. On the other hand, transcripts, including automatic modernization and tagging, are being produced by means of a web based computer-assisted interactive-predictive tool (CATTI). We present results of the performance of this process through 12 image batches processed so far. These results show the impact caused by an optical modelling technological transition: from classical HMM-based methods to new technology based on recurrent neural networks.

11 citations


Proceedings ArticleDOI
01 Aug 2018
TL;DR: A binarization free dynamic programming approach that generates an equidistant text line extraction polygon is presented and compared with other solutions ranging from the actual human reviewed ground-truth polygons to simpler automatic generated rectangle areas.
Abstract: Text Line Segmentation is a basic document layout task that consists in detecting and extracting the text lines present in a document page image. Although considered a basic task, generally, it is a necessary step for Handwritten Text Recognition (HTR) higher level tasks. Most state of the art automatic text recognition, text-to-line image alignment and key word spotting systems require it due to their need for isolated text line images as input. Traditionally most Text Line Segmentation approaches cover both detection and extraction sub steps. However, the community has recently shifted its focus to tackle independently the baseline detection in document images. This shift generates the need for extraction methods that use these detected baselines as input. In this paper, a binarization free dynamic programming approach that generates an equidistant text line extraction polygon is presented. The approach performs this calculation, based on the information provided by priorly detected text baselines and automatically generated foreground pixels distance maps. We evaluate our approach both in a synthetic competition corpus and in a challenging real handwritten text recognition task corpus. We evaluate it not only at the graphical error level but also the impact it produces on an HTR task trained with the line images it yields. We compare our solution with other solutions ranging from the actual human reviewed ground-truth polygons to simpler automatic generated rectangle areas.

5 citations


Proceedings ArticleDOI
06 Jun 2018
TL;DR: A language-independent document segmentation system that segments a heterogeneous printed document into homogeneous components like halftones and graphics, texts and tables including its individual cells and Text Extraction from Table Cells is presented.
Abstract: The research on document layout analysis has been widespread over a large arena recently and is craving for more efficiency day by day. Document segmentation is an important preprocessing step before analyzing the layouts. This paper presents a language-independent document segmentation system that segments a heterogeneous printed document into homogeneous components like halftones and graphics, texts and tables including its individual cells. From an input document page homogeneous components are segmented in three steps with three separate modules, which are- extraction of halftone images, extraction of tables and segmentation of text blocks. These modules altogether build the whole page segmentation system which takes an input image of heterogeneous document page and produces an output with explicitly indicated homogeneous segments with colored bounding boxes. The modules use morphological operations to detect the components. To improve the performance of image segmentation Residual Image Fragments Retrieval (RIFR) is proposed. The paper also proposes Text Extraction from Table Cells (TETC). Combining RIFR and TETC together we get an overall accuracy of 93%. Table and cell detection have a higher accuracy of 96% whereas image and texts have around 90% accuracy.

5 citations


Proceedings ArticleDOI
02 Feb 2018
TL;DR: A hybrid method consisting of three fundamental steps to detect table zones: classification of the regions, detection of the tables that constitute intersecting horizontal and vertical lines, and identification of the table zones made up by only parallel lines is presented.
Abstract: Table detection is a crucial step in many document analysis applications as tables are used for presenting essential information to readers in a structured manner. It is still a challenging problem due to the variety of table structures and the complexity of document layout. This paper presents a hybrid method consisting of three fundamental steps to detect table zones: classification of the regions, detection of the tables that constitute intersecting horizontal and vertical lines, and identification of the tables made up by only parallel lines. Experiments on the UW-III dataset show that the obtained results are very promising.

4 citations


Proceedings ArticleDOI
01 Apr 2018
TL;DR: This paper presents a method for document layout analysis that applies the analyzing of whitespace in maximum homogeneous regions to focus on the balance between processing time and performance.
Abstract: This paper presents a method for document layout analysis. This method applies the analyzing of whitespace in maximum homogeneous regions. This method focuses on the balance between processing time and performance. It consists of two main stages: classification and segmentation. Firstly, by using the analysis of whitespace analysis on Maximum multi-layer horizontal homogeneous regions, the text and non-text elements are classified. Then, text regions are extracted by using mathematical morphology. Besides, non-text elements are classified into separators, tables, images via a machine learning approach. The proposed method's effectiveness is proved by the tests on UW-III (A1) datasets.

4 citations


Proceedings ArticleDOI
01 Oct 2018
TL;DR: This paper presents a new hybrid approach to analyze the structure of documents that is founded on morphological operators and connected components, and conducted the experiments on a dataset containing ancient historical newspapers.
Abstract: During the last decades, the interest in preserving digitally historical documents have gained considerable attention. To exploit all the advantages and opportunities offered by the digitized documents, it's necessary to understand their contents. The first step toward that understanding is to determine the locations of the entities of the document, such as figures, titles, and captions, text, etc. This paper presents a new hybrid approach to analyze the structure of documents that is founded on morphological operators and connected components. The proposed method is divided into two stages, preprocessing, in which the quality of the document images is enhanced; and layout analysis, in which, we identify three types of layout. We also include a fragmentation process, in which we divide the page image into sections. Finally, We conducted the experiments on a dataset containing ancient historical newspapers.

2 citations


Proceedings ArticleDOI
01 Nov 2018
TL;DR: The proposed method is a combination of printed form analysis and colored handwritten symbols detection that is primarily designed to support commercial activities and assist users in document layout analysis.
Abstract: One of the difficulties in the automation of market trading, data crawling in commercial activities, such as banking operation is the identification of handwriting marked regions. This is the first and important step in mixed-document image understanding. In this paper, we propose a method to extract the handwriting marked regions in business forms. Our method is a combination of printed form analysis and colored handwritten symbols detection. This program is primarily designed to support commercial activities and assist users in document layout analysis. Evaluated on the real commercial dataset, the method achieved high performance.

2 citations