scispace - formally typeset
Proceedings ArticleDOI

Document Layout Analysis Using Multigaussian Fitting

TLDR
A novel technique for layout analysis of documents with complex Manhattan layouts that requires only one parameter - the number of gaussians to fit the height histogram data and is therefore easy to automate and adapt to many documents.
Abstract
This paper proposes a novel technique for layout analysis of documents with complex Manhattan layouts. The technique is designed for Indic script newspapers and works on many types of documents not necessarily with Indic scripts with Manhattan layout. The main idea behind the algorithm is to categorise the physical elements of a document into noise, text, titles and graphics based on their heights. A histogram of heights is computed from the bounding boxes of connected components and a multigaussian fit is used to discover optimal split points between the categories. The gaussian with the highest peak is assumed to correspond to running text. Running text regions are grouped into blocks using nearest neighbour analysis. These initial regions are further refined using a second-level classification of the other elements into graphics, light-coloured text on a dark background, and graphical separators. The resulting layouts show accuracies comparable to some of the best and most popular algorithms such as MHS (winner of ICDAR-RDCL2015 competition) and PRImA's Aletheia (tool developed by PRImA Research Lab). Results of testing on many Indic script newspapers and other documents, and comparison with Aletheia and MHS on ICDAR dataset show its performance. Our initial results on an Indic document dataset show high performance in identifying running text (> 98%) with an accuracy of 82% on identifying the other elements. Ground truth data for the Indic script newspaper documents is being generated for a more extensive quantitative testing. The strength of our algorithm is that it requires only one parameter - the number of gaussians to fit the height histogram data and is therefore easy to automate and adapt to many documents.

read more

Citations
More filters
Proceedings ArticleDOI

Parameter-Free Table Detection Method

TL;DR: Two parameter-free table detection methods are proposed: one for the closed tables and other for open tables, which requires no training dataset and achieves more than 90% in table recognition.
Proceedings ArticleDOI

A Document Layout Analysis Method Based on Morphological Operators and Connected Components

TL;DR: This paper presents a new hybrid approach to analyze the structure of documents that is founded on morphological operators and connected components, and conducted the experiments on a dataset containing ancient historical newspapers.
Journal ArticleDOI

Document image analysis and recognition: a survey

- 01 Aug 2022 - 
TL;DR: In this paper , the problems of document image recognition and the existing solutions are analyzed and a preliminary systematization allowed us to distinguish groups of methods for extracting information from documents of different types: single-page and multi-page.
Journal ArticleDOI

DRFN: A unified framework for complex document layout analysis

TL;DR: In this article , a dynamic residual feature fusion network (DRFN) is proposed to integrate the feature differences between non-Manhattan layouts and Manhattan layouts, which makes full use of low-dimensional information and maintains the integrity of high-level semantic information through a Dynamic Residual Fusion Block (DRF).
References
More filters
Journal ArticleDOI

Adaptive document image binarization

TL;DR: A new method is presented for adaptive document image binarization, where the page is considered as a collection of subcomponents such as text, background and picture, which adapts and performs well in each case qualitatively and quantitatively.
Journal ArticleDOI

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Book

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Journal ArticleDOI

A prototype document image analysis system for technical journals

TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.
Journal ArticleDOI

Block segmentation and text extraction in mixed text/image documents

TL;DR: It is shown that a constrained run length algorithm is well suited to partition most documents into areas of text lines, solid black lines, and rectangular ☐es enclosing graphics and halftone images.