Document Layout Analysis Using Multigaussian Fitting

doi:10.1109/ICDAR.2017.127

Proceedings ArticleDOI

Document Layout Analysis Using Multigaussian Fitting

- pp 747-752

TLDR

A novel technique for layout analysis of documents with complex Manhattan layouts that requires only one parameter - the number of gaussians to fit the height histogram data and is therefore easy to automate and adapt to many documents.

Abstract:

This paper proposes a novel technique for layout analysis of documents with complex Manhattan layouts. The technique is designed for Indic script newspapers and works on many types of documents not necessarily with Indic scripts with Manhattan layout. The main idea behind the algorithm is to categorise the physical elements of a document into noise, text, titles and graphics based on their heights. A histogram of heights is computed from the bounding boxes of connected components and a multigaussian fit is used to discover optimal split points between the categories. The gaussian with the highest peak is assumed to correspond to running text. Running text regions are grouped into blocks using nearest neighbour analysis. These initial regions are further refined using a second-level classification of the other elements into graphics, light-coloured text on a dark background, and graphical separators. The resulting layouts show accuracies comparable to some of the best and most popular algorithms such as MHS (winner of ICDAR-RDCL2015 competition) and PRImA's Aletheia (tool developed by PRImA Research Lab). Results of testing on many Indic script newspapers and other documents, and comparison with Aletheia and MHS on ICDAR dataset show its performance. Our initial results on an Indic document dataset show high performance in identifying running text (> 98%) with an accuracy of 82% on identifying the other elements. Ground truth data for the Indic script newspaper documents is being generated for a more extensive quantitative testing. The strength of our algorithm is that it requires only one parameter - the number of gaussians to fit the height histogram data and is therefore easy to automate and adapt to many documents.

Document Layout Analysis Using Multigaussian Fitting

Citations

Parameter-Free Table Detection Method

A Document Layout Analysis Method Based on Morphological Operators and Connected Components

Document image analysis and recognition: a survey

DRFN: A unified framework for complex document layout analysis

Parameter free approach for segmenting complex manhattan layouts

References

Adaptive document image binarization

The document spectrum for page layout analysis

The document spectrum for page layout analysis

A prototype document image analysis system for technical journals

Block segmentation and text extraction in mixed text/image documents

Related Papers (5)

Geometric Layout Analysis of Scanned Documents

Page Layout Analysis System for Unconstrained Historic Documents

Extraction of text layout structures on document images based on statistical characterization

A histogram-based technique for automatic threshold assessment in a run length smoothing-based algorithm

Document structure analysis and performance evaluation