High Performance Layout Analysis of Arabic and Urdu Document Images

doi:10.1109/ICDAR.2011.257

Proceedings ArticleDOI

High Performance Layout Analysis of Arabic and Urdu Document Images

- pp 1275-1279

TLDR

Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.

Abstract:

Text-lines extraction and their reading order determination is an important step in optical character recognition (OCR) systems. Research in OCR of Arabic script documents has primarily focused on character recognition and therefore most of researchers use primitive methods like projection profile analysis for text-line extraction. Although projection methods achieve good accuracy on clean, skew corrected documents, their performance drops under challenging situations (border noise, skew, complex layouts). This paper presents a robust layout analysis system for extracting text-lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian) and styles (Naskh, Nastaliq). The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations. Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

The optical character recognition of Urdu-like cursive scripts

Saeeda Naz, +5 more

TL;DR: The Urdu, Pushto, and Sindhi languages are discussed, with the emphasis being on the Nasta'liq and Naskh scripts, with an emphasis on the preprocessing, segmentation, feature extraction, classification, and recognition in OCR.

...read moreread less

Journal ArticleDOI

A comprehensive survey of mostly textual document segmentation algorithms since 2008

Sebastien Eskenazi, +2 more

- 01 Apr 2017 -

Pattern Recognition

TL;DR: This survey highlights the variety of the approaches that have been proposed for document image segmentation since 2008 and provides a clear typology of documents and of document images segmentation algorithms.

...read moreread less

Journal ArticleDOI

Line and Ligature Segmentation of Urdu Nastaleeq Text

Ibrar Ahmad, +4 more

- 01 Jan 2017 -

IEEE Access

TL;DR: Both of the proposed segmentation algorithms outperform the existing algorithms employed for Urdu Nastaleeq text segmentation and are tested on Arabic, for which it also extracted lines correctly.

...read moreread less

Proceedings Article

End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net.

Tuan Anh Nguyen Dang, +1 more

TL;DR: A novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document, namely the Multi-Stage Attentional U-Net, which leverages a specialized multi-stage encoder-decoders design, in conjunction with efficient uses of the self-attention mechanism and the box convolution.

...read moreread less

Proceedings ArticleDOI

BCE-Arabic-v1 dataset: Towards interpreting Arabic document images for people with visual impairments

Rana S.M. Saad, +4 more

TL;DR: This paper shows that the inaccessibility of scanned PDF documents is in large part due to the failure of the OCR engine to understand the layout of an Arabic document, and investigates the performance of state-of-the-art document annotation tools.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Document analysis system

Kwan Y. Wong, +2 more

- 01 Nov 1982 -

Ibm Journal of Research and Development

TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.

...read moreread less

Journal ArticleDOI

The document spectrum for page layout analysis

Lawrence O'Gorman

- 01 Nov 1993 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.

...read moreread less

Book

The document spectrum for page layout analysis

Lawrence O'Gorman

TL;DR: The document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.

...read moreread less

Journal ArticleDOI

Twenty years of document image analysis in PAMI

George Nagy

- 01 Jan 2000 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.

...read moreread less

Journal ArticleDOI

A prototype document image analysis system for technical journals

George Nagy, +2 more

- 01 Jul 1992 -

IEEE Computer

TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.

...read moreread less

High Performance Layout Analysis of Arabic and Urdu Document Images

Citations

The optical character recognition of Urdu-like cursive scripts

A comprehensive survey of mostly textual document segmentation algorithms since 2008

Line and Ligature Segmentation of Urdu Nastaleeq Text

End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net.

BCE-Arabic-v1 dataset: Towards interpreting Arabic document images for people with visual impairments

References

Document analysis system

The document spectrum for page layout analysis

The document spectrum for page layout analysis

Twenty years of document image analysis in PAMI

A prototype document image analysis system for technical journals

Related Papers (5)

A segmentation-free approach to Arabic and Urdu OCR

Extraction of Arabic text from multilingual documents

Implementation of a statistical based Arabic character recognition system

Automatic language identification of bilingual English and Farsi scripts

A robust page segmentation method for Persian/Arabic documents