scispace - formally typeset
Proceedings ArticleDOI

High Performance Layout Analysis of Arabic and Urdu Document Images

TLDR
Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.
Abstract
Text-lines extraction and their reading order determination is an important step in optical character recognition (OCR) systems. Research in OCR of Arabic script documents has primarily focused on character recognition and therefore most of researchers use primitive methods like projection profile analysis for text-line extraction. Although projection methods achieve good accuracy on clean, skew corrected documents, their performance drops under challenging situations (border noise, skew, complex layouts). This paper presents a robust layout analysis system for extracting text-lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian) and styles (Naskh, Nastaliq). The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations. Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.

read more

Citations
More filters
Journal ArticleDOI

The optical character recognition of Urdu-like cursive scripts

TL;DR: The Urdu, Pushto, and Sindhi languages are discussed, with the emphasis being on the Nasta'liq and Naskh scripts, with an emphasis on the preprocessing, segmentation, feature extraction, classification, and recognition in OCR.
Journal ArticleDOI

A comprehensive survey of mostly textual document segmentation algorithms since 2008

TL;DR: This survey highlights the variety of the approaches that have been proposed for document image segmentation since 2008 and provides a clear typology of documents and of document images segmentation algorithms.
Journal ArticleDOI

Line and Ligature Segmentation of Urdu Nastaleeq Text

TL;DR: Both of the proposed segmentation algorithms outperform the existing algorithms employed for Urdu Nastaleeq text segmentation and are tested on Arabic, for which it also extracted lines correctly.
Proceedings Article

End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net.

TL;DR: A novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document, namely the Multi-Stage Attentional U-Net, which leverages a specialized multi-stage encoder-decoders design, in conjunction with efficient uses of the self-attention mechanism and the box convolution.
Proceedings ArticleDOI

BCE-Arabic-v1 dataset: Towards interpreting Arabic document images for people with visual impairments

TL;DR: This paper shows that the inaccessibility of scanned PDF documents is in large part due to the failure of the OCR engine to understand the layout of an Arabic document, and investigates the performance of state-of-the-art document annotation tools.
References
More filters
Journal ArticleDOI

Document analysis system

TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.
Journal ArticleDOI

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Book

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Journal ArticleDOI

Twenty years of document image analysis in PAMI

TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
Journal ArticleDOI

A prototype document image analysis system for technical journals

TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.