scispace - formally typeset
Proceedings ArticleDOI

iDocChip - A Configurable Hardware Architecture for Historical Document Image Processing: Text Line Extraction

TLDR
iDocChip is a low power, energy-efficient accelerator with real-time capabilities called iDocChip, which is a hybrid hardware-software programmable System-on-Chip (SoC) for digitizing historical documents, and the resulting custom hardware accelerator outperforms the existing anyOCR software implementation by 120x, while achieving 1700x higher energy efficiency without affecting the high accuracy of the system.
Abstract
Digitizing historical archives poses a great challenge due to the quality degradation existing in these documents. Hence, even well-established Optical Character Recognition (OCR) systems, such as Abby, OCRopus, Tesseract, etc., fail to give sufficient recognition accuracy for historical archives, since they are optimized for transcribing contemporary documents. In contrast, the open-source anyOCR system is designed specifically for digitizing historical documents with state-of-the-art image processing techniques, to achieve high accuracy. Nowadays, the retrieval of historical document images for further OCR requires special scanning devices that are bulky and stationary. As a result, a portable device that combines scanning and OCR capabilities is beneficial to transcribe documents without the need to remove them from where they are archived. For example, smart goggles equipped with embedded OCR device can be used for instant word spotting. However, the available anyOCR software implementation has long runtime and high power consumption. As a solution, we propose a low power, energy-efficient accelerator with real-time capabilities called iDocChip, which is a hybrid hardware-software programmable System-on-Chip (SoC) for digitizing historical documents. This chip can be easily integrated in a portable device. This paper focuses on one of the most crucial processing steps in anyOCR: Text line extraction. We propose, to the best of our knowledge, the first hybrid hardware-software architecture of the text line extraction technique implemented on an FPGA based programmable SoC. The resulting custom hardware accelerator outperforms the existing anyOCR software implementation by 120x, while achieving 1700x higher energy efficiency without affecting the high accuracy of the system.

read more

Citations
More filters
Journal ArticleDOI

iDocChip: A Configurable Hardware Architecture for Historical Document Image Processing

TL;DR: In this paper, the authors proposed a low power energy-efficient accelerator with real-time capabilities called iDocChip, which is a configurable hybrid hardware-software programmable system-on-chip (SoC) based anyOCR for digitizing historical documents.
References
More filters
Journal ArticleDOI

Text line segmentation of historical documents: a survey

TL;DR: The objective of this paper is to present a survey of existing methods, developed during the last decade and dedicated to documents of historical interest.
Proceedings ArticleDOI

A Steerable Directional Local Profile Technique for Extraction of Handwritten Arabic Text Lines

TL;DR: Analysis of experimental results on the DARPA MADCAT Arabic handwritten document data indicate that the method is robust and is capable of correctly isolating handwritten text lines even on challenging document images.
Proceedings ArticleDOI

Handwritten Text Line Segmentation by Shredding Text into its Lines

TL;DR: A novel technique to segment handwritten document images into text lines by shredding their surface with local minima tracers is proposed, which gets promising results comparable to state of the art text line segmentation techniques.
Journal ArticleDOI

A Two-Stage Method for Text Line Detection in Historical Documents

TL;DR: In this paper, a two-stage text line detection method for historical documents is presented, where the first stage labels pixels to belong to one of the three classes: baseline, separator and other, and the second stage performs bottom-up clustering to build baselines.
Journal ArticleDOI

A comprehensive survey of mostly textual document segmentation algorithms since 2008

TL;DR: This survey highlights the variety of the approaches that have been proposed for document image segmentation since 2008 and provides a clear typology of documents and of document images segmentation algorithms.
Related Papers (5)