Proceedings ArticleDOI
iDocChip - A Configurable Hardware Architecture for Historical Document Image Processing: Text Line Extraction
Menbere Kina Tekleyohannes,Vladimir Rybalkin,Muhammad Mohsin Ghaffar,Norbert Wehn,Andreas Dengel +4 more
- pp 1-8
TLDR
iDocChip is a low power, energy-efficient accelerator with real-time capabilities called iDocChip, which is a hybrid hardware-software programmable System-on-Chip (SoC) for digitizing historical documents, and the resulting custom hardware accelerator outperforms the existing anyOCR software implementation by 120x, while achieving 1700x higher energy efficiency without affecting the high accuracy of the system.Abstract:
Digitizing historical archives poses a great challenge due to the quality degradation existing in these documents. Hence, even well-established Optical Character Recognition (OCR) systems, such as Abby, OCRopus, Tesseract, etc., fail to give sufficient recognition accuracy for historical archives, since they are optimized for transcribing contemporary documents. In contrast, the open-source anyOCR system is designed specifically for digitizing historical documents with state-of-the-art image processing techniques, to achieve high accuracy. Nowadays, the retrieval of historical document images for further OCR requires special scanning devices that are bulky and stationary. As a result, a portable device that combines scanning and OCR capabilities is beneficial to transcribe documents without the need to remove them from where they are archived. For example, smart goggles equipped with embedded OCR device can be used for instant word spotting. However, the available anyOCR software implementation has long runtime and high power consumption. As a solution, we propose a low power, energy-efficient accelerator with real-time capabilities called iDocChip, which is a hybrid hardware-software programmable System-on-Chip (SoC) for digitizing historical documents. This chip can be easily integrated in a portable device. This paper focuses on one of the most crucial processing steps in anyOCR: Text line extraction. We propose, to the best of our knowledge, the first hybrid hardware-software architecture of the text line extraction technique implemented on an FPGA based programmable SoC. The resulting custom hardware accelerator outperforms the existing anyOCR software implementation by 120x, while achieving 1700x higher energy efficiency without affecting the high accuracy of the system.read more
Citations
More filters
Journal ArticleDOI
iDocChip: A Configurable Hardware Architecture for Historical Document Image Processing
Menbere Kina Tekleyohannes,Vladimir Rybalkin,Muhammad Mohsin Ghaffar,Javier Alejandro Varela,Norbert Wehn,Andreas Dengel +5 more
TL;DR: In this paper, the authors proposed a low power energy-efficient accelerator with real-time capabilities called iDocChip, which is a configurable hybrid hardware-software programmable system-on-chip (SoC) based anyOCR for digitizing historical documents.
References
More filters
Journal ArticleDOI
Text line segmentation of historical documents: a survey
TL;DR: The objective of this paper is to present a survey of existing methods, developed during the last decade and dedicated to documents of historical interest.
Proceedings ArticleDOI
A Steerable Directional Local Profile Technique for Extraction of Handwritten Arabic Text Lines
TL;DR: Analysis of experimental results on the DARPA MADCAT Arabic handwritten document data indicate that the method is robust and is capable of correctly isolating handwritten text lines even on challenging document images.
Proceedings ArticleDOI
Handwritten Text Line Segmentation by Shredding Text into its Lines
Anguelos Nicolaou,Basilis Gatos +1 more
TL;DR: A novel technique to segment handwritten document images into text lines by shredding their surface with local minima tracers is proposed, which gets promising results comparable to state of the art text line segmentation techniques.
Journal ArticleDOI
A Two-Stage Method for Text Line Detection in Historical Documents
TL;DR: In this paper, a two-stage text line detection method for historical documents is presented, where the first stage labels pixels to belong to one of the three classes: baseline, separator and other, and the second stage performs bottom-up clustering to build baselines.
Journal ArticleDOI
A comprehensive survey of mostly textual document segmentation algorithms since 2008
TL;DR: This survey highlights the variety of the approaches that have been proposed for document image segmentation since 2008 and provides a clear typology of documents and of document images segmentation algorithms.