scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Experiences of integration and performance testing of multilingual OCR for printed Indian scripts

TL;DR: The project is an attempt to implement an integrated platform for OCR of different Indian languages and currently is being enhanced for handling the space and time constraints, achieving higher recognition accuracies and adding new functionalities.
Abstract: This paper presents integration and testing scheme for managing a large Multilingual OCR Project. The project is an attempt to implement an integrated platform for OCR of different Indian languages. Software engineering, workflow management and testing processes have been discussed in this paper. The OCR has now been experimentally deployed for some specific applications and currently is being enhanced for handling the space and time constraints, achieving higher recognition accuracies and adding new functionalities.
Citations
More filters
Journal ArticleDOI
TL;DR: A benchmark image database of isolated handwritten Bangla compound characters, used in the standard Bangla literature, is presented, which may facilitate research on handwritten character recognition, especially related to Bangla form document processing systems.
Abstract: In the present work, we present a benchmark image database of isolated handwritten Bangla compound characters, used in the standard Bangla literature. A thorough survey over more than 2 million Bangla words has revealed that there exist around 334 compound characters in Bangla script. Of which, only around 171 character classes form unique pattern shapes, and some of these classes are often written in multiple styles. Altogether, 55,278 isolated character images, belonging to 199 different pattern shapes, are collected using three different data collection modalities. The database is divided into training and test sets in 4:1 ratio for each pattern class, by considering a balanced distribution of shapes from different modalities. A convex hull and quadtree-based feature set has been designed, and the test set recognition performance is reported with the support vector machine classifier. We have achieved a recognition accuracy of 79.35 % on the test database consisting of 171 character classes. The complete compound character image database is freely available as CMATERdb 3.1.3.3 from the website http://code.google.com/p/cmaterdb/ , which may facilitate research on handwritten character recognition, especially related to Bangla form document processing systems.

81 citations

Proceedings Article
01 Nov 2012
TL;DR: This paper proposes a recognition scheme for the Indian script of Devanagari using a Recurrent Neural Network known as Bidirectional LongShort Term Memory (BLSTM) and reports a reduction of more than 20% in word error rate and over 9% reduction in character error rate while comparing with the best available OCR system.
Abstract: In this paper, we propose a recognition scheme for the Indian script of Devanagari. Recognition accuracy of Devanagari script is not yet comparable to its Roman counterparts. This is mainly due to the complexity of the script, writing style etc. Our solution uses a Recurrent Neural Network known as Bidirectional LongShort Term Memory (BLSTM). Our approach does not require word to character segmentation, which is one of the most common reason for high word error rate. We report a reduction of more than 20% in word error rate and over 9% reduction in character error rate while comparing with the best available OCR system.

64 citations


Cites background or methods from "Experiences of integration and perf..."

  • ...Keywords: BLSTM, Word recognition, Devanagari, OCR...

    [...]

  • ...In this section, we quantitatively compare our method against a state-of-the-art Indian language OCR [1]....

    [...]

  • ...It results in more than 20% improvement in word accuracy while comparing traditional OCR system....

    [...]

Proceedings ArticleDOI
11 Apr 2016
TL;DR: An end-to-end RNN based architecture which can detect the script and recognize the text in a segmentation-free manner is proposed for this purpose and demonstrated for 12 Indian languages and English.
Abstract: In Indian scenario, a document analysis system has to support multiple languages at the same time. With emerging multilingualism in urban India, often bilingual, trilingual or even more languages need to be supported. This demands development of a multilingual OCR system which can work seamlessly across Indic scripts. In our approach the script is identified at word level, prior to the recognition of the word. An end-to-end RNN based architecture which can detect the script and recognize the text in a segmentation-free manner is proposed for this purpose. We demonstrate the approach for 12 Indian languages and English. It is observed that, even with the similar architecture, performance on Indian languages are poorer compared to English. We investigate this further. Our approach is evaluated on a large corpus comprising of thousands of pages. The Hindi OCR is compared with other popular OCRs for the language, as a further testimony for the efficacy of our method.

40 citations


Cites background from "Experiences of integration and perf..."

  • ...Though there have been many attempts in developing OCRs for Indian scripts from the 1970s to the beginning of this decade [2, 3, 4], methods that can scale across languages and yield reasonable results over a wide variety of documents are not yet devised....

    [...]

Proceedings ArticleDOI
27 Mar 2012
TL;DR: A novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images by exploiting the additional context present in the character n-gram images, which enables better disambiguation between confusing characters in the recognition phase.
Abstract: In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for post processing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.

40 citations


Cites background from "Experiences of integration and perf..."

  • ...The major contributions of our work are: • A novel re-posing of the OCR problem to one of recognizing character n-grams....

    [...]

Proceedings ArticleDOI
07 Apr 2014
TL;DR: A web based OCR system which follows a unified architecture for seven Indian languages, is robust against popular degradations, follows a segmentation free approach, addresses the UNICODE re-ordering issues, and can enable continuous learning with user inputs and feedbacks is proposed.
Abstract: The current Optical Character Recognition OCR systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed documents. Reasons for this limitation includes the lack of resources (e.g. not enough examples with natural variations, lack of documentation available about the possible font/style variations) and the architecture which necessitates hard segmentation of word images followed by an isolated symbol recognition. Variations among scripts, latent symbol to UNICODE conversion rules, non-standard fonts/styles and large degradations are some of the major reasons for the unavailability of robust solutions. In this paper, we propose a web based OCR system which (i) follows a unified architecture for seven Indian languages, (ii) is robust against popular degradations, (iii) follows a segmentation free approach, (iv) addresses the UNICODE re-ordering issues, and (v) can enable continuous learning with user inputs and feedbacks. Our system is designed to aid the continuous learning while being usable i.e., we capture the user inputs (say example images) for further improving the OCRs. We use the popular BLSTM based transcription scheme to achieve our target. This also enables incremental training and refinement in a seamless manner. We report superior accuracy rates in comparison with the available OCRs for the seven Indian languages.

24 citations


Cites background from "Experiences of integration and perf..."

  • ...There has been significant progress in the recent past on developing robust solutions [1], [2], [3]....

    [...]

  • ...Traditionally this module has been formulated as an adhoc composition of a set of isolated character (or symbol) classifiers [1], [9]....

    [...]

  • ...OCR [1] Tesseract [17] Our Method Char....

    [...]

  • ...Word to symbol/character separation, is required for classifiers that recognize isolated characters [1]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.

592 citations

Journal ArticleDOI
TL;DR: A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented and extension of the work to Devnagari, the third most popular Script in the World, is discussed.

381 citations


"Experiences of integration and perf..." refers background in this paper

  • ...There have been many attempts in development of OCRs for Indian Scripts like Devanagari, Malayalam[10], Telugu, Tamil[7], Bangla[14], Gurumukhi[15] and Kannada[8]....

    [...]

  • ...There have been many attempts in development of OCRs for Indian Scripts like Devanagari, Malayalam[10], Telugu, Tamil[7], Bangla[14], Gurumukhi[15] and Kannada[8]....

    [...]

  • ...Bangla and Devanagari are among the most symbol-rich since originally about 2000 shapes need to be recognized for a complete OCR system[14]....

    [...]

BookDOI
01 Jan 2010
TL;DR: This book provides an overview of the current state-of-the–art in the OCR of the different Indic scripts as well as other issues in the creation of accessible digital libraries for Indi scripts.
Abstract: Research on OCR of Indian scripts is gaining momentum in recent times. Many projects funded by government and industry are currently underway to scan hundreds of thousands of indic-script documents and manuscripts to create large digital library archives to preserve these treasures for posterity. OCR is a key enabling technology for making these archives practically accessible to researchers and lay users alike by creating search-able indexes and machine readable text repositories of these documents. This book provides an overview of the current state-of-the–art in the OCR of the different Indic scripts as well as other issues in the creation of accessible digital libraries for Indic scripts. It provides a good technical overview of the latest research in the field.

68 citations

Book
27 Oct 2009
TL;DR: This unique guide/reference is the very first comprehensive book on the subject of OCR (Optical Character Recognition) for Indic scripts and provides a section on the enhancement of text and images obtained from historical Indic palm leaf manuscripts.
Abstract: This unique guide/reference is the very first comprehensive book on the subject of OCR (Optical Character Recognition) for Indic scripts. Features: contains contributions from the leading researchers in the field; discusses data set creation for OCR development; describes OCR systems that cover 8 different scripts Bangla, Devanagari, Gurmukhi, Gujarati, Kannada, Malayalam, Tamil, and Urdu (Perso-Arabic); explores the challenges of Indic script handwriting recognition in the online domain; examines the development of handwriting-based text input systems; describes ongoing work to increase access to Indian cultural heritage materials; provides a section on the enhancement of text and images obtained from historical Indic palm leaf manuscripts; investigates different techniques for word spotting in Indic scripts; reviews mono-lingual and cross-lingual information retrieval in Indic languages. This is an excellent reference for researchers and graduate students studying OCR technology and methodologies.

46 citations

Book ChapterDOI
19 Aug 2002
TL;DR: Recognition of Indian language characters has been a topic of interest for quite some time and the need for efficient and robust algorithms and systems for recognition is being felt in India, especially in the post and telegraph department where OCR can assist the staff in sorting mail.
Abstract: Document Image processing and Optical Character Recognition (OCR) have been a frontline research area in the field of human-machine interface for the last few decades. Recognition of Indian language characters has been a topic of interest for quite some time. The earlier contributions were reported in [1] and [2]. A more recent work is reported in [3] and [9]. The need for efficient and robust algorithms and systems for recognition is being felt in India, especially in the post and telegraph department where OCR can assist the staff in sorting mail. Character recognition can also form a part in applications like intelligent scanning machines, text to speech converters, and automatic language-to-language translators.

46 citations


"Experiences of integration and perf..." refers background in this paper

  • ...There have been many attempts in development of OCRs for Indian Scripts like Devanagari, Malayalam[10], Telugu, Tamil[7], Bangla[14], Gurumukhi[15] and Kannada[8]....

    [...]

  • ...There have been many attempts in development of OCRs for Indian Scripts like Devanagari, Malayalam[10], Telugu, Tamil[7], Bangla[14], Gurumukhi[15] and Kannada[8]....

    [...]

  • ...Tamil has 18 consonants and 12 vowels[7]....

    [...]