Experiences of integration and performance testing of multilingual OCR for printed Indian scripts

doi:10.1145/2034617.2034628

Home
/
Papers
/
Experiences of integration and performance testing of multilingual OCR for printed Indian scripts

Proceedings Article•DOI•

Experiences of integration and performance testing of multilingual OCR for printed Indian scripts

Deepak Arya¹, C. V. Jawahar², Chakravorty Bhagvati³, Tushar Patnaik¹, Bidyut B. Chaudhuri⁴, Gurpreet Singh Lehal⁵, Santanu Chaudhury⁶, A. G. Ramakrishna⁷ - Show less +4 more•Institutions (7)

Centre for Development of Advanced Computing¹, International Institute of Information Technology, Hyderabad², University of Hyderabad³, Indian Statistical Institute⁴, Punjabi University⁵, Indian Institute of Technology Delhi⁶, Indian Institute of Science⁷

17 Sep 2011-pp 9

TL;DR: The project is an attempt to implement an integrated platform for OCR of different Indian languages and currently is being enhanced for handling the space and time constraints, achieving higher recognition accuracies and adding new functionalities.

read less

Abstract: This paper presents integration and testing scheme for managing a large Multilingual OCR Project. The project is an attempt to implement an integrated platform for OCR of different Indian languages. Software engineering, workflow management and testing processes have been discussed in this paper. The OCR has now been experimentally deployed for some specific applications and currently is being enhanced for handling the space and time constraints, achieving higher recognition accuracies and adding new functionalities.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A benchmark image database of isolated Bangla handwritten compound characters

[...]

Nibaran Das¹, Kallol Acharya¹, Ram Sarkar¹, Subhadip Basu¹, Mahantapas Kundu¹, Mita Nasipuri¹ - Show less +2 more•Institutions (1)

Jadavpur University¹

01 Dec 2014-International Journal on Document Analysis and Recognition

TL;DR: A benchmark image database of isolated handwritten Bangla compound characters, used in the standard Bangla literature, is presented, which may facilitate research on handwritten character recognition, especially related to Bangla form document processing systems.

...read moreread less

Abstract: In the present work, we present a benchmark image database of isolated handwritten Bangla compound characters, used in the standard Bangla literature. A thorough survey over more than 2 million Bangla words has revealed that there exist around 334 compound characters in Bangla script. Of which, only around 171 character classes form unique pattern shapes, and some of these classes are often written in multiple styles. Altogether, 55,278 isolated character images, belonging to 199 different pattern shapes, are collected using three different data collection modalities. The database is divided into training and test sets in 4:1 ratio for each pattern class, by considering a balanced distribution of shapes from different modalities. A convex hull and quadtree-based feature set has been designed, and the test set recognition performance is reported with the support vector machine classifier. We have achieved a recognition accuracy of 79.35 % on the test database consisting of 171 character classes. The complete compound character image database is freely available as CMATERdb 3.1.3.3 from the website http://code.google.com/p/cmaterdb/ , which may facilitate research on handwritten character recognition, especially related to Bangla form document processing systems.

...read moreread less

81 citations

Proceedings Article•

Recognition of printed Devanagari text using BLSTM Neural Network

[...]

Naveen Sankaran¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

01 Nov 2012

TL;DR: This paper proposes a recognition scheme for the Indian script of Devanagari using a Recurrent Neural Network known as Bidirectional LongShort Term Memory (BLSTM) and reports a reduction of more than 20% in word error rate and over 9% reduction in character error rate while comparing with the best available OCR system.

...read moreread less

Abstract: In this paper, we propose a recognition scheme for the Indian script of Devanagari. Recognition accuracy of Devanagari script is not yet comparable to its Roman counterparts. This is mainly due to the complexity of the script, writing style etc. Our solution uses a Recurrent Neural Network known as Bidirectional LongShort Term Memory (BLSTM). Our approach does not require word to character segmentation, which is one of the most common reason for high word error rate. We report a reduction of more than 20% in word error rate and over 9% reduction in character error rate while comparing with the best available OCR system.

...read moreread less

64 citations

Cites background or methods from "Experiences of integration and perf..."

...Keywords: BLSTM, Word recognition, Devanagari, OCR...
[...]
...In this section, we quantitatively compare our method against a state-of-the-art Indian language OCR [1]....
[...]
...It results in more than 20% improvement in word accuracy while comparing traditional OCR system....
[...]

Proceedings Article•DOI•

Multilingual OCR for Indic Scripts

[...]

Minesh Mathew¹, Ajeet Kumar Singh¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

11 Apr 2016

TL;DR: An end-to-end RNN based architecture which can detect the script and recognize the text in a segmentation-free manner is proposed for this purpose and demonstrated for 12 Indian languages and English.

...read moreread less

Abstract: In Indian scenario, a document analysis system has to support multiple languages at the same time. With emerging multilingualism in urban India, often bilingual, trilingual or even more languages need to be supported. This demands development of a multilingual OCR system which can work seamlessly across Indic scripts. In our approach the script is identified at word level, prior to the recognition of the word. An end-to-end RNN based architecture which can detect the script and recognize the text in a segmentation-free manner is proposed for this purpose. We demonstrate the approach for 12 Indian languages and English. It is observed that, even with the similar architecture, performance on Indian languages are poorer compared to English. We investigate this further. Our approach is evaluated on a large corpus comprising of thousands of pages. The Hindi OCR is compared with other popular OCRs for the language, as a further testimony for the efficacy of our method.

...read moreread less

40 citations

Cites background from "Experiences of integration and perf..."

...Though there have been many attempts in developing OCRs for Indian scripts from the 1970s to the beginning of this decade [2, 3, 4], methods that can scale across languages and yield reasonable results over a wide variety of documents are not yet devised....
[...]

Proceedings Article•DOI•

Robust Recognition of Degraded Documents Using Character N-Grams

[...]

Shrey Dutta¹, Naveen Sankaran¹, K. Pramod Sankar², C. V. Jawahar¹•Institutions (2)

International Institute of Information Technology, Hyderabad¹, Xerox²

27 Mar 2012

TL;DR: A novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images by exploiting the additional context present in the character n-gram images, which enables better disambiguation between confusing characters in the recognition phase.

...read moreread less

Abstract: In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for post processing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.

...read moreread less

40 citations

Cites background from "Experiences of integration and perf..."

...The major contributions of our work are: • A novel re-posing of the OCR problem to one of recognizing character n-grams....
[...]

Proceedings Article•DOI•

Towards a Robust OCR System for Indic Scripts

[...]

Praveen Krishnan¹, Naveen Sankaran¹, Ajeet Kumar Singh¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

07 Apr 2014

TL;DR: A web based OCR system which follows a unified architecture for seven Indian languages, is robust against popular degradations, follows a segmentation free approach, addresses the UNICODE re-ordering issues, and can enable continuous learning with user inputs and feedbacks is proposed.

...read moreread less

Abstract: The current Optical Character Recognition OCR systems for Indic scripts are not robust enough for recognizing arbitrary collection of printed documents. Reasons for this limitation includes the lack of resources (e.g. not enough examples with natural variations, lack of documentation available about the possible font/style variations) and the architecture which necessitates hard segmentation of word images followed by an isolated symbol recognition. Variations among scripts, latent symbol to UNICODE conversion rules, non-standard fonts/styles and large degradations are some of the major reasons for the unavailability of robust solutions. In this paper, we propose a web based OCR system which (i) follows a unified architecture for seven Indian languages, (ii) is robust against popular degradations, (iii) follows a segmentation free approach, (iv) addresses the UNICODE re-ordering issues, and (v) can enable continuous learning with user inputs and feedbacks. Our system is designed to aid the continuous learning while being usable i.e., we capture the user inputs (say example images) for further improving the OCRs. We use the popular BLSTM based transcription scheme to achieve our target. This also enables incremental training and refinement in a seamless manner. We report superior accuracy rates in comparison with the available OCRs for the seven Indian languages.

...read moreread less

24 citations

Cites background from "Experiences of integration and perf..."

...There has been significant progress in the recent past on developing robust solutions [1], [2], [3]....
[...]
...Traditionally this module has been formulated as an adhoc composition of a set of isolated character (or symbol) classifiers [1], [9]....
[...]
...OCR [1] Tesseract [17] Our Method Char....
[...]
...Word to symbol/character separation, is required for classifiers that recognize isolated characters [1]....
[...]

1
2
3
4
…
5
6

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Indian script character recognition: a survey

[...]

Umapada Pal¹, Bidyut B. Chaudhuri¹•Institutions (1)

Indian Statistical Institute¹

01 Sep 2004-Pattern Recognition

TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.

...read moreread less

592 citations

Journal Article•DOI•

A complete printed Bangla OCR system

[...]

Bidyut B. Chaudhuri¹, Umapada Pal¹•Institutions (1)

Indian Statistical Institute¹

01 Mar 1998-Pattern Recognition

TL;DR: A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented and extension of the work to Devnagari, the third most popular Script in the World, is discussed.

...read moreread less

381 citations

"Experiences of integration and perf..." refers background in this paper

...There have been many attempts in development of OCRs for Indian Scripts like Devanagari, Malayalam[10], Telugu, Tamil[7], Bangla[14], Gurumukhi[15] and Kannada[8]....
[...]
...There have been many attempts in development of OCRs for Indian Scripts like Devanagari, Malayalam[10], Telugu, Tamil[7], Bangla[14], Gurumukhi[15] and Kannada[8]....
[...]
...Bangla and Devanagari are among the most symbol-rich since originally about 2000 shapes need to be recognized for a complete OCR system[14]....
[...]

Book•DOI•

Guide to OCR for Indic Scripts

[...]

Venu Govindaraju, Srirangaraj Setlur

01 Jan 2010

TL;DR: This book provides an overview of the current state-of-the–art in the OCR of the different Indic scripts as well as other issues in the creation of accessible digital libraries for Indi scripts.

...read moreread less

Abstract: Research on OCR of Indian scripts is gaining momentum in recent times. Many projects funded by government and industry are currently underway to scan hundreds of thousands of indic-script documents and manuscripts to create large digital library archives to preserve these treasures for posterity. OCR is a key enabling technology for making these archives practically accessible to researchers and lay users alike by creating search-able indexes and machine readable text repositories of these documents. This book provides an overview of the current state-of-the–art in the OCR of the different Indic scripts as well as other issues in the creation of accessible digital libraries for Indic scripts. It provides a good technical overview of the latest research in the field.

...read moreread less

68 citations

Book•

Guide to OCR for Indic Scripts: Document Recognition and Retrieval

[...]

Venu Govindaraju, Srirangaraj Setlur

27 Oct 2009

TL;DR: This unique guide/reference is the very first comprehensive book on the subject of OCR (Optical Character Recognition) for Indic scripts and provides a section on the enhancement of text and images obtained from historical Indic palm leaf manuscripts.

...read moreread less

Abstract: This unique guide/reference is the very first comprehensive book on the subject of OCR (Optical Character Recognition) for Indic scripts. Features: contains contributions from the leading researchers in the field; discusses data set creation for OCR development; describes OCR systems that cover 8 different scripts Bangla, Devanagari, Gurmukhi, Gujarati, Kannada, Malayalam, Tamil, and Urdu (Perso-Arabic); explores the challenges of Indic script handwriting recognition in the online domain; examines the development of handwriting-based text input systems; describes ongoing work to increase access to Indian cultural heritage materials; provides a section on the enhancement of text and images obtained from historical Indic palm leaf manuscripts; investigates different techniques for word spotting in Indic scripts; reviews mono-lingual and cross-lingual information retrieval in Indic languages. This is an excellent reference for researchers and graduate students studying OCR technology and methodologies.

...read moreread less

46 citations

Book Chapter•DOI•

A Complete Tamil Optical Character Recognition System

[...]

K. G. Aparna, A. G. Ramakrishnan

19 Aug 2002

TL;DR: Recognition of Indian language characters has been a topic of interest for quite some time and the need for efficient and robust algorithms and systems for recognition is being felt in India, especially in the post and telegraph department where OCR can assist the staff in sorting mail.

...read moreread less

Abstract: Document Image processing and Optical Character Recognition (OCR) have been a frontline research area in the field of human-machine interface for the last few decades. Recognition of Indian language characters has been a topic of interest for quite some time. The earlier contributions were reported in [1] and [2]. A more recent work is reported in [3] and [9]. The need for efficient and robust algorithms and systems for recognition is being felt in India, especially in the post and telegraph department where OCR can assist the staff in sorting mail. Character recognition can also form a part in applications like intelligent scanning machines, text to speech converters, and automatic language-to-language translators.

...read moreread less

46 citations

"Experiences of integration and perf..." refers background in this paper

...There have been many attempts in development of OCRs for Indian Scripts like Devanagari, Malayalam[10], Telugu, Tamil[7], Bangla[14], Gurumukhi[15] and Kannada[8]....
[...]
...There have been many attempts in development of OCRs for Indian Scripts like Devanagari, Malayalam[10], Telugu, Tamil[7], Bangla[14], Gurumukhi[15] and Kannada[8]....
[...]
...Tamil has 18 consonants and 12 vowels[7]....
[...]