scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A complete OCR for printed Hindi text in Devanagari script

TL;DR: A complete OCR for printed Hindi text in Devanagari script is presented and a performance of 93% at character level is obtained.
Abstract: In this paper, we present a complete OCR for printed Hindi text in Devanagari script. A performance of 93% at character level is obtained.
Citations
More filters
Journal ArticleDOI
01 Nov 2011
TL;DR: In this paper, the state of the art from 1970s of machine printed and handwritten Devanagari optical character recognition (OCR) is discussed in various sections of the paper.
Abstract: In India, more than 300 million people use Devanagari script for documentation. There has been a significant improvement in the research related to the recognition of printed as well as handwritten Devanagari text in the past few years. State of the art from 1970s of machine printed and handwritten Devanagari optical character recognition (OCR) is discussed in this paper. All feature-extraction techniques as well as training, classification and matching techniques useful for the recognition are discussed in various sections of the paper. An attempt is made to address the most important results reported so far and it is also tried to highlight the beneficial directions of the research till date. Moreover, the paper also contains a comprehensive bibliography of many selected papers appeared in reputed journals and conference proceedings as an aid for the researchers working in the field of Devanagari OCR.

159 citations


Cites methods from "A complete OCR for printed Hindi te..."

  • ...In [29], the classification of printed Devanagari characters is done through five filters: 1) coverage of the region of the core...

    [...]

Proceedings ArticleDOI
25 Jul 2009
TL;DR: Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text.
Abstract: We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

117 citations

Journal ArticleDOI
01 Jan 2007
TL;DR: This paper addresses the problem of Bangla basic character recognition with multi-font Bangla character recognition and proposes a novel feature extraction scheme based on the digital curvelet transform.
Abstract: This paper addresses the problem of Bangla basic character recognition. Multi-font Bangla character recognition has not been attempted previously. Twenty popular Bangla fonts have been used for the purpose of character recognition. A novel feature extraction scheme based on the digital curvelet transform is proposed. The curvelet transform, although heavily utilized in various areas of image processing, has not been used as the feature extraction scheme for character recognition. The curvelet coefficients of an original image as well as its morphologically altered versions are used to train separate k– nearest neighbor classifiers. The output values of these classifiers are fused using a simple majority voting scheme to arrive at a final decision.

93 citations


Cites background from "A complete OCR for printed Hindi te..."

  • ...There has been limited research on recognition of Oriya [4], Tamil [5], Devanagari [6] and Bengali [7]....

    [...]

01 Jan 2010
TL;DR: The recognition rate of the proposed OCR system with the image document of Devnagari Script has been found to be quite high and a technique for OCR System for different five fonts and sizes of printed DevNagari script using Artificial Neural Network is proposed.
Abstract: There are about 300 million people in India who speak Hindi and write Devnagari script. Research in Optical Character Recognition (OCR) is popular for its application potential in banks, post offices, defense organizations and library automation etc. However most of the OCR systems are available for European texts. In this paper, we have proposed a technique for OCR System for different five fonts and sizes of printed Devnagari script using Artificial Neural Network. The recognition rate of the proposed OCR system with the image document of Devnagari Script has been found to be quite high.

71 citations

Journal ArticleDOI
TL;DR: A review of OCR work on Indian scripts, mainly on Bangla and Devanagari—the two most popular scripts in India, and the various methodologies and their reported results are presented.
Abstract: The past few decades have witnessed an intensive research on optical character recognition (OCR) for Roman, Chinese, and Japanese scripts. A lot of work has been also reported on OCR efforts for various Indian scripts, like Devanagari, Bangla, Oriya, Tamil, Telugu, Malayalam, Kannada, Gurmukhi, Gujarati, etc. In this paper, we present a review of OCR work on Indian scripts, mainly on Bangla and Devanagari—the two most popular scripts in India. We have summarized most of the published papers on this topic and have also analysed the various methodologies and their reported results. Future directions of research in OCR for Indian scripts have been also given.

70 citations


Cites background from "A complete OCR for printed Hindi te..."

  • ...Bangla; Devanagari; Indian script; optical character recognition; survey on OCR....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: H holistic approaches that avoid segmentation by recognizing entire character strings as units are described, including methods that partition the input image into subimages, which are then classified.
Abstract: Character segmentation has long been a critical area of the OCR process. The higher recognition rates for isolated characters vs. those obtained for words and connected character strings well illustrate this fact. A good part of recent progress in reading unconstrained printed and written text may be ascribed to more insightful handling of segmentation. This paper provides a review of these advances. The aim is to provide an appreciation for the range of techniques that have been developed, rather than to simply list sources. Segmentation methods are listed under four main headings. What may be termed the "classical" approach consists of methods that partition the input image into subimages, which are then classified. The operation of attempting to decompose the image into classifiable units is called "dissection." The second class of methods avoids dissection, and segments the image either explicitly, by classification of prespecified windows, or implicitly by classification of subsets of spatial features collected from the image as a whole. The third strategy is a hybrid of the first two, employing dissection together with recombination rules to define potential segments, but using classification to select from the range of admissible segmentation possibilities offered by these subimages. Finally, holistic approaches that avoid segmentation by recognizing entire character strings as units are described.

880 citations

Journal ArticleDOI
TL;DR: A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented and extension of the work to Devnagari, the third most popular Script in the World, is discussed.

381 citations

Journal ArticleDOI
TL;DR: A method is presented for the machine recognition of constrained, hand printed Devanagari characters, where each stage of decision making narrows down the choice regarding the class membership of the input token.

158 citations

Journal ArticleDOI
TL;DR: The selection of a set of moments that provide good discrimination between characters, the comparison of three classification schemes, the choice of a weighting vector that improves the classification performance, and a series of experiments to determine how the recognition rate is affected by the number of library feature vector sets are presented.
Abstract: An investigation of the use of two-dimensional moments as features for recognition has resulted in the development of a systematic method of character recognition. The method has been applied to six machine-printed fonts. Documents used to test the method contained 24 lines of alphanumeric characters. Before scanning a document to be processed, a training document having the same font must be scanned and stored in memory. Characters on the training document are isolated by contour tracing, and then the 2D moments of each character are computed and stored in a library of feature vectors. The document to be recognized is then scanned, and the 2D moments of its characters are compared with those in the library for classification. In this paper we present the selection of a set of moments that provide good discrimination between characters, the comparison of three classification schemes, the selection of a weighting vector that improves the classification performance, and a series of experiments to determine how the recognition rate is affected by the number of library feature vector sets. Recognition rates between 98.5% and 99.7% have been achieved for all fonts tested.

146 citations


"A complete OCR for printed Hindi te..." refers background in this paper

  • ...Two dimensional moments have been widely studied as a feature for character classification [ 6 ]....

    [...]

Proceedings ArticleDOI
01 Sep 2000
TL;DR: A system for recognition of machine printed Gurmukhi script operates at sub-character level and a recognition rate of 96.6% at the processing speed of 175 characters second was achieved on clean images of text without employing any post-processing technique.
Abstract: A system for recognition of machine printed Gurmukhi script is presented. The recognition system presented operates at sub-character level. The segmentation process breaks a word into sub-characters and the recognition phase consists of classifying these sub-characters and combining them to form Gurmukhi characters. A set of very simple and easy to computer features is used and a hybrid classification scheme consisting of binary decision trees and nearest neighbours is employed. A recognition rate of 96.6% at the processing speed of 175 characters second was achieved on clean images of text without employing any post-processing technique.

114 citations