scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

18 Aug 1997-Vol. 2, pp 1011-1015
TL;DR: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent, and shows a good performance for single font scripts printed on clear documents.
Abstract: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent. These scripts, having the same origin in ancient Brahmi script, have many features in common and hence a single system can be modeled to recognize them. In the proposed model, document digitization, skew detection, text line segmentation and zone separation, word and character segmentation, character grouping into basic, modifier and compound character category are done for both scripts by the same set of algorithms. The feature sets and classification tree as well as the knowledge base required for error correction (such as lexicon) differ for Bangla and Devnagari. The system shows a good performance for single font scripts printed on clear documents.
Citations
More filters
Journal ArticleDOI
TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.

592 citations

01 Jan 2014
TL;DR: Investigation of the phonological length of utterance in native Kannada speaking children of 3 to 7 years age revealed increase inPMLU score as the age increased suggesting a developmental trend in PMLU acquisition.
Abstract: Phonological mean length of utterance (PMLU) is a whole word measure for measuring phonological proficiency. It measures the length of a child’s word and the number of correct consonants. The present study investigated the phonological length of utterance in native Kannada speaking children of 3 to 7 years age. A total of 400 subjects in the age range of 3-7 years participated in the study. Spontaneous speech samples were elicited from each child and analyzed for PMLU as per the rules suggested by Ingram. Mann-Whitney U test and Kruskal Wallis test were employed to compare the differences between the means of PMLU scores across the gender and the age respectively. The result revealed increase in PMLU score as the age increased suggesting a developmental trend in PMLU acquisition. No statistically significant differences were observed between the means of PMLU scores across the gender.

230 citations

Journal ArticleDOI
01 Nov 2011
TL;DR: In this paper, the state of the art from 1970s of machine printed and handwritten Devanagari optical character recognition (OCR) is discussed in various sections of the paper.
Abstract: In India, more than 300 million people use Devanagari script for documentation. There has been a significant improvement in the research related to the recognition of printed as well as handwritten Devanagari text in the past few years. State of the art from 1970s of machine printed and handwritten Devanagari optical character recognition (OCR) is discussed in this paper. All feature-extraction techniques as well as training, classification and matching techniques useful for the recognition are discussed in various sections of the paper. An attempt is made to address the most important results reported so far and it is also tried to highlight the beneficial directions of the research till date. Moreover, the paper also contains a comprehensive bibliography of many selected papers appeared in reputed journals and conference proceedings as an aid for the researchers working in the field of Devanagari OCR.

159 citations


Cites background or methods from "An OCR system to read two Indian la..."

  • ...Another OCR system development of printed Devanagari is by Palit and Chaudhuri [11] as well as Pal and Chaudhuri [12]....

    [...]

  • ...U. Pal is with Indian Statistical Institute, Kolkata 700108, India (e-mail: umapada@isical.ac.in)....

    [...]

  • ...Even for recognizing Devanagari handwritten characters, the method proposed by Pal et al. [35] has the highest accuracy, as shown in Table IV....

    [...]

  • ...A modified quadratic classifier is applied by Pal et al. [51] on the features of handwritten characters for recognition....

    [...]

  • ...Pal and Chaudhuri [12] and [99] also proposed a suffix- and prefix-based error correction technique, which can take care of different inflectional languages....

    [...]

Journal ArticleDOI
TL;DR: A two pass algorithm for the segmentation and decomposition of Devanagari composite characters/symbols into their constituent symbols and a recognition rate has been achieved on the segmented conjuncts.

143 citations


Cites background from "An OCR system to read two Indian la..."

  • ...hary et al. (12; 13; 14 ) , no attempt has been made so far in isolating the touching and fused...

    [...]

Proceedings ArticleDOI
20 Sep 1999
TL;DR: In this paper, an automatic technique of separating the text lines using script characteristics and shape based features is presented and has an overall accuracy of about 98.5%.
Abstract: In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other official Indian languages. For OCR of such a document page, it is necessary to separate these three script forms before feeding them to the OCRs of individual scripts. In this paper, an automatic technique of separating the text lines using script characteristics and shape based features is presented. At present, the system has an overall accuracy of about 98.5%.

134 citations


Cites background from "An OCR system to read two Indian la..."

  • ...A bilingual (Bangla and Devnagari) OCR system is already working in our lab[2]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Research aimed at correcting words in text has focused on three progressively more difficult problems: nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction, which surveys documented findings on spelling error patterns.
Abstract: Research aimed at correcting words in text has focused on three progressively more difficult problems:(1) nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction. In response to the first problem, efficient pattern-matching and n-gram analysis techniques have been developed for detecting strings that do not appear in a given word list. In response to the second problem, a variety of general and application-specific spelling correction techniques have been developed. Some of them were based on detailed studies of spelling error patterns. In response to the third problem, a few experiments using natural-language-processing tools or statistical-language models have been carried out. This article surveys documented findings on spelling error patterns, provides descriptions of various nonword detection and isolated-word error correction techniques, reviews the state of the art of context-dependent word correction techniques, and discusses research issues related to all three areas of automatic error correction in text.

1,417 citations

Journal ArticleDOI
01 Jul 1992
TL;DR: Both template matching and structure analysis approaches to R&D are considered and it is noted that the two approaches are coming closer and tending to merge.
Abstract: Research and development of OCR systems are considered from a historical point of view. The historical development of commercial systems is included. Both template matching and structure analysis approaches to R&D are considered. It is noted that the two approaches are coming closer and tending to merge. Commercial products are divided into three generations, for each of which some representative OCR systems are chosen and described in some detail. Some comments are made on recent techniques applied to OCR, such as expert systems and neural networks, and some open problems are indicated. The authors' views and hopes regarding future trends are presented. >

892 citations

Journal ArticleDOI
TL;DR: The current state of a system that recognizes printed text of various fonts and sizes for the Roman alphabet is described, which combines several techniques in order to improve the overall recognition rate.
Abstract: We describe the current state of a system that recognizes printed text of various fonts and sizes for the Roman alphabet. The system combines several techniques in order to improve the overall recognition rate. Thinning and shape extraction are performed directly on a graph of the run-length encoding of a binary image. The resulting strokes and other shapes are mapped, using a shape-clustering approach, into binary features which are then fed into a statistical Bayesian classifier. Large-scale trials have shown better than 97 percent top choice correct performance on mixtures of six dissimilar fonts, and over 99 percent on most single fonts, over a range of point sizes. Certain remaining confusion classes are disambiguated through contour analysis, and characters suspected of being merged are broken and reclassified. Finally, layout and linguistic context are applied. The results are illustrated by sample pages.

381 citations

Journal ArticleDOI
TL;DR: It is shown that a digital arc S is the digitization of a straight line segment if and only if it has the "chord property:" the line segment joining any two points of S lies everywhere within distance 1 of S.
Abstract: It is shown that a digital arc S is the digitization of a straight line segment if and only if it has the "chord property:" the line segment joining any two points of S lies everywhere within distance 1 of S. This result is used to derive several regularity properties of digitizations of straight line segments.

335 citations

Book
01 Jan 1995
TL;DR: In this article, a class of techniques based on smeared run length codes that divide a page into gray and nearly white parts are described, and then segmentation is performed by finding connected components either by the gray elements or of the white.
Abstract: Page segmentation is the process by which a scanned page is divided into columns and blocks which are then classified as halftones, graphics, or text. Past techniques have used the fact that such parts form right rectangles for most printed material. This property is not true when the page is tilted, and the heuristics based on it fail in such cases unless a rather expensive tilt angle estimation is performed. We describe a class of techniques based on smeared run length codes that divide a page into gray and nearly white parts. Segmentation is then performed by finding connected components either by the gray elements or of the white, the latter forming white streams that partition a page into blocks of printed material. Such techniques appear quite robust in the presence of severe tilt (even greater than 10 °) and are also quite fast (about a second a page on a SPARC station for gray element aggregation). Further classification into text or halftones is based mostly on properties of the across scanlines correlation. For text correlation of adjacent scanlines tends to be quite high, but then it drops rapidly. For halftones, the correlation of adjacent scanlines is usually well below that for text, but it does not change much with distance.

249 citations