An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

doi:10.1109/ICDAR.1997.620662

Home
/
Papers
/
An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

Proceedings Article•DOI•

An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

Bidyut B. Chaudhuri, Umapada Pal¹•Institutions (1)

Indian Statistical Institute¹

18 Aug 1997-Vol. 2, pp 1011-1015

TL;DR: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent, and shows a good performance for single font scripts printed on clear documents.

read less

Abstract: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent. These scripts, having the same origin in ancient Brahmi script, have many features in common and hence a single system can be modeled to recognize them. In the proposed model, document digitization, skew detection, text line segmentation and zone separation, word and character segmentation, character grouping into basic, modifier and compound character category are done for both scripts by the same set of algorithms. The feature sets and classification tree as well as the knowledge base required for error correction (such as lexicon) differ for Bangla and Devnagari. The system shows a good performance for single font scripts printed on clear documents.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Indian script character recognition: a survey

[...]

Umapada Pal¹, Bidyut B. Chaudhuri¹•Institutions (1)

Indian Statistical Institute¹

01 Sep 2004-Pattern Recognition

TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.

...read moreread less

592 citations

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow

[...]

Arvind Jadhav, Ugc-Net English, Yashwantrao Chavan

01 Jan 2014

TL;DR: Investigation of the phonological length of utterance in native Kannada speaking children of 3 to 7 years age revealed increase inPMLU score as the age increased suggesting a developmental trend in PMLU acquisition.

...read moreread less

Abstract: Phonological mean length of utterance (PMLU) is a whole word measure for measuring phonological proficiency. It measures the length of a child’s word and the number of correct consonants. The present study investigated the phonological length of utterance in native Kannada speaking children of 3 to 7 years age. A total of 400 subjects in the age range of 3-7 years participated in the study. Spontaneous speech samples were elicited from each child and analyzed for PMLU as per the rules suggested by Ingram. Mann-Whitney U test and Kruskal Wallis test were employed to compare the differences between the means of PMLU scores across the gender and the age respectively. The result revealed increase in PMLU score as the age increased suggesting a developmental trend in PMLU acquisition. No statistically significant differences were observed between the means of PMLU scores across the gender.

...read moreread less

230 citations

Journal Article•DOI•

Offline Recognition of Devanagari Script: A Survey

[...]

R. Jayadevan¹, Satish R. Kolhe, Pradeep M. Patil², Umapada Pal•Institutions (2)

Pune Institute of Computer Technology¹, Vishwakarma Institute of Technology²

01 Nov 2011

TL;DR: In this paper, the state of the art from 1970s of machine printed and handwritten Devanagari optical character recognition (OCR) is discussed in various sections of the paper.

...read moreread less

Abstract: In India, more than 300 million people use Devanagari script for documentation. There has been a significant improvement in the research related to the recognition of printed as well as handwritten Devanagari text in the past few years. State of the art from 1970s of machine printed and handwritten Devanagari optical character recognition (OCR) is discussed in this paper. All feature-extraction techniques as well as training, classification and matching techniques useful for the recognition are discussed in various sections of the paper. An attempt is made to address the most important results reported so far and it is also tried to highlight the beneficial directions of the research till date. Moreover, the paper also contains a comprehensive bibliography of many selected papers appeared in reputed journals and conference proceedings as an aid for the researchers working in the field of Devanagari OCR.

...read moreread less

159 citations

Cites background or methods from "An OCR system to read two Indian la..."

...Another OCR system development of printed Devanagari is by Palit and Chaudhuri [11] as well as Pal and Chaudhuri [12]....
[...]
...U. Pal is with Indian Statistical Institute, Kolkata 700108, India (e-mail: umapada@isical.ac.in)....
[...]
...Even for recognizing Devanagari handwritten characters, the method proposed by Pal et al. [35] has the highest accuracy, as shown in Table IV....
[...]
...A modified quadratic classifier is applied by Pal et al. [51] on the features of handwritten characters for recognition....
[...]
...Pal and Chaudhuri [12] and [99] also proposed a suffix- and prefix-based error correction technique, which can take care of different inflectional languages....
[...]

Journal Article•DOI•

Segmentation of touching and fused Devanagari characters

[...]

Veena Bansal¹, R.M.K. Sinha¹•Institutions (1)

Indian Institute of Technology Kanpur¹

01 Apr 2002-Pattern Recognition

TL;DR: A two pass algorithm for the segmentation and decomposition of Devanagari composite characters/symbols into their constituent symbols and a recognition rate has been achieved on the segmented conjuncts.

...read moreread less

143 citations

Cites background from "An OCR system to read two Indian la..."

...hary et al. (12; 13; 14 ) , no attempt has been made so far in isolating the touching and fused...
[...]

Proceedings Article•DOI•

Script line separation from Indian multi-script documents

[...]

Umapada Pal, Bidyut B. Chaudhuri

20 Sep 1999

TL;DR: In this paper, an automatic technique of separating the text lines using script characteristics and shape based features is presented and has an overall accuracy of about 98.5%.

...read moreread less

Abstract: In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other official Indian languages. For OCR of such a document page, it is necessary to separate these three script forms before feeding them to the OCRs of individual scripts. In this paper, an automatic technique of separating the text lines using script characteristics and shape based features is presented. At present, the system has an overall accuracy of about 98.5%.

...read moreread less

134 citations

Cites background from "An OCR system to read two Indian la..."

...A bilingual (Bangla and Devnagari) OCR system is already working in our lab[2]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Techniques for automatically correcting words in text

[...]

Karen Kukich

01 Dec 1992-ACM Computing Surveys

TL;DR: Research aimed at correcting words in text has focused on three progressively more difficult problems: nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction, which surveys documented findings on spelling error patterns.

...read moreread less

Abstract: Research aimed at correcting words in text has focused on three progressively more difficult problems:(1) nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction. In response to the first problem, efficient pattern-matching and n-gram analysis techniques have been developed for detecting strings that do not appear in a given word list. In response to the second problem, a variety of general and application-specific spelling correction techniques have been developed. Some of them were based on detailed studies of spelling error patterns. In response to the third problem, a few experiments using natural-language-processing tools or statistical-language models have been carried out. This article surveys documented findings on spelling error patterns, provides descriptions of various nonword detection and isolated-word error correction techniques, reviews the state of the art of context-dependent word correction techniques, and discusses research issues related to all three areas of automatic error correction in text.

...read moreread less

1,417 citations

Journal Article•DOI•

Historical review of OCR research and development

[...]

Shunji Mori¹, Ching Y. Suen², Kazuhiko Yamamoto•Institutions (2)

Ricoh¹, Concordia University²

01 Jul 1992

TL;DR: Both template matching and structure analysis approaches to R&D are considered and it is noted that the two approaches are coming closer and tending to merge.

...read moreread less

Abstract: Research and development of OCR systems are considered from a historical point of view. The historical development of commercial systems is included. Both template matching and structure analysis approaches to R&D are considered. It is noted that the two approaches are coming closer and tending to merge. Commercial products are divided into three generations, for each of which some representative OCR systems are chosen and described in some detail. Some comments are made on recent techniques applied to OCR, such as expert systems and neural networks, and some open problems are indicated. The authors' views and hopes regarding future trends are presented. >

...read moreread less

892 citations

Journal Article•DOI•

On the Recognition of Printed Characters of Any Font and Size

[...]

Simon Kahan¹, Theo Pavlidis², Henry S. Baird³•Institutions (3)

University of Washington¹, Stony Brook University², Bell Labs³

01 Feb 1987-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The current state of a system that recognizes printed text of various fonts and sizes for the Roman alphabet is described, which combines several techniques in order to improve the overall recognition rate.

...read moreread less

Abstract: We describe the current state of a system that recognizes printed text of various fonts and sizes for the Roman alphabet. The system combines several techniques in order to improve the overall recognition rate. Thinning and shape extraction are performed directly on a graph of the run-length encoding of a binary image. The resulting strokes and other shapes are mapped, using a shape-clustering approach, into binary features which are then fed into a statistical Bayesian classifier. Large-scale trials have shown better than 97 percent top choice correct performance on mixtures of six dissimilar fonts, and over 99 percent on most single fonts, over a range of point sizes. Certain remaining confusion classes are disambiguated through contour analysis, and characters suspected of being merged are broken and reclassified. Finally, layout and linguistic context are applied. The results are illustrated by sample pages.

...read moreread less

381 citations

Journal Article•DOI•

Digital Straight Line Segments

[...]

Azriel Rosenfeld¹•Institutions (1)

University of Maryland, College Park¹

01 Dec 1974-IEEE Transactions on Computers

TL;DR: It is shown that a digital arc S is the digitization of a straight line segment if and only if it has the "chord property:" the line segment joining any two points of S lies everywhere within distance 1 of S.

...read moreread less

Abstract: It is shown that a digital arc S is the digitization of a straight line segment if and only if it has the "chord property:" the line segment joining any two points of S lies everywhere within distance 1 of S. This result is used to derive several regularity properties of digitizations of straight line segments.

...read moreread less

335 citations

Book•

Page segmentation and classification

[...]

Theo Pavlidis, Jiangying Zhou

01 Jan 1995

TL;DR: In this article, a class of techniques based on smeared run length codes that divide a page into gray and nearly white parts are described, and then segmentation is performed by finding connected components either by the gray elements or of the white.

...read moreread less

Abstract: Page segmentation is the process by which a scanned page is divided into columns and blocks which are then classified as halftones, graphics, or text. Past techniques have used the fact that such parts form right rectangles for most printed material. This property is not true when the page is tilted, and the heuristics based on it fail in such cases unless a rather expensive tilt angle estimation is performed. We describe a class of techniques based on smeared run length codes that divide a page into gray and nearly white parts. Segmentation is then performed by finding connected components either by the gray elements or of the white, the latter forming white streams that partition a page into blocks of printed material. Such techniques appear quite robust in the presence of severe tilt (even greater than 10 °) and are also quite fast (about a second a page on a SPARC station for gray element aggregation). Further classification into text or halftones is based mostly on properties of the across scanlines correlation. For text correlation of adjacent scanlines tends to be quite high, but then it drops rapidly. For halftones, the correlation of adjacent scanlines is usually well below that for text, but it does not change much with distance.

...read moreread less

249 citations