scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Segmentation of touching and fused Devanagari characters

01 Apr 2002-Pattern Recognition (Pergamon)-Vol. 35, Iss: 4, pp 875-893
TL;DR: A two pass algorithm for the segmentation and decomposition of Devanagari composite characters/symbols into their constituent symbols and a recognition rate has been achieved on the segmented conjuncts.
Abstract: Devanagari script is a two dimensional composition of symbols It is highly cumbersome to treat each composite character as a separate atomic symbol because such combinations are very large in number This paper presents a two pass algorithm for the segmentation and decomposition of Devanagari composite characters/symbols into their constituent symbols The proposed algorithm extensively uses structural properties of the script In the first pass, words are segmented into easily separable characters/composite characters Statistical information about the height and width of each separated box is used to hypothesize whether a character box is composite In the second pass, the hypothesized composite characters are further segmented A recognition rate of 85 percent has been achieved on the segmented conjuncts The algorithm is designed to segment a pair of touching characters
Citations
More filters
Journal ArticleDOI
TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.
Abstract: Intensive research has been done on optical character recognition (OCR) and a large number of articles have been published on this topic during the last few decades. Many commercial OCR systems are now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters. There are no sufficient number of work on Indian language character recognition although there are 12 major scripts in India. In this paper, we present a review of the OCR work done on Indian language scripts. The review is organized into 5 sections. Sections 1 and 2 cover introduction and properties on Indian scripts. In Section 3, we discuss different methodologies in OCR development as well as research work done on Indian scripts recognition. In Section 4, we discuss the scope of future work and further steps needed for Indian script OCR development. In Section 5 we conclude the paper.

592 citations

Journal ArticleDOI
01 Nov 2011
TL;DR: In this paper, the state of the art from 1970s of machine printed and handwritten Devanagari optical character recognition (OCR) is discussed in various sections of the paper.
Abstract: In India, more than 300 million people use Devanagari script for documentation. There has been a significant improvement in the research related to the recognition of printed as well as handwritten Devanagari text in the past few years. State of the art from 1970s of machine printed and handwritten Devanagari optical character recognition (OCR) is discussed in this paper. All feature-extraction techniques as well as training, classification and matching techniques useful for the recognition are discussed in various sections of the paper. An attempt is made to address the most important results reported so far and it is also tried to highlight the beneficial directions of the research till date. Moreover, the paper also contains a comprehensive bibliography of many selected papers appeared in reputed journals and conference proceedings as an aid for the researchers working in the field of Devanagari OCR.

159 citations


Cites background from "Segmentation of touching and fused ..."

  • ...The system described by Sinha and Mahabala [10] for printed Devanagari characters stores structural descriptions for each symbol of the script in terms of primitives and their relationships....

    [...]

  • ...A syntactic pattern analysis system for Devanagari script recognition is presented in Sinha’s Ph.D. thesis [9]....

    [...]

  • ...Bansal and Sinha [20] considered several statistical classifying features like horizontal zero crossings, moments, vertex points, and pixel density in different zones for Devanagari characters....

    [...]

  • ...Sinha [24] also demonstrated how the spatial association among the constituent symbols of Devanagari script plays an important role in understanding Devanagari words....

    [...]

  • ...Bansal and Sinha [18] presented a two-pass algorithm for the segmentation of machine-printed composite characters into their constituent symbols....

    [...]

Journal ArticleDOI
TL;DR: A review of OCR work on Indian scripts, mainly on Bangla and Devanagari—the two most popular scripts in India, and the various methodologies and their reported results are presented.
Abstract: The past few decades have witnessed an intensive research on optical character recognition (OCR) for Roman, Chinese, and Japanese scripts. A lot of work has been also reported on OCR efforts for various Indian scripts, like Devanagari, Bangla, Oriya, Tamil, Telugu, Malayalam, Kannada, Gurmukhi, Gujarati, etc. In this paper, we present a review of OCR work on Indian scripts, mainly on Bangla and Devanagari—the two most popular scripts in India. We have summarized most of the published papers on this topic and have also analysed the various methodologies and their reported results. Future directions of research in OCR for Indian scripts have been also given.

70 citations


Cites background from "Segmentation of touching and fused ..."

  • ...OCR systems have to segment the word into individual characters (Bansal & Sinha 2002; Chowdhury et al 2008; Ma & Doermann 2003; Pal & Datta 2003)....

    [...]

  • ...The comparison is done with respect to feature set, classifier, and reported accuracy rate....

    [...]

Journal ArticleDOI
TL;DR: This paper is the first survey that focuses on touched character segmentation and provides segmentation rates, descriptions of the test data for the approaches discussed, and the main trends in the field of touched character segmentsation.
Abstract: Character segmentation is a challenging problem in the field of optical character recognition. Presence of touched characters make this dilemma more crucial. The goal of this paper is to provide major concepts and progress in domain of off-line cursive touched character segmentation. Accordingly, two broad classes of technique are identified. These include methods that perform explicit or implicit character segmentation. The basic methods used by each class of technique are presented and the contributions of individual algorithms within each class are discussed. It is the first survey that focuses on touched character segmentation and provides segmentation rates, descriptions of the test data for the approaches discussed. Finally, the main trends in the field of touched character segmentation are examined, important contributions are presented and future directions are also suggested.

69 citations


Cites methods from "Segmentation of touching and fused ..."

  • ...Such problems are investigated using recognition-based segmentation in Bansal and Sinha (2002)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, robust algorithms for character segmentation and recognition are presented for multilingual Indian document images of Latin and Devanagari scripts, where primary segmentation paths are obtained using structural property of characters, whereas overlapped and joined characters are separated using graph distance theory.
Abstract: In this paper, robust algorithms for character segmentation and recognition are presented for multilingual Indian document images of Latin and Devanagari scripts. These documents generally suffer from their layout organizations, local skews, and low print quality and contain intermixed texts (machine-printed and handwritten). In the proposed character segmentation algorithm, primary segmentation paths are obtained using structural property of characters, whereas overlapped and joined characters are separated using graph distance theory. Finally, segmentation results are validated using highly accurate support vector machine classifier. For the proposed character recognition algorithm, three new geometrical shape-based features are computed. First and second features are formed with respect to the center pixel of character, whereas neighborhood information of text pixels is used for the calculation of third feature. For recognizing the input character, $k$ -Nearest Neighbor classifier is used, as it has intrinsically zero training time. Comprehensive experiments are carried out on different databases containing printed as well as handwritten texts. Benchmarking results illustrate that proposed algorithms have better performances compared to other contemporary approaches, where highest segmentation and recognition rates of 98.86% and 99.84%, respectively, are obtained.

69 citations


Cites methods from "Segmentation of touching and fused ..."

  • ...Bansal and Sinha [9] worked on Devanagari characters segmentation using projections and statistical dimensional information of the characters....

    [...]

  • ...[9] V. Bansal and R. M. K. Sinha, ‘‘Segmentation of touching and fused Devanagari characters,’’ Pattern Recognit., vol. 35, no. 4, pp. 875–893, Apr. 2002....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: H holistic approaches that avoid segmentation by recognizing entire character strings as units are described, including methods that partition the input image into subimages, which are then classified.
Abstract: Character segmentation has long been a critical area of the OCR process. The higher recognition rates for isolated characters vs. those obtained for words and connected character strings well illustrate this fact. A good part of recent progress in reading unconstrained printed and written text may be ascribed to more insightful handling of segmentation. This paper provides a review of these advances. The aim is to provide an appreciation for the range of techniques that have been developed, rather than to simply list sources. Segmentation methods are listed under four main headings. What may be termed the "classical" approach consists of methods that partition the input image into subimages, which are then classified. The operation of attempting to decompose the image into classifiable units is called "dissection." The second class of methods avoids dissection, and segments the image either explicitly, by classification of prespecified windows, or implicitly by classification of subsets of spatial features collected from the image as a whole. The third strategy is a hybrid of the first two, employing dissection together with recombination rules to define potential segments, but using classification to select from the range of admissible segmentation possibilities offered by these subimages. Finally, holistic approaches that avoid segmentation by recognizing entire character strings as units are described.

880 citations

Journal ArticleDOI
TL;DR: A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented and extension of the work to Devnagari, the third most popular Script in the World, is discussed.
Abstract: A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented This is the first OCR system among all script forms used in the Indian sub-continent The problem is difficult because (i) there are about 300 basic, modified and compound character shapes in the script, (ii) the characters in a word are topologically connected and (iii) Bangla is an inflectional language In our system the document image captured by Flat-bed scanner is subject to skew correction, text graphics separation, line segmentation, zone detection, word and character segmentation using some conventional and some newly developed techniques From zonal information and shape characteristics, the basic, modified and compound characters are separated for the convenience of classification The basic and modified characters which are about 75 in number and which occupy about 96% of the text corpus, are recognized by a structural-feature-based tree classifier The compound characters are recognized by a tree classifier followed by template-matching approach The feature detection is simple and robust where preprocessing like thinning and pruning are avoided The character unigram statistics is used to make the tree classifier efficient Several heuristics are also used to speed up the template matching approach A dictionary-based error-correction scheme has been used where separate dictionaries are compiled for root word and suffixes that contain morpho-syntactic informations as well For single font clear documents 9550% word level (which is equivalent to 9910% character level) recognition accuracy has been obtained Extension of the work to Devnagari, the third most popular script in the world, is also discussed

381 citations

Journal ArticleDOI
TL;DR: The current state of a system that recognizes printed text of various fonts and sizes for the Roman alphabet is described, which combines several techniques in order to improve the overall recognition rate.
Abstract: We describe the current state of a system that recognizes printed text of various fonts and sizes for the Roman alphabet. The system combines several techniques in order to improve the overall recognition rate. Thinning and shape extraction are performed directly on a graph of the run-length encoding of a binary image. The resulting strokes and other shapes are mapped, using a shape-clustering approach, into binary features which are then fed into a statistical Bayesian classifier. Large-scale trials have shown better than 97 percent top choice correct performance on mixtures of six dissimilar fonts, and over 99 percent on most single fonts, over a range of point sizes. Certain remaining confusion classes are disambiguated through contour analysis, and characters suspected of being merged are broken and reclassified. Finally, layout and linguistic context are applied. The results are illustrated by sample pages.

381 citations

Proceedings ArticleDOI
18 Aug 1997
TL;DR: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent, and shows a good performance for single font scripts printed on clear documents.
Abstract: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent. These scripts, having the same origin in ancient Brahmi script, have many features in common and hence a single system can be modeled to recognize them. In the proposed model, document digitization, skew detection, text line segmentation and zone separation, word and character segmentation, character grouping into basic, modifier and compound character category are done for both scripts by the same set of algorithms. The feature sets and classification tree as well as the knowledge base required for error correction (such as lexicon) differ for Bangla and Devnagari. The system shows a good performance for single font scripts printed on clear documents.

198 citations


"Segmentation of touching and fused ..." refers background in this paper

  • ...hary et al. (12; 13; 14 ) , no attempt has been made so far in isolating the touching and fused...

    [...]

Book
01 Nov 1992
TL;DR: This is the first book to offer a broad selection of state-of-the-art research papers, including authoritative critical surveys of the literature, and parallel studies of the architecture of complete high-performance printed-document reading systems.
Abstract: Document image analysis is the automatic computer interpretation of images of printed and handwritten documents, including text, drawings, maps, music scores, etc. Research in this field supports a rapidly growing international industry. This is the first book to offer a broad selection of state-of-the-art research papers, including authoritative critical surveys of the literature, and parallel studies of the architectureof complete high-performance printed-document reading systems. A unique feature is the extended section on music notation, an ideal vehicle for international sharing of basic research. Also, the collection includes important new work on line drawings, handwriting, character and symbol recognition, and basic methodological issues. The IAPR 1990 Workshop on Syntactic and Structural Pattern Recognition is summarized,including the reports of its expert working groups, whose debates provide a fascinating perspective on the field. The book is an excellent text for a first-year graduate seminar in document image analysis,and is likely to remain a standard reference in the field for years.

185 citations