scispace - formally typeset
Search or ask a question

Showing papers on "Devanagari published in 2003"


Journal ArticleDOI
TL;DR: An adaptive Hindi OCR system was applied to a complete Hindi--English bilingual dictionary and a set of ideal images extracted from Hindi documents in PDF format and results show the recognition accuracy can reach 88% for noisy images and 95% for ideal images.
Abstract: We present an adaptive Hindi OCR implemented as part of a rapidly retargetable language tool effort. The system includes: script identification, character segmentation, training sample creation, and character recognition. In script identification, Hindi words are identified from bilingual or multilingual documents based on features of the Devanagari script or using Support Vector Machines. Identified words are then segmented into individual characters in the next step, where the composite characters are identified and further segmented based on the structural properties of the script and statistical information. Segmented characters are recognized using generalized Hausdorff image comparison (GHIC) and postprocessing is applied to improve the performance. The OCR system, which was designed and implemented in one month, was applied to a complete Hindi--English bilingual dictionary and a set of ideal images extracted from Hindi documents in PDF format. Experimental results show the recognition accuracy can reach 88p for noisy images and 95p for ideal images. The presented method can also be extended to design OCR systems for different scripts.

55 citations


Proceedings ArticleDOI
03 Aug 2003
TL;DR: This paper presents a top-down, projection-profilebased algorithm to separate text blocks from image blocks in a Devanagari document, and uses a distinctive feature of Devonagari text, called Shirorekha (Header Line), to analyze the pattern produced by Devangari text in the horizontalprofile.
Abstract: In this paper we present a top-down, projection-profilebased algorithm to separate text blocks from image blocksin a Devanagari document. We use a distinctive feature ofDevanagari text, called Shirorekha (Header Line) to analyzethe pattern produced by Devanagari text in the horizontalprofile. The horizontal profile corresponding to a textblock possesses certain regularity in frequency, orientationand shows spatial cohesion. The algorithm uses these featuresto identify text blocks in a document image containingboth text and graphics.

52 citations


Proceedings ArticleDOI
10 Mar 2003
TL;DR: A National Science Foundation sponsored project under the International Digital Libraries program is described to create data resources that will facilitate development of Devanagari OCR technology and provide a standardized test bed and evaluation tools for Devanakari script recognition.
Abstract: The Indian subcontinent has a large number of languages, dialects, and scripts with the Devanagari script being the primary and most widely used of all the scripts. To date, much of the Devanagari optical character recognition (OCR) research has been restricted to a handful of groups. So, techniques have not yet been widely disseminated or evaluated independently and automated evaluation tools are currently not available for lack of a standard representation of ground-truth and result data. A key reason for the absence of sustained research efforts in off-line Devanagari OCR appears to be the paucity of data resources. Ground truthed data for words and characters, on-line dictionaries, corpora of text documents and reliable, standardized statistical analyses and evaluation tools are currently lacking. So, the creation of such data resources will undoubtedly provide a much needed fillip to researchers working on Devanagari OCR. This paper describes a National Science Foundation sponsored project under the International Digital Libraries program to create data resources that will facilitate development of Devanagari OCR technology and provide a standardized test bed and evaluation tools for Devanagari script recognition.

26 citations