scispace - formally typeset
Search or ask a question

Showing papers on "Devanagari published in 2004"


Journal ArticleDOI
TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.

592 citations


Proceedings ArticleDOI
23 Jan 2004
TL;DR: A versatile platform that facilitates automatic segmentation of document images in multiple Indian languages and an interface to capture the ground truth corresponding to the text is described, thus providing users prompt and natural feedback.
Abstract: We present methodologies for three important tasks that will eventually enable digital access of multilingual Indian document images. First, we describe several document image analysis techniques necessary to prepare Devanagari document images for OCR. The second task is OCR for machine printed Devanagari words without the help of a lexicon. We describe the OCR methodology and show how it is being extended to other Indian languages. Finally, we describe a versatile platform that facilitates automatic segmentation of document images in multiple Indian languages and an interface to capture the ground truth corresponding to the text. We use transliterated English text and virtual keyboards in a range of Indian languages for this purpose. The multilingual data entry capabilities of the tool and its underlying UNICODE data representation within a structured XML document also allow users to annotate passages of text in one language in other languages using a markup scheme to switch between scripts. Text and annotations are rendered in the appropriate scripts as the text is being annotated, thus providing users prompt and natural feedback. The XML back-end allows meta-data to be recorded describing the annotated document.

16 citations


Proceedings Article
01 Jan 2004
TL;DR: A scheme for transcoding document images for presentation on handheld devices like PDA’s, e-books etc and use of the knowledge of the document model represented through standard ontology language for generation of document summary is presented.
Abstract: In this paper we have presented a scheme for transcoding document images for presentation on handheld devices like PDA’s, e-books etc. We have proposed technqiues suitable, in particular ,for images of documents of Indian languages having Devanagari based scripts (viz. Hindi, Marathi, Bengali, Assamese, etc). Appropriate compression scheme for textual component of document images exploiting script specific characteristics has been suggested. We have also explored use of the knowledge of the document model represented through standard ontology language for generation of document summary. An experimented system has been developed for validation of these schemes.

7 citations



01 Jan 2004
TL;DR: The authors traces the gradual evolution of the Indo-Aryan languages and more specifically, the Gujarati language, which is spoken mainly in Gujarat, a state in western India, where it is a regional language officially recognized by the Constitution of India.
Abstract: This article traces the gradual evolution of the Indo-Aryan languages and more specifically, the Gujarati language. This language is spoken mainly in Gujarat, a state in western India, where it is a regional language officially recognized by the Constitution of India. It is written in Gujarati script which is very similar to Devanagari (the script used for Sanskrit) but without the continuous line at the top of the letters. The origin of the Gujarati language lies in the Sanskrit language, the oldest known form of the Indo-Aryan languages. The Indo-Aryan languages are a sub-set of the Indo-European language family (Chatterji I 978:476).

2 citations