scispace - formally typeset
Search or ask a question

Showing papers by "Gaurav Harit published in 2009"


Proceedings ArticleDOI
25 Jul 2009
TL;DR: This paper describes how a new XML based tagging scheme has been exploited to achieve the objectives of the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.
Abstract: This paper presents an XML-based scheme for managing a large multilingual OCR project. In particular we describe how a new XML based tagging scheme has been exploited to achieve the objectives of the project. Managing a large multi-lingual OCR project involving multiple research groups, developing script specific and script independent technologies in a collaborative fashion is a challenging problem. In this paper, we present some of the software and data management strategies designed for the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.

3 citations


Book ChapterDOI
01 Jan 2009
TL;DR: An interactive access scheme for Indian language document collection is presented using techniques for word-image-based search and retrieval and the compression and retrieval paradigm is applicable even for those Indian scripts for which reliable OCR technology is not available.
Abstract: Indexing and retrieval of Indian language documents is an important problem. We present an interactive access scheme for Indian language document collection using techniques for word-image-based search. The compression and retrieval paradigm we propose is applicable even for those Indian scripts for which reliable OCR technology is not available. Our technique for word spotting is based on exploiting the geometrical features of the word image. The word image features are represented in the form of a graph called geometric feature graph (GFG). The GFG is encoded as a string which serves as a compressed representation of the word image skeleton. We have also augmented the GFG-based word image spotting with latent semantic analysis for more effective retrieval. The query is specified as a set of word images and the documents that best match with the query representation in the latent semantic space are retrieved. The retrieval paradigm is further enhanced to the conceptual level with the use of document image content-domain knowledge specified in the form of an ontology.

2 citations


Proceedings ArticleDOI
04 Feb 2009
TL;DR: A document image analysis system which performs segmentation, content characterization as well as semantic labeling of components, and has obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.
Abstract: In this paper we describe our document image analysis system which performs segmentation, content characterization as well as semantic labeling of components. Segmentation is done using white spaces and gives the segmented components arranged in a hierarchy. Semantic labeling is done using domain knowledge which is specified where possible in the form of a document model applicable to a class of documents. The novelty of the system lies in the suite of methods it employs which are capable of handling documents in Indian scripts. We have obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

2 citations