Search or ask a question

Showing papers by "Gaurav Harit published in 2009"

PDF

Open Access

Proceedings Article•DOI•

Managing multilingual OCR project using XML

[...]

Gaurav Harit¹, K. J. Jinesh², Ritu Garg³, C. V. Jawahar², Santanu Chaudhury³ - Show less +1 more•Institutions (3)

Indian Institute of Technology Kharagpur¹, International Institute of Information Technology, Hyderabad², Indian Institute of Technology Delhi³

25 Jul 2009

TL;DR: This paper describes how a new XML based tagging scheme has been exploited to achieve the objectives of the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.

...read moreread less

Abstract: This paper presents an XML-based scheme for managing a large multilingual OCR project. In particular we describe how a new XML based tagging scheme has been exploited to achieve the objectives of the project. Managing a large multi-lingual OCR project involving multiple research groups, developing script specific and script independent technologies in a collaborative fashion is a challenging problem. In this paper, we present some of the software and data management strategies designed for the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.

...read moreread less

3 citations

Book Chapter•DOI•

GFG-Based Compression and Retrieval of Document Images in Indian Scripts

[...]

Gaurav Harit¹, Santanu Chaudhury¹, Ritu Garg¹•Institutions (1)

Indian Institute of Technology Delhi¹

01 Jan 2009

TL;DR: An interactive access scheme for Indian language document collection is presented using techniques for word-image-based search and retrieval and the compression and retrieval paradigm is applicable even for those Indian scripts for which reliable OCR technology is not available.

...read moreread less

Abstract: Indexing and retrieval of Indian language documents is an important problem. We present an interactive access scheme for Indian language document collection using techniques for word-image-based search. The compression and retrieval paradigm we propose is applicable even for those Indian scripts for which reliable OCR technology is not available. Our technique for word spotting is based on exploiting the geometrical features of the word image. The word image features are represented in the form of a graph called geometric feature graph (GFG). The GFG is encoded as a string which serves as a compressed representation of the word image skeleton. We have also augmented the GFG-based word image spotting with latent semantic analysis for more effective retrieval. The query is specified as a set of word images and the documents that best match with the query representation in the latent semantic space are retrieved. The retrieval paradigm is further enhanced to the conceptual level with the use of document image content-domain knowledge specified in the form of an ontology.

...read moreread less

2 citations

Proceedings Article•DOI•

Syntactic and Semantic Labeling of Hierarchically Organized Document Image Components of Indian Scripts

[...]

Gaurav Harit¹, Ritu Garg², Santanu Chaudhury²•Institutions (2)

Indian Institute of Technology Kharagpur¹, Indian Institute of Technology Delhi²

04 Feb 2009

TL;DR: A document image analysis system which performs segmentation, content characterization as well as semantic labeling of components, and has obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

...read moreread less

Abstract: In this paper we describe our document image analysis system which performs segmentation, content characterization as well as semantic labeling of components. Segmentation is done using white spaces and gives the segmented components arranged in a hierarchy. Semantic labeling is done using domain knowledge which is specified where possible in the form of a document model applicable to a class of documents. The novelty of the system lies in the suite of methods it employs which are capable of handling documents in Indian scripts. We have obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

...read moreread less

2 citations