scispace - formally typeset

Proceedings ArticleDOI

Managing multilingual OCR project using XML

25 Jul 2009-pp 18

TL;DR: This paper describes how a new XML based tagging scheme has been exploited to achieve the objectives of the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.

AbstractThis paper presents an XML-based scheme for managing a large multilingual OCR project. In particular we describe how a new XML based tagging scheme has been exploited to achieve the objectives of the project. Managing a large multi-lingual OCR project involving multiple research groups, developing script specific and script independent technologies in a collaborative fashion is a challenging problem. In this paper, we present some of the software and data management strategies designed for the project aimed at developing OCR for 11 scripts of Indian origin for which mature OCR technology was not available.

...read more


Citations
More filters
Proceedings ArticleDOI
17 Sep 2011
TL;DR: The project is an attempt to implement an integrated platform for OCR of different Indian languages and currently is being enhanced for handling the space and time constraints, achieving higher recognition accuracies and adding new functionalities.
Abstract: This paper presents integration and testing scheme for managing a large Multilingual OCR Project. The project is an attempt to implement an integrated platform for OCR of different Indian languages. Software engineering, workflow management and testing processes have been discussed in this paper. The OCR has now been experimentally deployed for some specific applications and currently is being enhanced for handling the space and time constraints, achieving higher recognition accuracies and adding new functionalities.

25 citations


Cites methods from "Managing multilingual OCR project u..."

  • ...It should follow the Input/Output XML scheme specified for the project [1]....

    [...]

  • ...XML has been used as architecture specication language and enables handling huge amount of data in such large projects [1]....

    [...]

01 Jan 2014
TL;DR: A symbol vocabulary and a system of logic are combined to enable inferences about elements in the knowledge representation to create new knowledge representation sentences by using various techniques.
Abstract: A knowledge representation (KR) is an idea to enable an individual to determine consequences by thinking rather than acting, i.e., by reasoning about the world rather than taking action in it. The knowledge acquired from experts or induced from a set of data must be represented in a format that is both understandable by humans and executable on computers. Knowledge representation research involves analysis of how to reason accurately and effectively and how best to use a set of symbols to represent a set of fact within a knowledge domain. A symbol vocabulary and a system of logic are combined to enable inferences about elements in the knowledge representation to create new knowledge representation sentences by using various techniques.

2 citations

Proceedings ArticleDOI
01 Oct 2016
TL;DR: Over the years, the volume of information available through the world wide web has been increasing continuously, and never has so much information readily available and shared among so many people.
Abstract: Over the years, the volume of information available through the world wide web has been increasing continuously, and never has so much information readily available and shared among so many people. Unfortunately, the unstructured nature and huge volume of information accessible over network have made it difficult for users to shift through and find relevant information. The information retrievals commonly used are based on keywords. These techniques used keyword lists to describe the content of information, but one problem with such list is that they do not say anything about the symantic relationships between keywords, nor do they take into account the meaning of words or phrases.

1 citations


References
More filters
Proceedings ArticleDOI
09 Oct 1994
TL;DR: The status of the UNIPEN project of data exchange and recognizer benchmarks started two years ago is reported, to propose and implement solutions to the growing need of handwriting samples for online handwriting recognizers used by pen-based computers.
Abstract: We report the status of the UNIPEN project of data exchange and recognizer benchmarks started two years ago at the initiative of the International Association of Pattern Recognition (Technical Committee 11). The purpose of the project is to propose and implement solutions to the growing need of handwriting samples for online handwriting recognizers used by pen-based computers. Researchers from several companies and universities have agreed on a data format, a platform of data exchange and a protocol for recognizer benchmarks. The online handwriting data of concern may include handprint and cursive from various alphabets (including Latin and Chinese), signatures and pen gestures. These data will be compiled and distributed by the Linguistic Data Consortium. The benchmarks will be arbitrated the US National Institute of Standards and Technologies. We give a brief introduction to the UNIPEN format. We explain the protocol of data exchange and benchmarks.

428 citations

Journal Article
TL;DR: This paper addresses current topics about document image understanding from a technical point of view as a survey and proposes methods/approaches for recognition of various kinds of documents.
Abstract: The subject about document image understanding is to extract and classify individual data meaningfully from paper-based documents. Until today, many methods/approaches have been proposed with regard to recognition of various kinds of documents, various technical problems for extensions of OCR, and requirements for practical usages. Of course, though the technical research issues in the early stage are looked upon as complementary attacks for the traditional OCR which is dependent on character recognition techniques, the application ranges or related issues are widely investigated or should be established progressively. This paper addresses current topics about document image understanding from a technical point of view as a survey. key words: document model, top-down, bottom-up, layout structure, logical structure, document types, layout recognition

221 citations

Proceedings ArticleDOI
23 Aug 2004
TL;DR: A general XML based solution to distributed processing and memory structures in cognitive vision systems is described and practical experiences are reported to underline its suitability.
Abstract: Distributed processing and memory structures are very important aspects of cognitive vision systems. Both issues not only require sophisticated conceptual designs but also pose problems of software and systems engineering. In this paper, we describe a general XML based solution to these problems. Practical experiences are reported to underline its suitability.

43 citations

Proceedings ArticleDOI
23 Sep 2007
TL;DR: A new format for representing both intermediate and final OCR results is described, developed in response to the needs of a newly developed OCR system and ground truth data release, which embeds OCR information invisibly inside the HTML and CSS standards.
Abstract: Large scale scanning and document conversion efforts have led to a renewed interest in OCR systems and workflows. This paper describes a new format for representing both intermediate and final OCR results, developed in response to the needs of a newly developed OCR system and ground truth data release. The format embeds OCR information invisibly inside the HTML and CSS standards and therefore can represent a wide range of linguistic and typographic phenomena with already well-defined, widely understood markup and can be processed using widely available and known tools. The format is based on a new, multi-level abstraction of OCR results based on logical markup, common typesetting models, and OCR engine-specific markup, making it suitable both for the support of existing workflows and the development of future model-based OCR engines.

36 citations

Journal ArticleDOI
TL;DR: Analysts and some early adopters say XML will help simplify this complex, labor-intensive process because it lets developers precisely identify pieces of data on the basis of content.
Abstract: Many companies have a wealth of information in their databases and applications, and want to leverage these assets. They are beginning to see that XML could help them do this by providing a standard data format for cross-platform information exchange. XML thus offers a powerful new way to integrate new and existing applications within companies. Today much of that linking must be done using enterprise application integration software, which is often complex and quite costly. Developers may have to translate legacy APIs to new component APIs or change flow data and workflow processes are structured to permit data exchange between incompatible applications. Analysts and some early adopters say XML will help simplify this complex, labor-intensive process because it lets developers precisely identify pieces of data on the basis of content.

35 citations