scispace - formally typeset
Book ChapterDOI

A Heuristic Approach for Designing Regional Language Based Raw–Text Extractor and Unicode Font–Mapping Tool

13 Dec 2008-Vol. 28, pp 1-12
TL;DR: This work has concentrated its research work to give a heuristic approach for interactive information extraction technique where the information is in Indian Regional Language, which enables any naive user to extract regional language (Indian) based document from a web document efficiently.
Abstract: Information Extraction (IE) is a type of information retrieval meant for extracting structured information. In general, the information on the web is well structured in HTML or XML format. And IE will be there to structure these documents, by using learning techniques for pattern matching in the content. A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted. In this paper, we have concentrated our research work to give a heuristic approach for interactive information extraction technique where the information is in Indian Regional Language. This enables any naive user to extract regional language (Indian) based document from a web document efficiently. It is just similar to a pre-programmed information extraction engine.
Topics: Information extraction (69%), XML (58%), Unicode (56%), Natural language (54%)
References
More filters

Journal ArticleDOI
Abstract: Many different languages are spoken in India, each language being the mother tongue of tens of millions of people. While the languages and scripts are distinct from each other, the grammar and the alphabet are similar to a large extent. One common feature is that all the Indian languages are phonetic in nature. In this paper we describe the development of a transliteration scheme Om which exploits this phonetic nature of the alphabet. Om uses ASCII characters to represent Indian language alphabets, and thus can be read directly in English, by a large number of users who cannot read script in other Indian languages than their mother tongue. It is also useful in computer applications where local language tools such as email and chat are not yet available. Another significant contribution presented in this paper is the development of a text editor for Indian languages that integrates the Om input for many Indian languages into a word processor such as Microsoft Win Word®. The text editor is also developed on Java® platform that can run on Unix machines as well. We propose this transliteration scheme as a possible standard for Indian language transliteration and keyboard entry.

35 citations


Ganapathiraju, Madhavi, Balakrishnan, Reddy, Raj 
01 Jan 2005-
TL;DR: The development of a transliteration scheme Om which uses ASCII characters to represent Indian language alphabets, and thus can be read directly in English, by a large number of users who cannot read script in other Indian languages than their mother tongue.
Abstract: Many different languages are spoken in India, each language being the mother tongue of tens of millions of people.While the languages and scripts are distinct from each other, the grammar and the alphabet are similar to a large extent. One common feature is that all the Indian languages are phonetic in nature. In this paper we describe the development of a translit eration scheme Om which exploits this phonetic nature of the alphabet. Om uses ASCⅡ characters to represent Indian language alphabets, and thus can be read directly in English, by a large number of users who cannot read script in other Indian languages than their mother tongue. It is also useful in computer applications where local language tools such as email and chat are not yet available. Another significant contribution presented in this paper is the development of a text editor for Indian languages that integrates the Om input for many Indian languages into a word processor such as Microsoft WinWord(R). The text editor is also developed on Java(R) platform that can run on Unix machines as well. We propose this transliteration scheme as a possible standard for Indian language transliteration and keyboard entry.

20 citations


"A Heuristic Approach for Designing ..." refers background in this paper

  • ...Not only that, a detailed study of the Hindi fonts is highly required as the algorithm is based on this language [4, 5 ]. The approach for raw text extraction is same for any other language(s) as it follows a generic heuristic depending on regular expression searching a matched pattern of tags associated with the keyword specifying the proprietary font name as. But, here Hindi is used for our test set....

    [...]