scispace - formally typeset
Search or ask a question
Topic

Telugu

About: Telugu is a research topic. Over the lifetime, 548 publications have been published within this topic receiving 4279 citations.


Papers
More filters
Proceedings Article
07 Jun 2012
TL;DR: A collection of parallel corpora between English and six languages from the Indian subcontinent, which are low-resource, under-studied, and exhibit linguistic phenomena that are difficult for machine translation research are built.
Abstract: Recent work has established the efficacy of Amazon's Mechanical Turk for constructing parallel corpora for machine translation research. We apply this to building a collection of parallel corpora between English and six languages from the Indian subcontinent: Bengali, Hindi, Malayalam, Tamil, Telugu, and Urdu. These languages are low-resource, under-studied, and exhibit linguistic phenomena that are difficult for machine translation. We conduct a variety of baseline experiments and analysis, and release the data to the community.

134 citations

Journal ArticleDOI
TL;DR: Various feature extraction and classification techniques associated with the offline handwriting recognition of the regional scripts are discussed in this survey, which will serve as a compendium not only for researchers in India, but also for policymakers and practitioners in India.
Abstract: Offline handwriting recognition in Indian regional scripts is an interesting area of research as almost 460 million people in India use regional scripts. The nine major Indian regional scripts are Bangla (for Bengali and Assamese languages), Gujarati, Kannada, Malayalam, Oriya, Gurumukhi (for Punjabi language), Tamil, Telugu, and Nastaliq (for Urdu language). A state-of-the-art survey about the techniques available in the area of offline handwriting recognition (OHR) in Indian regional scripts will be of a great aid to the researchers in the subcontinent and hence a sincere attempt is made in this article to discuss the advancements reported in this regard during the last few decades. The survey is organized into different sections. A brief introduction is given initially about automatic recognition of handwriting and official regional scripts in India. The nine regional scripts are then categorized into four subgroups based on their similarity and evolution information. The first group contains Bangla, Oriya, Gujarati and Gurumukhi scripts. The second group contains Kannada and Telugu scripts and the third group contains Tamil and Malayalam scripts. The fourth group contains only Nastaliq script (Perso-Arabic script for Urdu), which is not an Indo-Aryan script. Various feature extraction and classification techniques associated with the offline handwriting recognition of the regional scripts are discussed in this survey. As it is important to identify the script before the recognition step, a section is dedicated to handwritten script identification techniques. A benchmarking database is very important for any pattern recognition related research. The details of the datasets available in different Indian regional scripts are also mentioned in the article. A separate section is dedicated to the observations made, future scope, and existing difficulties related to handwriting recognition in Indian regional scripts. We hope that this survey will serve as a compendium not only for researchers in India, but also for policymakers and practitioners in India. It will also help to accomplish a target of bringing the researchers working on different Indian scripts together. Looking at the recent developments in OHR of Indian regional scripts, this article will provide a better platform for future research activities.

133 citations

Proceedings ArticleDOI
10 Sep 2001
TL;DR: This work presents an efficient and practical approach to Telugu OCR which limits the number of templates to be recognized to just 370, avoiding issues of classifier design for thousands of shapes or very complex glyph segmentation.
Abstract: Telugu is the language spoken by more than 100 million people of South India. Telugu has a complex orthography with a large number of distinct character shapes (estimated to be of the order of 10,000) composed of simple and compound characters formed from 16 vowels (called achchus) and 36 consonants (called hallus). We present an efficient and practical approach to Telugu OCR which limits the number of templates to be recognized to just 370, avoiding issues of classifier design for thousands of shapes or very complex glyph segmentation. A compositional approach using connected components and fringe distance template matching was tested to give a raw OCR accuracy of about 92%. Several experiments across varying fonts and resolutions showed the approach to be satisfactory.

122 citations

Journal Article
TL;DR: The contrast between the two forms of language?speech versus script?is all the more striking given that written language is invariably based on spoken language.
Abstract: tures of the language spoken around him, whether it be English, Chinese, or Telugu, Learning the written language, however, is frequently quite an arduous process. Millions of people in the world are illiterate for lack of adequate educational oppor tunity, A significant number of American children have problems with reading and writing, even with the help of the best facilities. This contrast between the two forms of language?speech versus script?is all the more striking given that written language is invariably based on spoken language.

120 citations

Proceedings ArticleDOI
03 Aug 2003
TL;DR: This paper describes the character recognition process from printed documents containing Hindi and Telugu text using a bilingual recognizer based on Principal Component Analysis followed by support vector classification.
Abstract: This paper describes the character recognition process from printed documents containing Hindi and Telugu text. Hindi and Telugu are among the most popular languages in India. The bilingual recognizer is based on Principal Component Analysis followed by support vector classification. This attains an overall accuracy of approximately 96.7%. Extensive experimentation is carried out on an independent test set of approximately 200000 characters. Applications based on this OCR are sketched.

111 citations


Network Information
Related Topics (5)
Vocabulary
44.6K papers, 941.5K citations
72% related
Natural language
31.1K papers, 806.8K citations
70% related
Grammar
33.8K papers, 767.6K citations
70% related
Sentence
41.2K papers, 929.6K citations
69% related
Language model
17.5K papers, 545K citations
68% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202360
2022149
202141
202044
201929
201836