scispace - formally typeset
Search or ask a question

Showing papers on "Devanagari published in 2009"


Journal ArticleDOI
TL;DR: P pioneering development of two databases for handwritten numerals of two most popular Indian scripts, a multistage cascaded recognition scheme using wavelet based multiresolution representations and multilayer perceptron classifiers and application for the recognition of mixed handwritten numeral recognition of three Indian scripts Devanagari, Bangla and English.
Abstract: This article primarily concerns the problem of isolated handwritten numeral recognition of major Indian scripts. The principal contributions presented here are (a) pioneering development of two databases for handwritten numerals of two most popular Indian scripts, (b) a multistage cascaded recognition scheme using wavelet based multiresolution representations and multilayer perceptron classifiers and (c) application of (b) for the recognition of mixed handwritten numerals of three Indian scripts Devanagari, Bangla and English. The present databases include respectively 22,556 and 23,392 handwritten isolated numeral samples of Devanagari and Bangla collected from real-life situations and these can be made available free of cost to researchers of other academic Institutions. In the proposed scheme, a numeral is subjected to three multilayer perceptron classifiers corresponding to three coarse-to-fine resolution levels in a cascaded manner. If rejection occurred even at the highest resolution, another multilayer perceptron is used as the final attempt to recognize the input numeral by combining the outputs of three classifiers of the previous stages. This scheme has been extended to the situation when the script of a document is not known a priori or the numerals written on a document belong to different scripts. Handwritten numerals in mixed scripts are frequently found in Indian postal mails and table-form documents.

328 citations


Proceedings ArticleDOI
26 Jul 2009
TL;DR: A novel and effective scheme based on analysis of connected components for extraction of Devanagari and Bangla texts from camera captured scene images and it is observed that there are situations when repeated binarization by a well-known global thresholding approach is effective.
Abstract: With the increasing popularity of digital cameras attached with various handheld devices, many new computational challenges have gained significance. One such problem is extraction of texts from natural scene images captured by such devices. The extracted text can be sent to OCR or to a text-to-speech engine for recognition. In this article, we propose a novel and effective scheme based on analysis of connected components for extraction of Devanagari and Bangla texts from camera captured scene images. A common unique feature of these two scripts is the presence of headline and the proposed scheme uses mathematical morphology operations for their extraction. Additionally, we consider a few criteria for robust filtering of text components from such scene images. Moreover, we studied the problem of binarization of such scene images and observed that there are situations when repeated binarization by a well-known global thresholding approach is effective. We tested our algorithm on a repository of 100 scene images containing texts of Devanagari and / or Bangla.

63 citations


Book
27 Oct 2009
TL;DR: This unique guide/reference is the very first comprehensive book on the subject of OCR (Optical Character Recognition) for Indic scripts and provides a section on the enhancement of text and images obtained from historical Indic palm leaf manuscripts.
Abstract: This unique guide/reference is the very first comprehensive book on the subject of OCR (Optical Character Recognition) for Indic scripts. Features: contains contributions from the leading researchers in the field; discusses data set creation for OCR development; describes OCR systems that cover 8 different scripts Bangla, Devanagari, Gurmukhi, Gujarati, Kannada, Malayalam, Tamil, and Urdu (Perso-Arabic); explores the challenges of Indic script handwriting recognition in the online domain; examines the development of handwriting-based text input systems; describes ongoing work to increase access to Indian cultural heritage materials; provides a section on the enhancement of text and images obtained from historical Indic palm leaf manuscripts; investigates different techniques for word spotting in Indic scripts; reviews mono-lingual and cross-lingual information retrieval in Indic languages. This is an excellent reference for researchers and graduate students studying OCR technology and methodologies.

46 citations


Journal ArticleDOI
TL;DR: This overview examines the historical development of mechanizing Indian scripts and the computer processing of Indian languages and the challenges involved in their design and in exploiting their structural similarity that lead to a unified solution.
Abstract: This overview examines the historical development of mechanizing Indian scripts and the computer processing of Indian languages. While examining possible solutions, the author describes the challenges involved in their design and in exploiting their structural similarity that lead to a unified solution. The focus is on the Devanagari script and Hindi language, and on the technological solutions for processing them.

42 citations


01 Jan 2009
TL;DR: This work has tested the recognition performance of about 5 feature extraction methods available in literature on Devanagari handwritten characters using two classifiers MLP and SVM.
Abstract: Devanagari script is being used in various languages, in south Asian subcontinent, such as Sanskrit, Rajasthani, Marathi and Nepali and it is also the script of Hindi, the mother tongue of majority of Indians. Recognition of handwritten characters of Devanagari alphabet set is an important area of research. The work done for the recognition of Devanagari handwritten script is negligible in literature despite it is being used by millions people in India and abroad and it has numerous applications. The feature extraction method(s) used to recognize hand-printed characters play an important role in ICR applications. There are many feature extraction methods available in literature. We have tested the recognition performance of about 5 feature extraction methods available in literature on Devanagari handwritten characters. A database of more than 25000 handwritten Devanagari characters is developed by collecting the samples from hundreds writers belonging to 43 Devanagari alphabets. The performance comparisons have been made using two classifiers MLP and SVM.

42 citations


Journal ArticleDOI
TL;DR: A novel feature of this approach is that it uses sub-character primitive components in the classification stage in order to reduce the number of classes whereas the n-gram language model based on the linguistic character units for word recognition is used.
Abstract: This paper describes a novel recognition driven segmentation methodology for Devanagari Optical Character Recognition. Prior approaches have used sequential rules to segment characters followed by template matching for classification. Our method uses a graph representation to segment characters. This method allows us to segment horizontally or vertically overlapping characters as well as those connected along non-linear boundaries into finer primitive components. The components are then processed by a classifier and the classifier score is used to determine if the components need to be further segmented. Multiple hypotheses are obtained for each composite character by considering all possible combinations of the classifier results for the primitive components. Word recognition is performed by designing a stochastic finite state automaton (SFSA) that takes into account both classifier scores as well as character frequencies. A novel feature of our approach is that we use sub-character primitive components in the classification stage in order to reduce the number of classes whereas we use an n-gram language model based on the linguistic character units for word recognition.

35 citations


Book ChapterDOI
15 Dec 2009
TL;DR: The objective of the current work is to recognize postal codes written in Roman, Devanagari, Bangla and Arabic scripts by using a script independent unified pattern classifier to classify any digit pattern of thesescripts into one of the 25 classes.
Abstract: The objective of the current work is to recognize postal codes written in Roman, Devanagari, Bangla and Arabic scripts. In the first stage 25 unique digit patterns are identified from the handwritten numeral patterns of the said four scripts. A script independent unified pattern classifier is then designed to classify any digit pattern of thesescripts into one of the 25 classes. In the next stage a rule-based script inference engine infers about the script of the numeric string, that invokes one of the four script specific classifiers. The average script-inference accuracy over a six digit numeric string is observed as 95.1% and the best recognition rates for the four script specific digit classifiers are obtained as 96.10%, 94.40%, 96.45 % and 95.60% respectively.

32 citations


Book ChapterDOI
01 Jan 2009
TL;DR: A multi-font Gurmukhi OCR for printed text with an accuracy rate exceeding 96% at the character level and a multi-stage classification scheme in which the binary tree and k-nearest neighbor classifiers have been used in a hierarchical fashion are presented.
Abstract: Recognition of Indian language scripts is a challenging problem and work towards the development of a complete OCR system for Indian language scripts is still in its infancy. Complete OCR systems have recently been developed for Devanagari and Bangla scripts. However, research in the field of recognition of Gurmukhi script faces major problems mainly due to the unique characteristics of the script such as connectivity of characters on a headline, characters pointing in both horizontal and vertical directions, two or more characters in a word having intersecting minimum bounding rectangles along horizontal direction, existence of a large set of visually similar character pairs, multi-component characters, touching and broken characters, and horizontally overlapping text segments. This chapter addresses the problems in the various stages of the development of a complete OCR system for Gurmukhi script and discusses potential solutions. A multi-font Gurmukhi OCR for printed text with an accuracy rate exceeding 96% at the character level is presented. A combination of local and global structural features is used for the feature extraction process, aimed at capturing the geometrical and topological features of the characters. For classification, we have implemented a multi-stage classification scheme in which the binary tree and k-nearest neighbor classifiers have been used in a hierarchical fashion.

31 citations


Book ChapterDOI
01 Jan 2009
TL;DR: This chapter describes the challenges in recognizing online handwriting in Indic scripts and provides an overview of the state of the art for isolated character and word recognition, starting with handwriting-based text input systems (IMEs) that have been built for entering Indic Scripts.
Abstract: Online handwriting recognition refers to the problem of machine recognition of handwriting captured in the form of pen trajectories. The recognition technology holds significant promise for Indic scripts, given that the Indic languages are used by a sixth of the world’s population, and the greater ease of use of handwriting-based text input for these scripts compared to keyboard-based methods. Even though the recognition of handwritten Devanagari, Bangla, and Tamil has received significant attention in recent times, one may say that research efforts directed at Indic script recognition in general are in their early stages. The structure of the scripts and the variety of shapes and writing styles pose challenges that are different from other scripts and hence require customized techniques for feature representation and recognition. In this chapter, we describe the challenges in recognizing online handwriting in Indic scripts and provide an overview of the state of the art for isolated character and word recognition. We then present in brief some of the promising applications, starting with handwriting-based text input systems (IMEs) that have been built for entering Indic scripts. In the last section, we provide a few pointers to resources such as tools and data sets that are currently available for online Indic script recognition research. endabstract

29 citations


Book ChapterDOI
01 Jan 2009
TL;DR: An adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extended, a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.
Abstract: In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of the Devanagari script and support vector machine (SVM). Identified words are then segmented into individual characters, using a font-model-based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our recognition system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin and non-Latin scripts. Results show that the character-level recognition accuracy exceeds 92% for non-Latin and 96% for Latin text on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

18 citations


01 Jan 2009
TL;DR: An automatic recognition system for isolated handwritten numerals recognition for three popular south Indian scripts, Kannada, Devanagari, and Telugu numeral sets are used for their recognition.
Abstract: In this paper an automatic recognition system for isolated handwritten numerals recognition for three popular south Indian scripts. Kannada, Devanagari, and Telugu numeral sets are used for their recognition. The proposed method is thinning free and without size normalization. The structural features viz. directional density of pixels, water reservoirs, maximum profile distancess, and fill hole density are used for handwritten numerals recognition. A Eclidian distance criterion and K-nearest neighbor classifier is used to classify the handwritten numerals. A total of 5250 numeral images are considered for experimentation, and the overall accuracy of 95.40%, 90.20%, and 98.40% for Kannada, Devanagari and Telugu numerals respectively are achived. The novelty of the proposed method is thinning free, fast, and without size normalization. Keywords - OCR, Handwritten Numeral Recognition, k-NN, Structural feature, and Indian script

01 Jan 2009
TL;DR: The proposed model to identify and separate text lines of Telugu, Devanagari and English scripts from a printed trilingual document uses the distinct features extracted from the top and bottom profiles of the printed text lines.
Abstract: In a multi-script multi-lingual environment, a document may contain text lines in more than one script/language forms. It is necessary to identify different script regions of the document in order to feed the document to the OCRs of individual language. With this context, this paper proposes to develop a model to identify and separate text lines of Telugu, Devanagari and English scripts from a printed trilingual document. The proposed method uses the distinct features extracted from the top and bottom profiles of the printed text lines. Experimentation conducted involved 1500 text lines for learning and 900 text lines for testing. The performance has turned out to be 99.67%.

Journal Article
TL;DR: In this paper, the authors used functional brain imaging to study brain activation patterns when 16 native speakers read phrases in Devanagari, a writing system with alphabetic and syllabic properties.
Abstract: We used functional brain imaging to study brain activation patterns when 16 native speakers read phrases in Devanagari, a writing system with alphabetic and syllabic properties. We found activation in the left insula, fusiform gyrus and inferior frontal gyrus, as seen for reading alphabetic scripts and in the right superior parietal lobule as associated with reading syllabic scripts. Additionally, we found bilateral activation in the middle frontal gyrus (Lt. BA 46, Rt. BA 6/44) which we attribute to complex visuo-spatial processing required for reading Devanagari, wherein consonants are placed linearly from left to right and vowels positioned non-linearly around them.

01 Jan 2009
TL;DR: A two-stage system for handling the extra space insertion problem in Urdu has been presented, which is the first time such a system has been developed for Urdu script.
Abstract: Hindi and Urdu are variants of the same language, but while Hindi is written in the Devanagari script from left to right, Urdu is written in a script derived from a Persian modification of Arabic script written from right to left. To break the script barrier an Urdu-Devnagri transliteration system has been developed. The transliteration system faced many problems related to word segmentation of Urdu script, as in many cases space is not properly put between Urdu words. Sometimes it is deleted resulting in many Urdu words being jumbled together and many other times extra space is put in word resulting in over segmentation of that word. In this paper, a two-stage system for handling the extra space insertion problem in Urdu has been presented. In the first stage, Urdu grammar rules have been applied, while a statistical based approach has been employed in the second stage. For statistical analysis, lexical resources from both Urdu and Hindi languages, including Urdu and Hindi unigram and bigram probabilities have been used. In addition the Urdu-Devnagri transliteration module is also executed in parallel to help in decision making. The system was tested on 1.84 million word Urdu corpus and the success rate was 98.57%. This is the first time such a system has been developed for Urdu script.

Book ChapterDOI
01 Jan 2009
TL;DR: This chapter presents (perhaps) the first system for recognizing handwritten Urdu words, which achieved an accuracy of 70% for the top choice, and 82% forThe top three choices.
Abstract: Urdu is a language spoken in the Indian subcontinent by an estimated 130–270 million speakers. At the spoken level, Urdu and Hindi are considered dialects of a single language because of shared vocabulary and the similarity in grammar. At the written level, however, Urdu is much closer to Arabic because it is written in Nastaliq, the calligraphic style of the Persian–Arabic script. Therefore, a speaker of Hindi can understand spoken Urdu but may not be able to read written Urdu because Hindi is written in Devanagari script, whereas an Arabic writer can read the written words but may not understand the spoken Urdu. In this chapter we present an overview of written Urdu. Prior research in handwritten Urdu OCR is very limited. We present (perhaps) the first system for recognizing handwritten Urdu words. On a data set of about 1300 handwritten words, we achieved an accuracy of 70% for the top choice, and 82% for the top three choices.

Proceedings ArticleDOI
01 Dec 2009
TL;DR: A three tier strategy is suggested to recognize the hand-printed characters of Devanagari script and the recognition rate achieved is 94.2% on the authors' database consisting of more than 25000 characters belonging to 43 alphabets.
Abstract: In this paper, a three tier strategy is suggested to recognize the hand-printed characters of Devanagari script. In primary and secondary stage classification, the structural properties of the script are exploited to avoid classification error. The results of all the three stages are reported on two classifiers i.e. MLP and SVM and the results achieved with the later are very good. The performance of the proposed scheme is reported in respect of recognition accuracy and time. The recognition rate achieved with the proposed scheme is 94.2% on our database consisting of more than 25000 characters belonging to 43 alphabets.

31 Dec 2009
TL;DR: A proposal to encode the Takri script in the international character encoding standard Unicode, which was published in Unicode Standard version 6.1 in January 2012.
Abstract: Author(s): Pandey, Anshuman | Abstract: This is a proposal to encode the Takri script in the international character encoding standard Unicode. This script was published in Unicode Standard version 6.1 in January 2012. Takri was used in northern India and surrounding countries in South Asia. It was the writing system for the Chambeali and Dogri languages, well as Jaunsari, Kulvi, and Mandeali. It was the official script in a number of states of north and northwestern India from the 17c until the mid-20c, when it was gradually replaced by Devanagari.

Proceedings ArticleDOI
02 Nov 2009
TL;DR: Though VKB starts with a higher user error rate compared to InScript, the error rate drops by 55% by the end of the experiment, and the input speed of VKB is found to be 81% higher than In Script, which points to interesting research directions for the use of multiple natural modalities for Indic text input.
Abstract: Multimodal systems, incorporating more natural input modalities like speech, hand gesture, facial expression etc., can make human-computer-interaction more intuitive by drawing inspiration from spontaneous human-human-interaction. We present here a multimodal input device for Indic scripts called the Voice Key Board (VKB) which offers a simpler and more intuitive method for input of Indic scripts. VKB exploits the syllabic nature of Indic language scripts and exploits the user's mental model of Indic scripts wherein a base consonant character is modified by different vowel ligatures to represent the actual syllabic character. We also present a user evaluation result for VKB comparing it with the most common input method for the Devanagari script, the InScript keyboard. The results indicate a strong user preference for VKB in terms of input speed and learnability. Though VKB starts with a higher user error rate compared to InScript, the error rate drops by 55% by the end of the experiment, and the input speed of VKB is found to be 81% higher than InScript. Our user study results point to interesting research directions for the use of multiple natural modalities for Indic text input.

Patent
04 Nov 2009
TL;DR: In this article, a mechanism for identifying invalid syllables in Devanagari script is described, in which a character type for a character of the text is determined and a new state associated with the character by referencing a state machine with the determined character type and a current state of the script.
Abstract: A mechanism for identifying invalid syllables in Devanagari script is disclosed. A method of embodiments of the invention includes receiving Devanagari text from an application of a computing device for parsing, determining a character type for a character of the Devanagari text, determining a new state associated with the character by referencing a Devanagari state machine with the determined character type and a current state of the Devanagari text, and transmitting an invalid syllable signal to the application for display on a display device to an end user of the application if the determined new state is invalid.

Journal ArticleDOI
TL;DR: Saraswati, a cross‐lingual Sanskrit Digital Library hosted at Banaras Hindu University, is described, which uses the UTF‐8 character representation system and generates on‐the‐fly transliteration from one Indic language script to another.
Abstract: Purpose – The purpose of this paper is to describe Saraswati, a cross‐lingual Sanskrit Digital Library hosted at Banaras Hindu University. The system aims to assist those who know Sanskrit and at least one Indic script out of Devanagari, Kannada, Telugu and Bengali.Design/methodology/approach – The system is developed with the Unicode standard using PHP as the programming language. The system follows three levels of architecture for search, display, and storage of Sanskrit documents. The system uses the UTF‐8 character representation system and generates on‐the‐fly transliteration from one Indic language script to another.Findings – The system successfully demonstrates transliteration of Sanskrit text from one language to another. Saraswati is also capable of searching a given keyword across different languages and produces the result in the desired language script.Research limitations/implications – Some languages such as Tamil (not chosen for study) use context dependent consonants, and with the present...

Proceedings Article
01 Jan 2009
TL;DR: This paper describes a system for script identification of handwritten word images, divided into two main phases, training and testing, which shows significant strength in the approach.
Abstract: This paper describes a system for script identification of handwritten word images. The system is divided into two main phases, training and testing. The training phase performs a moment based feature extraction on the training word images and generates their corresponding feature vectors. The testing phase extracts moment features from a test word image and classifies it into one of the candidate script classes using information from the trained feature vectors. Experiments are reported on handwritten word images from three scripts: Latin, Devanagari and Arabic. Three different classifiers are evaluated over a dataset consisting of 12000 word images in training set and 7942 word images in testing set. Results show significant strength in the approach with all the classifiers having a consistent accuracy of over 97%.

Book
13 Jan 2009
TL;DR: The 1810 dictionary of Marathi as mentioned in this paper provides a detailed description of the Devanagari alphabet, its word and sentence formation, and its complex tense, voice, gender, agreement, inflection, and case systems.
Abstract: Marathi, an official language of Maharashtra and Goa, is among the twenty most widely spoken languages in the world. The southernmost Indo-Aryan language, it is also spoken in Gujarat, Madhya Pradesh, Karnataka, and Daman and Diu, and is believed to be over 1,300 years old, with its origins in Sanskrit. First published in 1805, this grammar of Marathi (then known as Mahratta) was compiled by the Baptist missionary William Carey (1761–1834) during his time in India. Its purpose was to assist Carey's European students at Fort William College in their learning of the language, and it is comprehensive in ITS coverage, providing numerous examples. Containing detailed descriptions of Marathi's Devanagari alphabet, its word and sentence formation, and its complex tense, voice, gender, agreement, inflection, and case systems, the work remains an invaluable resource for linguists today. Carey's 1810 dictionary of Marathi is also reissued in this series.

Dissertation
13 Jul 2009
TL;DR: This focus is on the recognition of offline handwritten Hindi characters that can be used in common applications like bank cheques, commercial forms, government records, bill processing systems, Postcode Recognition, Signature Verification, passport readers, offline document recognition generated by the expanding technological society.
Abstract: Development of a Character recognition system for Devnagri is difficult because (i) there are about 350 basic, modified (“matra”) and compound character shapes in the script and (ii) the characters in a words are topologically connected. Here focus is on the recognition of offline handwritten Hindi characters that can be used in common applications like bank cheques, commercial forms, government records, bill processing systems, Postcode Recognition, Signature Verification, passport readers, offline document recognition generated by the expanding technological society. Handwriting has continued to persist as a means of communication and recording information in day-to-day life even with the introduction of new technologies. Challenges in handwritten characters recognition lie in the variation and distortion of offline handwritten Hindi characters since different people may use different style of handwriting, and direction to draw the same shape of any Hindi character. This overview describes the nature of handwritten language, how it is translated into electronic data, and the basic concepts behind written language recognition algorithms. Handwritten Hindi character are imprecise in nature as their corners are not always sharp, lines are not perfectly straight, and curves are not necessarily smooth, unlikely the printed character. Furthermore, Hindi character can be drawn in different sizes and orientation in contrast to handwriting which is often assumed to be written on a baseline in an upright position. Therefore, a robust offline Hindi handwritten recognition system has to account for all of these factors. An approach using Artificial Neural Network is considered for recognition of Handwritten Hindi Character Recognition.

Book ChapterDOI
01 Jan 2009
TL;DR: This chapter describes the use of a script-specific keyword spotting for Devanagari documents that makes use of domain knowledge of the script and addresses the needs of a digital library to provide access to a collection of documents from multiple scripts.
Abstract: With advances in the field of digitization of printed documents and several mass digitization projects underway, information retrieval and document search have emerged as key research areas. However, most of the current work in these areas is limited to English and a few oriental languages. The lack of efficient solutions for Indic scripts has hampered information extraction from a large body of documents of cultural and historical importance. This chapter presents two relevant topics in this area. First, we describe the use of a script-specific keyword spotting for Devanagari documents that makes use of domain knowledge of the script. Second, we address the needs of a digital library to provide access to a collection of documents from multiple scripts. This requires intelligent solutions which scale across different scripts. We present a script-independent keyword spotting approach for this purpose. Experimental results illustrate the efficacy of our methods.

Book ChapterDOI
01 Jan 2009
TL;DR: A post-recognition error detection approach based on spell-checker principles has been proposed mainly to correct an error in a single position in a recognized word string.
Abstract: This chapter describes our work in the OCR of Bangla and Devanagari, two of the most widely used scripts of the Indian subcontinent. Due to their strong structural similarities, these two scripts can be tackled under a single framework. The proposed approach starts with character and symbol segmentation and employs three recognizers for symbols of different zones. For the middle zone, a two-stage approach with group and individual symbol recognizers is used. The main recognizer is a covariance-based quadratic classifier. The problem of error evaluation and creating ground truth for Indic scripts has also been addressed. A post-recognition error detection approach based on spell-checker principles has been proposed mainly to correct an error in a single position in a recognized word string. Encouraging results have been obtained on multi-font Bangla and Devanagari documents.

Book ChapterDOI
18 Feb 2009
TL;DR: Professor Joshi proposes that a phonemic encoding scheme be adopted as the standard for machine processing of Sanskrit text, in which each phoneme is represented by a single character code that represents a single Sanskrit sound.
Abstract: Professor Joshi proposes that a phonemic encoding scheme be adopted as the standard for machine processing of Sanskrit text In the scheme he details, each phoneme is represented by a single character code that represents a single Sanskrit sound Graphic units in Devanagari corresponding to syllabic units, including consonant plus /a/, are represented as sequences Glyphs corresponding to intial vowels verses dependent vowels are not given distinct encodings; rather they are selected based upon context (P Scharf)

01 Jan 2009
TL;DR: A dictionary based solution with word bigrams combined with a remapped keypad layout gave the desired results and increased disambiguation accuracy in KSPC, from 1.0856 to 1.0154.
Abstract: Languages with many letters pose a problem for text entry on reduced keyboards. Using multitap is time consuming as there can be 6-9 characters per key on a mobile phone. For singletap methods more letters per key results in more words per key sequence, i.e. greater ambiguity when selecting which word to present to the user. Todays singletap methods for mobile phones mostly rely on a dictionary and word frequencies, this works remarkably well with the Latin alphabet. But this is not enough when the number of letters per key increases. In this master thesis we investigated different methods to improve the word disambiguation. These methods include word bigrams, part of speech n-grams and keypad remappings. We have chosen the Devanagari script for our implementation as it is one of the scripts with this problem. We have worked with Hindi for the language specific data. We found that a dictionary based solution with word bigrams combined with a remapped keypad layout gave the desired results. The use of these techniques gave an increase in disambiguation accuracy, from 77% to 94%. We also saw an improvement in KSPC, from 1.0856 to 1.0154.