scispace - formally typeset
Search or ask a question

Showing papers on "Intelligent word recognition published in 2009"


Proceedings Article
01 Feb 2009
TL;DR: It is demonstrated that the performance of the proposed method can be far superior to that of commercial OCR systems, and can benefit from synthetically generated training data obviating the need for expensive data collection and annotation.
Abstract: This paper tackles the problem of recognizing characters in images of natural scenes. In particular, we focus on recognizing characters in situations that would traditionally not be handled well by OCR techniques. We present an annotated database of images containing English and Kannada characters. The database comprises of images of street scenes taken in Bangalore, India using a standard camera. The problem is addressed in an object cateogorization framework based on a bag-of-visual-words representation. We assess the performance of various features based on nearest neighbour and SVM classification. It is demonstrated that the performance of the proposed method, using as few as 15 training images, can be far superior to that of commercial OCR systems. Furthermore, the method can benefit from synthetically generated training data obviating the need for expensive data collection and annotation.

520 citations


Journal ArticleDOI
TL;DR: A novel hierarchical approach is presented here for optical character recognition (OCR) of handwritten Bangla words that segments a word image on Matra hierarchy, then recognizes the individual word segments and finally identifies the constituent characters of the word image through intelligent combination of recognition decisions of the associated word segments.

112 citations


Proceedings ArticleDOI
26 Jul 2009
TL;DR: A system for handwritten Chinese text recognition integrating language model is described, which generates character segmentation and word segmentation candidates, and the candidate paths are evaluated by character recognition scores and language model.
Abstract: This paper describes a system for handwritten Chinese text recognition integrating language model. On a text line image, the system generates character segmentation and word segmentation candidates, and the candidate paths are evaluated by character recognition scores and language model. The optimal path, giving segmentation and recognition result, is found using a pruned dynamic programming search method. We evaluate various language models, including the character-based n-gram, word-based n-gram, and hybrid n-gram models. Experimental results on the HIT-HW database show that the language models improve the recognition performance remarkably.

44 citations


Proceedings ArticleDOI
26 Jul 2009
TL;DR: A novel SIFT based feature for offline handwritten Chinese character recognition taking into account of the characteristics of handwritten Chinese samples is proposed, a modification of SIFT descriptor.
Abstract: SIFT descriptor has been widely applied in computer vision and object recognition, but has not been explored in the field of handwritten Chinese character recognition. In this paper we proposed a novel SIFT based feature for offline handwritten Chinese character recognition. The presented feature is a modification of SIFT descriptor taking into account of the characteristics of handwritten Chinese samples. In our approach, global elastic meshing is first constructed and then the related gradient code of each sub-region is accumulated dynamically. Experiments using MQDF classifier show our feature’s effectiveness with a recognition rate of 97.868%, which outperforms original SIFT feature and two traditional features, Gabor feature and gradient feature.

42 citations


Proceedings ArticleDOI
20 Aug 2009
TL;DR: This study proposes a novel solution for performing character recognition in Gujrati, the official language of Gujarat by proposing a method called Pattern Matching where a character is identified by analyzing its shape and comparing its features that distinguish each character.
Abstract: during the last forty years, Handwritten Character Recognition (HCR) has most often been investigated under the framework of Character Recognition (OCR) and Pattern Recognition. HCR is more considered as a perceptual and interpretation task closely connected with research into Human Language. India is a country which uses many languages in the different parts of the country be it for personal use or use of business. In this study we propose a novel solution for performing character recognition in Gujrati, the official language of Gujarat. Pursued by the preprocessing techniques, we suggest a method called Pattern Matching where a character is identified by analyzing its shape and comparing its features that distinguish each character. Various handwritten characters from forms or peripheral devices etc. are recognized with the help of various pre-processing and image enhancement techniques. These characters are further more specifically recognized by Pattern matching using Neural Network.

39 citations


Proceedings ArticleDOI
26 Jul 2009
TL;DR: A publicly available database, CASIA-OLHWDB1, for research on online handwritten Chinese character recognition, which contains unconstrained handwritten characters of 4,037 categories produced by 420 persons, and 1,694,741 samples in total.
Abstract: This paper describes a publicly available database, CASIA-OLHWDB1, for research on online handwritten Chinese character recognition. This database is the first of our series of online/offline handwritten characters and texts, collected using Anoto pen on paper. It contains unconstrained handwritten characters of 4,037 categories (3,866 Chinese characters and 171 symbols) produced by 420 persons, and 1,694,741 samples in total. It can be used for design and evaluation of character recognition algorithms and classifier design for handwritten text recognition systems. We have partitioned the samples into three grades and into training and test sets. Preliminary experiments on the database using a state-of-the-art recognizer justify the challenge of recognition.

38 citations


Dissertation
19 Nov 2009
TL;DR: The main goal of this thesis is to develop an online handwritten Gurmukhi character recognition system that can be used for quick and natural way of communication between computer and human beings.
Abstract: Computers are greatly influencing the lives of human beings and their usage is increasing at a tremendous rate. The ease with which we can exchange information between user and computer is of immense importance today because input devices such as keyboard and mouse have limitations vis-a-vis input through natural handwriting. We can use the online handwriting recognition process for a quick and natural way of communication between computer and human beings. Handwriting recognition is in research for over four decades and has attracted many researchers across the world. Variations in handwriting are one prominent problem and achieving high degree of accuracy is a tedious task. The main goal of this thesis is to develop an online handwritten Gurmukhi character recognition system. Gurmukhi is the script of Punjabi language which is widely spoken across the globe. This thesis is divided into six chapters. A brief outline of each chapter is given in the following paragraphs. Chapter 1 includes three sections, namely, issues in online handwriting recognition system, literature review and overview of Gurmukhi script. Issues in online handwriting recognition system include: handwriting styles variations; constrained and unconstrained handwriting; personal, situational and material factors; writer dependent vs. writer independent recognition systems. In literature review, a detailed literature survey on each phase of established procedure of online handwriting recognition has been presented. The established procedure to recognize online handwriting includes data collection, preprocessing, feature extraction, segmentation, recognition and post-processing. We have also reviewed literature for different recognition methods. These recognition methods are statistical, syntactical and structural, neural network and elastic matching methods. In addition, we have also discussed some of the results reported in the literature of online handwriting recognition. This literature review covers different languages such as English, Chinese, Japanese, Urdu, Hindi, Bangali, Tamil and Telugu. In the overview of Gurmukhi script, we have included nature of handwriting in Gurmukhi script and different characters of Gurmukhi script. Chapter 2 contains the work carried out for three phases of online handwriting character recognition. These phases are data collection, preprocessing and feature extraction. These phases are discussed in three sections entitled data collection phase, preprocessing phase and computation of features phase. In data collection phase, input handwritten strokes are

36 citations


Journal Article
TL;DR: The proposed method has been successfully tested on IFN/ENIT database consisting of 26459 Arabic words handwritten by 411 different writers, and the results were promising and very encouraging in more accurate detection of the baseline and segmentation of words for further recognition.
Abstract: Efficient preprocessing is very essential for automatic recognition of handwritten documents. In this paper, techniques on segmenting words in handwritten Arabic text are presented. Firstly, connected components (ccs) are extracted, and distances among different components are analyzed. The statistical distribution of this distance is then obtained to determine an optimal threshold for words segmentation. Meanwhile, an improved projection based method is also employed for baseline detection. The proposed method has been successfully tested on IFN/ENIT database consisting of 26459 Arabic words handwritten by 411 different writers, and the results were promising and very encouraging in more accurate detection of the baseline and segmentation of words for further recognition. Keywords—Arabic OCR, off-line recognition, Baseline estimation, Word segmentation.

32 citations


Proceedings ArticleDOI
06 Mar 2009
TL;DR: Zone and Distance metric based feature extraction system is presented and 97.75% recognition rate for Kannada numerals is obtained, obtained in this paper.
Abstract: Character recognition is the important area in image processing and pattern recognition fields. Handwritten character recognition has received extensive attention in academic and production fields. The recognition system can be either on-line or off-line. Off-line handwriting recognition is the subfield of optical character recognition. India is a multi-lingual and multi-script country, where eighteen official scripts are accepted and have over hundred regional languages. In this paper we present Zone and Distance metric based feature extraction system. The character centroid is computed and the image is further divided in to n equal zones. Average distance from the character centroid to the each pixel present in the zone is computed. This procedure is repeated for all the zones present in the numeral image. Finally n such features are extracted for classification and recognition. Support vector machine is used for subsequent classification and recognition purpose. We obtained 97.75% recognition rate for Kannada numerals.

30 citations


Proceedings Article
01 Jan 2009
TL;DR: A novel technique is presented here for recognition of handwritten compound characters of Bangla alphabet, which advocates for incrementally expanding the number of learned character classes from more frequently occurred to less frequently occurred ones.
Abstract: A novel technique is presented here for recognition of handwritten compound characters of Bangla alphabet. It advocates for incrementally expanding the number of learned character classes from more frequently occurred to less frequently occurred ones. The work is preceded by a survey for finding the frequencies of occurrences of all Bangla characters in the standard literature. One important finding of the survey is that only 4.27 percent of characters in a standard text piece are on average compound characters. Out of the 160 compound character classes, characters of 55 classes constitute 90 percent of the compound characters occurring on average in a standard text piece. For the time being, handwritten characters from these classes are considered here. The average recognition rate, as observed under this work, is 84.67 percent after 3 fold cross validation of results. It is more or less comparable with the performance reported in another related work[3]. The work presented here can be considered as an important step for the development of OCR for handwritten Bangla characters, including complex shaped compound characters.

24 citations


Proceedings ArticleDOI
23 Jan 2009
TL;DR: A new elastic image matching (EM) technique based on an eigen-deformation for recognition of offline isolated English uppercase handwritten characters and offline isolated handwritten characters of Devnagari, the most popular script in India is proposed.
Abstract: Recognition of alphabetic characters is a basic need in incorporating intelligence to computers. Machine intelligence involves several aspects among which optical recognition is a tool, which can be integrated to text recognition. To make these aspects effective character recognition with better accuracy is important. However, handwritten character recognition is still a difficult task because of the high variability in the character shapes written by individuals. While large amount of work has been done towards recognition of handwritten English characters relatively less work is reported for the recognition of Indian language scripts. So, we proposed a new elastic image matching (EM) technique based on an eigen-deformation for recognition of offline isolated English uppercase handwritten characters and offline isolated handwritten characters of Devnagari, the most popular script in India. Deformations in handwritten characters have category-dependent tendencies. The estimation and the utilization of such tendencies called eigen-deformations are investigated for the better performance of elastic matching based handwritten character recognition. The eigen-deformations are estimated by the principal component analysis of actual deformations automatically collected by the elastic matching. Typical deformations of each category can be extracted as the eigen-deformations. According to a similarity measure (e.g.: Euclidean, Mahalanobis similarity measures etc.), a prototype matching is done for recognition.

Journal ArticleDOI
TL;DR: A powerful segmentation-free letter detection method based upon joint boosting with histograms of gradients as features based on efficient inference on an ensemble of hidden Markov models to recognize complete words in ambiguous handwritten text.

Posted Content
TL;DR: A scheme for offline Handwritten Devnagari Character Recognition is proposed, which uses different feature extraction methodologies and recognition algorithms, which achieves recognition rates 98.03%" for top 5 results and 89.46% for top 1 result.
Abstract: In this paper a scheme for offline Handwritten Devnagari Character Recognition is proposed, which uses different feature extraction methodologies and recognition algorithms. The proposed system assumes no constraints in writing style or size. First the character is preprocessed and features namely : Chain code histogram and moment invariant features are extracted and fed to Multilayer Perceptrons as a preliminary recognition step. Finally the results of both MLP's are combined using weighted majority scheme. The proposed system is tested on 1500 handwritten devnagari character database collected from different people. It is observed that the proposed system achieves recognition rates 98.03% for top 5 results and 89.46% for top 1 result.

Proceedings ArticleDOI
01 Dec 2009
TL;DR: An off-line Arabic handwritten word recognition system is proposed, in which technical details are presented in terms of three stages, i.e. preprocessing, feature extraction and classification.
Abstract: Due to similarities between Arabic letters, and the various writing styles employed, recognition of Arabic handwritten text can be difficult. In this paper, an off-line Arabic handwritten word recognition system is proposed, in which technical details are presented in terms of three stages, i.e. preprocessing, feature extraction and classification. Firstly, words are segmented from input scripts and also normalized in size. Secondly, each segmented word is divided into overlapping blocks. Absolute mean values computed for each block of segmented words constitutes a feature vector. Finally, the resulting feature vectors are used to classify the words using the K nearest Neighbour classifier (KNN). The proposed system has been successfully tested on the IFN/ENIT database consisting of 32492 Arabic handwritten words which are written by more than 1000 different writers. Experimental results show a good recognition rate when compared with other methods.

Book ChapterDOI
01 Jan 2009
TL;DR: An adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extended, a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.
Abstract: In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of the Devanagari script and support vector machine (SVM). Identified words are then segmented into individual characters, using a font-model-based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our recognition system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin and non-Latin scripts. Results show that the character-level recognition accuracy exceeds 92% for non-Latin and 96% for Latin text on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

Proceedings ArticleDOI
04 Feb 2009
TL;DR: The character recognition process for printed documents containing English and Oriya texts is described, which needs more effort of the OCR (Optical Character Recognition) designers for improving the accuracy rate.
Abstract: Recognition of documents containing multiscripts is really a challenging task, which needs more effort of the OCR (Optical Character Recognition) designers for improving the accuracy rate. Previously OCR was developed for documents with single scripts only mainly for English and regional languages. Old documents of not only uniscripts but also multiscripts is needed to be preserved for future use. This paper describes the character recognition process for printed documents containing English and Oriya texts. Though the languages in India are different but still we can find some common features among them. In consideration to our paper we need to distinguish between the Roman Script and the Oriya Script. Most of the English that is. Roman Script are linear as well as circular in nature and the Oriya characters are circular in nature. So we need to separate these scripts by taking into consideration of their features paragraph wise or line wise.

Proceedings ArticleDOI
06 Mar 2009
TL;DR: This work proposes a method on offline isolated English character recognition using stroke distribution of a character and two feature extraction methods based on directional features to classify the character under consideration to a class if hit.
Abstract: Machine simulation of human functions like recognition of the text is a challenging task. The Off-line Handwritten Character Recognition requires more research to reach the ultimate goal of machine recognition of the text. An attempt is made towards English language by a large number of researchers since six decades. But for Indian languages it is still a dream. We propose a method on offline isolated English character. The method is also applied to Marathi vowels. The image acquired is preprocessed to remove all unwanted details from the image so that the image is suitable for feature extraction. Feature extraction plays an important role in handwritten recognition. The two feature extraction methods based on directional features are considered. The first method uses stroke distribution of a character. The second method uses contour extraction. The Two directional features are compared with two different correlation techniques separately to check the suitability of the recognition method. First correlation technique calculates the dissimilarity between reference pattern and test pattern, and the other calculates the similarity between reference pattern and test pattern. The result of the comparison is to classify the character under consideration to a class if hit. If miss, the confusion information is extracted for the analysis.

Proceedings ArticleDOI
01 Dec 2009
TL;DR: This work represents the development of an online handwriting recognition system for Bangla script, widely used in eastern India and Bangladesh, which is characterized by structure or shape based representation of a stroke in which a stroke is represented as a string of shape features.
Abstract: Developing efficient handwriting recognition systems that are fast and highly reliable is a challenging problem. This work represents the development of an online handwriting recognition system for Bangla script, widely used in eastern India and Bangladesh. In our approach, an online handwritten character/cluster is characterized by structure or shape based representation of a stroke in which a stroke is represented as a string of shape features. Using this string representation, an unknown stroke is identified by comparing it with a database of strokes using DTW (Dynamic Time Warping) technique. Identifying all the component strokes recognizes a full character. A recognition experiment has been conducted with a total of 495 classes on 20,873 data samples and 10 people as data contributors yielding 97.33% recognition rate with 2.18% misrecognition rate and 0.5% rejection rate.

01 Jan 2009
TL;DR: An attempt is made to develop English handwritten character and digit recognition system using the Multi Class SVM classifier and a novel feature set called structural micro feature set for handwritten data.
Abstract: In this paper an attempt is made to develop English handwritten character and digit recognition system .The paper describes the process of character recognition using the Multi Class SVM classifier and a novel feature set. The problem of recognition of English handwritten characters is still an active area of research. The support vector machine(SVM) is new learning machine with very good generalization ability. Recent results in pattern recognition have shown SVM (Support vector classifier) often have superior recognition rates in comparison to the other classification methods. The input data is English handwritten characters and digits. Here the novel and computational feature set called as structural micro feature set is proposed for handwritten data. Distinctive features for each character are extracted. Those features are passed to multiclass svm classifier which generate the hyperplane. Multicass hyperplane plots the values of test images in the classified class.

Proceedings ArticleDOI
21 Nov 2009
TL;DR: An optical character recognition system based on image preprocessing technologies combined with Least Square Support Vector Machine (LS-SVM) has been developed, which first uses dynamic thresholding operation and robust gray value normalization to segment characters and extract features respectively, and then uses LS-S VM to classify characters based on features.
Abstract: Optical character recognition (OCR) is a very active field for research and development, and has become one of the most successful applications of automatic pattern recognition. To avoid the curse of dimensionality and improve the recognition performance, an optical character recognition system based on image preprocessing technologies combined with Least Square Support Vector Machine (LS-SVM) has been developed, which first uses dynamic thresholding operation and robust gray value normalization to segment characters and extract features respectively, and then uses LS-SVM to classify characters based on features. The proposed method has been evaluated by carrying out recognition experiments on the optical characters of electronic components. The results show that the proposed method has a better recognition performance, and holds a lot of potential for developing robust recognition learning.

Journal ArticleDOI
TL;DR: A lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic- specific lexicons for improving the recognition accuracy and a method which uses topic-specific language models and a maximum-entropy based topic categorized model to refine the recognition output.
Abstract: Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.

Patent
30 Jun 2009
TL;DR: In this article, a statistical system and method for generating patterns and performing online handwriting recognition based on those patterns was proposed, where a plurality of predetermined patterns may be generated by performing feature extraction operations on one or more character samples utilizing a Gabor filter.
Abstract: A statistical system and method for generating patterns and performing online handwriting recognition based on those patterns. A plurality of predetermined patterns may be generated by performing feature extraction operations on one or more character samples utilizing a Gabor filter. An online handwritten character may be acquired. The online handwritten character may be pre-processed. One or more feature extraction operations, utilizing a Gabor filter, may be performed on the online handwritten character to produce a feature vector. One or more patterns may be generated, using a statistical algorithm, for the online handwritten character, based on the feature vector. The online handwritten character may be statistically classified based on a comparison between the one or more patterns generated for the online handwritten character and the plurality of predetermined patterns.

Journal ArticleDOI
TL;DR: The algorithm of pre-processing such as line and character segmentation is studied and determined so that the design can give a good result and can be implemented in hardware.
Abstract: Research on character recognition is divided into two forms which are on-line and off-line recognition. For on-line recognition, the user will write on input surface which is called tablet and it will recognize the character immediately, while off-line recognition is a recognition process that using a written document which will be scan and save into a computer for further processing. Pre-processing such as transformation of image into binary, line segmentation or character segmentation need to be done and it must be done accurately as possible. After pre-processing is done, the recognition process will recognize the character that has been segmentized. This research is focusing on off-line character recognition. The algorithm of pre-processing such as line and character segmentation is studied and determined so that the design can give a good result and can be implemented in hardware. A process of transformation towards the characters is done using Discrete Wavelet Transform since it will show the details of the pixels. After that, a process to generate a sequence of binary that using a value of threshold (threshold value is determine by experiment) is done so that it can be use for recognition process. This sequence of binary will be classified using Hamming Distance which it can trace bit changes in the two sequence of binary and the bit value distinction will be used to recognize the character.

Proceedings ArticleDOI
26 Jul 2009
TL;DR: A document level OCR which incorporates information from the entire document to reduce word error rates and demonstrates a relative improvement of 28% for long words and 12% for all words which appear at least twice in the corpus for Telugu.
Abstract: The word error rate of any optical character recognition system (OCR) is usually substantially below its component or character error rate. This is especially true of Indic languages in which a word consists of many components. Current OCRs recognize each character or word separately and do not take advantage of document level constraints. We propose a document level OCR which incorporates information from the entire document to reduce word error rates. Word images are first clustered using a locality sensitive hashing technique. Individual words are then recognized using a (regular) OCR. The OCR outputs of word images in a cluster are then corrected probabilistically by comparing with the OCR outputs of other members of the same cluster. The approach may be applied to improve the accuracy of any OCR run on documents in any language. In particular, we demonstrate it for Telugu, where the use of language models for post-processing is not promising. We show a relative improvement of 28% for long words and 12% for all words which appear at least twice in the corpus.

Proceedings ArticleDOI
26 Jul 2009
TL;DR: This paper presents a unified approach for multi-lingual recognition of alphabetic scripts using the multi-stream paradigm and shows interesting recognition performances with only 1.5% of script confusion and an overall word recognition rate of 84.5%.
Abstract: Generally, handwritten word recognition systems use script specific methodologies. In this paper, we present a unified approach for multi-lingual recognition of alphabetic scripts. The proposed system operates independently of the nature of the script using the multi-stream paradigm. The experiments have been carried out on a multi-script database composed of Arabic and Latin handwritten words from the IFN/ENIT and the IRONOFF public databases and show interesting recognition performances with only1.5% of script confusion and an overall word recognition rate of 84.5% using a multi-script lexicon of 1142 words.

Proceedings ArticleDOI
26 Jul 2009
TL;DR: A method for detection and correction of errors in recognition results of handwritten and machine printed Gurmukhi OCR and suggestions are made based on the similarity of the source word with the words of the same code present in dictionary.
Abstract: A post-processor is an integral part of any OCR system. This paper proposes a method for detection and correction of errors in recognition results of handwritten and machine printed Gurmukhi OCR. Based on the shape similarity of characters, the consonants of Gurmukhi Script are divided into different sets. Each set is given a unique number. In case of a recognition error, based on the shape of the consonants, corrections are made by taking each consonant of the subset into consideration. According to proposed algorithm, each recognized word is first encoded based on its consonants. The corresponding code is then searched in the dictionary. If it exits then words from the list of the code are match with the source word. In case of match the word is treated as correct else suggestions are made based on the similarity of the source word with the words of the same code present in dictionary. The method has been tested on the output of OCR of variety of machine printed and handwritten documents.

Patent
18 Jun 2009
Abstract: A method and an apparatus for recognizing characters using an image are provided. A camera is activated according to a character recognition request and a preview mode is set for displaying an image photographed through the camera in real time. An auto focus of the camera is controlled and an image having a predetermined level of clarity is obtained for character recognition from the images obtained in the preview mode. The image for character recognition is character-recognition-processed so as to extract recognition result data. A final recognition character row is drawn that excludes non-character data from the recognition result data. A first word is combined including at least one character of the final recognition character row and a predetermined maximum number of characters. A dictionary database that stores dictionary information on various languages using the first word is searched, so as to provide the user with the corresponding word.

Book ChapterDOI
01 Jan 2009
TL;DR: This chapter presents (perhaps) the first system for recognizing handwritten Urdu words, which achieved an accuracy of 70% for the top choice, and 82% forThe top three choices.
Abstract: Urdu is a language spoken in the Indian subcontinent by an estimated 130–270 million speakers. At the spoken level, Urdu and Hindi are considered dialects of a single language because of shared vocabulary and the similarity in grammar. At the written level, however, Urdu is much closer to Arabic because it is written in Nastaliq, the calligraphic style of the Persian–Arabic script. Therefore, a speaker of Hindi can understand spoken Urdu but may not be able to read written Urdu because Hindi is written in Devanagari script, whereas an Arabic writer can read the written words but may not understand the spoken Urdu. In this chapter we present an overview of written Urdu. Prior research in handwritten Urdu OCR is very limited. We present (perhaps) the first system for recognizing handwritten Urdu words. On a data set of about 1300 handwritten words, we achieved an accuracy of 70% for the top choice, and 82% for the top three choices.

01 Jan 2009
TL;DR: A proposed finite state model for the optical recognition of Nastalique printed text is discussed and it is shown that optical character recognition of the Latin script is relatively easier.
Abstract: Finite state technology is being used since long to model NLP (Natural Language Processing) applications specially it has very successfully applied to machine translation and speech recognition systems. Character recognition in cursive scripts or handwritten Latin script also have attracted researchers’ attention and some research is also done in this area. Optical character recognition is the translation of optically scanned bitmaps of printed or written text into digitally editable data files. OCRs developed for many world languages are already under efficient use but none exist for Nastalique – a calligraphic adaptation of the Arabic script, just as Jawi is for Malay. Urdu has 39 characters against the Arabic 28. Each character then has 2-4 different shapes according to their position in the word: initial, medial, final and isolated. In Nastalique, word and character overlapping makes optical recognition more complex. Optical character recognition of the Latin script is relatively easier. This paper based on research on Nastalique OCR discusses a proposed finite state model for the optical recognition of Nastalique printed text.

Proceedings Article
10 Sep 2009
TL;DR: Combining the two systems using log-linear combination gives better results than either system separately, with consistent CER gains of 0.1-0.2% absolute over the word based standard system.
Abstract: The Chinese language is based on characters which are syllabic in nature. Since languages have syllabotactic rules which govern the construction of syllables and their allowed sequences, Chinese character sequence models can be used as a first level approximation of allowed syllable sequences. N-gram character sequence models were trained on 4.3 billion characters. Characters are used as a first level recognition unit with multiple pronunciations per character. For comparison the CU-HTK Mandarin word based system was used to recognize words which were then converted to character sequences. The character only system error rates for one best recognition were slightly worse than word based character recognition. However combining the two systems using log-linear combination gives better results than either system separately. An equally weighted combination gave consistent CER gains of 0.1-0.2% absolute over the word based standard system. Copyright © 2009 ISCA.