scispace - formally typeset
Search or ask a question

Showing papers on "Intelligent word recognition published in 2008"


Journal ArticleDOI
TL;DR: It is shown experimentally that explicit state duration modeling in the SCHMM framework can significantly improve the discriminating capacity of the SCHMMs to deal with very difficult pattern recognition tasks such as unconstrained handwritten Arabic recognition.

67 citations


Journal ArticleDOI
TL;DR: This paper presents a multilingual character recognition system for printed South Indian scripts (Kannada, Telugu, Tamil and Malayalam) and English documents based on Fourier transform and principal component analysis (PCA), which are two commonly used techniques of image processing and recognition.

60 citations


Proceedings ArticleDOI
31 Mar 2008
TL;DR: A new database of off-line Arabic handwriting text is built to be used for writer identification research and the performance of edge-based directional probability distributions as features and other features in Arabic writer identification is evaluated.
Abstract: A system for writer identification based on Arabic handwritten words was built. First a database of words was gathered and used as a test base. Then, features vectors were extracted from writers' word images. Prior to feature extraction, normalization operations were applied to a word or text line. In this research, we studied the feature extraction and recognition operations on Arabic text, on the identification rate of writers. Since there is no well known database containing Arabic handwritten words for researchers to test, we built a new database of off-line Arabic handwriting text to be used for writer identification research. The proposed database is meant to provide training and testing sets for Arabic writer identification research. Arabic handwritten words were collected from 100 writers. We evaluated the performance of edge-based directional probability distributions as features and other features in Arabic writer identification.

55 citations


Proceedings ArticleDOI
13 Dec 2008
TL;DR: In this paper, a system for offline recognition of handwritten handwritten Tamil characters using Hidden Markov Models (HMM) has been presented, which uses a combination of Time domain and frequency domain feature.
Abstract: Concerning to optical character recognition, handwriting has sustained to persist as a means of communication and recording information in day to day life even with the introduction of new technologies. Hidden Markov Models (HMM) have long been a popular choice for Western cursive handwriting recognition following their success in speech recognition. However, when it comes to Indic script recognition, the published work employing HMMs is limited, and generally focused on isolated character recognition. A system for offline recognition of cursive handwritten Tamil characters is presented. In this effort, offline cursive handwritten recognition system for Tamil based on HMM and uses a combination of Time domain and frequency domain feature is proposed. The tolerance of the system is evident as it can overwhelm the complexities arise out of font variations and proves to be flexible and robust. Higher degree of accuracy in results has been obtained with the implementation of this approach on a comprehensive database. These initial results are promising and warrant further research in this direction. The results are also encouraging to explore possibilities for adopting the approach to other Indic scripts as well.

45 citations


Patent
Asaf Tzadok1, Eugeniusz Walach1
16 Apr 2008
TL;DR: In this article, a document-specific database is created from an OCR scan of a document of interest, which contains an exhaustive listing of words in the document and images of each word, taken from all the fonts encountered, are entered into the database and mapped to a corresponding textual representation.
Abstract: Disclosed embodiments of the invention provide automated global optimization methods and systems of OCR, tailored to each document being digitized. A document-specific database is created from an OCR scan of a document of interest, which contains an exhaustive listing of words in the document. Images of each word, taken from all the fonts encountered, are entered into the database and mapped to a corresponding textual representation. After entry of a first instance of an image of a word written in a particular font, each new occurrence of the word in that font can be quickly recognized by image processing techniques. The disclosed methods and systems may be used in conjunction with adaptive character recognition training and word recognition training of the OCR engines.

41 citations


Proceedings ArticleDOI
17 Dec 2008
TL;DR: A segmentation-based approach to handwritten Devanagari word recognition is proposed, on the basis of the head line, a word image is segmented in to pseudo characters.
Abstract: The present paper proposes a segmentation-based approach to handwritten Devanagari word recognition. On the basis of the head line, a word image is segmented in to pseudo characters. Hidden Markov models are proposed to recognize the pseudo characters. The word level recognition is done on the basis of a string edit distance.

38 citations


Proceedings ArticleDOI
01 Nov 2008
TL;DR: Zone and Distance metric based feature extraction system is presented and 98 % and 96 % recognition rate for Kannada and Telugu numerals respectively are obtained.
Abstract: Character recognition is the important area in image processing and pattern recognition fields. Handwritten character recognition has received extensive attention in academic and production fields. The recognition system can be either on-line or off-line. Off-line handwriting recognition is the subfield of optical character recognition. India is a multi-lingual and multi-script country, where eighteen official scripts are accepted and have over hundred regional languages. In this paper we present Zone and Distance metric based feature extraction system. The character centroid is computed and the image is further divided in to n equal zones. Average distance from the character centroid to the each pixel present in the zone is computed. This procedure is repeated for all the zones present in the numeral image. Finally n such features are extracted for classification and recognition. Feed forward back propagation neural network is designed for subsequent classification and recognition purpose. We obtained 98 % and 96 % recognition rate for Kannada and Telugu numerals respectively.

37 citations



Proceedings ArticleDOI
16 Jul 2008
TL;DR: The projection distance metric and zoning based scheme for numeral recognition and a nearest neighbor classifier is used for subsequent purpose and gives around 93% and 90% of recognition accuracy for Kannada and Tamil numerals respectively.
Abstract: Handwritten character recognition has received extensive attention in academic and production fields. The recognition system can be either online or off-line. There is a large demand for Optical character recognition on hand written documents. India is a multi-lingual country and multi script country, where eighteen official scripts are accepted and have over hundred regional languages. In this paper we have proposed the projection distance metric and zoning based scheme for numeral recognition. We tested our proposed method for Kannada and Tamil numerals. A nearest neighbor classifier is used for subsequent purpose. The proposed method gives around 93% and 90% of recognition accuracy for Kannada and Tamil numerals respectively.

35 citations


Proceedings ArticleDOI
27 Jan 2008
TL;DR: A gap metrics based machine learning approach to separate a line of unconstrained handwritten text into words and proposes a combined distance measure computed using three different methods to overcome the disadvantage of different distance computation methods.
Abstract: Word segmentation is the most critical pre-processing step for any handwritten document recognition and/or retrieval system. When the writing style is unconstrained (written in a natural manner), recognition of individual components may be unreliable, so they must be grouped together into word hypotheses before recognition algorithms can be used. This paper describes a gap metrics based machine learning approach to separate a line of unconstrained handwritten text into words. Our approach uses a set of both local and global features, which is motivated by the ways in which human beings perform this kind of task. In addition, in order to overcome the disadvantage of different distance computation methods, we propose a combined distance measure computed using three different methods. The classification is done by using a three-layer neural network. The algorithm is evaluated using an unconstrained handwriting database that contains 50 pages (1026 line, 7562 words images) handwritten documents. The overall accuracy is 90.8%, which shows a better performance than a previous method.

34 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: A novel segmentation based approach is proposed for recognition of offline handwritten Devanagari words and a hidden Markov model is used for recognition at pseudocharacter level.
Abstract: A novel segmentation based approach is proposed for recognition of offline handwritten Devanagari words. Stroke based features are used as feature vectors. A hidden Markov model is used for recognition at pseudocharacter level. The word level recognition is done on the basis of a string edit distance.

Book ChapterDOI
11 Apr 2008
TL;DR: Research on Urdu Nastaliq OCR is reported, challenges are discussed and a new solution for its implementation is suggested to suggest a new approach to its implementation.
Abstract: Character recognition in cursive scripts or handwritten Latin script has attracted researchers’ attention recently and some research has been done in this area. Optical character recognition is the translation of optically-scanned bitmaps of printed or written text into digitally editable data files. OCRs developed for many world languages are already in use but none exists for Urdu Nastaliq – a calligraphic adaptation of the Arabic script, just as Jawi is for Malay. Urdu Nastaliq has 39 characters against Arabic 28. Each character then has 2-4 different shapes according to its position in the word: initial, medial, final and isolated. In Nastaliq, inter-word and intra-word overlapping makes optical recognition more complex. Character recognition of the Latin script is relatively easier. This paper reports research on Urdu Nastaliq OCR, discusses challenges and suggest a new solution for its implementation.

Book ChapterDOI
09 Dec 2008
TL;DR: A novel skeletonization algorithm called MFITS (morphology-fused index table skeletonization) is proposed and a skeleton-based Chinese calligraphic character recognition method is proposed too.
Abstract: The large amount of digitized Chinese calligraphic works in existence is a valuable part of the Chinese cultural heritage. But they can hardly be recognized by optical character recognition (OCR) which performs well on machine printed characters against clean background, because there are so different styles of shape complexity characters. So the approaches of automatic Chinese calligraphic character recognition become more and more important. A novel skeletonization algorithm called MFITS (morphology-fused index table skeletonization) is proposed and a skeleton-based Chinese calligraphic character recognition method is proposed too. The experiments show that MFITS can extract skeletons with only a few deformations and the skeleton-based Chinese calligraphic character image recognition method has a good performance.

Journal ArticleDOI
TL;DR: This study proposes a novel solution for performing character recognition in Tamil using octal graph conversion for recognizing off-line handwritten Tamil characters which improves the slant correction and indicates that the approach can be used forCharacter recognition in other Indic scripts as well.
Abstract: Problem Statement: Handwriting recognition has attracted voluminous research in recent times. The segmentation and recognition of the characters from handwritten scripts incorporates considerable overhead. Almost all the existing handwritten character recognition techniques use neural network approach, which requires lot of preprocessing and hence accomplishing these problems using neural network is a tedious task. Approach: In this study we propose a novel solution for performing character recognition in Tamil, the official language of the south Indian province of Tamil Nadu. Pursued by the preprocessing techniques, Segmentation, Normalization and Feature Extraction the approach utilizes octal graph conversion for recognizing off-line handwritten Tamil characters which improves the slant correction. The graph tries to represent the basic form of a letter independent of the style of writing. Using the weights of the graphs and by the appropriate feature matching with the predefined characters, the written characters are recognized. Results: The performance evaluation of off line handwritten Tamil character using octal graph conversion and the metrics based on ranks of the letters proves good Recognition Efficiency Conclusion: We show that, in practise, the proposed approach produces near optimal results besides outperforming the other methodologies in existence. Results indicate that the approach can be used for character recognition in other Indic scripts as well.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: A novel algorithm for smoothing image and segmentation of the Arabic character using width writing estimated from skeleton character and Principal component Analysis (PCA) as data processing algorithm to features vector in order to reduce dimension is proposed.
Abstract: This paper describes new methods for handwritten Arabic character recognition. We propose a novel algorithm for smoothing image and segmentation of the Arabic character using width writing estimated from skeleton character. The moments and Fourier descriptor of profile projection and centroid distance are used as features of each character these feature are invariant in translation , rotation and scale we apply Principal component Analysis (PCA) as data processing algorithm to features vector in order to reduce dimension. The classifier proposed in this work is based on Support Vector Machines (SVM) wich considerd an recent optimal classifier up to now. The results show that these methods are very powerful for isolated handwritten Arabic character.

Proceedings ArticleDOI
10 Jun 2008
TL;DR: The main challenges (difficulties) researchers are facing and up to dated solutions (the common methods) are used for Arabic text recognition.
Abstract: Optical Characters Recognition (OCR) is one of the active subjects of research since the early days of computer science. Even if Arabic characters are used by more than a half a billion people; Arabic characters recognition has not received enough interests by the researchers. Little research progress has been achieved comparing to what has been done with Latin and Chinese. The cursive nature of the Arabic characters makes it more difficult to achieve a high accuracy in character recognition since even printed Arabic characters are in cursive form. This paper presents the main challenges (difficulties) researchers are facing and up to dated solutions (the common methods) are used for Arabic text recognition.

Proceedings ArticleDOI
07 Jul 2008
TL;DR: A novel character recognition method of license plate number based on parallel BP neural networks that will enhance the accuracy of the recognition system that aims to read automatically the Chinese license plate.
Abstract: In the automated license plate recognition system, many reading errors are caused by inadequate character recognition method. This paper presents a novel character recognition method of license plate number based on parallel BP neural networks. This will enhance the accuracy of the recognition system that aims to read automatically the Chinese license plate. In the proposed methodology, the character is binarized and the noise is eliminated in the preprocessing stage, then the character feature is extracted by using skeleton and the character is normalized to size 8*16 pixels. Finally, the character feature is put into the parallel neural networks and the character is recognized. The proposed method in character recognition is effective, and promising results have been obtained in experiments on Chinese license plates.

Proceedings ArticleDOI
27 Jan 2008
TL;DR: A system for the off-line recognition of cursive Arabic handwritten words based on Hidden Markov Models (HMMs) and uses a sliding window approach, which shows that using contextual character models improves recognition.
Abstract: In this paper we present a system for the off-line recognition of cursive Arabic handwritten words. This system in an enhanced version of our reference system presented in [El-Hajj et al., 05] which is based on Hidden Markov Models (HMMs) and uses a sliding window approach. The enhanced version proposed here uses contextual character models. This approach is motivated by the fact that the set of Arabic characters includes a lot of ascending and descending strokes which overlap with one or two neighboring characters. Additional character models are constructed according to characters in their left or right neighborhood. Our experiments on images of the benchmark IFN/ENIT database of handwritten villages/towns names show that using contextual character models improves recognition. For a lexicon of 306 name classes, accuracy is increased by 0.6% in absolute value which corresponds to a 7.8% reduction in error rate.

Proceedings ArticleDOI
24 Jul 2008
TL;DR: This research constructs a topic based language model for every document using a training data which is manually categorized and trains a topic categorization sub-system based on Maximum Entropy model which is used to generate the topic distribution of a test document.
Abstract: Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers produce reasonably clean output when used with a restricted lexicon. But in absence of such a restricted lexicon, the output of an unconstrained handwritten word recognizer is noisy. The objective of this research is to process noisy recognizer output and eliminate spurious recognition choices using a topic based language model. We construct a topic based language model for every document using a training data which is manually categorized. A topic categorization sub-system based on Maximum Entropy model is also trained which is used to generate the topic distribution of a test document. A given test word image is processed by the recognizer and its word recognition likelihood is refined by incorporating topic distribution of the document and topic based language model probability. The proposed method is evaluated on a publicly available IAM dataset and experimental results show significant improvement in the word recognition accuracy from 32% to 40% over a test set consisting of 4033 word images extracted from 70 handwritten document images.

Patent
08 Feb 2008
TL;DR: In this article, a text input system and method involving finger-based handwriting recognition and word prediction was presented, which consisted of a text prediction component (300) for predicting a plurality of follow-up words based on a text context, the text predictive component (310) outputting a set of candidate words; a character handwriting recognition component (330) for recognizing a handwritten character candidate, the handwritten character candidates being determined based upon handwriting input received from a touch sensitive input field (340); a candidate word filtering component (350) for filtering the set of candidates received from the text
Abstract: The present invention relates to a text input system and method involving finger-based handwriting recognition and word prediction. A text input device (300) comprises: a text prediction component (310) for predicting a plurality of follow-up words based on a text context, the text prediction component (310) outputting a set of candidate words; a character handwriting recognition component (330) for recognizing a handwritten character candidate, the handwritten character candidate being determined based upon handwriting input received from a touch sensitive input field (340); a candidate word filtering component (350) for filtering the set of candidate words received from the text prediction component (310) based on the recognized handwritten character candidate; a word presentation component (360) for presenting candidate words from the filtered set of candidate words to a user of the device; and a word selection component (380) for receiving a user selection of a presented candidate word from the user.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: The promising experimental results demonstrated the method is an orientation free and stroke-order free method for unconstrained cursive handwritten Chinese word recognition.
Abstract: In this paper, we propose an orientation free method for unconstrained cursive handwritten chinese word recognition. By a novel gravity center balancing method, the orientation of a handwritten word can be detected. Through the process of stroke extraction, stroke breaking, heuristic over-segmentation and path searching by recognition and lexicon information, the handwritten word with characters even connected or partially overlapped can be recognized. Experiments were performed on 173,660 unconstrained handwritten Chinese word samples collected by Pocket PC. The promising experimental results demonstrated our method is an orientation free and stroke-order free method for unconstrained cursive handwritten Chinese word recognition.

Proceedings Article
01 Jan 2008
TL;DR: This paper presents a character-based Conditional Random Fields (CRFs) model for Chinese word segmentation and named entity recognition, and it turns out to perform well.
Abstract: Chinese word segmentation and named entity recognition (NER) are both important tasks in Chinese information processing. This paper presents a character-based Conditional Random Fields (CRFs) model for such two tasks. In The SIGHAN Bakeoff 2007, this model participated in all closed tracks for both Chinese NER and word segmentation tasks, and turns out to perform well. Our system ranks 2nd in the closed track on NER of MSRA, and 4th in the closed track on word segmentation of SXU.

Proceedings ArticleDOI
11 Dec 2008
TL;DR: The approach to gain performance in online character recognition is to design more representative features for handwritten character representation in order to tackle the huge inter-class variability problem and increase recognition accuracy.
Abstract: Transforming handwriting into digital text and recognition of handwritten patterns opens a vast scope of application opportunities from searching for handwritten notes and document management to causing actions by writing symbols. Despite receiving a great attention, a massive number of applications, and a huge research effort, recognition of handwritten text has not still reached a desired efficiency and is an active area of research. One of the most important factors that makes handwriting recognition a challenging task is the huge variety of writing styles which can not be captured efficiently through available classification methods using current feature descriptors. Our approach to gain performance in online character recognition is to design more representative features for handwritten character representation in order to tackle the huge inter-class variability problem and increase recognition accuracy. The representation can also be used in recognition of other online planar patterns. The experimental results show that proposed representation with SVM classifier outperforms best reported recognition rates for Arabic characters in a writer-independent system.

01 Jan 2008
TL;DR: OCR system that saves abstracted characters to DB automatically after extracting only equivalent and necessary characters from a large amount of documents by using BP algorithm that is one of Artificial neural network is constructed.
Abstract: †Summary Most government agencies and companies have kept proof data and documentations which are passed certain period of time and exchanged electronic forms by the regulation of an office management. The method that saving relevant documents by scanning or entering manually on computer was used for document's digitalizing. So that the government agencies and companies are trying to reduce these inconvenience nowadays. They use OCR (OCR : Optical Character Recognition) technique which is that saving relevant documents to DB after extracting text by using OCR(Optical Character Recognition). However, there is inconvenience in general OCR. That is, text should be entered to DB after classifying segments one by one in realized whole document after doing character recognition through OCR. In this paper, in order to solve this problem, we constructed OCR system that saves abstracted characters to DB automatically after extracting only equivalent and necessary characters from a large amount of documents by using BP algorithm that is one of Artificial neural network.

Proceedings ArticleDOI
Jufeng Yang1, Guangshun Shi1, Kai Wang1, Qian Geng1, Qing-Ren Wang1 
01 Dec 2008
TL;DR: A novel two-level algorithm to recognize expressions is proposed that segments expressions fatherly and recognizes isolated symbols and an XML-based system to help users save, modify and search the recognition result is designed.
Abstract: In this paper, we study the major modules of on-line handwritten chemical expressions recognition. We propose a novel two-level algorithm to recognize expressions. In the first level, structural information is used to distinguish different parts and recognize substances. Then the algorithm segments expressions fatherly and recognizes isolated symbols. To meet the demand of actual applications, the paper also designs an XML-based system to help users save, modify and search the recognition result. The experiment shows that the presented algorithm is reliable.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: Experimental results show that the robustness of the algorithm benefits from the facts that both digit transition detection and digit-sequence recognition are more reliable than direct character recognition.
Abstract: This paper presents an algorithm for robust time recognition of video clock. The existing OCR algorithms cannot recognize time properly due to digits of time are in very low resolution and blur. To confront the challenges of time recognition, our algorithm employs three techniques. The first one is a digit transition detection, which identifies SECOND transit frames. The second is a digit-sequence recognition, which uses the property that digits in clock appear in cycle of 0 to 9 to form digit sequence. The third is an on-the-fly template creation. Informally, the robustness of our algorithm benefits from the facts that both digit transition detection and digit-sequence recognition are more reliable than direct character recognition. Experimental results show that our algorithm can achieve a high accuracy in recognizing time.

Journal ArticleDOI
TL;DR: A holistic lexicon reduction technique for offline handwritten Arabic word recognition is proposed in this paper and involves the extraction of dots and subwords from the cursive Arabic word image to describe its shape.
Abstract: Given large number of words to be recognized, a two-stage strategy for eliminating unlikely candidates before recognition can be a reasonable and powerful approach for increasing the recognition speed. A holistic lexicon reduction technique for offline handwritten Arabic word recognition is proposed in this paper. The principle of this technique involves the extraction of dots and subwords from the cursive Arabic word image to describe its shape. In the first stage of reduction, the number of subwords in the input word is estimated. Then in the second stage, the word descriptor, based on the dots information, is used while taking into account only the candidates selected in the first stage. Experimental results on IFN/ENIT database, consisting of 26,459 cursive Arabic word images, show a lexicon reduction of 92.5% with accuracy of 74%.

Proceedings ArticleDOI
17 Nov 2008
TL;DR: A bio-inspired unified model to improve the recognition accuracy of character recognition problems for CAPTCHA (completely automated public turing test to tell computers and humans apart) and can be generalized to cope with broader domains.
Abstract: In this paper, we present a bio-inspired unified model to improve the recognition accuracy of character recognition problems for CAPTCHA (completely automated public turing test to tell computers and humans apart). Our study focused on segmenting different CAPTCHA characters to show the importance of visual preprocessing in recognition. Traditional character recognition systems show a low recognition rate for CAPTCHA characters due to their noisy backgrounds and distorted characters. We imitated the human visual attention system to let a recognition system know where to focus on despite the noise. The preprocessed characters were then recognized by an OCR system. For the CAPTHA characters we tested, the overall recognition rate increased from 16.63% to 70.74% after preprocessing. From our experimental results, we found out the importance of preprocessing for character recognition. Also, by imitating the human visual system, a more unified model can be built. The model presented is an instance for a certain type of visual recognition problem and can be generalized to cope with broader domains.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: An adaptation framework to recognize characters in a book with a learning framework is proposed and the post processor verifies the output of the recognition module, which is further used for learning and thus to improve the performance over iteration.
Abstract: The problem of character recognition in a book should be formulated significantly different from that of a single page or word. An ideal approach to design such a recognizer is to adapt the classifier to the font and style of the collection. In this paper, we propose an adaptation framework to recognize characters in a book with a learning framework. In the proposed system, the post processor verifies the output of the recognition module, which is further used for learning and thus to improve the performance over iteration. Experiments are conducted on about 500,000 annotated symbols from five books in Malayalam (an Indian language). We achieve an average improvement of 14% in classification accuracy.

Journal ArticleDOI
TL;DR: A system for offline recognition of cursive handwritten Tamil characters is presented and uses a combination of Time domain and frequency domain feature, which proves to be flexible and robust.
Abstract: In spite of several advancements in technologies pertaining to Optical character recognition, handwriting continues to persist as means of documenting information for day-to-day life. The process of segmentation and recognition pose quiets a lot of challenges especially in recognizing cursive handwritten scripts of different languages. The concept proposed is a solution crafted to perform character recognition of hand-written scripts in Tamil, a language having official status in India, Sri Lanka, and Singapore. The approach utilizes discrete Hidden Markov Models (HMMs) for recognizing off-line cursive handwritten Tamil characters. The tolerance of the system is evident as it can overwhelm the complexities arise out of font variations and proves to be flexible and robust. Higher degree of accuracy in results has been obtained with the implementation of this approach on a comprehensive database and the precision of the results demonstrates its application on commercial usage. The methodology promises to present a simple and fast scaffold to construct a full OCR system extended with suitable pre-processing.