scispace - formally typeset
Search or ask a question

Showing papers on "Optical character recognition published in 2001"


Journal ArticleDOI
10 Sep 2001
TL;DR: This work proposes a variant of the Turing test using pessimal print: that is, low-quality images of machine-printed text synthesized pseudo-randomly over certain ranges of words, typefaces, and image degradations and shows experimentally that judicious choice of these ranges can ensure that the images are legible to human readers but illegible to several of the best present-day optical character recognition (OCR) machines.
Abstract: We exploit the gap in ability between human and machine vision systems to craft a family of automatic challenges that tell human and machine users apart via graphical interfaces including Internet browsers. Turing proposed (1950) a method whereby human judges might validate "artificial intelligence" by failing to distinguish between human and machine interlocutors. Stimulated by the "chat room problem", and influenced by the CAPTCHA project of Blum et al. (2000), we propose a variant of the Turing test using pessimal print: that is, low-quality images of machine-printed text synthesized pseudo-randomly over certain ranges of words, typefaces, and image degradations. We show experimentally that judicious choice of these ranges can ensure that the images are legible to human readers but illegible to several of the best present-day optical character recognition (OCR) machines. Our approach is motivated by a decade of research on performance evaluation of OCR machines and on quantitative stochastic models of document image quality. The slow pace of evolution of OCR and other species of machine vision over many decades suggests that pessimal print will defy automated attack for many years. Applications include 'bot' barriers and database rationing.

196 citations


Journal ArticleDOI
TL;DR: A novel texture analysis-based approach toward font recognition that takes the document as an image containing some specific textures and regard font recognition as texture identification, which is content-independent and involves no detailed local feature analysis.
Abstract: We describe a novel texture analysis-based approach toward font recognition. Existing methods are typically based on local typographical features that often require connected components analysis. In our method, we take the document as an image containing some specific textures and regard font recognition as texture identification. The method is content-independent and involves no detailed local feature analysis. Experiments are carried out by using 14000 samples of 24 frequently used Chinese fonts (six typefaces combined with four styles), as well as 32 frequently used English fonts (eight typefaces combined with four styles). An average recognition rate of 99.1 percent is achieved. Experimental results are also included on the robustness of the method against image degradation (e.g., pepper and salt noise) and on the comparison with existing methods.

185 citations


Patent
16 Nov 2001
TL;DR: In this article, a system and method of managing documents where after document preparation, documents may be scanned to form a digital document image, a compressed digital image file with a text layer is created so that a separate text file may be extracted from the document image and tethered together by a unique identifier.
Abstract: A system and method of managing documents wherein after document preparation, documents may be scanned to form a digital document image. After optical character recognition, a compressed digital image file with a text layer is created so that a separate text file may be extracted from the document image and tethered together by a unique identifier. The compressed digital image file and its corresponding extracted text file may be sent to a server and where an inventory of each word of every document is created. The images and text inventory are then inserted into a database such that users manipulating the system may use Boolean searches and/or activate hyperlinks tethered to document images for the purposes of navigation or the creation of index entries that may contain additional information about the documents. In the preferred method, the system allows the management of a plurality of documents over a wide area network such as the Internet.

150 citations


Proceedings ArticleDOI
01 Dec 2001
TL;DR: A fast and robust algorithm to identify text in image or video frames with complex backgrounds and compression effects with advantages compared to conventional methods in both identification quality and computation time is presented.
Abstract: The paper presents a fast and robust algorithm to identify text in image or video frames with complex backgrounds and compression effects. The algorithm first extracts the candidate text line on the basis of edge analysis, baseline location and heuristic constraints. Support Vector Machine (SVM) is then used to identify text line from the candidates in edge-based distance map feature space. Experiments based on a large amount of images and video frames from different sources showed the advantages of this algorithm compared to conventional methods in both identification quality and computation time.

144 citations


Patent
01 May 2001
TL;DR: In this paper, a document image that is the source of Optical Character Recognition (OCR) output is displayed so that a user can select a region of the displayed document image, and text of the OCR output corresponding to the selected region is submitted as an input to a search engine.
Abstract: An document image that is the source of Optical Character Recognition (OCR) output is displayed so that a user can select a region of the displayed document image. When the region is selected, text of the OCR output corresponding to the selected region is submitted as an input to a search engine.

132 citations


Journal ArticleDOI
TL;DR: An Arabic OCR system is proposed, which uses a recognition-based segmentation technique to overcome the classical segmentation problems and shows a 90% recognition accuracy with a 20 chars/s recognition rate.

124 citations


Proceedings ArticleDOI
10 Sep 2001
TL;DR: This work presents an efficient and practical approach to Telugu OCR which limits the number of templates to be recognized to just 370, avoiding issues of classifier design for thousands of shapes or very complex glyph segmentation.
Abstract: Telugu is the language spoken by more than 100 million people of South India. Telugu has a complex orthography with a large number of distinct character shapes (estimated to be of the order of 10,000) composed of simple and compound characters formed from 16 vowels (called achchus) and 36 consonants (called hallus). We present an efficient and practical approach to Telugu OCR which limits the number of templates to be recognized to just 370, avoiding issues of classifier design for thousands of shapes or very complex glyph segmentation. A compositional approach using connected components and fringe distance template matching was tested to give a raw OCR accuracy of about 92%. Several experiments across varying fonts and resolutions showed the approach to be satisfactory.

122 citations


Proceedings ArticleDOI
10 Sep 2001
TL;DR: The paper deals with an optical character recognition system for printed Oriya, a popular Indian script, that achieves 96.3% character level accuracy on average.
Abstract: The paper deals with an optical character recognition system for printed Oriya, a popular Indian script. The development of OCR for this script is difficult because a large number of characters have to be recognized. In the proposed system, the digitized document image is first passed through preprocessing modules like skew correction, line segmentation, zone detection, word and character segmentation, etc. These modules have been developed by combining some conventional techniques with some newly proposed ones. Next, individual characters are recognized using a combination of stroke and run-number based features, along with features obtained from the concept of a water reservoir. The feature detection methods are simple and robust. A prototype of the system has been tested on a variety of printed Oriya material, and currently achieves 96.3% character level accuracy on average.

105 citations


Journal ArticleDOI
TL;DR: A spelling correction system designed specifically for OCR-generated text that selects candidate words through the use of information gathered from multiple knowledge sources is described, based on static and dynamic device mappings, approximate string matching, and n-gram analysis.
Abstract: In this paper, we describe a spelling correction system designed specifically for OCR-generated text that selects candidate words through the use of information gathered from multiple knowledge sources. This system for text correction is based on static and dynamic device mappings, approximate string matching, and n-gram analysis. Our statistically based, Bayesian system incorporates a learning feature that collects confusion information at the collection and document levels. An evaluation of the new system is presented as well.

97 citations


Proceedings ArticleDOI
01 Sep 2001
TL;DR: A system for recognizing unconstrained English handwritten text based on a large vocabulary based on hidden Markov models using a threshold that separates intra- and inter-word distances from each other and the stability of the segmentation algorithm is investigated.
Abstract: We present a system for recognizing unconstrained English handwritten text based on a large vocabulary. We describe the three main components of the system, which are preprocessing, feature extraction and recognition. In the preprocessing phase the handwritten texts are first segmented into lines. Then each line of text is normalized with respect to of skew, slant, vertical position and width. After these steps, text lines are segmented into single words. For this purpose distances between connected components are measured. Using a threshold, the distances are divided into distances within a word and distances between different words. A line of text is segmented at positions where the distances are larger than the chosen threshold. From each image representing a single word, a sequence of features is extracted. These features are input to a recognition procedure which is based on hidden Markov models. To investigate the stability of the segmentation algorithm the threshold that separates intra- and inter-word distances from each other is varied. If the threshold is small many errors are caused by over-segmentation, while for large thresholds under-segmentation errors occur. The best segmentation performance is 95.56% correctly segmented words, tested on 541 text lines containing 3899 words. Given a correct segmentation rate of 95.56%, a recognition rate of 73.45% on the word level is achieved.

83 citations


Proceedings ArticleDOI
01 Sep 2001
TL;DR: This paper summarizes the core idea of the T-Recs table recognition system, an integrated system covering block-segmentation, table location and a model-free structural analysis of tables, and proposes a quality evaluation measure that reflects the bottom-up strategy of either T-recs or T- Recs++.
Abstract: This paper summarizes the core idea of the T-Recs table recognition system, an integrated system covering block-segmentation, table location and a model-free structural analysis of tables. T-Recs works on the output of commercial OCR systems that provide the word bounding box geometry together with the text itself (e.g. Xerox ScanWorX). While T-Recs performs well on a number of document categories, business letters still remained a challenging domain because the T-Recs location heuristics are mislead by their header or footer resulting in a low recognition precision. Business letters such as invoices are a very interesting domain for industrial applications due to the large amount of documents to be analyzed and the importance of the data carried within their tables. Hence, we developed a more restrictive approach which is implemented in the T-Recs++ prototype. This paper describes the ideas of the T-Recs++ location and also proposes a quality evaluation measure that reflects the bottom-up strategy of either T-Recs or T-Recs++. Finally, some results comparing both systems on a collection of business letters are given.

Proceedings ArticleDOI
01 Sep 2001
TL;DR: A complete OCR for printed Hindi text in Devanagari script is presented and a performance of 93% at character level is obtained.
Abstract: In this paper, we present a complete OCR for printed Hindi text in Devanagari script. A performance of 93% at character level is obtained.

Proceedings ArticleDOI
26 Sep 2001
TL;DR: A new form of filter is derived from the Gabor filter, and it is shown that this filter can efficiently estimate the scales of these stripes and enhance the edges of only those stripes found to correspond to a suitable scale.
Abstract: Stripes are common sub-structures of text characters, and the scale of these stripes varies little within a word. This scale consistency thus provides us with a useful feature for text detection and segmentation. A new form of filter is derived from the Gabor filter, and it is shown that this filter can efficiently estimate the scales of these stripes. The contrast of text in video can then be increased by enhancing the edges of only those stripes found to correspond to a suitable scale. More specifically the algorithm presented here enhances the stripes in three pre-selected scale ranges. The resulting enhancement yields much better performance from the binarization process, which is the step required before character recognition.

Proceedings ArticleDOI
07 May 2001
TL;DR: Three key techniques contributing to the high recognition accuracy are highlighted, namely theuse of Gabor features, the use of discriminative feature extraction, and theUse of minimum classification error as a criterion for model training.
Abstract: We have developed a Chinese OCR engine for machine printed documents. Currently, our OCR engine can support a vocabulary of 6921 characters which include 6707 simplified Chinese characters in GB2312-80, 12 frequently used GBK Chinese characters, 62 alphanumeric characters, 140 punctuation marks and symbols. The supported font styles include Song, Fang Song, Kat, He, Yuan, LiShu, WeiBei, XingKai, etc. The averaged character recognition accuracy is above 99% for newspaper quality documents with a recognition speed of about 250 characters per second on a Pentium III-450 MHz PC yet only consuming less than 2 MB memory. We describe the key technologies we used to construct the above recognizer. Among them, we highlight three key techniques contributing to the high recognition accuracy, namely the use of Gabor features, the use of discriminative feature extraction, and the use of minimum classification error as a criterion for model training.

Proceedings ArticleDOI
10 Sep 2001
TL;DR: Algorithm for transforming the paper documents into a representation of text apt to be used as input for an automatic text recognizer and line segmentation was found to be successful in 97% of all samples.
Abstract: For being able to automatically acquire the information recorded in church registers and other historical scriptures, the writing on these documents has to be recognized. This paper describes algorithms for transforming the paper documents into a representation of text apt to be used as input for an automatic text recognizer. The automatic recognition of old handwritten scriptures is difficult for two main reasons. Lines of text in general are not straight and ascenders and descenders of adjacent lines interfere. The algorithms described in this paper provide ways to reconstruct the path of the lines of text using an approach of gradually constructing line segments until a unique line of text is formed. In addition, the single lines are segmented and an output in form of a raster image is provided. The method was applied to church registers. They were written between the 17th and 19th Century. Line segmentation was found to be successful in 97% of all samples.

Proceedings ArticleDOI
10 Sep 2001
TL;DR: Experimental results in the application of detecting the user-specified words from both English and Chinese document images show that weighted Hausdorff distance is a promising approach for word image matching.
Abstract: An approach to word image matching based on weighted Hausdorff distance (WHD) is proposed in this paper to facilitate the detection and location of the user-specified words in the document images. Preprocessing such as eliminating the space between adjacent characters in the word images and scale normalization is first done before the WHD is utilized to measure the distance between the template image and the word image extracted from the document image. Experimental results in the application of detecting the user-specified words from both English and Chinese document images show that it is a promising approach for word image matching.

Proceedings ArticleDOI
10 Sep 2001
TL;DR: An automatic technique for the identification of printed Roman, Chinese, Arabic, Devnagari and Bangla text lines from a single document is proposed and has an accuracy of about 97.33%.
Abstract: In a general situation, a document page may contain several scriptforms. For optical character recognition (OCR) of such a document page, it is necessary to separate the scripts before feeding them to their individual OCR systems. An automatic technique for the identification of printed Roman, Chinese, Arabic, Devnagari and Bangla text lines from a single document is proposed. Shape based features, statistical features and some features obtained from the concept of a water reservoir are used for script identification. The proposed scheme has an accuracy of about 97.33%.

Journal ArticleDOI
TL;DR: This paper presents a machine-printed and hand-written text classification scheme for Bangla and Devnagari, the two most popular Indian scripts, which has an accuracy of 98.6%.

Patent
12 Apr 2001
TL;DR: In this paper, a check image is received, and check clearing processes are performed using the check image and the data without using a physical check itself, without the use of physical checks.
Abstract: A method, apparatus, and computer implemented instructions for use in a network data processing system to process a check. A check image is received. Optical character recognition is performed on the check image to generate data. Check clearing processes is performed using the check image and the data. These processes are performed without using a physical check itself.

Patent
12 Apr 2001
TL;DR: In this article, a check is received in an automatic teller machine and an image of the check is generated using optical character recognition (OCR) and then a markup language is created to represent the check using the data.
Abstract: A method and apparatus for processing a check in an automatic teller machine in a data processing system. A check is received in the automatic teller machine. The check is scanned within the automatic teller machine to generate an image. Optical character recognition is performed on the image to generate data. A markup language is created, which is a representation of the check using the data.

Journal ArticleDOI
Premkumar Natajan1, Zhidong Lu1, Richard Schwartz1, Issam Bazzi1, John Makhoul1 
TL;DR: The script independence of the system is demonstrated in three languages with different types of script: Arabic, English, and Chinese, and an unsupervised adaptation method is described to improve performance under degraded conditions.
Abstract: This paper presents a script-independent methodology for optical character recognition (OCR) based on the use of hidden Markov models (HMM). The feature extraction, training and recognition components of the system are all designed to be script independent. The training and recognition components were taken without modification from a continuous speech recognition system; the only component that is specific to OCR is the feature extraction component. To port the system to a new language, all that is needed is text image training data from the new language, along with ground truth which gives the identity of the sequences of characters along each line of each text image, without specifying the location of the characters on the image. The parameters of the character HMMs are estimated automatically from the training data, without the need for laborious handwritten rules. The system does not require presegmentation of the data, neither at the word level nor at the character level. Thus, the system is able to handle languages with connected characters in a straightforward manner. The script independence of the system is demonstrated in three languages with different types of script: Arabic, English, and Chinese. The robustness of the system is further demonstrated by testing the system on fax data. An unsupervised adaptation method is then described to improve performance under degraded conditions.

Proceedings ArticleDOI
Ismail Haritaoglu1
08 Dec 2001
TL;DR: An automatic sign/text language translation for foreign travelers, where people can use the system whenever they want to see text or signs in their own language where they are originally written in a foreign language in the scene.
Abstract: We describe a scene text extraction system for handheld devices to provide enhanced information perception services to the user. It uses a color camera attached to a personal digital assistant as an input device to capture scene images from the real world and it employs image enhancement and segmentation methods to extract written information from the scene, convert them to text information and show them to the user so that he/she can see both the real world and information together. We implemented a prototype application: an automatic sign/text language translation for foreign travelers, where people can use the system whenever they want to see text or signs in their own language where they are originally written in a foreign language in the scene.

Journal ArticleDOI
TL;DR: A new method for automatic detection of skew in a document image using mathematical morphology is presented, which is extremely fast as well as independent of script forms.
Abstract: Any paper document when converted to electronic form through standard digitizing devices, like scanners, is subject to a small tilt or skew. Meanwhile, a de-skewed document allows a more compact representation of its components, particularly text objects, such as words, lines, and paragraphs, where they can be represented by their rectilinear bounding boxes. This simplified representation leads to more efficient, robust, as well as simpler algorithms for document image analysis including optical character recognition (OCR). This paper presents a new method for automatic detection of skew in a document image using mathematical morphology. The proposed algorithm is extremely fast as well as independent of script forms.

Journal ArticleDOI
TL;DR: A slant removal algorithm is presented based on the use of the vertical projection profile of word images and the Wigner-Ville distribution, which can be easily incorporated into any optical character recognition system.

Book ChapterDOI
TL;DR: Pattern Recognition is a fast growing field with applications in many diverse areas such as optical character recognition (OCR), computer – aided diagnosis and speech recognition, to name but a few.
Abstract: Pattern Recognition (PR) is a fast growing field with applications in many diverse areas such as optical character recognition (OCR), computer – aided diagnosis and speech recognition, to name but a few.

Journal ArticleDOI
TL;DR: Fuzzy C-Mean method, which determined each cluster location using maximum membership defuzzification and neighborhood smoothing techniques, can be applied to classify text, image, and background areas in optical character recognition (OCR) application for elaborated open document systems.
Abstract: Classification of text and image using statistical features (mean and standard deviation of pixel color values) is found to be a simple yet powerful method for text and image segmentation. The features constitute a systematic structure that segregates one from another. We identified this segregation in the form of class clustering by means of Fuzzy C-Mean method, which determined each cluster location using maximum membership defuzzification and neighborhood smoothing techniques. The method can then be applied to classify text, image, and background areas in optical character recognition (OCR) application for elaborated open document systems.

Proceedings ArticleDOI
23 Nov 2001
TL;DR: The segmentation module of the O/sup 3/ MR system (Object Oriented Optical Music Recognition) system is presented and the proposed approach is based on the adoption of projections for the extraction of basic symbols that constitute a graphic element of the music notation.
Abstract: The optical music recognition problem has been addressed in several ways, obtaining suitable results only when simple music constructs are processed. The most critical phase of the optical music recognition process is the first analysis of the image sheet. The first analysis consists of segmenting the acquired sheet into smaller parts which may be processed to recognize the basic symbols. The segmentation module of the O/sup 3/ MR system (Object Oriented Optical Music Recognition) system is presented. The proposed approach is based on the adoption of projections for the extraction of basic symbols that constitute a graphic element of the music notation. A set of examples is also included.

Proceedings ArticleDOI
01 Sep 2001
TL;DR: A shape based post processing system for an OCR of Gurmukhi script has been developed based on the size and shape of a word and an improvement of 3% in recognition rate has been reported on machine printed images using the post processing techniques.
Abstract: A shape based post processing system for an OCR of Gurmukhi script has been developed. Based on the size and shape of a word, the Punjabi corpora has been split into different partitions. The statistical information of Punjabi language syllable combination, corpora look up and holistic recognition of most commonly occurring words have been combined to design the post processor. An improvement of 3% in recognition rate from 94.35% to 97.34% has been reported on machine printed images using the post processing techniques.

Journal ArticleDOI
TL;DR: A labelling approach for the automatic recognition of tables of contents (ToC) is described, used for the electronic consulting of scientific papers in a digital library system named Calliope, and operates by text labelling without using any a priori model.
Abstract: A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: “title” and “authors”. Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction.

Proceedings ArticleDOI
10 Sep 2001
TL;DR: A new connected component based segmentation algorithm which automatically extracts text regions from natural scene images is proposed in this paper, utilizing a multichannel decomposition method to locate text blocks in complex backgrounds.
Abstract: A new connected component based segmentation algorithm which automatically extracts text regions from natural scene images is proposed in this paper. This approach utilizes a multichannel decomposition method to locate text blocks in complex backgrounds. Block alignment analysis and recognition confidence values are used in the combination and identification of the connected components. The algorithm is applied to a test image database and shows promising results.