Showing papers on "Optical character recognition published in 1997"

PDF

Open Access

Journal Article•DOI•

Determination of the script and language content of document images

[...]

A.L. Spitz¹•Institutions (1)

01 Mar 1997-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This work has developed techniques for distinguishing which language is represented in an image of text using a technique based on character shape codes, a representation of Latin text that is inexpensive to compute.

...read moreread less

Abstract: Most document recognition work to date has been performed on English text. Because of the large overlap of the character sets found in English and major Western European languages such as French and German, some extensions of the basic English capability to those languages have taken place. However, automatic language identification prior to optical character recognition is not commonly available and adds utility to such systems. Languages and their scripts have attributes that make it possible to determine the language of a document automatically. Detection of the values of these attributes requires the recognition of particular features of the document image and, in the case of languages using Latin-based symbols, the character syntax of the underlying language. We have developed techniques for distinguishing which language is represented in an image of text. This work is restricted to a small but important subset of the world's languages. The method first classifies the script into two broad classes: Han-based and Latin-based. This classification is based on the spatial relationships of features related to the upward concavities in character structures. Language identification within the Han script class (Chinese, Japanese, Korean) is performed by analysis of the distribution of optical density in the text images. We handle 23 Latin-based languages using a technique based on character shape codes, a representation of Latin text that is inexpensive to compute.

...read moreread less

279 citations

Proceedings Article•DOI•

Adaptive document binarization

[...]

Jaakko Sauvola¹, Tapio Seppänen¹, S. Haapakoski¹, Matti Pietikäinen¹•Institutions (1)

University of Oulu¹

18 Aug 1997

TL;DR: A new method is presented for adaptive document image binarization, where the page is considered as a collection of subcomponents such as text, background and picture, using document characteristics to determine (surface) attributes, often used in document segmentation.

...read moreread less

Abstract: A new method is presented for adaptive document image binarization, where the page is considered as a collection of subcomponents such as text, background and picture. The problems caused by noise, illumination and many source type related degradations are addressed. The algorithm uses document characteristics to determine (surface) attributes, often used in document segmentation. Using characteristic analysis, two new algorithms are applied to determine a local threshold for each pixel. An algorithm based on soft decision control is used for thresholding the background and picture regions. An approach utilizing local mean and variance of gray values is applied to textual regions. Tests were performed with images including different types of document components and degradations. The results show that the method adapts and performs well in each case.

...read moreread less

257 citations

Journal Article•DOI•

Off-line, handwritten numeral recognition by perturbation method

[...]

T.M. Ha, Horst Bunke¹•Institutions (1)

University of Bern¹

01 May 1997-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper proposes a recognition method which is able to account for a variety of distortions due to eccentric handwriting, and tested its method on two worldwide standard databases of isolated numerals, namely CEDAR and NIST.

...read moreread less

Abstract: This paper presents a new approach to off-line, handwritten numeral recognition. From the concept of perturbation due to writing habits and instruments, we propose a recognition method which is able to account for a variety of distortions due to eccentric handwriting. We tested our method on two worldwide standard databases of isolated numerals, namely CEDAR and NIST, and obtained 99.09 percent and 99.54 percent correct recognition rates at no-rejection level respectively. The latter result was obtained by testing on more than 170000 numerals.

...read moreread less

219 citations

Journal Article•DOI•

Document image binarization based on texture features

[...]

Ying Liu¹, Sargur N. Srihari•Institutions (1)

State University of New York System¹

01 May 1997-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A texture feature based thresholding algorithm that is appreciably better than those obtained by typical existing thresholding techniques for document images with poor contrast, strong noise, complex patterns, and/or variable modalities in gray-scale histograms is developed.

...read moreread less

Abstract: Binarization has been difficult for document images with poor contrast, strong noise, complex patterns, and/or variable modalities in gray-scale histograms. We developed a texture feature based thresholding algorithm to address this problem. Our algorithm consists of three steps: 1) candidate thresholds are produced through iterative use of Otsu's algorithm (1978); 2) texture features associated with each candidate threshold are extracted from the run-length histogram of the accordingly binarized image; 3) the optimal threshold is selected so that desirable document texture features are preserved. Experiments with 9,000 machine printed address blocks from an unconstrained US mail stream demonstrated that over 99.6 percent of the images were successfully binarized by the new thresholding method, appreciably better than those obtained by typical existing thresholding techniques. Also, a system run with 500 troublesome mail address blocks showed that an 8.1 percent higher character recognition rate was achieved with our algorithm as compared with Otsu's algorithm.

...read moreread less

218 citations

Journal Article•DOI•

Representation and recognition of handwritten digits using deformable templates

[...]

Anil K. Jain¹, D. Zongker²•Institutions (2)

Michigan State University¹, University of Washington²

01 Dec 1997-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The application of deformable templates to recognition of handprinted digits shows that there does exist a good low-dimensional representation space and methods to reduce the computational requirements, the primary limiting factor, are discussed.

...read moreread less

Abstract: We investigate the application of deformable templates to recognition of handprinted digits. Two characters are matched by deforming the contour of one to fit the edge strengths of the other, and a dissimilarity measure is derived from the amount of deformation needed, the goodness of fit of the edges, and the interior overlap between the deformed shapes. Classification using the minimum dissimilarity results in recognition rates up to 99.25 percent on a 2,000 character subset of NIST Special Database 1. Additional experiments on an independent test data were done to demonstrate the robustness of this method. Multidimensional scaling is also applied to the 2,000/spl times/2,000 proximity matrix, using the dissimilarity measure as a distance, to embed the patterns as points in low-dimensional spaces. A nearest neighbor classifier is applied to the resulting pattern matrices. The classification accuracies obtained in the derived feature space demonstrate that there does exist a good low-dimensional representation space. Methods to reduce the computational requirements, the primary limiting factor of this method, are discussed.

...read moreread less

200 citations

Proceedings Article•DOI•

An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)

[...]

Bidyut B. Chaudhuri, Umapada Pal¹•Institutions (1)

Indian Statistical Institute¹

18 Aug 1997

TL;DR: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent, and shows a good performance for single font scripts printed on clear documents.

...read moreread less

Abstract: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent. These scripts, having the same origin in ancient Brahmi script, have many features in common and hence a single system can be modeled to recognize them. In the proposed model, document digitization, skew detection, text line segmentation and zone separation, word and character segmentation, character grouping into basic, modifier and compound character category are done for both scripts by the same set of algorithms. The feature sets and classification tree as well as the knowledge base required for error correction (such as lexicon) differ for Bangla and Devnagari. The system shows a good performance for single font scripts printed on clear documents.

...read moreread less

198 citations

Journal Article•DOI•

Neural-network classifiers for recognizing totally unconstrained handwritten numerals

[...]

Sung-Bae Cho¹•Institutions (1)

Yonsei University¹

01 Jan 1997-IEEE Transactions on Neural Networks

TL;DR: Three sophisticated neural-network classifiers to solve complex pattern recognition problems: multiple multilayer perceptron (MLP) classifiers, hidden Markov model (HMM)/MLP hybrid classifier, and structure-adaptive self-organizing map (SOM) classifier are presented.

...read moreread less

Abstract: Artificial neural networks have been recognized as a powerful tool for pattern classification problems, but a number of researchers have also suggested that straightforward neural-network approaches to pattern recognition are largely inadequate for difficult problems such as handwritten numeral recognition. In this paper, we present three sophisticated neural-network classifiers to solve complex pattern recognition problems: multiple multilayer perceptron (MLP) classifier, hidden Markov model (HMM)/MLP hybrid classifier, and structure-adaptive self-organizing map (SOM) classifier. In order to verify the superiority of the proposed classifiers, experiments were performed with the unconstrained handwritten numeral database of Concordia University, Montreal, Canada. The three methods have produced 97.35%, 96.55%, and 96.05% of the recognition rates, respectively, which are better than those of several previous methods reported in the literature on the same database.

...read moreread less

192 citations

Patent•

Portable data collection device having optical character recognition

[...]

Jianhua Xu

10 Oct 1997

TL;DR: In this paper, a portable data collection device providing for optical character recognition is described, which includes an internal region and includes a user handle that allows a user to position the housing relative to an indicia carrying target.

...read moreread less

Abstract: A portable data collection device providing for optical character recognition. A housing defines an internal region and includes a user handle that allows a user to position the housing relative to an indicia carrying target. An imaging assembly includes a two dimensional imaging array supported within the internal region of the housing. The imaging assembly includes a capture circuit that generates a video signal representative of an image of a target zone. An optics assembly supported by the housing focuses an image of the target area onto the photosensor array. A character recognition processing circuit receives the video signal and categorizes the indicia on the target into a set of predefined characters. The character recognition processing circuit includes a discriminator for identifying a text region of the target and identifying individual character regions within the text region and a categorizer for identifying a character from a set of possible characters for an individual character region. The categorizer performs one or more tests based on pixel data within the individual character region.

...read moreread less

192 citations

Proceedings Article•DOI•

A retargetable table reader

[...]

John H. Shamilian¹, Henry S. Baird, T.L. Wood•Institutions (1)

Bell Labs¹

18 Aug 1997

TL;DR: The architecture of a system for reading machine-printed documents in known predefined tabular-data layout styles, and algorithms for identifying and segmenting records with known layout, and integration of these algorithms with a graphical user interface (GUI) for defining new layouts are described.

...read moreread less

Abstract: We describe the architecture of a system for reading machine-printed documents in known predefined tabular-data layout styles. In these tables, textual data are presented in record lines made up of fixed-width fields. Tables often do not rely on line-art (ruled lines) to delimit fields, and in this way differ crucially from fixed forms. Our system performs these steps: copes with multiple tables per page; identifies records within tables; segments records into fields; and recognizes characters within fields, constrained by field-specific contextual knowledge. Obstacles to good performance on tables include small print, tight line-spacing, poor-quality text (such as photocopies), and line-art or background patterns that touch the text. Precise skew-correction and pitch-estimation, and high-performance OCR using neural nets proved crucial in overcoming these obstacles. The most significant technical advances in this work appear to be algorithms for identifying and segmenting records with known layout, and integration of these algorithms with a graphical user interface (GUI) for defining new layouts. This GUI has been ergonomically designed to make efficient and intuitive use of exemplary images, so that the skill and manual effort required to retarget the system to new table layouts are held to a minimum. The system has been applied in this way to more than 400 distinct tabular layouts. During the last three years the system has read over fifty million records with high accuracy.

...read moreread less

142 citations

Proceedings Article•DOI•

Combining multiple representations and classifiers for pen-based handwritten digit recognition

[...]

F. Alimoglu¹, Ethem Alpaydin¹•Institutions (1)

Boğaziçi University¹

18 Aug 1997

TL;DR: This work investigates techniques to combine multiple representations of a handwritten digit to increase classification accuracy without significantly increasing system complexity or recognition time and implements and compares voting, mixture of experts, stacking and cascading.

...read moreread less

Abstract: We investigate techniques to combine multiple representations of a handwritten digit to increase classification accuracy without significantly increasing system complexity or recognition time. We compare multiexpert and multistage combination techniques and discuss in detail in a comparative manner methods for combining multiple learners: voting, mixture of experts, stacking, boosting and cascading. In pen based handwritten character recognition, the input is the dynamic movement of the pentip over the pressure sensitive tablet. There is also the image formed as a result of this movement. On a real world database, we notice that the two multi layer perceptron (MLP) neural network based classifiers using these representations separately make errors on different patterns, implying that a suitable combination of the two would lead to higher accuracy. Thus we implement and compare voting, mixture of experts, stacking and cascading. Combined classifiers have an error percentage less than individual ones. The final combined system of two MLPs has less complexity and memory requirement than a single k nearest neighbor using one of the representations.

...read moreread less

122 citations

Journal Article•DOI•

Neural and fuzzy methods in handwriting recognition

[...]

Paul D. Gader¹, James M. Keller¹, Raghu Krishnapuram¹, Jung-Hsien Chiang, Magdi A. Mohamed¹ - Show less +1 more•Institutions (1)

University of Missouri¹

01 Feb 1997-IEEE Computer

TL;DR: This article describes the application of neural and fuzzy methods to three problems: recognition of handwritten words; recognition of numeric fields; and location of handwritten street numbers in address images.

...read moreread less

Abstract: Handwriting recognition requires tools and techniques that recognize complex character patterns and represent imprecise, common-sense knowledge about the general appearance of characters, words and phrases. Neural networks and fuzzy logic are complementary tools for solving such problems. Neural networks, which are highly nonlinear and highly interconnected for processing imprecise information, can finely approximate complicated decision boundaries. Fuzzy set methods can represent degrees of truth or belonging. Fuzzy logic encodes imprecise knowledge and naturally maintains multiple hypotheses that result from the uncertainty and vagueness inherent in real problems. By combining the complementary strengths of neural and fuzzy approaches into a hybrid system, we can attain an increased recognition capability for solving handwriting recognition problems. This article describes the application of neural and fuzzy methods to three problems: recognition of handwritten words; recognition of numeric fields; and location of handwritten street numbers in address images.

...read moreread less

Journal Article•DOI•

Skew detection and text line position determination in digitized documents

[...]

Basilios Gatos¹, Nikos Papamarkos¹, Christodoulos Chamzas¹•Institutions (1)

Democritus University of Thrace¹

01 Sep 1997-Pattern Recognition

TL;DR: A computationally efficient procedure for skew detection and text line position determination in digitized documents, which is based on the cross-correlation between the pixels of vertical lines in a document, which provides good and accurate results while it requires only a short computational time.

...read moreread less

Proceedings Article•DOI•

Number plate reading using computer vision

[...]

João Barroso, Erik L. Dagless¹, A. Rafael², José Bulas-Cruz³•Institutions (3)

University of Bristol¹, University of Aveiro², University of Trás-os-Montes and Alto Douro³

07 Jul 1997

TL;DR: For the location of the number plate area in the image, a new line-based method has been developed and the segmentation of the characters is derived from a technique first proposed by Lu (1995).

...read moreread less

Abstract: The main tasks of a number plate recognition system are the location of the number plate area in the image, the segmentation of the characters and their identification. These tasks are strongly inter-related, mainly because the way to check if the number plate has been correctly located is based on the result of the character identification process (it should correspond to a predefined syntax). Algorithmic improvements to previous versions of the system, based on the results of intensive testing, are described in this paper. For the location of the number plate area in the image, a new line-based method has been developed. The method, instead of looking for character like shapes in the image, takes advantage of the "signature" of the number plate area in a horizontal cross-section of the image. The method used for the segmentation of the characters is derived from a technique first proposed by Lu (1995). The identification of the characters uses the OCR engine developed by Barroso et al. (1995), based on the critical points method.

...read moreread less

Book Chapter•DOI•

Script and language identification from document images

[...]

G. S. Peake¹, Tieniu Tan¹•Institutions (1)

University of Reading¹

20 Jun 1997

TL;DR: In this paper, a uniform text block on which texture analysis can be performed is produced from a document image via simple processing using multiple channel (Gabor) filters and grey level co-occurrence matrices.

...read moreread less

Abstract: In this paper we present a detailed review of current script and language identification techniques. The main criticism of the existing techniques is that most of them rely on either connected component analysis or character segmentation. We go on to present a new method based on texture analysis for script identification which does not require character segmentation. A uniform text block on which texture analysis can be performed is produced from a document image via simple processing. Multiple channel (Gabor) filters and grey level co-occurrence matrices are used in independent experiments in order to extract texture features. Classification of test documents is made based on the features of training documents using the K-NN classifier. Initial results of over 95% accuracy on the classification of 105 rest decrements from 7 scripts are very promising. The method shows robustness with respect to noise, the presence of foreign characters or numerals, and can be applied to very small amounts of text.

...read moreread less

Patent•

Apparatus and method for OCR character and confidence determination using multiple OCR devices

[...]

Roger B. Bradford¹•Institutions (1)

Science Applications International Corporation¹

13 Jan 1997

TL;DR: In an optical character recognition (OCR) system an improved method and apparatus for recognizing the character and producing an indication of the confidence with which the character has been recognized as mentioned in this paper.

...read moreread less

Abstract: In an optical character recognition (OCR) system an improved method and apparatus for recognizing the character and producing an indication of the confidence with which the character has been recognized. The system employs a plurality of different OCR devices each of which outputs a indicated (or recognized) character along with the individual devices own determination of how confident it is in the indication. The OCR system uses that data output from each of the different OCR devices along with other attributes of the indicated character such as the relative accuracy of the particular OCR device indicating the character to choose the select character recognized by the system and to produce a combined confidence indication of how confident the system is in its recognition.

...read moreread less

Book Chapter•DOI•

Probabilistic Retrieval of OCR Degraded Text Using N-Grams

[...]

Stephen M. Harding¹, W. Bruce Croft¹, C. Weir•Institutions (1)

University of Massachusetts Amherst¹

01 Sep 1997

TL;DR: A web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described, which was less effective but can likely be improved with alternative query component weighting schemes and measures of term similarity.

...read moreread less

Abstract: The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described.

...read moreread less

Journal Article•DOI•

Large-scale simulation studies in image pattern recognition

[...]

Tin Kam Ho¹, Henry S. Baird•Institutions (1)

Bell Labs¹

01 Oct 1997-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Three closely related studies of machine-printed character recognition that rely on synthetic data generated pseudo-randomly in accordance with an explicit stochastic model of document image degradations are presented.

...read moreread less

Abstract: Many obstacles to progress in image pattern recognition result from the fact that per-class distributions are often too irregular to be well-approximated by simple analytical functions. Simulation studies offer one way to circumvent these obstacles. We present three closely related studies of machine-printed character recognition that rely on synthetic data generated pseudo-randomly in accordance with an explicit stochastic model of document image degradations. The unusually large scale of experiments - involving several million samples that makes this methodology possible have allowed us to compute sharp estimates of the intrinsic difficulty (Bayes risk) of concrete image recognition problems, as well as the asymptotic accuracy and domain of competency of classifiers.

...read moreread less

Patent•

Automatic language identification system for multilingual optical character recognition

[...]

Mindy R. Bokser, Chan Choy, Tapas Kanungo, Leonard K. Pon, Jun Yang - Show less +1 more

20 Nov 1997

TL;DR: In this article, a dictionary-based approach to identify languages within different zones in a multi-lingual document is presented. But the method is limited in the sense that it requires the dictionary to be associated with various candidate languages, and the language that exhibits the highest confidence factor is initially identified as the zone.

...read moreread less

Abstract: The disclosed invention utilizes a dictionary-based approach to identify languages within different zones in a multi-lingual document. As a first step, a document image is segmented into various zones, regions and word tokens, using suitable geometric propertis. Within each zone, the word tokens are compared to dictionaries associated with various candidate languages, and the language that exhibits the highest confidence factor is initially identified as the langage of the zone. Subsequently, each zone is further split into regions. The language for each region is then identified, using the confidence factors for the words of that region. For any language determination having a low confidence value, the previously determined language of the zone is employed to assist the identification process.

...read moreread less

Proceedings Article•DOI•

The performance evaluation of thresholding algorithms for optical character recognition

[...]

A.T. Abak¹, U. Baris, Bulent Sankur•Institutions (1)

Scientific and Technological Research Council of Turkey¹

18 Aug 1997

TL;DR: The paper presents performance evaluation of thresholding algorithms in the context of document analysis and character recognition systems using Hausdorff, Jaccard, and Yule measures to measure the similarity between thresholded bitmaps and original bitmaps of characters.

...read moreread less

Abstract: The paper presents performance evaluation of thresholding algorithms in the context of document analysis and character recognition systems. Several thresholding algorithms are comparatively evaluated on the basis of the original bitmaps of characters. Different distance measures such as Hausdorff, Jaccard, and Yule are used to measure the similarity between thresholded bitmaps and original bitmaps of characters.

...read moreread less

Patent•

Confusion matrix based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique

[...]

Randy G. Goldberg¹•Institutions (1)

AT&T¹

11 Aug 1997

TL;DR: In this article, a method and apparatus for correcting misrecognized words appearing in electronic documents that have been generated by scanning an original document in accordance with an optical character recognition ("OCR") technique is presented.

...read moreread less

Abstract: A method and apparatus for correcting misrecognized words appearing in electronic documents that have been generated by scanning an original document in accordance with an optical character recognition ("OCR") technique. If an incorrect word is found in the electronic document, the present invention generates at least one reference word and selects the reference word that is the most likely correct replacement for the incorrect word. This selection is accomplished by performing a probabilistic determination that assigns to each reference word a replacement word recognition probability. The probabilistic determination is carried out on the basis of a pre-stored confusion matrix that stores a plurality of probability values. The confusion matrix is used to associate each character of recognized word in the electronic document with a corresponding character of a word in the original document on the basis of these probability values.

...read moreread less

Proceedings Article•DOI•

Robust multifont OCR system from gray level images

[...]

F. LeBourgeois¹•Institutions (1)

Vision Institute¹

18 Aug 1997

TL;DR: The paper presents a general robust OCR system designed for practical use and suited to unconstrained gray-level images grabbed from a CCD camera, with minimum assumptions on font, text location, size, color and the background scene.

...read moreread less

Abstract: The paper presents a general robust OCR system designed for practical use and suited to unconstrained gray-level images grabbed from a CCD camera. The system works with minimum assumptions on font, text location, size, color and the background scene. The text blocks localization in complex scenes using a specific filter which enhances any text from the background without binarization. A special stage is designed to separate characters, even touched by using gray-level information. The authors also extract gray-level features which make the algorithm more reliable, in particular under poor printing conditions or bad contrast digitization.

...read moreread less

Journal Article•DOI•

Hybrid fuzzy-neural systems in handwritten word recognition

[...]

Jung-Hsien Chiang¹, Paul D. Gader¹•Institutions (1)

University of Missouri¹

01 Nov 1997-IEEE Transactions on Fuzzy Systems

TL;DR: Two hybrid fuzzy neural systems are developed and applied to handwritten word recognition and the combination of the two outperforms the individual systems with a small increase in computational cost over the MLP system.

...read moreread less

Abstract: Two hybrid fuzzy neural systems are developed and applied to handwritten word recognition. The word recognition system requires a module that assigns character class membership values to segments of images of handwritten words. The module must accurately represent ambiguities between character classes and assign low membership values to a wide variety of noncharacter segments resulting from erroneous segmentations. Each hybrid is a cascaded system. The first stage of both is a self-organizing feature map (SOFM). The second stages map distances into membership values. The third stage of one system is a multilayer perceptron (MLP). The third stage of the other is a bank of Choquet fuzzy integrals (FI). The two systems are compared individually and as a combination to the baseline system. The new systems each perform better than the baseline system. The MLP system slightly outperforms the FI system, but the combination of the two outperforms the individual systems with a small increase in computational cost over the MLP system. Recognition rates of over 92% are achieved with a lexicon set having average size of 100. Experiments were performed on a standard test set from the SUNY/USPS CD-ROM database.

...read moreread less

Proceedings Article•DOI•

Using character shape coding for information retrieval

[...]

Alan F. Smeaton¹, A.L. Spitz•Institutions (1)

Dublin City University¹

18 Aug 1997

TL;DR: A technique for performing information retrieval on document images in such a manner that the accuracy has great utility is developed, and a surprisingly good result is obtained.

...read moreread less

Abstract: In conventional information retrieval the task of finding users' search terms in a document is simple. When the document is not available in machine readable format, optical character recognition (OCR) can usually be performed. We have developed a technique for performing information retrieval on document images in such a manner that the accuracy has great utility. The method makes generalisations about the images of characters, then performs classification of these and agglomerates the resulting character shape codes into word tokens based on character shape coding. These are sufficiently specific in their representation of the underlying words to allow reasonable performance of retrieval. Using a collection of over 250 Mbytes of document texts and queries with known relevance assessments, we present a series of experiments to determine how various parameters in the retrieval strategy affect retrieval performance and we obtain a surprisingly good result.

...read moreread less

Proceedings Article•DOI•

HMM word recognition engine

[...]

D. Guillevic¹, Ching Y. Suen•Institutions (1)

Concordia University¹

18 Aug 1997

TL;DR: A hidden Markov model (HMM) based word recognition engine being developed to be integrated with the CENPARMI bank cheque processing system is described and preliminary results are compared with the previous global feature recognition scheme.

...read moreread less

Abstract: We describe a hidden Markov model (HMM) based word recognition engine being developed to be integrated with the CENPARMI bank cheque processing system. The various modules are described in detail, and preliminary results are compared with our previous global feature recognition scheme. The engine is tested on words from a database of over 4,500 cheques of 1,400 writers.

...read moreread less

Proceedings Article•DOI•

Optical formula recognition

[...]

Stéphane Lavirotte, Loïc Pottier

18 Aug 1997

TL;DR: This approach clearly separate OCR step, geometrical treatments and syntactic analysis, and defines a class of context sensitive graph grammars for mathematical formulas, and shows how to remove their ambiguities to define efficient parsing.

...read moreread less

Abstract: The paper describes the design and the first steps of implementation of OFR (optical formula recognition), a system for extracting and understanding mathematical expressions in printed documents. Our approach clearly separate OCR step, geometrical treatments and syntactic analysis. We focus on the third part: we define a class of context sensitive graph grammars for mathematical formulas, study their properties and show how to remove their ambiguities (by adding contexts in rules) to define efficient parsing. This method is based on a "critical pairs" approach in the sense of Knuth-Bendix algorithm.

...read moreread less

Patent•

Method and apparatus for performing an automatic correction of misrecognized words produced by an optical character recognition technique by using a Hidden Markov Model based algorithm

[...]

Randy G. Goldberg¹•Institutions (1)

AT&T¹

11 Aug 1997

TL;DR: In this paper, a method and apparatus for correcting misrecognized words appearing in electronic documents that have been generated by scanning an original document in accordance with an optical character recognition (OCR) technique is presented.

...read moreread less

Abstract: A method and apparatus for correcting misrecognized words appearing in electronic documents that have been generated by scanning an original document in accordance with an optical character recognition (“OCR”) technique. Each recognized word is generated by first producing, for each character position of the corresponding word in the original document, the N-best characters for occupying that character position. If an incorrect word is found in the electronic document, the present invention generates a plurality of reference words from which one is selected for replacing the incorrect word. This selected reference word is determined by the present invention to be the reference word that is the most likely correct replacement for the incorrect recognized word. This selection is accomplished by computing for each reference word a replacement word value. The reference word that is selected to replace the incorrect recognized word corresponds to the highest replacement word value.

...read moreread less

Patent•

Method and apparatus for enhancing optical character recognition

[...]

Philip Silvano DiPiazza¹, Thomas C. Redman¹•Institutions (1)

AT&T¹

14 Oct 1997

TL;DR: In this paper, a method and apparatus for enhancing optical character recognition comprises a data processor and memory for maintaining an error detection and correction log, which is used for real-time learning.

...read moreread less

Abstract: A method and apparatus for enhancing optical character recognition comprises a data processor and memory for maintaining an error detection and correction log. The data processor maintains a memory table of a plurality of rules for generating a rule base determined by recognition of a particular context type of an electronic bit-map portion. The appropriate rule base comprises rules and combinations of rules for application to bit-map portion data. A rule, a rule base or data may be selected and obtained from an internal or external memory. Upon application of the rule base, the error detection and correction log maintains a record of clear errors, corrected data, failed rules of the rule base and the original bit map. Possible errors are flagged and clear errors are automatically corrected provided a confidence level in the correction is reached or exceeded. Through validation by a source or from operator intervention, real-time learning is obtained from detecting and/or correcting errors or updating the rules of the rule table and the data upon which they operate. Through recognition of patterns of errors in the error detection and correction log, new rules may be generated for storage in the rule memory table, erroneous data corrected or incomplete data of the received forms or data fields themselves completed through context based analysis.

...read moreread less

Proceedings Article•DOI•

Retrieval methods for English-text with missrecognized OCR characters

[...]

Manabu Ohta¹, Atsuhiro Takasu¹, J. Adachi¹•Institutions (1)

University of Tokyo¹

18 Aug 1997

TL;DR: Three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors can tolerate such errors, and therefore costly manual post-editing is not required after OCR recognition.

...read moreread less

Abstract: This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection, i.e., which letter is likely to come after another. Multiple search terms are generated for an input query term by making reference to confusion matrices, after which a full-text search is run for each search term. The validity of retrieved terms is determined based on error-occurrence and character connection probabilities. The performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.

...read moreread less

Proceedings Article•DOI•

Perceptual model of handwriting drawing. Application to the handwriting segmentation problem

[...]

Eric Anquetil¹, G. Lorette•Institutions (1)

University of Rennes 1¹

18 Aug 1997

TL;DR: A new handwriting modeling and segmentation approach is introduced for cursive letter and word analysis based on the detection of a set of "perceptual anchorage points" to extract a priori pertinent strokes.

...read moreread less

Abstract: A new handwriting modeling and segmentation approach is introduced for cursive letter and word analysis. For the letter analysis, the proposed method is based on the detection of a set of "perceptual anchorage points" to extract a priori pertinent strokes. This physical segmentation of the handwritten drawing enables us to conduct a logical modeling of letters with respect to the most stable strokes of each letter class. For the handwritten word analysis, we present a constructive segmentation approach to overcome the word segmentation problem. The main idea is to locate "anchorage structures" in the word drawing based on the most robust strokes of the letters. This new approach of handwriting analysis has been implemented in a writer-independent online handwriting recognition system. Experimental results are reported using a lexicon context of 1128, 7000 and 25,000 words.

...read moreread less

Journal Article•DOI•

A heuristic algorithm for optical character recognition of Arabic script

[...]

A. Alper Atici¹, Fatos T. Yarman-Vural¹•Institutions (1)

Middle East Technical University¹

01 Oct 1997-Signal Processing

TL;DR: In this paper, a heuristic method is developed for segmentation, feature extraction and recognition of the Arabic script, which is part of a large project for transcription of the documents in Ottoman Archives.

...read moreread less

Collapse