scispace - formally typeset
Search or ask a question
Author

Raymond Wensley Smith

Bio: Raymond Wensley Smith is an academic researcher. The author has an hindex of 1, co-authored 1 publications receiving 35 citations.

Papers
More filters
Dissertation
01 Jan 1987
TL;DR: It is shown that highly accurate and fast recognition can be achieved using a remarkably small number of carefully chosen features, and that after training on only seven quite similar fonts, the recognition algorithm provides greater than 95% accuracy on fonts different to the training set.
Abstract: Almost all the current commercial OCR machines employ matrix matching, resulting in high speed and accuracy, but a severely restrictive range of recognized fonts. Published algorithms conversely, concentrate on feature extraction for font independence, yet they have previously been too slow for commercial use. Current algorithms also fail to distinguish between text and non-text images. This thesis presents a new approach to the automatic extraction of text from multimedia printed documents. An edge detection algorithm, which is capable of extracting the outlines of text from a grey level image, is used to obtain a high level of discrimination between text and non-text. An additional benefit is that text of any colour can be read from almost any background, provided that the contrast is reasonable. The outlines are approximated by polygons using a fast two-stage algorithm. A feature extraction approach to font independent character recognition is described, which uses these outline polygons. It is shown that highly accurate and fast recognition can be achieved using a remarkably small number of carefully chosen features. The results show that after training on only seven quite similar fonts, the recognition algorithm provides greater than 95% accuracy on fonts different to the training set. A more complex edge extraction algorithm is also described. This is capable of extracting text and line graphics from an arbitrary page. Although not essential for character recognition, this algorithm is useful for the interpretation of engineering drawings. As a further contribution to this problem, a thinning algorithm is defined, which is non-iterative and uses the polygonal approximated outlines from the edge extractor.

35 citations


Cited by
More filters
Proceedings ArticleDOI
Ray Smith1
23 Sep 2007
TL;DR: The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview.
Abstract: The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier.

1,530 citations

Proceedings ArticleDOI
Ray Smith1
04 Feb 2013
TL;DR: The development history of the Tesseract OCR engine is described, and the methods used are compared to general changes in the field over a similar time period to provide a primer for those interested in OCR research.
Abstract: This paper describes the development history of the Tesseract OCR engine, and compares the methods to general changes in the field over a similar time period. Emphasis is placed on the lessons learned with the goal of providing a primer for those interested in OCR research.

42 citations

Proceedings ArticleDOI
02 Jun 2014
TL;DR: A new algorithm for symbol recognition based on the AutoClass classifier, which itself is a version of the evolving fuzzy rule-based classifier eClass in which AnYa type of fuzzy rules and data density are used.
Abstract: A new algorithm for symbol recognition is proposed in this paper. It is based on the AutoClass classifier [1], [2], which itself is a version of the evolving fuzzy rule-based classifier eClass [3] in which AnYa[1] type of fuzzy rules and data density are used. In this classifier, symbol recognition task is divided into two stages: feature extraction, and recognition based on feature vector. This approach gives flexibility, allowing us to use various feature sets for one classifier. The feature extraction is performed by means of gist image descriptors[4] augmented by several additional features. In this method, we map the symbol images into the feature space, and then we apply AutoClass classifier in order to recognise them. Unlike many of the state-of-the-art algorithms, the proposed algorithm is evolving, i.e. it has a capability of incremental learning as well as ability to change its structure during the training phase. The classifier update is performed sample by sample, and we should not memorize the training set to provide recognition or further update. It gives a possibility to adapt the classifier to the broadening and changing data sets, which is especially useful for large scale systems improvement during exploitation. More, the classifier is computationally cheap, and it has shown stable recognition time during the increase of training data set size that is extremely important for online applications.

15 citations

Posted Content
TL;DR: The objective of the paper is to recognize handwritten samples of Roman numerals using Tesseract open source Optical Character Recognition (OCR) engine, trained with data samples of different persons to generate one user-independent language model, representing the handwritten Roman digit-set.
Abstract: The objective of the paper is to recognize handwritten samples of Roman numerals using Tesseract open source Optical Character Recognition (OCR) engine. Tesseract is trained with data samples of different persons to generate one user-independent language model, representing the handwritten Roman digit-set. The system is trained with 1226 digit samples collected form the different users. The performance is tested on two different datasets, one consisting of samples collected from the known users (those who prepared the training data samples) and the other consisting of handwritten data samples of unknown users. The overall recognition accuracy is obtained as 92.1% and 86.59% on these test datasets respectively.

15 citations

Proceedings ArticleDOI
06 Sep 2018
TL;DR: The authors proposed a post-OCR text correction approach for digitising texts in Romanised Sanskrit using OCR models trained for other languages written in Roman and found that the use of copying mechanism yields a percentage increase of 7.69 in Character Recognition Rate (CRR).
Abstract: We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman. Currently, there exists no dataset available for Romanised Sanskrit OCR. So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth. For training, we synthetically generate training images for both the settings. We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-to-sequence tasks (Schnober et al., 2016). We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01% from an OCR output with CRR of 35.76% for one of the dataset settings. A human judgement survey performed on the models shows that our proposed model results in predictions which are faster to comprehend and faster to improve for a human than the other systems.

8 citations