Open AccessDissertation
The extraction and recognition of text from multimedia document images
Reads0
Chats0
TLDR
It is shown that highly accurate and fast recognition can be achieved using a remarkably small number of carefully chosen features, and that after training on only seven quite similar fonts, the recognition algorithm provides greater than 95% accuracy on fonts different to the training set.Abstract:
Almost all the current commercial OCR machines employ matrix matching, resulting in high speed and accuracy, but a severely restrictive range of recognized fonts. Published algorithms conversely, concentrate on feature extraction for font independence, yet they have previously been too slow for commercial use. Current algorithms also fail to distinguish between text and non-text images. This thesis presents a new approach to the automatic extraction of text from multimedia printed documents. An edge detection algorithm, which is capable of extracting the outlines of text from a grey level image, is used to obtain a high level of discrimination between text and non-text. An additional benefit is that text of any colour can be read from almost any background, provided that the contrast is reasonable. The outlines are approximated by polygons using a fast two-stage algorithm. A feature extraction approach to font independent character recognition is described, which uses these outline polygons. It is shown that highly accurate and fast recognition can be achieved using a remarkably small number of carefully chosen features. The results show that after training on only seven quite similar fonts, the recognition algorithm provides greater than 95% accuracy on fonts different to the training set. A more complex edge extraction algorithm is also described. This is capable of extracting text and line graphics from an arbitrary page. Although not essential for character recognition, this algorithm is useful for the interpretation of engineering drawings. As a further contribution to this problem, a thinning algorithm is defined, which is non-iterative and uses the polygonal approximated outlines from the edge extractor.read more
Citations
More filters
Proceedings ArticleDOI
An Overview of the Tesseract OCR Engine
TL;DR: The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview.
Proceedings ArticleDOI
History of the Tesseract OCR engine: what worked and what didn't
TL;DR: The development history of the Tesseract OCR engine is described, and the methods used are compared to general changes in the field over a similar time period to provide a primer for those interested in OCR research.
Proceedings ArticleDOI
Symbol recognition with a new autonomously evolving classifier autoclass
TL;DR: A new algorithm for symbol recognition based on the AutoClass classifier, which itself is a version of the evolving fuzzy rule-based classifier eClass in which AnYa type of fuzzy rules and data density are used.
Posted Content
Recognition of handwritten Roman Numerals using Tesseract open source OCR engine
TL;DR: The objective of the paper is to recognize handwritten samples of Roman numerals using Tesseract open source Optical Character Recognition (OCR) engine, trained with data samples of different persons to generate one user-independent language model, representing the handwritten Roman digit-set.
Proceedings ArticleDOI
Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit
TL;DR: The authors proposed a post-OCR text correction approach for digitising texts in Romanised Sanskrit using OCR models trained for other languages written in Roman and found that the use of copying mechanism yields a percentage increase of 7.69 in Character Recognition Rate (CRR).
References
More filters
Book
Algorithms for Graphics and Image Processing
TL;DR: This chapter discusses Graphics, Image Processing, and Pattern Recognition, and the Reconstruction techniques used in this program, as well as some of the problems faced in implementing this program.
Journal ArticleDOI
On the Recognition of Printed Characters of Any Font and Size
TL;DR: The current state of a system that recognizes printed text of various fonts and sizes for the Roman alphabet is described, which combines several techniques in order to improve the overall recognition rate.
Journal ArticleDOI
Fast polygonal approximation of digitized curves
Jack Sklansky,Víctor M. González +1 more
TL;DR: A new technique for fast “scan-along” computation of piecewise linear approximations of digital curves in 2-space is described and the application to the boundaries of the images of a lung and a rib in chest radiographs is illustrated.
Journal ArticleDOI
Some Parallel Thinning Algorithms for Digital Pictures
R. Stefanelli,Azriel Rosenfeld +1 more
TL;DR: It is proved that several algorithms which perform a thinning transformation when applied to the picture in parallel do not change the connectivity properties of the picture.
Journal ArticleDOI
A vectorizer and feature extractor for document recognition
TL;DR: A thining algorithm that operates directly on the run length encoding of a bilevel image, which determines other features that are useful for character recognition: arcs, holes, endpoints, etc.