scispace - formally typeset
Open AccessDissertation

The extraction and recognition of text from multimedia document images

Reads0
Chats0
TLDR
It is shown that highly accurate and fast recognition can be achieved using a remarkably small number of carefully chosen features, and that after training on only seven quite similar fonts, the recognition algorithm provides greater than 95% accuracy on fonts different to the training set.
Abstract
Almost all the current commercial OCR machines employ matrix matching, resulting in high speed and accuracy, but a severely restrictive range of recognized fonts. Published algorithms conversely, concentrate on feature extraction for font independence, yet they have previously been too slow for commercial use. Current algorithms also fail to distinguish between text and non-text images. This thesis presents a new approach to the automatic extraction of text from multimedia printed documents. An edge detection algorithm, which is capable of extracting the outlines of text from a grey level image, is used to obtain a high level of discrimination between text and non-text. An additional benefit is that text of any colour can be read from almost any background, provided that the contrast is reasonable. The outlines are approximated by polygons using a fast two-stage algorithm. A feature extraction approach to font independent character recognition is described, which uses these outline polygons. It is shown that highly accurate and fast recognition can be achieved using a remarkably small number of carefully chosen features. The results show that after training on only seven quite similar fonts, the recognition algorithm provides greater than 95% accuracy on fonts different to the training set. A more complex edge extraction algorithm is also described. This is capable of extracting text and line graphics from an arbitrary page. Although not essential for character recognition, this algorithm is useful for the interpretation of engineering drawings. As a further contribution to this problem, a thinning algorithm is defined, which is non-iterative and uses the polygonal approximated outlines from the edge extractor.

read more

Citations
More filters
Proceedings ArticleDOI

An Overview of the Tesseract OCR Engine

TL;DR: The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview.
Proceedings ArticleDOI

History of the Tesseract OCR engine: what worked and what didn't

TL;DR: The development history of the Tesseract OCR engine is described, and the methods used are compared to general changes in the field over a similar time period to provide a primer for those interested in OCR research.
Proceedings ArticleDOI

Symbol recognition with a new autonomously evolving classifier autoclass

TL;DR: A new algorithm for symbol recognition based on the AutoClass classifier, which itself is a version of the evolving fuzzy rule-based classifier eClass in which AnYa type of fuzzy rules and data density are used.
Posted Content

Recognition of handwritten Roman Numerals using Tesseract open source OCR engine

TL;DR: The objective of the paper is to recognize handwritten samples of Roman numerals using Tesseract open source Optical Character Recognition (OCR) engine, trained with data samples of different persons to generate one user-independent language model, representing the handwritten Roman digit-set.
Proceedings ArticleDOI

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

TL;DR: The authors proposed a post-OCR text correction approach for digitising texts in Romanised Sanskrit using OCR models trained for other languages written in Roman and found that the use of copying mechanism yields a percentage increase of 7.69 in Character Recognition Rate (CRR).
References
More filters
Book

Algorithms for Graphics and Image Processing

TL;DR: This chapter discusses Graphics, Image Processing, and Pattern Recognition, and the Reconstruction techniques used in this program, as well as some of the problems faced in implementing this program.
Journal ArticleDOI

On the Recognition of Printed Characters of Any Font and Size

TL;DR: The current state of a system that recognizes printed text of various fonts and sizes for the Roman alphabet is described, which combines several techniques in order to improve the overall recognition rate.
Journal ArticleDOI

Fast polygonal approximation of digitized curves

TL;DR: A new technique for fast “scan-along” computation of piecewise linear approximations of digital curves in 2-space is described and the application to the boundaries of the images of a lung and a rib in chest radiographs is illustrated.
Journal ArticleDOI

Some Parallel Thinning Algorithms for Digital Pictures

TL;DR: It is proved that several algorithms which perform a thinning transformation when applied to the picture in parallel do not change the connectivity properties of the picture.
Journal ArticleDOI

A vectorizer and feature extractor for document recognition

TL;DR: A thining algorithm that operates directly on the run length encoding of a bilevel image, which determines other features that are useful for character recognition: arcs, holes, endpoints, etc.