The extraction and recognition of text from multimedia document images

Open AccessDissertation

The extraction and recognition of text from multimedia document images

Chats0

TLDR

It is shown that highly accurate and fast recognition can be achieved using a remarkably small number of carefully chosen features, and that after training on only seven quite similar fonts, the recognition algorithm provides greater than 95% accuracy on fonts different to the training set.

Abstract:

Almost all the current commercial OCR machines employ matrix matching, resulting in high speed and accuracy, but a severely restrictive range of recognized fonts. Published algorithms conversely, concentrate on feature extraction for font independence, yet they have previously been too slow for commercial use. Current algorithms also fail to distinguish between text and non-text images. This thesis presents a new approach to the automatic extraction of text from multimedia printed documents. An edge detection algorithm, which is capable of extracting the outlines of text from a grey level image, is used to obtain a high level of discrimination between text and non-text. An additional benefit is that text of any colour can be read from almost any background, provided that the contrast is reasonable. The outlines are approximated by polygons using a fast two-stage algorithm. A feature extraction approach to font independent character recognition is described, which uses these outline polygons. It is shown that highly accurate and fast recognition can be achieved using a remarkably small number of carefully chosen features. The results show that after training on only seven quite similar fonts, the recognition algorithm provides greater than 95% accuracy on fonts different to the training set. A more complex edge extraction algorithm is also described. This is capable of extracting text and line graphics from an arbitrary page. Although not essential for character recognition, this algorithm is useful for the interpretation of engineering drawings. As a further contribution to this problem, a thinning algorithm is defined, which is non-iterative and uses the polygonal approximated outlines from the edge extractor.

The extraction and recognition of text from multimedia document images

Citations

An Overview of the Tesseract OCR Engine

History of the Tesseract OCR engine: what worked and what didn't

Symbol recognition with a new autonomously evolving classifier autoclass

Recognition of handwritten Roman Numerals using Tesseract open source OCR engine

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

References

Algorithms for Graphics and Image Processing

On the Recognition of Printed Characters of Any Font and Size

Fast polygonal approximation of digitized curves

Some Parallel Thinning Algorithms for Digital Pictures

A vectorizer and feature extractor for document recognition

Related Papers (5)

An Overview of the Tesseract OCR Engine

The Fourth Annual Test of OCR Accuracy

A simple and efficient skew detection algorithm via text row accumulation

Optical Character Recognition: An Illustrated Guide to the Frontier

Industrial OCR approaches: architecture, algorithms, and adaptation techniques