Probabilistic Approach for Correction of Optically-Character-Recognized Strings Using Suffix Tree

doi:10.1109/NCVPRIPG.2011.24

Proceedings ArticleDOI

Probabilistic Approach for Correction of Optically-Character-Recognized Strings Using Suffix Tree

- pp 74-77

TLDR

An approach for correcting character recognition errors of an OCR which can recognise Indic Scripts and achieves maximum error rate reduction of 33% over simple character recognition system.

Abstract:

In this paper we present an approach for correcting character recognition errors of an OCR which can recognise Indic Scripts. Suffix tree is used to index the lexicon in lexicographical order to facilitate the probabilistic search. To obtain the best probable match against the mis-recognised string, it is compared with the sub-strings (edges of suffix tree) using similarity measure as weighted Levenshtein distance, where Confusion probabilities of characters (Unicodes) are used as substitution cost, until it exceeds the specified cost k. Retrieved candidates are sorted and selected on the basis of their lowest edit cost. Exploiting this information, the system can correct non-word errors and achieves maximum error rate reduction of 33% over simple character recognition system.

Citations

PDF

Open Access

More filters

Book ChapterDOI

On-line construction of suffix trees

Maxime Crochemore, +1 more

Patent

System and method of character recognition using fully convolutional neural networks

Such Felipe Petroski, +4 more

TL;DR: In this article, the authors present a method for extracting symbols from a digitized object using a dictionary, which they compare with a word in the dictionary, the comparison providing a confidence factor, and the method includes outputting a prediction equal to the word when the confidence factor is greater than a predetermined threshold.

...read moreread less

Journal ArticleDOI

Automatic Text Correction for Devanagari OCR

Atul Kumar, +1 more

- 09 Dec 2016 -

Indian journal of science and technology

TL;DR: A new technique for correcting errors done by Devanagari OCR (Optical Character Reader) system based on confusion matrix, which provides suggestions for all the correct words at top position is proposed.

...read moreread less

References

PDF

Open Access

More filters

Unsupervised Post-Correction of OCR Errors

Fachgebiet Wissensbasierte

TL;DR: The approach combines several methods for retrieving the best correction proposal for a misspelled word: A general spelling correction (Anagram Hash), a new OCR adapted method based on the shape of characters (OCR-Key) and context information (bigrams).

...read moreread less