scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Probabilistic Approach for Correction of Optically-Character-Recognized Strings Using Suffix Tree

15 Dec 2011-pp 74-77
TL;DR: An approach for correcting character recognition errors of an OCR which can recognise Indic Scripts and achieves maximum error rate reduction of 33% over simple character recognition system.
Abstract: In this paper we present an approach for correcting character recognition errors of an OCR which can recognise Indic Scripts. Suffix tree is used to index the lexicon in lexicographical order to facilitate the probabilistic search. To obtain the best probable match against the mis-recognised string, it is compared with the sub-strings (edges of suffix tree) using similarity measure as weighted Levenshtein distance, where Confusion probabilities of characters (Unicodes) are used as substitution cost, until it exceeds the specified cost k. Retrieved candidates are sorted and selected on the basis of their lowest edit cost. Exploiting this information, the system can correct non-word errors and achieves maximum error rate reduction of 33% over simple character recognition system.
Citations
More filters
Book ChapterDOI
01 Sep 2002

451 citations

Patent
17 May 2018
TL;DR: In this article, the authors present a method for extracting symbols from a digitized object using a dictionary, which they compare with a word in the dictionary, the comparison providing a confidence factor, and the method includes outputting a prediction equal to the word when the confidence factor is greater than a predetermined threshold.
Abstract: Embodiments of the present disclosure include a method for extracting symbols from a digitized object. The method includes processing the word block against a dictionary. The method includes comparing the word block against a word in the dictionary, the comparison providing a confidence factor. The method includes outputting a prediction equal to the word when the confidence factor is greater than a predetermined threshold. The method includes evaluating properties of the word block when the confidence factor is less than the predetermined threshold. The method includes predicting a value of the word block based on the properties of the word block. The method further includes determining an error rate for the predicted value of the word block. The method includes outputting a value for the word block, the output equal to a calculated value corresponding to a value of the word block having the lowest error rate.

13 citations

Journal ArticleDOI
TL;DR: A new technique for correcting errors done by Devanagari OCR (Optical Character Reader) system based on confusion matrix, which provides suggestions for all the correct words at top position is proposed.
Abstract: Objectives: This paper proposes a new technique for correcting errors done by Devanagari OCR (Optical Character Reader) system based on confusion matrix. Methods/Statistical Analysis: Confusion matrix is generated from large corpus of Hindi. The system takes each word of OCR output and generate number of strings from topmost five confused characters for each character of input word along with probability of these strings for ranking. Each string is validated with the character trigram dictionary and these valid strings are used for best suggestions. Findings: The topmost five words is taken as suggestions. The system has been tested for variety of OCR outputs documents of Devanagari script. The system provides suggestions for all the correct words at top position. For more than 10000 unique words in Devanagari OCR output, system gives the accuracy of 97%. Application/Improvements: This system is used in post-processing of Devanagari OCR. With some improvements, the system can also be used for Gurumukhi Script and Urdu script.

4 citations

References
More filters
Journal ArticleDOI
TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.
Abstract: The string-to-string correction problem is to determine the distance between two strings as measured by the minimum cost sequence of “edit operations” needed to change the one string into the other. The edit operations investigated allow changing one symbol of a string into another single symbol, deleting one symbol from a string, or inserting a single symbol into a string. An algorithm is presented which solves this problem in time proportional to the product of the lengths of the two strings. Possible applications are to the problems of automatic spelling correction and determining the longest subsequence of characters common to two strings.

3,252 citations

Journal ArticleDOI
TL;DR: An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string, developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries.
Abstract: An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It always has the suffix tree for the scanned part of the string ready. The method is developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries. Regardless of its quadratic worst case this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give, in a natural way, the well-known algorithms for constructing suffix automata (DAWGs).

1,528 citations

Proceedings ArticleDOI
03 Oct 2000
TL;DR: A new channel model for spelling correction, based on generic string to string edits, is described, which gives significant performance improvements compared to previously proposed models.
Abstract: The noisy channel model has been applied to a wide range of problems, including spelling correction. These models consist of two components: a source model and a channel model. Very little research has gone into improving the channel model for spelling correction. This paper describes a new channel model for spelling correction, based on generic string to string edits. Using this model gives significant performance improvements compared to previously proposed models.

617 citations


"Probabilistic Approach for Correcti..." refers background in this paper

  • ...[3] E. Brill and R. C. Moore, “An improved error model for noisy channel spelling correction,” 2000, pp. 286–293....

    [...]

  • ...Brill and Moore [3] introduced an improved noisy channel model for spelling correction....

    [...]

  • ...Brill and Moore [3] reports 52% error reduction without language model and 74% with...

    [...]

  • ...Brill and Moore [3] reports 52% error reduction without language model and 74% with language model....

    [...]