scispace - formally typeset
Search or ask a question
Topic

Optical character recognition

About: Optical character recognition is a research topic. Over the lifetime, 7342 publications have been published within this topic receiving 158193 citations. The topic is also known as: OCR & optical character reader.


Papers
More filters
Proceedings ArticleDOI
11 Dec 2006
TL;DR: This article describes a simple OCR system that was implemented in Symbian C++ to be run on a stock Nokia 6630 cameraphone and is limited to recognizing English capital letters printed in black, against a white background.
Abstract: In optical character recognition (OCR), visible characters appearing as images (i.e. on paper) are recognized as symbolic characters and stored in a computer?s memory or similar device. The purpose of this work is to find whether current mobile cameraphones are able to run OCR software without relying on dedicated hardware or facilities provided by the network. This article describes a simple OCR system that was implemented in Symbian C++ to be run on a stock Nokia 6630 camera-phone. The system is limited to recognizing English capital letters printed in black, against a white background. The opportunities and hardships related to bringing OCR to run on mobile platforms having image capturing capability are also discussed in more general terms.

40 citations

Proceedings ArticleDOI
24 Jul 2008
TL;DR: A new paradigm is proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging, which formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach.
Abstract: Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to down-stream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. The problem of identifying tabular structures that should not be parsed as sentential text is also discussed.

40 citations

Proceedings ArticleDOI
27 Mar 2012
TL;DR: A novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images by exploiting the additional context present in the character n-gram images, which enables better disambiguation between confusing characters in the recognition phase.
Abstract: In this paper we present a novel recognition approach that results in a 15% decrease in word error rate on heavily degraded Indian language document images. OCRs have considerably good performance on good quality documents, but fail easily in presence of degradations. Also, classical OCR approaches perform poorly over complex scripts such as those for Indian languages. We address these issues by proposing to recognize character n-gram images, which are basically groupings of consecutive character/component segments. Our approach is unique, since we use the character n-grams as a primitive for recognition rather than for post processing. By exploiting the additional context present in the character n-gram images, we enable better disambiguation between confusing characters in the recognition phase. The labels obtained from recognizing the constituent n-grams are then fused to obtain a label for the word that emitted them. Our method is inherently robust to degradations such as cuts and merges which are common in digital libraries of scanned documents. We also present a reliable and scalable scheme for recognizing character n-gram images. Tests on English and Malayalam document images show considerable improvement in recognition in the case of heavily degraded documents.

40 citations

Journal ArticleDOI
TL;DR: The presented technique, which is writer independent, proved to be effective in the automatic recognition of Arabic (Indian) numerals in terms of the highest recognition rate possible.

40 citations

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper proposed a new algorithm for printed script identification based on texture analysis that uses the histogram of the local patterns as description of the script stroke directions distribution which is the characteristic of every script.
Abstract: Script identification is an important step in multi-script document analysis. As different textures present in text portion of a script are the main distinct features of the script, in this paper, we proposed a new algorithm for printed script identification based on texture analysis. Since local patterns is a unifying concept for traditional statistical and structural approaches of texture analysis, here the basic idea is to use the histogram of the local patterns as description of the script stroke directions distribution which is the characteristic of every script. As local pattern, the basic version of the Local Binary Patterns (LBP) and a modified version of the Orientation of the Local Binary Patterns (OLBP) are proposed. A Least Square Support Vector Machine (LS-SVM) is used as identifier. The scheme has been verified on two databases. The first or training database is a database with 200 sheets of 10 different scripts. The scripts font is provided by the Google translator. The second or test database has been obtained by scanning different newspapers and books. It contains 5 common scripts among 10 different scripts of the first database. From the experiment we obtained encouraging results.

40 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
87% related
Feature (computer vision)
128.2K papers, 1.7M citations
85% related
Image segmentation
79.6K papers, 1.8M citations
85% related
Convolutional neural network
74.7K papers, 2M citations
84% related
Deep learning
79.8K papers, 2.1M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023186
2022425
2021333
2020448
2019430
2018357