scispace - formally typeset
Search or ask a question
Topic

Optical character recognition

About: Optical character recognition is a research topic. Over the lifetime, 7342 publications have been published within this topic receiving 158193 citations. The topic is also known as: OCR & optical character reader.


Papers
More filters
Proceedings ArticleDOI
19 Apr 1994
TL;DR: The MS-TDNN integrates the high accuracy single character recognition capabilities of a TDNN with a non-linear time alignment procedure (dynamic time warping algorithm) for finding stroke and character boundaries in isolated, handwritten characters and words.
Abstract: Shows how the multi-state time delay neural network (MS-TDNN), which is already used successfully in continuous speech recognition tasks, can be applied both to online single character and cursive (continuous) handwriting recognition. The MS-TDNN integrates the high accuracy single character recognition capabilities of a TDNN with a non-linear time alignment procedure (dynamic time warping algorithm) for finding stroke and character boundaries in isolated, handwritten characters and words. In this approach each character is modelled by up to 3 different states and words are represented as a sequence of these characters. The authors describe the basic MS-TDNN architecture and the input features used in the paper, and present results (up to 97.7% word recognition rate) both on writer dependent/independent, single character recognition tasks and writer dependent, cursive handwriting tasks with varying vocabulary sizes up to 20000 words. >

37 citations

Proceedings ArticleDOI
01 Sep 2015
TL;DR: A new OCR correction strategy, customised for historical medical documents, which combines rule-based correction of regular errors with a medically-tuned spell-checking strategy, whose corrections are guided by information about subject-specific language usage from the publication period of the article to be corrected.
Abstract: Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, owing to large-scale digitisation efforts. Searchable access is typically provided by applying Optical Character Recognition (OCR) software to scanned page images. Often, however, the automatically recognised text contains a large number of errors, since OCR systems are typically optimised to deal with modern documents, and can struggle with historical document features, including variable print characteristics and archaic vocabulary usage. Low quality OCR text can reduce the efficiency of search systems over historical archives, particularly semantic systems that are based on the application of sophisticated text mining (TM) techniques. We report on a new OCR correction strategy, customised for historical medical documents. The method combines rule-based correction of regular errors with a medically-tuned spell-checking strategy, whose corrections are guided by information about subject-specific language usage from the publication period of the article to be corrected. The performance of our method compares favourably to other OCR post-correction strategies, in improving word-level accuracy of poor-quality documents by up to 16%.

37 citations

Proceedings ArticleDOI
01 Nov 2008
TL;DR: Zone and Distance metric based feature extraction system is presented and 98 % and 96 % recognition rate for Kannada and Telugu numerals respectively are obtained.
Abstract: Character recognition is the important area in image processing and pattern recognition fields. Handwritten character recognition has received extensive attention in academic and production fields. The recognition system can be either on-line or off-line. Off-line handwriting recognition is the subfield of optical character recognition. India is a multi-lingual and multi-script country, where eighteen official scripts are accepted and have over hundred regional languages. In this paper we present Zone and Distance metric based feature extraction system. The character centroid is computed and the image is further divided in to n equal zones. Average distance from the character centroid to the each pixel present in the zone is computed. This procedure is repeated for all the zones present in the numeral image. Finally n such features are extracted for classification and recognition. Feed forward back propagation neural network is designed for subsequent classification and recognition purpose. We obtained 98 % and 96 % recognition rate for Kannada and Telugu numerals respectively.

37 citations

Proceedings ArticleDOI
05 Mar 2007
TL;DR: A fuzzy technique for segmentation of handwritten Bangla word images is presented and can be considered as a significant step towards the development of a full-fledged Bangla OCR system, especially for handwritten documents.
Abstract: A fuzzy technique for segmentation of handwritten Bangla word images is presented. It works in two steps. In first step, the black pixels constituting the Matra (i.e., the longest horizontal line joining the tops of individual characters of a Bangla word) in the target word image is identified by using a fuzzy feature. In second step, some of the black pixels on the Matra are identified as segment points (i.e., the points through which the word is to be segmented) by using three fuzzy features. On experimentation with a set of 210 samples of handwritten Bangla words, collected from different sources, the average success rate of the technique is shown to be 95.32%. Apart from certain limitations, the technique can be considered as a significant step towards the development of a full-fledged Bangla OCR system, especially for handwritten documents

37 citations

Journal ArticleDOI
TL;DR: This work forms the word segmentation problem as a binary quadratic assignment problem that considers pairwise correlations between the gaps as well as the likelihoods of individual gaps, and estimates all parameters based on the Structured SVM framework so that the proposed method works well regardless of writing styles and written languages without user-defined parameters.
Abstract: Segmentation of handwritten document images into text-lines and words is an essential task for optical character recognition. However, since the features of handwritten document are irregular and diverse depending on the person, it is considered a challenging problem. In order to address the problem, we formulate the word segmentation problem as a binary quadratic assignment problem that considers pairwise correlations between the gaps as well as the likelihoods of individual gaps. Even though many parameters are involved in our formulation, we estimate all parameters based on the Structured SVM framework so that the proposed method works well regardless of writing styles and written languages without user-defined parameters. Experimental results on ICDAR 2009/2013 handwriting segmentation databases show that proposed method achieves the state-of-the-art performance on Latin-based and Indian languages.

37 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
87% related
Feature (computer vision)
128.2K papers, 1.7M citations
85% related
Image segmentation
79.6K papers, 1.8M citations
85% related
Convolutional neural network
74.7K papers, 2M citations
84% related
Deep learning
79.8K papers, 2.1M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023186
2022425
2021333
2020448
2019430
2018357