scispace - formally typeset
Search or ask a question
Topic

Optical character recognition

About: Optical character recognition is a research topic. Over the lifetime, 7342 publications have been published within this topic receiving 158193 citations. The topic is also known as: OCR & optical character reader.


Papers
More filters
Journal ArticleDOI
TL;DR: A generalised, hierarchical framework for script identification is proposed and a set of energy and intensity space features for this task are presented to establish the utility of a global approach to the classification of scripts.
Abstract: Automatic identification of a script in a given document image facilitates many important applications such as automatic archiving of multilingual documents, searching online archives of document images and for the selection of script-specific OCR in a multi-lingual environment. In this paper, we model script identification as a texture classification problem and examine a global approach inspired by human visual perception. A generalised, hierarchical framework is proposed for script identification. A set of energy and intensity space features for this task is also presented. The framework serves to establish the utility of a global approach to the classification of scripts. The framework has been tested on two datasets: 10 Indian and 13 world scripts. The obtained accuracy of identification across the two datasets is above 94%. The results demonstrate that the framework can be used to develop solutions for script identification from document images across a large set of script classes.

46 citations

Proceedings ArticleDOI
01 Nov 2017
TL;DR: Ground truth in MUSCIMA++ is a notation graph, which analysis shows to be a necessary and sufficient representation of music notation, and is designed and collected, a new dataset for OMR.
Abstract: Optical Music Recognition (OMR) promises to make accessible the content of large amounts of musical documents, an important component of cultural heritage. However, the field does not have an adequate dataset and ground truth for benchmarking OMR systems, which has been a major obstacle to measurable progress. Furthermore, machine learning methods for OMR require training data. We design and collect MUSCIMA++, a new dataset for OMR. Ground truth in MUSCIMA++ is a notation graph, which our analysis shows to be a necessary and sufficient representation of music notation. Building on the CVC-MUSCIMA dataset for staffline removal, the MUSCIMA++ dataset v1.0 consists of 140 pages of handwritten music, with 91254 manually annotated notation symbols and 82247 explicitly marked relationships between symbol pairs. The dataset allows training and directly evaluating models for symbol classification, symbol localization, and notation graph assembly, and indirectly musical content extraction, both in isolation and jointly. Open-source tools are provided for manipulating the dataset, visualizing the data and annotating further, and the data is made available under an open license.

46 citations

Journal ArticleDOI
TL;DR: This article presents OCR by combining CNN and Error Correcting Output Code (ECOC) classifier, which shows that CNN-ECOC gives higher accuracy as compared to the traditional CNN classifier.

46 citations

Book ChapterDOI
12 Dec 2016
TL;DR: The present evaluation is expected to advance OCR research, providing new insights and consideration to the research area, and assist researchers to determine which service is ideal for optical character recognition in an accurate and efficient manner.
Abstract: Optical character recognition (OCR) as a classic machine learning challenge has been a longstanding topic in a variety of applications in healthcare, education, insurance, and legal industries to convert different types of electronic documents, such as scanned documents, digital images, and PDF files into fully editable and searchable text data. The rapid generation of digital images on a daily basis prioritizes OCR as an imperative and foundational tool for data analysis. With the help of OCR systems, we have been able to save a reasonable amount of effort in creating, processing, and saving electronic documents, adapting them to different purposes. A set of different OCR platforms are now available which, aside from lending theoretical contributions to other practical fields, have demonstrated successful applications in real-world problems. In this work, several qualitative and quantitative experimental evaluations have been performed using four well-know OCR services, including Google Docs OCR, Tesseract, ABBYY FineReader, and Transym. We analyze the accuracy and reliability of the OCR packages employing a dataset including 1227 images from 15 different categories. Furthermore, we review the state-of-the-art OCR applications in healtcare informatics. The present evaluation is expected to advance OCR research, providing new insights and consideration to the research area, and assist researchers to determine which service is ideal for optical character recognition in an accurate and efficient manner.

46 citations

Patent
15 Sep 2000
TL;DR: In this article, a system and method for indexing and searching textual archives using semantic units such as syllables and morphemes is presented, where the string of semantic units that result from a decoding process are stored in a semantic unit database and indexed with pointers to the corresponding textual data in the textual archive.
Abstract: A system and method for indexing and searching textual archives using semantic units such as syllables and morphemes. In one aspect, a system for indexing a textual archive comprises an AHR (automatic handwriting recognition) system and/or OCR (optical character recognition) system for transcribing (decoding) textual input data (handwritten or typed text) into a string of semantic units (e.g., syllables or morphemes) using a statistical language model and vocabulary based on semantic units (such as syllables or morphemes). The string of semantic units that result from a decoding process are stored in a semantic unit database and indexed with pointers to the corresponding textual data in the textual archive. In another aspect, a system for searching a textual archive is provided, wherein a word (or words) to be searched is rendered into a string of semantic units (e.g., syllables or morphemes) depending on the application. A search engine then compares the string of semantic units (resulting from the input query) against the decoded semantic unit database, and then identifies textual data stored in the textual archive using the indexes that were generated during a semantic unit-based indexing process.

46 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
87% related
Feature (computer vision)
128.2K papers, 1.7M citations
85% related
Image segmentation
79.6K papers, 1.8M citations
85% related
Convolutional neural network
74.7K papers, 2M citations
84% related
Deep learning
79.8K papers, 2.1M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023186
2022425
2021333
2020448
2019430
2018357