scispace - formally typeset
Search or ask a question
Topic

Optical character recognition

About: Optical character recognition is a research topic. Over the lifetime, 7342 publications have been published within this topic receiving 158193 citations. The topic is also known as: OCR & optical character reader.


Papers
More filters
01 Jan 2000
TL;DR: A method of text retrieval from document images using a similarity measure based on an N-Gram algorithm to directly extract image features instead of using optical character recognition.
Abstract: In this paper, we propose a method of text retrieval from document images using a similarity measure based on an N-Gram algorithm We directly extract image features instead of using optical character recognition Character image objects are extracted from document images based on connected components first and then an unsupervised classifier is used to classify these objects All objects are encoded according to one unified class set and each document image is represented by one stream of object codes Next, we retrieve N-Gram slices from these streams and build document vectors Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors Four copora of news articles were used to test the validity of our method During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-Gram algorithm for text documents

36 citations

Book ChapterDOI
15 Jun 2009
TL;DR: An automated approach to classify sentences of scholarly work with respect to their rhetorical function is presented, which is robust to noise and can process raw text.
Abstract: We present an automated approach to classify sentences of scholarly work with respect to their rhetorical function. While previous work that achieves this task of argumentative zoning requires richly annotated input, our approach is robust to noise and can process raw text. Even in cases where the input has noise (as it is obtained from optical character recognition or text extraction from PDF files), our robust classifier is largely accurate. We perform an in-depth study of our system both with clean and noisy inputs. We also give preliminary results from in situ acceptability testing when the classifier is embedded within a digital library reading environment.

36 citations

Journal ArticleDOI
TL;DR: This work introduced entropy-based thresholding with metaheuristic approach to find optimal threshold for gray images and found Tsallis method offer better PSNR and SSIM values and capable of effective segmentation of images.
Abstract: Image segmentation is necessity of many application like brain tumor detection, optical character recognition, thermal energy leakage detection, Face recognition etc. multilevel thresholding is the...

36 citations

Patent
01 Dec 2011
TL;DR: In this article, a server system receives a visual query from a client system, performs optical character recognition (OCR) on the visual query to produce text recognition data representing textual characters, including a plurality of textual characters in a contiguous region of the query.
Abstract: A server system receives a visual query from a client system, performs optical character recognition (OCR) on the visual query to produce text recognition data representing textual characters, including a plurality of textual characters in a contiguous region of the visual query. The server system also produces structural information associated with the textual characters in the visual query. Textual characters in the plurality of textual characters are scored. The method further includes identifying, in accordance with the scoring, one or more high quality textual strings, each comprising a plurality of high quality textual characters from among the plurality of textual characters in the contiguous region of the visual query. A canonical document that includes the one or more high quality textual strings and that is consistent with the structural information is retrieved. At least a portion of the canonical document is sent to the client system.

36 citations

Proceedings ArticleDOI
20 Oct 1993
TL;DR: Preliminary results on a new approach to document image binarization, an algorithm based on gray scale histogram and run-length histogram analysis, show that over 99% of such address blocks can be correctly binarized.
Abstract: Document image binarization is not a completely solved problem for unconstrained document images. Binarization algorithms, whether global or local, can easily fail on images with noisy or complex background, or poor contrast. The authors report preliminary results on a new approach to document image binarization, an algorithm based on gray scale histogram and run-length histogram analysis. Experimental results on unconstrained machine printed address blocks from the US letter mail stream show that over 99% of such address blocks can be correctly binarized. >

36 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
87% related
Feature (computer vision)
128.2K papers, 1.7M citations
85% related
Image segmentation
79.6K papers, 1.8M citations
85% related
Convolutional neural network
74.7K papers, 2M citations
84% related
Deep learning
79.8K papers, 2.1M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023186
2022425
2021333
2020448
2019430
2018357