scispace - formally typeset
Search or ask a question

Showing papers on "Optical character recognition published in 2020"


Journal ArticleDOI
12 Jun 2020-Sensors
TL;DR: A CNN architecture is proposed in order to achieve accuracy even better than that of ensemble architectures, along with reduced operational complexity and cost.
Abstract: Traditional systems of handwriting recognition have relied on handcrafted features and a large amount of prior knowledge. Training an Optical character recognition (OCR) system based on these prerequisites is a challenging task. Research in the handwriting recognition field is focused around deep learning techniques and has achieved breakthrough performance in the last few years. Still, the rapid growth in the amount of handwritten data and the availability of massive processing power demands improvement in recognition accuracy and deserves further investigation. Convolutional neural networks (CNNs) are very effective in perceiving the structure of handwritten characters/words in ways that help in automatic extraction of distinct features and make CNN the most suitable approach for solving handwriting recognition problems. Our aim in the proposed work is to explore the various design options like number of layers, stride size, receptive field, kernel size, padding and dilution for CNN-based handwritten digit recognition. In addition, we aim to evaluate various SGD optimization algorithms in improving the performance of handwritten digit recognition. A network's recognition accuracy increases by incorporating ensemble architecture. Here, our objective is to achieve comparable accuracy by using a pure CNN architecture without ensemble architecture, as ensemble architectures introduce increased computational cost and high testing complexity. Thus, a CNN architecture is proposed in order to achieve accuracy even better than that of ensemble architectures, along with reduced operational complexity and cost. Moreover, we also present an appropriate combination of learning parameters in designing a CNN that leads us to reach a new absolute record in classifying MNIST handwritten digits. We carried out extensive experiments and achieved a recognition accuracy of 99.87% for a MNIST dataset.

153 citations


Journal ArticleDOI
TL;DR: This review article serves the purpose of presenting state of the art results and techniques on OCR and also provide research directions by highlighting research gaps.
Abstract: Given the ubiquity of handwritten documents in human transactions, Optical Character Recognition (OCR) of documents have invaluable practical worth. Optical character recognition is a science that enables to translate various types of documents or images into analyzable, editable and searchable data. During last decade, researchers have used artificial intelligence/machine learning tools to automatically analyze handwritten and printed documents in order to convert them into electronic format. The objective of this review paper is to summarize research that has been conducted on character recognition of handwritten documents and to provide research directions. In this Systematic Literature Review (SLR) we collected, synthesized and analyzed research articles on the topic of handwritten OCR (and closely related topics) which were published between year 2000 to 2019. We followed widely used electronic databases by following pre-defined review protocol. Articles were searched using keywords, forward reference searching and backward reference searching in order to search all the articles related to the topic. After carefully following study selection process 176 articles were selected for this SLR. This review article serves the purpose of presenting state of the art results and techniques on OCR and also provide research directions by highlighting research gaps.

139 citations


Posted Content
TL;DR: In this paper, a systematic literature review (SLR) is presented to summarize research that has been conducted on character recognition of handwritten documents and to provide research directions, which serve the purpose of presenting state of the art results and techniques on OCR.
Abstract: Given the ubiquity of handwritten documents in human transactions, Optical Character Recognition (OCR) of documents have invaluable practical worth. Optical character recognition is a science that enables to translate various types of documents or images into analyzable, editable and searchable data. During last decade, researchers have used artificial intelligence / machine learning tools to automatically analyze handwritten and printed documents in order to convert them into electronic format. The objective of this review paper is to summarize research that has been conducted on character recognition of handwritten documents and to provide research directions. In this Systematic Literature Review (SLR) we collected, synthesized and analyzed research articles on the topic of handwritten OCR (and closely related topics) which were published between year 2000 to 2018. We followed widely used electronic databases by following pre-defined review protocol. Articles were searched using keywords, forward reference searching and backward reference searching in order to search all the articles related to the topic. After carefully following study selection process 142 articles were selected for this SLR. This review article serves the purpose of presenting state of the art results and techniques on OCR and also provide research directions by highlighting research gaps.

93 citations


Journal ArticleDOI
TL;DR: This paper proposed a fully convolutional network without any recurrent connections trained with the CTC loss function, which achieved state-of-the-art results on seven public benchmark datasets, covering a wide spectrum of text recognition tasks.

80 citations


Proceedings ArticleDOI
01 Jan 2020
TL;DR: A series of extrinsic assessment tasks are performed using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks, finding a consistent impact resulting from OCR errors on downstream tasks with some tasks more irredeemably harmed by O CR errors.
Abstract: A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.

77 citations


Proceedings ArticleDOI
21 Apr 2020
TL;DR: Twitter A11y increases access to social media platforms for people with visual impairments by providing high-quality automatic descriptions for user-posted images by increasing alt-text coverage from 7.6% to 78.5%, before crowdsourcing descriptions for the remaining images.
Abstract: Social media platforms are integral to public and private discourse, but are becoming less accessible to people with vision impairments due to an increase in user-posted images. Some platforms (i.e. Twitter) let users add image descriptions (alternative text), but only 0.1% of images include these. To address this accessibility barrier, we created Twitter A11y, a browser extension to add alternative text on Twitter using six methods. For example, screenshots of text are common, so we detect textual images, and create alternative text using optical character recognition. Twitter A11y also leverages services to automatically generate alternative text or reuse them from across the web. We compare the coverage and quality of Twitter A11y's six alt-text strategies by evaluating the timelines of 50 self-identified blind Twitter users. We find that Twitter A11y increases alt-text coverage from 7.6% to 78.5%, before crowdsourcing descriptions for the remaining images. We estimate that 57.5% of returned descriptions are high-quality. We then report on the experiences of 10 participants with visual impairments using the tool during a week-long deployment. Twitter A11y increases access to social media platforms for people with visual impairments by providing high-quality automatic descriptions for user-posted images.

63 citations


Posted Content
TL;DR: A novel dataset, TextCaps, with 145k captions for 28k images, challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.
Abstract: Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.

60 citations


Posted Content
Wenwen Yu1, Ning Lu, Xianbiao Qi, Ping Gong1, Rong Xiao 
TL;DR: PICK is introduced, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity.
Abstract: Computer vision with state-of-the-art deep learning models has achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins. Our code is available at this https URL.

58 citations


Book ChapterDOI
23 Aug 2020
TL;DR: In this paper, the authors focus on tables that have complex structures, dense content, and varying layouts with no dependency on meta-features and/or optical character recognition (OCR) models.
Abstract: Tables are information-rich structured objects in document images. While significant work has been done in localizing tables as graphic objects in document images, only limited attempts exist on table structure recognition. Most existing literature on structure recognition depends on extraction of meta-features from the pdf document or on the optical character recognition (ocr) models to extract low-level layout features from the image. However, these methods fail to generalize well because of the absence of meta-features or errors made by the ocr when there is a significant variance in table layouts and text organization. In our work, we focus on tables that have complex structures, dense content, and varying layouts with no dependency on meta-features and/or ocr.

53 citations


Posted Content
TL;DR: This paper proposes a practical ultra lightweight OCR system, i.e., PP-OCR, with an overall model size of only 3.5M, and introduces a bag of strategies to either enhance the model ability or reduce the model size.
Abstract: The Optical Character Recognition (OCR) systems have been widely used in various of application scenarios, such as office automation (OA) systems, factory automations, online educations, map productions etc. However, OCR is still a challenging task due to the various of text appearances and the demand of computational efficiency. In this paper, we propose a practical ultra lightweight OCR system, i.e., PP-OCR. The overall model size of the PP-OCR is only 3.5M for recognizing 6622 Chinese characters and 2.8M for recognizing 63 alphanumeric symbols, respectively. We introduce a bag of strategies to either enhance the model ability or reduce the model size. The corresponding ablation experiments with the real data are also provided. Meanwhile, several pre-trained models for the Chinese and English recognition are released, including a text detector (97K images are used), a direction classifier (600K images are used) as well as a text recognizer (17.9M images are used). Besides, the proposed PP-OCR are also verified in several other language recognition tasks, including French, Korean, Japanese and German. All of the above mentioned models are open-sourced and the codes are available in the GitHub repository, i.e., this https URL.

52 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work presents ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon, and relies on a novel generative model which can generate images of words with an arbitrary length.
Abstract: Optical character recognition (OCR) systems performance have improved significantly in the deep learning era. This is especially true for handwritten text recognition (HTR), where each author has a unique style, unlike printed text, where the variation is smaller by design. That said, deep learning based HTR is limited, as in every other task, by the number of training examples. Gathering data is a challenging and costly task, and even more so, the labeling task that follows, of which we focus here. One possible approach to reduce the burden of data annotation is semi-supervised learning. Semi supervised methods use, in addition to labeled data, some unlabeled samples to improve performance, compared to fully supervised ones. Consequently, such methods may adapt to unseen images during test time. We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. ScrabbleGAN relies on a novel generative model which can generate images of words with an arbitrary length. We show how to operate our approach in a semi-supervised manner, enjoying the aforementioned benefits such as performance boost over state of the art supervised HTR. Furthermore, our generator can manipulate the resulting text style. This allows us to change, for instance, whether the text is cursive, or how thin is the pen stroke.

Posted Content
TL;DR: ScrabbleGAN as discussed by the authors is a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon using a generative model which can generate images of words with an arbitrary length.
Abstract: Optical character recognition (OCR) systems performance have improved significantly in the deep learning era. This is especially true for handwritten text recognition (HTR), where each author has a unique style, unlike printed text, where the variation is smaller by design. That said, deep learning based HTR is limited, as in every other task, by the number of training examples. Gathering data is a challenging and costly task, and even more so, the labeling task that follows, of which we focus here. One possible approach to reduce the burden of data annotation is semi-supervised learning. Semi supervised methods use, in addition to labeled data, some unlabeled samples to improve performance, compared to fully supervised ones. Consequently, such methods may adapt to unseen images during test time. We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. ScrabbleGAN relies on a novel generative model which can generate images of words with an arbitrary length. We show how to operate our approach in a semi-supervised manner, enjoying the aforementioned benefits such as performance boost over state of the art supervised HTR. Furthermore, our generator can manipulate the resulting text style. This allows us to change, for instance, whether the text is cursive, or how thin is the pen stroke.

Journal ArticleDOI
TL;DR: This article presents OCR by combining CNN and Error Correcting Output Code (ECOC) classifier, which shows that CNN-ECOC gives higher accuracy as compared to the traditional CNN classifier.

Book ChapterDOI
23 Aug 2020
TL;DR: The TextCaps dataset as mentioned in this paper is a large dataset with 145k captions for 28k images, which is used to study how to comprehend text in the context of an image, requiring spatial, semantic and visual reasoning between multiple text tokens and visual entities such as objects.
Abstract: Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.

Journal ArticleDOI
TL;DR: This paper designs an image processing module for a mobile device based on the characteristics of a CNN, and proposes a lightweight network structure for optical character recognition (OCR) on specific data sets.
Abstract: Deep learning (DL) is a hot topic in current pattern recognition and machine learning. DL has unprecedented potential to solve many complex machine learning problems and is clearly attractive in the framework of mobile devices. The availability of powerful pattern recognition tools creates tremendous opportunities for next-generation smart applications. A convolutional neural network (CNN) enables data-driven learning and extraction of highly representative, hierarchical image features from appropriate training data. However, for some data sets, the CNN classification method needs adjustments in its structure and parameters. Mobile computing has certain requirements for running time and network weight of the neural network. In this paper, we first design an image processing module for a mobile device based on the characteristics of a CNN. Then, we describe how to use the mobile to collect data, process the data, and construct the data set. Finally, considering the computing environment and data characteristics of mobile devices, we propose a lightweight network structure for optical character recognition (OCR) on specific data sets. The proposed method using a CNN has been validated by comparison with the results of existing methods, used for optical character recognition.

Journal ArticleDOI
TL;DR: The training of a variety of OCR models with deep neural networks (DNN) is explored, finding an optimal DNN for the data and, with additional training data, successfully train high-quality mixed-language models.
Abstract: The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy ( https://github.com/tmbdev/ocropy ) and Tesseract ( https://github.com/tesseract-ocr/tesseract ), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.

Journal ArticleDOI
TL;DR: This work introduced entropy-based thresholding with metaheuristic approach to find optimal threshold for gray images and found Tsallis method offer better PSNR and SSIM values and capable of effective segmentation of images.
Abstract: Image segmentation is necessity of many application like brain tumor detection, optical character recognition, thermal energy leakage detection, Face recognition etc. multilevel thresholding is the...

Journal ArticleDOI
01 Nov 2020
TL;DR: Improved recognition results for Devanagari ancient characters have been presented using the scale-invariant feature transform (SIFT) and Gabor filter feature extraction techniques and poly-SVM classifier.
Abstract: Recognition of Devanagari ancient handwritten character is an important task for resourceful contents' exploitation of the priceless information contained in them. There are numerous Devanagari ancient handwritten documents from fifteenth to the nineteenth century. This paper presents an optical character recognition system for the recognition of Devanagari ancient manuscripts. In this paper, improved recognition results for Devanagari ancient characters have been presented using the scale-invariant feature transform (SIFT) and Gabor filter feature extraction techniques. Support vector machine (SVM) classifier is used for the classification task in this work. For experimental results, a database consisting of 5484 samples of Devanagari characters was collected from various ancient manuscripts placed in libraries and museums. SIFT- and Gabor filter-based features are used to extract the properties of the handwritten Devanagari ancient characters for recognition. Principle component analysis is used to reduce the length of the feature vector for reducing training time of the model and to improve recognition accuracy. Recognition accuracy of 91.39% has been achieved using the proposed system based on tenfold cross-validation technique and poly-SVM classifier.

Posted Content
TL;DR: This paper argues that a simple attention mechanism can do the same or even better job without any bells and whistles of multi-modality encoder design, and finds this simple baseline model consistently outperforms state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three tasks of ST-V QA.
Abstract: Texts appearing in daily scenes that can be recognized by OCR (Optical Character Recognition) tools contain significant information, such as street name, product brand and prices. Two tasks -- text-based visual question answering and text-based image captioning, with a text extension from existing vision-language applications, are catching on rapidly. To address these problems, many sophisticated multi-modality encoding frameworks (such as heterogeneous graph structure) are being used. In this paper, we argue that a simple attention mechanism can do the same or even better job without any bells and whistles. Under this mechanism, we simply split OCR token features into separate visual- and linguistic-attention branches, and send them to a popular Transformer decoder to generate answers or captions. Surprisingly, we find this simple baseline model is rather strong -- it consistently outperforms state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three tasks of ST-VQA, although these SOTA models use far more complex encoding mechanisms. Transferring it to text-based image captioning, we also surpass the TextCaps Challenge 2020 winner. We wish this work to set the new baseline for this two OCR text related applications and to inspire new thinking of multi-modality encoder design. Code is available at this https URL

Proceedings ArticleDOI
12 Oct 2020
TL;DR: A novel Cascade Reasoning Network (CRN) is proposed that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module that aims to explicitly model the connections and interactions between texts and visual concepts.
Abstract: We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvqa.

Posted Content
TL;DR: An end-to-end structured multimodal attention (SMA) neural network is proposed to mainly solve the first two issues above.
Abstract: Text based Visual Question Answering (TextVQA) is a recently raised challenge that requires a machine to read text in images and answer natural language questions by jointly reasoning over the question, Optical Character Recognition (OCR) tokens and visual content. Most of the state-of-the-art (SoTA) VQA methods fail to answer these questions because of i) poor text reading ability; ii) lacking of text-visual reasoning capacity; and iii) adopting a discriminative answering mechanism instead of a generative one which is hard to cover both OCR tokens and general text tokens in the final answer. In this paper, we propose a structured multimodal attention (SMA) neural network to solve the above issues. Our SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then design a multimodal graph attention network to reason over it. Finally, the outputs from the above module are processed by a global-local attentional answering module to produce an answer that covers tokens from both OCR and general text iteratively. Our proposed model outperforms the SoTA models on TextVQA dataset and all three tasks of ST-VQA dataset. To provide an upper bound for our method and a fair testing base for further works, we also provide human-annotated ground-truth OCR annotations for the TextVQA dataset, which were not given in the original release.

Journal ArticleDOI
TL;DR: A set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data and obtained scores are comparable or even better than the scores of several state-of-the-art systems are introduced.
Abstract: As the number of digitized historical documents has increased rapidly during the last a few decades, it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. Nowadays, OCR methods are often not adapted to the historical domain; moreover, they usually need a significant amount of annotated documents. Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented complete OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our segmentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. Both approaches are state of the art in the relevant fields. We have created a novel real dataset for OCR from Porta fontium portal. This corpus is freely available for research, and all proposed methods are evaluated on these data. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems. To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.

Journal Article
TL;DR: In this paper, an adaptive multi-mode EC (AMMEC) algorithm at the decoder supported utilizing preprocessing flexible macro-block ordering error resilience (FMO-ER) technique at the encoder; to efficiently conceal the erroneous MBs of intra and inter-coded frames of 3D video.
Abstract: 3D Multi-View Video (MVV) is multiple video streams shot by several cameras around one scene simultaneously. In Multi-view Video Coding (MVC), the spatio-temporal and interview correlations between frames and views are often used for error concealment. 3D video transmission over erroneous networks remains a substantial issue thanks to restricted resources and therefore the presence of severe channel errors. Efficiently compressing 3D video with a low transmission rate, while maintaining a top quality of the received 3D video, is extremely challenging. Since it's not plausible to re-transmit all the corrupted Macro-Blocks (MBs) thanks to real-time applications and limited resources. Thus it's mandatory to retrieve the lost MBs at the decoder side using sufficient post-processing schemes, like Error Concealment (EC). Error Concealment (EC) algorithms have the advantage of enhancing the received 3D video quality with no modifications within the transmission rate or within the encoder hardware or software. During this presentation, I will be able to explore tons of and different Adaptive Multi-Mode EC (AMMEC) algorithms at the decoder supported utilizing various and adaptive pre-processing techniques, i.e. Flexible Macro-block Ordering Error Resilience (FMO-ER) at the encoder; to efficiently conceal and recover the erroneous MBs of intra and inter-coded frames of the transmitted 3D video. Also, I will be able to present extensive experimental simulation results to point out that our proposed novel schemes can significantly improve the target and subjective 3D video quality. In this paper, secure, timely, fast, and reliable transmission of Wireless Capsule Endoscopy (WCE) images having abnormalities to the physicians are considered. The proposed algorithm uses the image preprocessing technique followed by edge detection using the Fisher Transform (FT) and morphological operation so as to extract features. Implementation of a binary classifier called Linear Support Vector Machine (LSVM) is completed so as to classify the WCE images followed by channel condition gain, specific frame are going to be transmitted to the physician. Thus it's mandatory to retrieve the lost MBs at the decoder side using sufficient post-processing schemes, like error concealment (EC). During this paper, we propose an adaptive multi-mode EC (AMMEC) algorithm at the decoder supported utilizing pre-processing flexible macro-block ordering error resilience (FMO-ER) technique at the encoder; to efficiently conceal the erroneous MBs of intra and inter-coded frames of 3D video. Experimental simulation results show that the proposed FMO-ER/AMMEC schemes can significantly improve the target and subjective 3D video quality. Text superimposed on the video frames provides supplemental but important information for video indexing and retrieval. The detection and recognition of text from the video is thus a crucial issue in automated content-based indexing of visual information in video archives. Text of interest isn't limited to static text. They might be scrolling during a linear motion where only a part of the text information is out there during different frames of the video. The matter is further complicated if the video is corrupted with noise. An algorithm is proposed to detect, classify and segment both static and straightforward linear moving text in a complex noisy background. The extracted texts are further processed using averaging to achieve a top-quality suitable for text recognition by commercial optical character recognition (OCR) software. We have developed a system with multiple pan-tilt cameras for capturing high-resolution videos of a moving person. This technique controls the cameras in order that each camera captures the simplest view of the person (i.e. one among body parts like the top, torso, and limbs) supported criteria for camera-work optimization. For achieving this optimization in real-time, time-consuming pre-processes, which give useful clues for the optimization, are performed during a training stage. Specifically, a target performance (e.g. a dance) is captured to accumulate the configuration of the body parts at each frame. During a real capture stage, the system compares an online-reconstructed shape with those within the training data for fast retrieval of the configuration of the body parts. The retrieved configuration is employed by an efficient scheme for optimizing special effects. Experimental results show the special effects optimized in accordance with the given criteria. A high-resolution 3D videos produced by the proposed system also are shown as typical use of high-resolution videos.

Journal ArticleDOI
TL;DR: A comprehensive survey of the work done in the various phases of an OCR with special focus on the OCR for ancient text documents is presented and future directions for the upcoming researchers in the field of ancient text recognition are presented.
Abstract: Optical character recognition (OCR) is an important research area in the field of pattern recognition. A lot of research has been done on OCR in the last 60 years. There is a large volume of paper-based data in various libraries and offices. Also, there is a wealth of knowledge in the form of ancient text documents. It is a challenge to maintain and search from this paper-based data. At many places, efforts are being done to digitize this data. Paper based documents are scanned to digitize data but scanned data is in pictorial form. It cannot be recognized by computers because computers can understand standard alphanumeric characters as ASCII or some other codes. Therefore, alphanumeric information must be retrieved from scanned images. Optical character recognition system allows us to convert a document into electronic text, which can be used for edit, search, etc. operations. OCR system is the machine replication of human reading and has been the subject of intensive research for more than six decades. This paper presents a comprehensive survey of the work done in the various phases of an OCR with special focus on the OCR for ancient text documents. This paper will help the novice researchers by providing a comprehensive study of the various phases, namely, segmentation, feature extraction and classification techniques required for an OCR system especially for ancient documents. It has been observed that there is a limited work is done for the recognition of ancient documents especially for Devanagari script. This article also presents future directions for the upcoming researchers in the field of ancient text recognition.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: This paper proposes a model based on a hierarchical stack of Transformers to approach the NER task for historical data, and shows that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.
Abstract: This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.

Proceedings ArticleDOI
31 Jan 2020
TL;DR: In this article, the authors presented a new dataset, the MIDV-2019 dataset, containing video clips shot with modern high-resolution mobile cameras, with strong projective distortions and with low lighting conditions.
Abstract: Recognition of identity documents using mobile devices has become a topic of a wide range of computer vision research. The portfolio of methods and algorithms for solving such tasks as face detection, document detection and rectification, text field recognition, and other, is growing, and the scarcity of datasets has become an important issue. One of the openly accessible datasets for evaluating such methods is MIDV-500, containing video clips of 50 identity document types in various conditions. However, the variability of capturing conditions in MIDV-500 did not address some of the key issues, mainly significant projective distortions and different lighting conditions. In this paper we present a MIDV-2019 dataset, containing video clips shot with modern high-resolution mobile cameras, with strong projective distortions and with low lighting conditions. The description of the added data is presented, and experimental baselines for text field recognition in different conditions.

Proceedings ArticleDOI
15 Jul 2020
TL;DR: The concept of text extraction, the process of extraction, and the latest techniques, technologies, and current research in the area are introduced to help other researchers in the field to get an overview of the technology.
Abstract: In the digital era, almost everything is automated, and information is stored and communicated in digital forms. However, there are several situations where the data is not digitized, and it might become essential to extract text from those to store in digitized form. The latest technology such as Text recognition software has completely revolutionized the process of text extraction using Optical Character Recognition. Therefore, this paper introduces the concept, explains the process of extraction, presents the latest techniques, technologies, and current research in the area. Such a review will help other researchers in the field to get an overview of the technology.

Proceedings ArticleDOI
Wei Han1, Hantao Huang1, Tao Han1
01 Dec 2020
TL;DR: This paper proposes a localization-aware answer prediction network (LaAP-Net) that not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer.
Abstract: Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.

Journal ArticleDOI
TL;DR: This paper uses MobileNet, a state of art (convolutional neural network) CNN architecture which is designed for mobile devices as it requires less computing power, for handwritten character recognition and achieves 96.46% accuracy in recognizing 231 classes.
Abstract: Handwritten character recognition is a very tough task in case of complex shaped alphabet set like Bangla script. As optical character recognition (OCR) has a huge application in mobile devices, model needs to be suitable for mobile applications. Many researches have been performed in this arena but none of them achieved satisfactory accuracy or could not detect more than 200 characters. MobileNet is a state of art (convolutional neural network) CNN architecture which is designed for mobile devices as it requires less computing power. In this paper, we used MobileNet for handwritten character recognition. It has achieved 96.46% accuracy in recognizing 231 classes (171 compound, 50 basic and 10 numerals), 96.17% accuracy in 171 compound character classes, 98.37% accuracy in 50 basic character classes and 99.56% accuracy in 10 numeral character classes.

Journal ArticleDOI
TL;DR: Using machine learning applied to OCR-extracted text has the potential to accurately identify clinically-relevant scanned content within EHRs.