scispace - formally typeset
Search or ask a question

Showing papers on "Optical character recognition published in 2022"


Proceedings ArticleDOI
02 Jan 2022
TL;DR: A traditional-split versus leave-one-dataset-out experimental setup is proposed to empirically assess the cross- dataset generalization of 12 Optical Character Recognition models applied to LP recognition on nine publicly available datasets with a great variety in several aspects.
Abstract: : Automatic License Plate Recognition (ALPR) systems have shown remarkable performance on license plates (LPs) from multiple regions due to advances in deep learning and the increasing availability of datasets. The evaluation of deep ALPR systems is usually done within each dataset; therefore, it is questionable if such results are a reliable indicator of generalization ability. In this paper, we propose a traditional-split versus leave-one-dataset-out experimental setup to empirically assess the cross-dataset generalization of 12 Optical Character Recognition (OCR) models applied to LP recognition on nine publicly available datasets with a great variety in several aspects (e.g., acquisition settings, image resolution, and LP layouts). We also introduce a public dataset for end-to-end ALPR that is the first to contain images of vehicles with Mercosur LPs and the one with the highest number of motorcycle images. The experimental results shed light on the limitations of the traditional-split protocol for evaluating approaches in the ALPR context, as there are significant drops in performance for most datasets when training and testing the models in a leave-one-dataset-out fashion.

19 citations


Journal ArticleDOI
TL;DR: This work proposes an accurate and efficient framework, named OCR-RCNN, for elevator button recognition, comprised of an R-CNN based button detector and an attention-RNN based character recognizer that outperforms alternative strategies and other state-of-the-art methods in the literature.
Abstract: Autonomous elevator operation is considered a promising solution for mobile navigation in office buildings. As a fundamental function, elevator button recognition remains unsolved due to the challenging image conditions and severe data imbalance problem. In this article, we propose an accurate and efficient framework, named OCR-RCNN, for elevator button recognition. The framework is comprised of an region-based convolutional neural network (R-CNN)-based button detector and an attention-RNN-based character recognizer. Leveraging the two components, we further propose an end-to-end architecture and a cascaded architecture to explore the most effective network design for the framework. Moreover, a perspective distortion removal algorithm is also developed to enhance the inference performance of OCR-RCNN. Another key contribution of this work is that we release the first large-scale elevator panel dataset with 2005 images and 21 767 button labels. Extensive experiments are conducted on the released dataset and other two publicly available datasets. The proposed framework achieves an F1 score of 0.94, 1.00, and 1.00 in detection task, and an accuracy of 79.6% 96.5%, and 96.4% in character recognition task. The results demonstrate the advantages of our method, outperforming alternative strategies and other state-of-the-art methods in the literature. The data and code are available on the project webpage https://github.com/zhudelong/ocr-rcnn-v2 .

14 citations



Journal ArticleDOI
TL;DR: In this article , a handwritten document recognition system based on the convolutional neural network technique is presented, which performs image pre-processing stages to prepare data for training using a CNN and then segmented the input document using line, word and character segmentation.
Abstract: This paper presents a handwritten document recognition system based on the convolutional neural network technique. In today’s world, handwritten document recognition is rapidly attaining the attention of researchers due to its promising behavior as assisting technology for visually impaired users. This technology is also helpful for the automatic data entry system. In the proposed system prepared a dataset of English language handwritten character images. The proposed system has been trained for the large set of sample data and tested on the sample images of user-defined handwritten documents. In this research, multiple experiments get very worthy recognition results. The proposed system will first perform image pre-processing stages to prepare data for training using a convolutional neural network. After this processing, the input document is segmented using line, word and character segmentation. The proposed system get the accuracy during the character segmentation up to 86%. Then these segmented characters are sent to a convolutional neural network for their recognition. The recognition and segmentation technique proposed in this paper is providing the most acceptable accurate results on a given dataset. The proposed work approaches to the accuracy of the result during convolutional neural network training up to 93%, and for validation that accuracy slightly decreases with 90.42%.

12 citations


Book ChapterDOI
01 Jan 2022
TL;DR: A detailed overview of general extraction methods from different types of documents with different forms of data is presented in this article , which is expected to advance OCR research, providing better understanding and assist researchers to determine which method is ideal for OCR.
Abstract: There exist businesses and applications that involve huge amount of data generated be it in any form to be processed & stored on daily basis. It is an implicit requirement to be able to carry out quick search through this enormous data in order to deal with the high amount of document and data generated. Documents are being digitized in all possible fields as collecting the required data from these documents manually is very time consuming as well as a tedious task. We have been able to save a huge amount of efforts in creating, processing, and saving scanned documents using OCR. It proves to be very efficient due to its use in variety of applications in Healthcare, Education, Banking, Insurance industries, etc. There exists sufficient researches and papers that describe the methods for converting the data residing in the documents into machine readable form. This paper describes a detailed overview of general extraction methods from different types of documents with different forms of data and in addition to this, we have also illustrated on various OCR platforms. The current study is expected to advance OCR research, providing better understanding and assist researchers to determine which method is ideal for OCR.

12 citations


Book ChapterDOI
25 Feb 2022
TL;DR: In this paper , the authors make public the OCR annotations for IDL documents using commercial OCR engine given their superior performance over open source OCR models, which can be used as a starting point for future works on Document Intelligence.
Abstract: Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the pretraining approaches is the inconsistent usage of pretraining data with different OCR engines leading to incomparable results between models. In other words, it is not obvious whether the performance gain is coming from diverse usage of amount of data and distinct OCR engines or from the proposed models. To remedy the problem, we make public the OCR annotations for IDL documents using commercial OCR engine given their superior performance over open source OCR models. It is our hope that OCR-IDL can be a starting point for future works on Document Intelligence. All of our data and its collection process with the annotations can be found in https://github.com/furkanbiten/idl_data .

11 citations


Journal ArticleDOI
TL;DR: In this article , the authors evaluated two key indicators for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO2), from 955 scanned sleep study reports.
Abstract: Abstract Objective Scanned documents in electronic health records (EHR) have been a challenge for decades, and are expected to stay in the foreseeable future. Current approaches for processing include image preprocessing, optical character recognition (OCR), and natural language processing (NLP). However, there is limited work evaluating the interaction of image preprocessing methods, NLP models, and document layout. Materials and Methods We evaluated 2 key indicators for sleep apnea, Apnea hypopnea index (AHI) and oxygen saturation (SaO2), from 955 scanned sleep study reports. Image preprocessing methods include gray-scaling, dilating, eroding, and contrast. OCR was implemented with Tesseract. Seven traditional machine learning models and 3 deep learning models were evaluated. We also evaluated combinations of image preprocessing methods, and 2 deep learning architectures (with and without structured input providing document layout information), with the goal of optimizing end-to-end performance. Results Our proposed method using ClinicalBERT reached an AUROC of 0.9743 and document accuracy of 94.76% for AHI, and an AUROC of 0.9523 and document accuracy of 91.61% for SaO2. Discussion There are multiple, inter-related steps to extract meaningful information from scanned reports. While it would be infeasible to experiment with all possible option combinations, we experimented with several of the most critical steps for information extraction, including image processing and NLP. Given that scanned documents will likely be part of healthcare for years to come, it is critical to develop NLP systems to extract key information from this data. Conclusion We demonstrated the proper use of image preprocessing and document layout could be beneficial to scanned document processing.

10 citations


Journal ArticleDOI
TL;DR: In this article , a simple model based on a deep neural network architecture that combines recent advances in computer vision and machine learning, which can be used to detect and convert a table into a format that can be edited or searched, is presented.

9 citations


Journal ArticleDOI
TL;DR: In this article , the authors proposed a segmentation-free method based on a deep convolutional recurrent neural network to solve the problem of cursive text recognition, particularly focusing on Urdu text in natural scenes.
Abstract: Text recognition in natural scene images is a challenging problem in computer vision. Different than the optical character recognition (OCR), text recognition in natural scene images is more complex due to variations in text size, colors, fonts, orientations, complex backgrounds, occlusion, illuminations and uneven lighting conditions. In this paper, we propose a segmentation-free method based on a deep convolutional recurrent neural network to solve the problem of cursive text recognition, particularly focusing on Urdu text in natural scenes. Compared to the non-cursive scripts, Urdu text recognition is more complex due to variations in the writing styles, several shapes of the same character, connected text, ligature overlapping, stretched, diagonal and condensed text. The proposed model gets a whole word image as an input without pre-segmenting into individual characters, and then transforms into the sequence of the relevant features. Our model is based on three components: a deep convolutional neural network (CNN) with shortcut connections to extract and encode the features, a recurrent neural network (RNN) to decode the convolutional features, and a connectionist temporal classification (CTC) to map the predicted sequences into the target labels. To increase the text recognition accuracy further, we explore deeper CNN architectures like VGG-16, VGG-19, ResNet-18 and ResNet-34 to extract more appropriate Urdu text features, and compare the recognition results. To conduct the experiments, a new large-scale benchmark dataset of cropped Urdu word images in natural scenes is developed. The experimental results show that the proposed deep CRNN network with shortcut connections outperform than other network architectures. The dataset is publicly available and can be downloaded from https://data.mendeley.com/datasets/k5fz57zd9z/1 .

9 citations


Journal ArticleDOI
TL;DR: In this article , the characteristics and inherent ambiguities of Bengali handwritten digits along with a comprehensive insight of two decades of the state-of-the-art datasets and approaches towards offline BHDR have been analyzed.
Abstract: Handwritten Digit Recognition (HDR) is one of the most challenging tasks in the domain of Optical Character Recognition (OCR). Irrespective of language, there are some inherent challenges of HDR, which mostly arise due to the variations in writing styles across individuals, writing medium and environment, inability to maintain the same strokes while writing any digit repeatedly, etc. In addition to that, the structural complexities of the digits of a particular language may lead to ambiguous scenarios of HDR. Over the years, researchers have developed numerous offline and online HDR pipelines, where different image processing techniques are combined with traditional Machine Learning (ML)-based and/or Deep Learning (DL)-based architectures. Although evidence of extensive review studies on HDR exists in the literature for languages, such as English, Arabic, Indian, Farsi, Chinese, etc., few surveys on Bengali HDR (BHDR) can be found, which lack a comprehensive analysis of the challenges, the underlying recognition process, and possible future directions. In this paper, the characteristics and inherent ambiguities of Bengali handwritten digits along with a comprehensive insight of two decades of the state-of-the-art datasets and approaches towards offline BHDR have been analyzed. Furthermore, several real-life application-specific studies, which involve BHDR, have also been discussed in detail. This paper will also serve as a compendium for researchers interested in the science behind offline BHDR, instigating the exploration of newer avenues of relevant research that may further lead to better offline recognition of Bengali handwritten digits in different application areas.

8 citations


Posted Content
TL;DR: The authors proposed a new approach for paragraph identification by spatial graph convolutional neural networks (GCN) applied on OCR text boxes, where two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results.
Abstract: Paragraphs are an important class of document entities. We propose a new approach for paragraph identification by spatial graph convolutional neural networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a beta-skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With only pure layout input features, the GCN model size is 3~4 orders of magnitude smaller compared to R-CNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles.

Book ChapterDOI
TL;DR: Zhang et al. as mentioned in this paper presented a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information.
Abstract: AbstractRecently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visual-textual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2% and +1.3% for Total-Text and CTW1500).KeywordsVision-language pre-trainingScene text detectionScene text spotting

Journal ArticleDOI
TL;DR: This article explored the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages, and concluded that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.
Abstract: Event detection is a crucial task for natural language processing and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labour-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages. We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.

Proceedings ArticleDOI
06 Jun 2022
TL;DR: This paper explores deep learning models and OCR methods to effectively extract textual information from engineering documents collected by the NAVY's military sealift command division using a deep learning-based optical character recognition (OCR) framework, which integrates several modules including a pre-trained text detection model, a fine-tuned OCR algorithm, and a deep generative model to augment data for the fine- Tuning.
Abstract: Digital engineering, the digital transformation of engineering practice, is profoundly changing the traditional engineering practice towards the fast integration of digital technologies and digital models in the engineering processes' life cycles. The traditional engineering process heavily relies on static engineering documents (e.g., spreadsheets, technical drawings, and scanned documents) to store and share information across the engineering process. A critical task in digital engineering is to extract relevant textual information from traditional engineering documents into machine-readable and editable formats. This paper explores deep learning models and OCR methods to effectively extract textual information from engineering documents collected by the NAVY's military sealift command division. We propose a deep learning-based optical character recognition (OCR) framework for this task, which integrates several modules including a pre-trained text detection model, a fine-tuned OCR algorithm, and a deep generative model to augment data for the fine-tuning. Experimental results showed that the fine-tuning method significantly improved word accuracies of OCR models from 60%-70% to 90% and above. Furthermore, the deep adversarial generative approach had proved to be an effective model for data augmentation.

Journal ArticleDOI
30 Apr 2022
TL;DR: This work has collected 3900 distorted Hindi characters and tried to extract six different types of features from these characters to analyze the recognition accuracy and achieved maximum recognition accuracy of 91.1%.

Journal ArticleDOI
TL;DR: The basics of Tamil text, previous tasks in Tamil text attention, Tamil letter recognition algorithms, and recognition barriers are covered in this study.
Abstract: The concept of visual cues in a paperless workspace is based on converting scanned images into machine-readable text. It requires the development of a variety of new applications, such as automated postal systems, banks, institutions, word processing, and library systems, to name a few. Artificial Intelligence (AI) is an area of computer science that focuses on making machines smart. Hand typing, script, and print text recognition are key study subjects because no 100% recognition can be achieved even if the scanned image is accurate. Text, numbers, and images can all be used to write text in a variety of ways. The basics of Tamil text, previous tasks in Tamil text attention, Tamil letter recognition algorithms, and recognition barriers are covered in this study.

Journal ArticleDOI
TL;DR: A Deep Convolutional Neural Network has been proposed that learns deep features for offline Gurmukhi handwritten character and numeral recognition (HCNR) and works efficiently for training as well as testing and exhibits a good recognition performance.
Abstract: Over the last few years, several researchers have worked on handwritten character recognition and have proposed various techniques to improve the performance of Indic and non-Indic scripts recognition. Here, a Deep Convolutional Neural Network has been proposed that learns deep features for offline Gurmukhi handwritten character and numeral recognition (HCNR). The proposed network works efficiently for training as well as testing and exhibits a good recognition performance. Two primary datasets comprising of offline handwritten Gurmukhi characters and Gurmukhi numerals have been employed in the present work. The testing accuracies achieved using the proposed network is 98.5% for characters and 98.6% for numerals.

Proceedings ArticleDOI
01 Jan 2022
TL;DR: In this paper , spatial graph convolutional networks (GCN) are applied on OCR text boxes to extract paragraphs from the lines in OCR results, where each step uses a β-skeleton graph constructed from bounding boxes.
Abstract: We propose a new approach for paragraph recognition in document images by spatial graph convolutional networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a β-skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With pure layout input features, the GCN model size is 3~4 orders of magnitude smaller compared to RCNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles.

Book ChapterDOI
08 Nov 2022
TL;DR: Neuro-OCR as mentioned in this paper is an interactive book reader for blind people based on optical character recognition (OCR), which is made up of a camera-based architecture that aids blind people in reading text on labels, printed notes, and objects.
Abstract: Everyone deserves to live freely, even those who are impaired. In recent decades, technology has focused on empowering disabled people to have as much control over their lives as possible. The braille system, which allows the blind to read, is now the only effective system available. However, this approach is time demanding, and it takes a long time to recognize the text. Our goal is to cut down on time it takes to read. Our article created a ground-breaking interactive book reader for blind people based on optical character recognition. In artificial intelligence and recognition of patterns, among the most effective technology applications are optical character recognition. It is necessary to have a simple content reader accessible, inexpensive, and easily obtainable in public. The framework is made up of a camera-based architecture that aids blind people in reading text on labels, printed notes, and objects. Text-to-speech (TTS), OCR, image processing methods, and a synthesis module are all part of our framework. Neuro-OCR deals with incorporating a complete text read-out device suited for the visually handicapped. We used Google Tesseract as an OCR and Pico as a TTS in our work. After which, the voice output is sent to the Telegram application and noticed by the user.

Journal ArticleDOI
TL;DR: The results showed that the Google Cloud Vision API works well for the Thai vehicle registration certificate with an accuracy of 84.43%, whereas the Tesseract OCR showed an accuracy level of 47.02%, while the proposed conditions facilitate the possibility of the implementation for Thai vehicleRegistration certificate recognition system.
Abstract: Optical character recognition (OCR) is a technology to digitize a paper-based document to digital form. This research studies the extraction of the characters from a Thai vehicle registration certificate via a Google Cloud Vision API and a Tesseract OCR. The recognition performance of both OCR APIs is also examined. The 84 color image files comprised three image sizes/resolutions and five image characteristics. For suitable image type comparison, the greyscale and binary image are converted from color images. Furthermore, the three pre-processing techniques, sharpening, contrast adjustment, and brightness adjustment, are also applied to enhance the quality of image before applying the two OCR APIs. The recognition performance was evaluated in terms of accuracy and readability. The results showed that the Google Cloud Vision API works well for the Thai vehicle registration certificate with an accuracy of 84.43%, whereas the Tesseract OCR showed an accuracy of 47.02%. The highest accuracy came from the color image with 1024×768 px, 300dpi, and using sharpening and brightness adjustment as pre-processing techniques. In terms of readability, the Google Cloud Vision API has more readability than the Tesseract. The proposed conditions facilitate the possibility of the implementation for Thai vehicle registration certificate recognition system.

Journal ArticleDOI
TL;DR: Chen et al. as mentioned in this paper proposed an end-to-end structured multimodal attention (SMA) neural network to solve the problems of poor text reading ability, lack of textual-visual reasoning capacity, and choosing discriminative answering mechanism over generative couterpart.
Abstract: Text based Visual Question Answering (TextVQA) is a recently raised challenge requiring models to read text in images and answer natural language questions by jointly reasoning over the question, textual information and visual content. Introduction of this new modality - Optical Character Recognition (OCR) tokens ushers in demanding reasoning requirements. Most of the state-of-the-art (SoTA) VQA methods fail when answer these questions because of three reasons: (1) poor text reading ability; (2) lack of textual-visual reasoning capacity; and (3) choosing discriminative answering mechanism over generative couterpart (although this has been further addressed by M4C). In this paper, we propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above. SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it. Finally, the outputs from the above modules are processed by a global-local attentional answering module to produce an answer splicing together tokens from both OCR and general vocabulary iteratively by following M4C. Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP. Demonstrating strong reasoning ability, it also won first place in TextVQA Challenge 2020. We extensively test different OCR methods on several reasoning models and investigate the impact of gradually increased OCR performance on TextVQA benchmark. With better OCR results, different models share dramatic improvement over the VQA accuracy, but our model benefits most blessed by strong textual-visual reasoning ability. To grant our method an upper bound and make a fair testing base available for further works, we also provide human-annotated ground-truth OCR annotations for the TextVQA dataset, which were not given in the original release. The code and ground-truth OCR annotations for the TextVQA dataset are available at https://github.com/ChenyuGAO-CS/SMA.



Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper developed an automatic OCR system designed to identify up to 13,070 large-scale printed Chinese characters by using deep learning neural networks and fine-tuning techniques.
Abstract: In the field of computer vision, large-scale image classification tasks are both important and highly challenging. With the ongoing advances in deep learning and optical character recognition (OCR) technologies, neural networks designed to perform large-scale classification play an essential role in facilitating OCR systems. In this study, we developed an automatic OCR system designed to identify up to 13,070 large-scale printed Chinese characters by using deep learning neural networks and fine-tuning techniques. The proposed framework comprises four components, including training dataset synthesis and background simulation, image preprocessing and data augmentation, the process of training the model, and transfer learning. The training data synthesis procedure is composed of a character font generation step and a background simulation process. Three background models are proposed to simulate the factors of the background noise patterns on ID cards. To expand the diversity of the synthesized training dataset, rotation and zooming data augmentation are applied. A massive dataset comprising more than 19.6 million images was thus created to accommodate the variations in the input images and improve the learning capacity of the CNN model. Subsequently, we modified the GoogLeNet neural architecture by replacing the fully connected layer with a global average pooling layer to avoid overfitting caused by a massive amount of training data. Consequently, the number of model parameters was reduced. Finally, we employed the transfer learning technique to further refine the CNN model using a small number of real data samples. Experimental results show that the overall recognition performance of the proposed approach is significantly better than that of prior methods and thus demonstrate the effectiveness of proposed framework, which exhibited a recognition accuracy as high as 99.39% on the constructed real ID card dataset.

Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed PageNet for weakly supervised page-level HCTR, which detects and recognizes characters and predicts the reading order between them, which is more robust and flexible when dealing with complex layouts including multi-directional and curved text lines.
Abstract: Handwritten Chinese text recognition (HCTR) has been an active research topic for decades. However, most previous studies solely focus on the recognition of cropped text line images, ignoring the error caused by text line detection in real-world applications. Although some approaches aimed at page-level text recognition have been proposed in recent years, they either are limited to simple layouts or require very detailed annotations including expensive line-level and even character-level bounding boxes. To this end, we propose PageNet for end-to-end weakly supervised page-level HCTR. PageNet detects and recognizes characters and predicts the reading order between them, which is more robust and flexible when dealing with complex layouts including multi-directional and curved text lines. Utilizing the proposed weakly supervised learning framework, PageNet requires only transcripts to be annotated for real data; however, it can still output detection and recognition results at both the character and line levels, avoiding the labor and cost of labeling bounding boxes of characters and text lines. Extensive experiments conducted on five datasets demonstrate the superiority of PageNet over existing weakly supervised and fully supervised page-level methods. These experimental results may spark further research beyond the realms of existing methods based on connectionist temporal classification or attention. The source code is available at https://github.com/shannanyinxiang/PageNet .

Journal ArticleDOI
TL;DR: A flexible and user-friendly solution for determining the medication quantity given to patients, using augmented reality and optical character recognition algorithm capabilities is proposed.
Abstract: The trend of personalized medicine and the increasing flexibility of drug dosage relevant goals of the 21st century represent the foundation for the current research. To obtain doses smaller than the smallest available, physicians frequently write prescriptions for children and adults, without preserving the integrity of the pill. Moreover, patients purchase large amounts of medication for cost-saving reasons. To support the correct administration of the remedies and the partial alignment to the personalized treatment trend, this paper proposes a flexible and user-friendly solution for determining the medication quantity given to patients, using augmented reality and optical character recognition algorithm capabilities. Via the MATLAB development environment and a Logitech HD Pro C920 webcam, the results were 80% correct in identifying the cutting position of the pill, by means of the Hough transform, and 30% correct in weight recognition exploitation using an optical character recognition (OCR) algorithm. In future work, a higher resolution camera and a more powerful computer can be used to increase the percentages mentioned above. In addition, a 3D scan of the pill fragmentation, combined with the weight recognition functionality, could increase the accuracy of the splitting procedure, conducted by the patient or the patient caretaker.

Proceedings ArticleDOI
29 Mar 2022
TL;DR: In this paper , computer vision technology is extrapolated onto the system to enhance the text inside the digitized image, which can be helpful in divination based on knowledge engineering and qualitative analysis in the near future.
Abstract: Optical Character Recognition (OCR) is a predominant aspect to transmute scanned images and other visuals into text. Computer vision technology is extrapolated onto the system to enhance the text inside the digitized image. This preliminary provisional setup holds the invoice's information and converts it into JSON and CSV configurations. This model can be helpful in divination based on knowledge engineering and qualitative analysis in the nearing future. The existing system contains data extraction and nothing more. In a paramount manner, image pre-processing techniques like black and white, inverted, noise removal, grayscale, thick font, and canny are applied to escalate the quality of the picture. With the enhanced image, more OpenCV procedures are carried through. In the very next step, three different OCRs are used: Keras OCR, Easy OCR, and Tesseract OCR, out of which Tesseract OCR gives the precise result. After the initial steps, the undesirable symbols (/t, /n) are cleared to get the escalated text as an output. Eventually, a unique work that is highly accurate in giving JSON and CSV formats is developed.Impact statement— In our protrude, a front-end android app is developed which takes input from the user and stores the output onto the database. The JSON and CSV files can be viewed through an app by the end.

Journal ArticleDOI
TL;DR: A deep analysis of OCR errors that impact the performance of NER and NEL is provided and subsequent recommendations on the adequate documents, the OCR quality levels, and the post-OCR correction strategies required to perform reliable NERand NEL are presented.
Abstract: Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through their optical character recognition (OCRed) version which include numerous errors. Although OCR engines have considerably improved over the last few years, OCR errors still considerably impact document access. Previous works were conducted to evaluate the impact of OCR errors on named entity recognition (NER) and named entity linking (NEL) techniques separately. In this article, we experimented with a variety of OCRed documents with different levels and types of OCR noise to assess in depth the impact of OCR on named entity processing. We provide a deep analysis of OCR errors that impact the performance of NER and NEL. We then present the resulting exhaustive study and subsequent recommendations on the adequate documents, the OCR quality levels, and the post-OCR correction strategies required to perform reliable NER and NEL.

Journal ArticleDOI
TL;DR: In this article , a segmentation-based, omnifont, open-vocabulary OCR for printed Arabic text was proposed, which uses an explicit, indirect character segmentation method.

Journal ArticleDOI
TL;DR: In this paper , a deep neural network based self-supervised pre-training technique is used to predict hidden (masked) sections of text to fill in the gaps of non-transcribable parts of the documents being processed.
Abstract: According to a recent Deloitte study, the COVID-19 pandemic continues to place a huge strain on the global health care sector. Covid-19 has also catalysed digital transformation across the sector for improving operational efficiencies. As a result, the amount of digitally stored patient data such as discharge letters, scan images, test results or free text entries by doctors has grown significantly. In 2020, 2314 exabytes of medical data was generated globally. This medical data does not conform to a generic structure and is mostly in the form of unstructured digitally generated or scanned paper documents stored as part of a patient’s medical reports. This unstructured data is digitised using Optical Character Recognition (OCR) process. A key challenge here is that the accuracy of the OCR process varies due to the inability of current OCR engines to correctly transcribe scanned or handwritten documents in which text may be skewed, obscured or illegible. This is compounded by the fact that processed text is comprised of specific medical terminologies that do not necessarily form part of general language lexicons. The proposed work uses a deep neural network based self-supervised pre-training technique: Robustly Optimized Bidirectional Encoder Representations from Transformers (RoBERTa) that can learn to predict hidden (masked) sections of text to fill in the gaps of non-transcribable parts of the documents being processed. Evaluating the proposed method on domain-specific datasets which include real medical documents, shows a significantly reduced word error rate demonstrating the effectiveness of the approach.