Bio: Agam Dwivedi is an academic researcher from International Institute of Information Technology, Hyderabad. The author has contributed to research in topics: Word error rate & Optical character recognition. The author has an hindex of 1, co-authored 1 publications receiving 5 citations.
••01 Jun 2020
TL;DR: A Sanskrit specific OCR system for printed classical Indic documents written in Sanskrit is developed, and an attention-based LSTM model for reading Sanskrit characters in line images is presented, setting the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
Abstract: OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
01 Jan 1969
05 Sep 2021
TL;DR: In this article, the authors presented a new dataset for identity documents (ID) recognition called MIDV-LAIT, which includes textual fields in Perso-Arabic, Thai, and Indian scripts.
Abstract: In this paper, we present a new dataset for identity documents (IDs) recognition called MIDV-LAIT. The main feature of the dataset is the textual fields in Perso-Arabic, Thai, and Indian scripts. Since open datasets with real IDs may not be published, we synthetically generated all the images and data. Even faces are generated and do not belong to any particular person. Recently some datasets have appeared for evaluation of the IDs detection, type identification, and recognition, but these datasets cover only Latin-based and Cyrillic-based languages. The proposed dataset is to fix this issue and make it easier to evaluate and compare various methods. As a baseline, we process all the textual field images in MIDV-LAIT with Tesseract OCR. The resulting recognition accuracy shows that the dataset is challenging and is of use for further researches.
••05 Sep 2021
TL;DR: In this article, the authors compare various features like the size (width and height) of the word images and word length statistics and discover that these factors are critical for the scene-text recognition systems.
Abstract: Scene-text recognition is remarkably better in Latin languages than the non-Latin languages due to several factors like multiple fonts, simplistic vocabulary statistics, updated data generation tools, and writing systems. This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages. We compare various features like the size (width and height) of the word images and word length statistics. Over the last decade, generating synthetic datasets with powerful deep learning techniques has tremendously improved scene-text recognition. Several controlled experiments are performed on English, by varying the number of (i) fonts to create the synthetic data and (ii) created word images. We discover that these factors are critical for the scene-text recognition systems. The English synthetic datasets utilize over 1400 fonts while Arabic and other non-Latin datasets utilize less than 100 fonts for data generation. Since some of these languages are a part of different regions, we garner additional fonts through a region-based search to improve the scene-text recognition models in Arabic and Devanagari. We improve the Word Recognition Rates (WRRs) on Arabic MLT-17 and MLT-19 datasets by \(24.54\%\) and \(2.32\%\) compared to previous works or baselines. We achieve WRR gains of \(7.88\%\) and \(3.72\%\) for IIIT-ILST and MLT-19 Devanagari datasets.
••01 Jan 2022