Bio: Dipankar Ganguly is an academic researcher. The author has contributed to research in topics: Encoder & Artificial intelligence. The author has an hindex of 1, co-authored 1 publications receiving 1 citations.
••01 Nov 2017
TL;DR: In this paper, a machine learning-based classifier symbols to Unicode generation scheme was proposed, which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.
Abstract: Classical OCRs based on isolated character (symbol) recognition have been the fundamental way of generating textual representations, particularly for Indian scripts, until the time transcription-based approaches gained momentum. Though the former approaches have been criticized as prone to failures, their accuracy has nevertheless been fairly decent in comparison with the newer transcription-based approaches. Analysis of isolated character recognition OCRs for Hindi and Bangla revealed most errors were generated in converting the output of the classifier to valid Unicode sequences, i.e., script grammar generation. Linguistic rules to generate scripts are inadequately integrated, thus resulting in a rigid Unicode generation scheme which is cumbersome to understand and error prone in adapting to new Indian scripts. In this paper we propose a machine learning-based classifier symbols to Unicode generation scheme which outperforms the existing generation scheme and improves accuracy for Devanagari and Bangla scripts.
01 Jan 2022
TL;DR: In this paper , a Transformer-based recognition engine fused with bidirectional Encoder Representations from Transformers (BERT) language model is proposed for text recognition from degraded historical document images of books.
Abstract: Text recognition from degraded historical document images of books presents several challenges to solve within the vision and document analytics community. The accuracy of existing recognition sharply drops with pitfalls in image quality as well as usage of decades old vocabulary and fonts. In order to overcome such challenges, we introduce a Transformer -based Recognition engine fused with Bidirectional Encoder Representations from Transformers (BERT) Language Model  which is preceded by deep back projection network (DBPN) along with cascaded segmentation carried out using the U-net framework. In essence, end-to-end segmentation framework is cascaded with the end-to-end Transformer Network for performing recognition. We have intensively tested our framework on Odia literature documents provided by Odia Virtual Academy. Significant improvements in results are observed and the proposed methodology empirically outperformed the state of the art for Odia Script.
••01 Jun 2020
TL;DR: A Sanskrit specific OCR system for printed classical Indic documents written in Sanskrit is developed, and an attention-based LSTM model for reading Sanskrit characters in line images is presented, setting the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
Abstract: OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.