Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

doi:10.1007/S11042-017-4745-3

Journal Article•DOI•

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Pawan Kumar Singh¹, Ram Sarkar¹, Nibaran Das¹, Subhadip Basu¹, Mahantapas Kundu¹, Mita Nasipuri¹ - Show less +2 more•Institutions (1)

Jadavpur University¹

01 Apr 2018-Multimedia Tools and Applications (Springer US)-Vol. 77, Iss: 7, pp 8441-8473

TL;DR: This paper addresses three key challenges here: collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, and development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor.

read less

Abstract: Handwritten document image dataset is one of the basic necessities to conduct research on developing Optical Character Recognition (OCR) systems. In a multilingual country like India, handwritten documents often contain more than one script, leading to complex pattern analysis problems. In this paper, we highlight two such situations where Devanagari and Bangla scripts, two most widely used scripts in Indian sub-continent, are individually used along with Roman script in documents. We address three key challenges here: 1) collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, 2) script-level annotation of 18931 Bangla words, 15528 Devanagari words and 10331 Roman words in those 300 document pages, and 3) development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor. The technique is statistically validated using multiple classifiers and it is found that Multi-Layer Perceptron (MLP) classifier performs the best. Average word-level script identification accuracies of 92.32%, 95.30% and 93.78% are achieved using 3-fold cross validation for Bangla-Roman, Devanagari-Roman and Bangla-Devanagari-Roman databases respectively. Both the mixed-script document databases along with the script-level annotations and 44790 extracted word images of the three aforementioned scripts are available freely at https://code.google.com/p/cmaterdb/ .

...read moreread less

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Citations

Cites methods from "Benchmark databases of handwritten ..."

References

"Benchmark databases of handwritten ..." refers background in this paper

"Benchmark databases of handwritten ..." refers methods in this paper

Related Papers (5)