scispace - formally typeset
Search or ask a question

Showing papers on "Devanagari published in 2018"


Journal ArticleDOI
TL;DR: A page-level handwritten document image dataset of 11 official Indic scripts, composed of 1458 document text-pages written by 463 individuals from various parts of India, is presented and the benchmark results for handwritten script identification (HSI) are reported.
Abstract: Without publicly available dataset, specifically in handwritten document recognition (HDR), we cannot make a fair and/or reliable comparison between the methods. Considering HDR, Indic script’s document recognition is still in its early stage compared to others such as Roman and Arabic. In this paper, we present a page-level handwritten document image dataset (PHDIndic_11), of 11 official Indic scripts: Bangla, Devanagari, Roman, Urdu, Oriya, Gurumukhi, Gujarati, Tamil, Telugu, Malayalam and Kannada. PHDIndic_11 is composed of 1458 document text-pages written by 463 individuals from various parts of India. Further, we report the benchmark results for handwritten script identification (HSI). Beside script identification, the dataset can be effectively used in many other applications of document image analysis such as script sentence recognition/understanding, text-line segmentation, word segmentation/recognition, word spotting, handwritten and machine printed texts separation and writer identification.

70 citations


Proceedings ArticleDOI
24 Apr 2018
TL;DR: A Convolutional Neural Network based Optical Character Recognition system (OCR) which accurately digitizes Ancient Sanskrit manuscripts (Devanagari Script) that are not necessarily in good condition.
Abstract: Ancient Sanskrit manuscripts are a rich source of knowledge about Science, Mathematics, Hindu mythology, Indian civilization, and culture. It therefore becomes critical that access to these manuscripts is made easy, to share this knowledge with the world and to facilitate further research on this Ancient literature. In this paper, we propose a Convolutional Neural Network (CNN) based Optical Character Recognition system (OCR) which accurately digitizes Ancient Sanskrit manuscripts (Devanagari Script) that are not necessarily in good condition. We use an image segmentation algorithm for calculating pixel intensities to identify letters in the image. The OCR considers typical compound characters (half letter combinations) as separate classes in order to improve the segmentation accuracy. The novelty of the OCR is its robustness to image quality, image contrast, font style and font size, which makes it an ideal choice for digitizing soiled and poorly maintained Sanskrit manuscripts.

40 citations


Proceedings ArticleDOI
24 Apr 2018
TL;DR: This paper releases a new handwritten word dataset for Devanagari, IIIT-HW-Dev, and empirically shows that usage of synthetic data and cross lingual transfer learning helps alleviate the issue of lack of training data.
Abstract: Handwriting recognition (HWR) in Indic scripts, like Devanagari is very challenging due to the subtleties in the scripts, variations in rendering and the cursive nature of the handwriting. Lack of public handwriting datasets in Indic scripts has long stymied the development of offline handwritten word recognizers and made comparison across different methods a tedious task in the field. In this paper, we release a new handwritten word dataset for Devanagari, IIIT-HW-Dev to alleviate some of these issues. We benchmark the IIIT-HW-Dev dataset using a CNN-RNN hybrid architecture. Furthermore, using this architecture, we empirically show that usage of synthetic data and cross lingual transfer learning helps alleviate the issue of lack of training data. We use this proposed pipeline on a public dataset, RoyDB and achieve state of the art results.

38 citations


Proceedings ArticleDOI
24 Apr 2018
TL;DR: The recognition accuracy obtained in the best case improves significantly the existing state-of-the-art of this handwriting recognition problem and further analysis of the simulation results provides an answer to the question: does an increase in the depth of the network eventually lead to an improved recognition performance on unknown samples?
Abstract: Deep neural network architectures have been used successfully in various document analysis studies. Its strength in producing human like performance has already been explored in handwritten English numeral recognition task. In this context, a natural question that often arises in a practitioner's mind: does an increase in the depth of the network eventually lead to an improved recognition performance on unknown samples? A goal of the present work is to search for an answer of the same through a case study of a larger class handwriting recognition problem. Here, we have studied recognition of handwritten Devanagari characters. In this study, we have implemented convolutional neural network (CNN) architectures of five different depths. We have also implemented additional neural architectures by adding two Bidirectional Long Short Term Memory (BLSTM) layers between the convolutional stack and the fully connected part of each of these five CNN networks. Simulations have been performed on two different databases of handwritten Devanagari characters consisting of 30408 and 36172 samples and a combined set consisting of 58451 samples. The recognition accuracy obtained in the best case improves significantly the existing state-of-the-art of this handwriting recognition problem. Also, further analysis of our simulation results provides an answer to the above question. Additionally, we have trained a BLSTM network alone using the Histogram of Oriented Gradient (HOG) features. Performance of this architecture failed to compete with the performance of CNN-BLSTM hybrid architecture.

34 citations


Proceedings ArticleDOI
01 Aug 2018
TL;DR: In this Experiment, this work successfully tried to classify handwritten Devanagari characters using transfer learning mechanism with the help of Alexnet, a convolutional neural network which shows impressive results.
Abstract: Since past few years, deep neural networks, because of their outstanding performance, are getting highly used in computer vision and machine learning tasks such as regression, segmentation, classification, detection, pattern recognition etc. Recognition of handwritten Devanagari characters is challenging task, but Deep learning can be effectively used as a solution for various such problems. Person to person variations in writing style makes handwritten character recognition one of the most difficult tasks. In this Experiment, we successfully tried to classify handwritten Devanagari characters using transfer learning mechanism with the help of Alexnet. Alexnet, a convolutional neural network, is trained over a dataset of around 16870 samples of 22 consonants of Devanagari script which shows impressive results. The transfer learning helps to learn faster and better even if the data samples are less as compared with the training a CNN from scratch.

34 citations


Journal ArticleDOI
TL;DR: In this article, a cross-language platform for handwritten word recognition and spotting for such low-resource scripts where training is performed with a sufficiently large dataset of an available script and testing is done on other scripts (considered as target script).

29 citations


Journal ArticleDOI
TL;DR: A lexicon free approach for the recognition of 3D handwritten words in Latin and Devanagari scripts by combining multiple classifiers by using the Recognizer Output Voting Error Reduction (ROVER) framework.

29 citations


Proceedings ArticleDOI
12 Mar 2018
TL;DR: Results indicate that the proposed deep learning architecture for the recognition of handwritten Multilanguage (mixed numerals belongs to multiple languages) numerals produces better results compared to methods suggested in the previous literature.
Abstract: Deep learning systems have recently gained importance as the architecture of choice in artificial intelligence (AI). Handwritten numeral recognition is essential for the development of systems that can accurately recognize digits in different languages which is a challenging task due to variant writing styles. This is still an open area of research for developing an optimized Multilanguage writer independent technique for numerals. In this paper, we propose a deep learning architecture for the recognition of handwritten Multilanguage (mixed numerals belongs to multiple languages) numerals (Eastern Arabic, Persian, Devanagari, Urdu, Western Arabic). The overall accuracy of the combined Multilanguage database was 99.26% with a precision of 99.29% on average. The average accuracy of each individual language was found to be 99.322%. Results indicate that the proposed deep learning architecture produces better results compared to methods suggested in the previous literature.

27 citations


Journal ArticleDOI
TL;DR: A Customized Convolutional Neural Network (CCNN) that has the ability to learn the features automatically and predict the class of numerals from a wide ranged data-set and its performance when verified using K- fold cross validation has achieved average 94.93% accuracy for testing data-sets.

27 citations


Journal ArticleDOI
TL;DR: This paper addresses three key challenges here: collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, and development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor.
Abstract: Handwritten document image dataset is one of the basic necessities to conduct research on developing Optical Character Recognition (OCR) systems. In a multilingual country like India, handwritten documents often contain more than one script, leading to complex pattern analysis problems. In this paper, we highlight two such situations where Devanagari and Bangla scripts, two most widely used scripts in Indian sub-continent, are individually used along with Roman script in documents. We address three key challenges here: 1) collection, compilation and organization of benchmark databases of images of 150 Bangla-Roman and 150 Devanagari-Roman mixed-script handwritten document pages respectively, 2) script-level annotation of 18931 Bangla words, 15528 Devanagari words and 10331 Roman words in those 300 document pages, and 3) development of a bi-script and tri-script word-level script identification module using Modified log-Gabor filter as feature extractor. The technique is statistically validated using multiple classifiers and it is found that Multi-Layer Perceptron (MLP) classifier performs the best. Average word-level script identification accuracies of 92.32%, 95.30% and 93.78% are achieved using 3-fold cross validation for Bangla-Roman, Devanagari-Roman and Bangla-Devanagari-Roman databases respectively. Both the mixed-script document databases along with the script-level annotations and 44790 extracted word images of the three aforementioned scripts are available freely at https://code.google.com/p/cmaterdb/ .

27 citations


Proceedings ArticleDOI
01 Feb 2018
TL;DR: The proposed approach achieves the maximum of 99.27% classification accuracy in training and is able to recognize the different handwritten Devanagari characters with an average accuracy of 97.06%.
Abstract: Hindi is the common and most popular language in the countries such as India, Nepal etc. People use this language not only for conversation but also in their vehicles license plates, documents, sign boards, handwritten notes etc. In recent years, many approaches have been proposed for Hindi character recognition and various applications such as text to speech translator, automatic license plate recognition etc. are proposed for these. Some computationally expensive approaches have achieved desirable accuracy but for light computing devices, recognition of handwritten characters is still challenging task. This paper proposes an approach for recognition of handwritten Devanagari character recognition. The shape variance of the character in Devanagari script is exhibited by variant of curves. These characters are distinguished using feature extraction in piecewise manner. The image partitioning technique is used for piecewise histogram of oriented gradients (HOG) features extraction. To train the neural network, a feature vector comprise of HOG features of all partitions is used. The proposed approach achieves the maximum of 99.27% classification accuracy in training and is able to recognize the different handwritten Devanagari characters with an average accuracy of 97.06%. The proposed approach may be useful in the application for blind people to read the handwritten contents.

Journal ArticleDOI
TL;DR: This paper reports an effective synthesizer for static and dynamic signatures written in Devanagari or Bengali scripts, and obtains promising results with artificially generated signatures in terms of appearance and performance when compared with those for real signatures.
Abstract: Developing an automatic signature verification system is challenging and demands a large number of training samples This is why synthetic handwriting generation is an emerging topic in document image analysis Some handwriting synthesizers use the motor equivalence model, the well-established hypothesis from neuroscience, which analyses how a human being accomplishes movement Specifically, a motor equivalence model divides human actions into two steps: 1) the effector independent step at cognitive level and 2) the effector dependent step at motor level In fact, recent work reports the successful application to Western scripts of a handwriting synthesizer, based on this theory This paper aims to adapt this scheme for the generation of synthetic signatures in two Indic scripts, Bengali (Bangla), and Devanagari (Hindi) For this purpose, we use two different online and offline databases for both Bengali and Devanagari signatures This paper reports an effective synthesizer for static and dynamic signatures written in Devanagari or Bengali scripts We obtain promising results with artificially generated signatures in terms of appearance and performance when we compare the results with those for real signatures

Proceedings ArticleDOI
01 Aug 2018
TL;DR: A framework for annotating large scale of handwritten word images with ease and speed is proposed, and a new handwritten word dataset for Telugu is released, which is collected and annotated using the proposed framework.
Abstract: Handwriting recognition (HWR) in Indic scripts is a challenging problem due to the inherent subtleties in the scripts, cursive nature of the handwriting and similar shape of the characters. Lack of publicly available handwriting datasets in Indic scripts has affected the development of handwritten word recognizers, and made direct comparisons across different methods an impossible task in the field. In this paper, we propose a framework for annotating large scale of handwritten word images with ease and speed. We also release a new handwritten word dataset for Telugu, which is collected and annotated using the proposed framework. We also benchmark major Indic scripts such as Devanagari, Bangla and Telugu for the tasks of word spotting and handwriting recognition using state of the art deep neural architectures. Finally, we evaluate the proposed pipeline on RoyDB, a public dataset, and achieve significant reduction in error rates.

Book ChapterDOI
18 Dec 2018
TL;DR: This study uses readily available pre-trained Convolutional Neural Network architectures on four different Indic scripts, viz.
Abstract: Filling up forms at post offices, railway counters, and for application of jobs has become a routine for modern people, especially in a developing country like India. Research on automation for the recognition of such handwritten forms has become mandatory. This applies more for a multilingual country like India. In the present work, we use readily available pre-trained Convolutional Neural Network (CNN) architectures on four different Indic scripts, viz. Bangla, Devanagari, Oriya, and Telugu to achieve a satisfactory recognition rate for handwritten Indic numerals. Furthermore, we have mixed Bangla and Oriya numerals and applied transfer learning for recognition. The main objective of this study is to realize how good a CNN model trained on an entire different dataset (of natural images) works for small and unrelated datasets. As a part of practical application, we have applied the proposed approach to recognize Bangla handwritten pin codes after their extraction from postal letters.

Proceedings ArticleDOI
01 Jan 2018
TL;DR: A script invariant feature vector is designed here based on the concept of the DAISY descriptor which has previously been applied in different research domains and is a computationally inexpensive approach when compared to other state-of-the-art prevalent architectures like LSTM or CNN.
Abstract: Handwritten digit recognition is a highly evolved research domain. The major issues that make this domain challenging are different photometric discrepancies, along with computation complexity. A script invariant feature vector is designed here based on the concept of the DAISY descriptor which has previously been applied in different research domains. We have applied this feature descriptor after suitable customization to fit it into the aforesaid classification problem. We have tested the same on handwritten digits written in four different scripts namely Arabic, Bangla, Devanagari and Roman. Bangla dataset is in-house, while the remaining are the standard databases. Experimental results demonstrate the effectiveness of the said feature descriptor for digit recognition. It is a computationally inexpensive approach when compared to other state-of-the-art prevalent architectures like LSTM or CNN.

Journal ArticleDOI
TL;DR: The authors showed that copying and writing complex akshara is more time efficient than writing, suggesting that having beginning learners copy and write the complex ankhara is an important pedagogical tool to use in classrooms.
Abstract: Hindi graphs, called akshara, are difficult to learn because of their visual complexity and large set of graphs. Akshara containing multiple consonants (complex akshara) are particularly difficult. In Hindi, complex akshara are formed by fusing individual consonantal graphs. Some complex akshara look similar to their component parts (transparent), whereas others do not (opaque). We taught 35 English-speaking adults a semi-artificial orthography that was modeled on the Devanagari script used for Hindi and other Indic languages. Participants were taught 80 complex akshara using 4 different methods: (1) choosing the components (from several choices) given the graph (2) choosing the correct graph (from several choices) given its components, (3) copying a graph while the graph and its components are displayed, and (4) writing a graph from memory given its components. Methods 1 and 2 compare emphasis on part-whole versus whole-part relationships, methods 1 & 2 and 3 & 4 compare motor effects, and methods 3 and 4 compare testing effects. We found that transparent graphs were better learned than opaque graphs. Testing on the akshara typically did not improve learning and there were few effects of emphasis on part-whole versus whole-part relationships. There was evidence for motor effects; copying & writing the akshara improved pure orthographic knowledge and people’s ability to produce the phonological form of a given akshara. These results corroborate other studies showing that copying and writing graphs helps beginning learners of English, Chinese, and Arabic build orthographic knowledge. Copying was more time efficient than writing, suggesting that having beginning learners copy akshara is an important pedagogical tool to use in classrooms.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: An effective handwritten numeral recognition approach based on Convolutional Neural Network (CNN) and Support Vector Machine (SVM) and the results show that the performance of the proposed approach is better than state-of-art approaches.
Abstract: Handwritten numeral recognition is an interesting area of research in the field of computer vision and pattern recognition. It plays an important role in postal automation services especially in a country like India where multiple languages and scripts are used. So, the recognition system needs to deal with many challenges like varying writing styles and cursive nature of handwriting. This paper proposes an effective handwritten numeral recognition approach based on Convolutional Neural Network (CNN) and Support Vector Machine (SVM). The proposed work is an attempt to develop a recognition system for recognizing the handwritten digits written in any one of the regional languages: Bangla, Devanagari, Oriya, and Telugu. The proposed system first normalizes the input image having single digit thereafter CNN works as a feature extractor while SVM as a classifier. Experiments have been conducted on benchmark database of ISI Kolkata (having numerals of Bangla, Devanagari, and Oriya languages) and CMaterdb (having numerals of Telugu language) (Jadavpur University). The results show that our model performs best for Devanagari language with accuracy 99.41% and for Bangla, Telugu, and Oriya, the accuracies are 99.14%, 99.16%, and 94.54% respectively. So, the performance of the proposed approach is better than state-of-art approaches.


Proceedings ArticleDOI
01 Dec 2018
TL;DR: Two feature extraction techniques, namely, DCT(Discrete Cosine Transformation) zigzag features and Histogram of oriented gradients are considered for extracting features of Devanagari ancient manuscripts for recognition of ancient documents in Devanakari script.
Abstract: In the present work, a system for recognition of ancient documents in Devanagari script is presented. Two feature extraction techniques, namely, DCT(Discrete Cosine Transformation) zigzag features and Histogram of oriented gradients are considered for extracting features of Devanagari ancient manuscripts. For recognition, three classification techniques, namely, SVM (Support Vector Machine), decision tree, and Naive Bayes are used. A database for the experiments is collected from various libraries and museums. Using SVM classifier with RBF kernel, a recognition accuracy of 90.70% with DCT zigzag feature vector of length 100 has been reported. A recognition accuracy of 90.70% with a partitioning strategy of dataset (80% data as training data and the remaining 20% data as testing data) has been achieved.

Proceedings ArticleDOI
TL;DR: In this paper, Mask Oriented Directional (MOD) features were used for the recognition of handwritten Devanagari script and achieved a 95.02% accuracy using SVM classifier.
Abstract: Recognition of handwritten Roman characters and numerals has been extensively studied in the last few decades and its accuracy reached to a satisfactory state. But the same cannot be said while talking about the Devanagari script which is one of most popular script in India. This paper proposes an efficient digit recognition system for handwritten Devanagari script. The system uses a novel 196-element Mask Oriented Directional (MOD) features for the recognition purpose. The methodology is tested using five conventional classifiers on 6000 handwritten digit samples. On applying 3-fold cross-validation scheme, the proposed system yields the highest recognition accuracy of 95.02% using Support Vector Machine (SVM) classifier.

Proceedings ArticleDOI
11 Jul 2018
TL;DR: The improved accuracy of Hindi, English and Bangla digit dataset is shown by using the proposed approach and also performing a number of cross-validation experiments on all three datasets using image augmentation.
Abstract: Handwritten digit recognition has turned into one of the demanding areas of research in the field of image processing. Many approaches have been proposed which include a statistical method, fuzzy technique, and neural network for feature classification and feature selection but have not been found to use convolutional autoencoder for Devanagari digit after performing image augmentation on the training dataset. This paper shows the use of unsupervised training using convolutional autoencoder with deep ConvNet in order to detect handwritten Devanagari digits, i.e., 0–9. Convolutional autoencoder is the type of autoencoder that is used to encode the input for extracting important features and then try to reconstruct the input image. This paper shows the improved accuracy of Hindi, English and Bangla digit dataset by using the proposed approach and also performing a number of cross-validation experiments on all three datasets using image augmentation.

Proceedings ArticleDOI
01 Aug 2018
TL;DR: This article proposes a novel approach for online handwritten word recognition in Devanagari script based on two recently developed models of Recurrent Neural Network (RNN), termed as Long-Short Term Memory (LSTM) and Bidirectional Long- shortterm Memory (BLSTM), specifically designed for sequential data where the segmentation of data into basic unit level is very difficult.
Abstract: Devanagari script is the most popular script in India. But, very little recognized works have been done in this script towards development of online handwritten text recognition systems. The existence of large number of symbols and symbol order variations in this script, has led to low recognition rates for even the best existing recognition system. Most of the existing studies in Devanagari script have relied upon the same Hidden Markov Model (HMM) which has been used for so many years in handwriting recognition, despite of its familiar shortcomings. This article proposes a novel approach for online handwritten word recognition in Devanagari script based on two recently developed models of Recurrent Neural Network (RNN), termed as Long-Short Term Memory (LSTM) and Bidirectional Long-Short Term Memory (BLSTM), specifically designed for sequential data where the segmentation of data into basic unit level is very difficult. Analysis shows that words are written in non-cursive fashion in Devanagari script. The proposed approach considers the local zone wise analysis of each basic stroke of a word to extract various features from each basic stroke. In this local zone wise feature extraction approach, dominant points are detected from strokes using slope angles, to find the local features. These features are then studied using both LSTM and BLSTM versions of RNN. Most of the existing word recognition systems in this script have followed the typical holistic approach whereas the proposed system has been developed in analytical scheme with a total of 10K words in lexicon. An exhaustive experiment on large datasets has been performed to evaluate the performance of the proposed recognition approach using both LSTM and BLSTM to make a comparative performance analysis. Experimental results show that the proposed system outperforms existing HMM based systems in the literature.

Journal ArticleDOI
TL;DR: This study reports on two experiments investigating the effects of script differences on masked translation priming in highly proficient early Hindi-English bilinguals and provides alternative accounts for these results in terms of how orthographic cues provided by L1 targets might lead to the discontinuation or disruption of processing for L2 primes.
Abstract: This study reports on two experiments investigating the effects of script differences on masked translation priming in highly proficient early Hindi-English bilinguals. In Experiment 1 (the cross-script experiment), L1 Hindi was presented in the standard Devanagari script, while L2 English was presented in the Roman alphabet. In Experiment 2 (the same-script experiment), both L1 Hindi and L2 English were presented in the Roman alphabet. Both experiments revealed translation priming in the L1-L2 direction. However, L2-L1 priming was obtained in the same-script experiment, but not in the cross-script experiment. These findings are discussed in relation to the orthographic cue hypothesis as well as hypotheses that hold that script differences influence the distance between the L1 and L2 in lexical space and/or cross-language lateral inhibition. We also provide alternative accounts for these results in terms of how orthographic cues provided by L1 targets might lead to the discontinuation or disruption of processing for L2 primes.

Book ChapterDOI
21 Dec 2018
TL;DR: A HWDCR system that recognizes Devanagari handwritten characters, the most popular script in India, is proposed using MLP-BP Neural Network Classifier for classification and the average recognition accuracy is achieved.
Abstract: Now a days recognizing the handwritten character is receiving high significance because of numerous applications like Educational Software, On-line Signature Verification, Bank Cheque Processing, postal code recognition, Electronic library etc Very less work is accounted in the research of Devanagari handwritten character recognition (HWDCR), so that there is a large scope of research in this area In this paper we proposed a HWDCR system that recognizes Devanagari handwritten characters, the most popular script in India Using pen tablet handwritten character is inputted and its on-line features are extracted like sequence of (x, y) coordinates, stroke and pressure information which are passed to classifier for classification We have used MLP-BP Neural Network Classifier for classification The average recognition accuracy is achieved by the proposed HWDCR system is 90% using on-line data

Journal ArticleDOI
17 Dec 2018
TL;DR: A new category of features called ‘sub-stroke-wise relative feature’ (SRF) which are based on relative information of the constituent parts of the handwritten strokes are proposed which significantly outperforms the state-of-the-art feature sets for online Bangla and Devanagari cursive word recognition.
Abstract: The main problem of Bangla (Bengali) and Devanagari handwriting recognition is the shape similarity of characters. There are only a few pieces of work on writer-independent cursive online Indian text recognition, and the shape similarity problem needs more attention from the researchers. To handle the shape similarity problem of cursive characters of Bangla and Devanagari scripts, in this article, we propose a new category of features called ‘sub-stroke-wise relative feature’ (SRF) which are based on relative information of the constituent parts of the handwritten strokes. Relative information among some of the parts within a character can be a distinctive feature as it scales up small dissimilarities and enhances discrimination among similar-looking shapes. Also, contextual anticipatory phenomena are automatically modeled by this type of feature, as it takes into account the influence of previous and forthcoming strokes. We have tested popular state-of-the-art feature sets as well as proposed SRF using various (up to 20,000-word) lexicons and noticed that SRF significantly outperforms the state-of-the-art feature sets for online Bangla and Devanagari cursive word recognition.

01 Jan 2018
TL;DR: These efforts towards developing general domain English–Bangla MT systems which are deployable to the Web are described, which have gained significant improvement over SMT baselines.
Abstract: A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping.

Book ChapterDOI
21 Dec 2018
TL;DR: A novel iterative isotropic dilation algorithm is proposed here to convert the components into a single component object and promising accuracy has been observed.
Abstract: In this work, a new problem of script identification named artistic multi-character script identification has been addressed. Two types of datasets of artistic documents/images prepared with Bangla, Devanagari and Roman script have been used: one is real life artistic multi-character script image and another is synthetic artistic multi-character script image. After binarization using Otsu’s algorithm, some character images found to be broken into components. To overcome this, a novel iterative isotropic dilation algorithm is proposed here to convert the components into a single component object. Then two types of features, namely shape based and texture based features have been considered. Discrete Gabor wavelet has been exploited with 2 scales and 4 orientations for texture feature extraction and PCA is used to reduce the dimensionality of the texture feature space. The performance of the proposed algorithm has been tested with different machine learning classifiers and promising accuracy has been observed.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: A robust model for writer identification for Indic languages is proposed and it is efficient for recognizing and classifying data because of its feature extraction and training at different convolution and pooling stages.
Abstract: Writer Identification plays an important role in fraud detection while considering the case of unauthorized access to banks and other security checks. Indic Forensic document analysis is also done in Devanagari handwritten languages. We have proposed a robust model for writer identification for Indic languages. It is complex to efficiently extract the words and characters from the Devanagari handwritten document because of overlapping, compound characters, modifiers and touching, etc. The proposed model is efficient for recognizing and classifying data because of its feature extraction and training at different convolution and pooling stages. We have prepared Devanagari (Hindi) dataset of 80 students. The proposed model is trained by using the prepared Hindi dataset and it is not require any domain knowledge for handwriting recognition. The experiments are done on three different languages (Hindi, Kannada and Arabic language) and obtained satisfactory results.

Journal ArticleDOI
TL;DR: An automatic approach for line-level handwritten script identification (HSI), considering eight official Indic scripts namely: Bangla, Devanagari, Kannada, Malayalam, Oriya, Roman, Telugu, and Urdu is proposed, and multilayer perceptron (MLP) is found as the best performer.
Abstract: Script identification is a well-studied problem for automatic processing of document images. Several attempts have been made so far, but it is still far ahead from the complete solution. In this paper, an automatic approach for line-level handwritten script identification (HSI), considering eight official Indic scripts namely: Bangla, Devanagari, Kannada, Malayalam, Oriya, Roman, Telugu, and Urdu is proposed. We consider a 148-dimensional feature vector using: image component fractal dimension, structural and visual appearance, directional stroke, interpolation and Gabor energy based texture features. For classification, we divide the whole script dataset based on different regions of India, to study a region-wise classification performance. Experimentation was carried out using the state-of-the-art classifiers: multilayer perceptron (MLP), support vector machine (SVM), random forest (RF), and fuzzy unordered rule induction algorithm (FURIA). Among all, we found that MLP as the best performer in terms of average accuracy of 98.2%, 99.5%, 99.1%, 99.5%, 99.9%, 98%, 98.9% for eight-script, bi-script, eastern, north, south Indian script groups, scripts with ‘matra’ vs without ‘matra’, and dravidian vs. non-dravidian groups respectively.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: This research paper has proposed a Light weight Inflectional Stemmer using affix stripping approach, for Sindhi Devanagari Script.
Abstract: Now a day’s web is multilingual and equipped with lot of information. In order to access information in different language we require some Language processing tools such as Stemmer, Part of Speech Tagger etc. Although plenty of different language processing tools are available for some languages, still some of the languages are not getting attention of research and community. Sindhi Language is one of them. This research paper we have proposed a Light weight Inflectional Stemmer using affix stripping approach, for Sindhi Devanagari Script.