Other affiliations: University of Mysore
Bio: Umapada Pal is an academic researcher from Indian Statistical Institute. The author has contributed to research in topic(s): Feature extraction & Handwriting recognition. The author has an hindex of 47, co-authored 478 publication(s) receiving 9925 citation(s). Previous affiliations of Umapada Pal include University of Mysore.
Topics: Feature extraction, Handwriting recognition, Image segmentation, Optical character recognition, Feature (machine learning)
TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.
Abstract: Intensive research has been done on optical character recognition (OCR) and a large number of articles have been published on this topic during the last few decades. Many commercial OCR systems are now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters. There are no sufficient number of work on Indian language character recognition although there are 12 major scripts in India. In this paper, we present a review of the OCR work done on Indian language scripts. The review is organized into 5 sections. Sections 1 and 2 cover introduction and properties on Indian scripts. In Section 3, we discuss different methodologies in OCR development as well as research work done on Indian scripts recognition. In Section 4, we discuss the scope of future work and further steps needed for Indian script OCR development. In Section 5 we conclude the paper.
TL;DR: A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented and extension of the work to Devnagari, the third most popular Script in the World, is discussed.
Abstract: A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented This is the first OCR system among all script forms used in the Indian sub-continent The problem is difficult because (i) there are about 300 basic, modified and compound character shapes in the script, (ii) the characters in a word are topologically connected and (iii) Bangla is an inflectional language In our system the document image captured by Flat-bed scanner is subject to skew correction, text graphics separation, line segmentation, zone detection, word and character segmentation using some conventional and some newly developed techniques From zonal information and shape characteristics, the basic, modified and compound characters are separated for the convenience of classification The basic and modified characters which are about 75 in number and which occupy about 96% of the text corpus, are recognized by a structural-feature-based tree classifier The compound characters are recognized by a tree classifier followed by template-matching approach The feature detection is simple and robust where preprocessing like thinning and pruning are avoided The character unigram statistics is used to make the tree classifier efficient Several heuristics are also used to speed up the template matching approach A dictionary-based error-correction scheme has been used where separate dictionaries are compiled for root word and suffixes that contain morpho-syntactic informations as well For single font clear documents 9550% word level (which is equivalent to 9910% character level) recognition accuracy has been obtained Extension of the work to Devnagari, the third most popular script in the world, is also discussed
01 Nov 2017
TL;DR: This paper presents the dataset, the tasks and the findings of this RRC-MLT challenge, which aims at assessing the ability of state-of-the-art methods to detect Multi-Lingual Text in scene images, such as in contents gathered from the Internet media and in modern cities where multiple cultures live and communicate together.
Abstract: Text detection and recognition in a natural environment are key components of many applications, ranging from business card digitization to shop indexation in a street. This competition aims at assessing the ability of state-of-the-art methods to detect Multi-Lingual Text (MLT) in scene images, such as in contents gathered from the Internet media and in modern cities where multiple cultures live and communicate together. This competition is an extension of the Robust Reading Competition (RRC) which has been held since 2003 both in ICDAR and in an online context. The proposed competition is presented as a new challenge of the RRC. The dataset built for this challenge largely extends the previous RRC editions in many aspects: the multi-lingual text, the size of the dataset, the multi-oriented text, the wide variety of scenes. The dataset is comprised of 18,000 images which contain text belonging to 9 languages. The challenge is comprised of three tasks related to text detection and script classification. We have received a total of 16 participations from the research and industrial communities. This paper presents the dataset, the tasks and the findings of this RRC-MLT challenge.
••18 Aug 1997
TL;DR: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent, and shows a good performance for single font scripts printed on clear documents.
Abstract: An OCR system is proposed that can read two Indian language scripts: Bangla and Devnagari (Hindi), the most popular ones in the Indian subcontinent. These scripts, having the same origin in ancient Brahmi script, have many features in common and hence a single system can be modeled to recognize them. In the proposed model, document digitization, skew detection, text line segmentation and zone separation, word and character segmentation, character grouping into basic, modifier and compound character category are done for both scripts by the same set of algorithms. The feature sets and classification tree as well as the knowledge base required for error correction (such as lexicon) differ for Bangla and Devnagari. The system shows a good performance for single font scripts printed on clear documents.
••03 Aug 2003
TL;DR: A robust scheme to segment unconstrained handwritten Banglatexts into lines, words and characters based on water reservoir principle is proposed to take care of variability involved in the writing style of different individuals.
Abstract: To take care of variability involved in the writing style ofdifferent individuals in this paper we propose a robustscheme to segment unconstrained handwritten Banglatexts into lines, words and characters. For linesegmentation, at first, we divide the text into verticalstripes. Stripe width of a document is computed bystatistical analysis of the text height in the document.Next we determine horizontal histogram of these stripesand the relationship of the minimal values of thehistograms is used to segment text lines. Based onvertical projection profile lines are segmented intowords. Segmentation of characters from handwrittenword is very tricky as the characters are seldomvertically separable. We use a concept based on waterreservoir principle for the purpose. Here we, at first,identify isolated and connected (touching) characters ina word. Next touching characters of the word aresegmented based on the reservoir base area points andstructural feature of the component.
01 Jan 2006
Christopher M. Bishop1•Institutions (1)
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.
01 Jan 1990
Teuvo Kohonen1•Institutions (1)
Abstract: An overview of the self-organizing map algorithm, on which the papers in this issue are based, is presented in this article.
••15 Oct 2004
Author's H-index: 47