An Empirical Intrinsic mode based characterization of
Indian Scripts
Kavita Bhardwaj
∗
Indian Institute of Technology
New Delhi,India
kavitab788@gmail.com
Santanu Chaudhury
Indian Institute of Technology
New Delhi,India
schaudhury@gmail.com
Sumantra Dutta Roy
Indian Institute of Technology
New Delhi,India
sumantra.dutta.roy@gmail.com
ABSTRACT
In this paper, we describe a novel technique for Document script
identification(DSI) from printed documents, using Empirical Mode
Decomposition (EMD). The intrinsic decomposition nature can adap-
tively decompose script images into a series of modes representing
different local features of script images. In this method, Radon
transformed script images are decomposed into finite set of IMFs
(Intrinsic Mode Functions). The energy concentration in a particu-
lar orientation characterises a script texture as it indicates the domi-
nance of individual script in that direction. We demonstrate how the
proposed method use these IMFs as feature vectors to distinguish
various scripts.
Keywords:
Empirical mode decomposition(EMD), Radon transform, Intrinsic
mode function, AdaBoostM1
1. INTRODUCTION
The Identification of the script used in printed documents is use-
ful for the digitization of the conventional paper documents, sort-
ing of document images according to the scripts in which they are
written, for selecting appropriate script-specific OCRs for the re-
trieval of online archives of document images or for indexing of
documents in digital library.
Ghosh et al. [1] proposes the categorisation of script recogni-
tion methods as structure-based and visual appearance-based. He
discussed the methods of both categories at page-level, paragraph-
level, word-level and character-level. A vast survey is presented
for each of the categories. By reffering Wang et al. [9] and Ghosh
et al. [1], it is found that according to the feature extraction, all
the methods lying under any category are grouped into three major
categories- Statistical-information based methods, Structure-based
methods, Texture-based methods. Statistical information-based al-
gorithms use character density distribution, vertical and horizontal
projections, for classifying printed documents. Waked et al. [8]
used bounding box size distribution, character density distribution,
∗
Corresponding author
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
DAR ’12, December 16, 2012, Mumbai, IN, India
Copyright 2012 ACM 978-1-4503-1797-9/12/12 ...$15.00.
vertical and horizontal projections for the classification of printed
documents. Lam et al. [5] has also used statistical features for
script identification in printed-documents. These methods are more
useful for scripts that differ significantly in style. Structure-based
methods focus on extraction and analysis of connected components
and use the identification results of these â
˘
AIJsignaturesâ
˘
A
˙
I to de-
termine the script(s) used. These methods in general have advan-
tage of discriminating similar scripts. Hochberg et al. [2] exploited
the shape characteristics of "textual symbols" for the identification
of script(s). Pal and Chaudhuri [6], presented the script charac-
teristics and shape based features for script identification. Visual-
appearance and texture analysis-based methods are related, because
according to appearance of any text block, corresponding texture
analysis-based method can be used for extraction of features. Joshi
et al. [4] proposes the Gabor function-based texture analysis to
extract features and used hierarchical classification to distinguish
among the script(s). Tan [7] developed Gabor function-based tex-
ture analysis for machine-printed script identification that discrim-
inates Chinese, Latin, Greek, Russian, Persian, and Malayalam
script documents.
For our problem, we propose an algorithm based on Empirical
Mode Decomposition(EMD) for textural analysis of script classes.
The directionality and periodicity reflect the effective directions for
textural processing of subpatterns. Each script class will always ex-
hibit a specific periodicity at a particular angular orientation. This
is observed in Radon transformed image of the script classes con-
sidered for the problem and they are decomposed in different mode
functions to compute directional energy specified by each IMF.
The scripts involved in this paper are Devnagari, Roman(English),
Malayalam, Bangla and Gurumukhi. The cosine similarity measure
is used as our measure to define the most similar script class for the
descrimination. We use Adaboost binary decision tree to improve
the classification.
The rest of the paper is organized as follows. In (Sec. 2), we sum-
marize the proposed approach and the framework for the problem is
described in (subsec. 2.1). The results are shown through Table 3in
(Sec. 3). Experimental observation are described in (Sec. 4). Con-
clusions and future work are discussed in (Sec. 5).
2. THE PROPOSED APPROACH
The method described in this paper involves four main steps.
• First, the preprocessing is performed initially to remove noise
that includes binarization of document images.
• Second, Radon transform is computed on document images
of each script at different angles of orientation between 0
◦
to 90
◦
. The unique characteristic of each script is observed
at a particular orientation in radon transformed image. The