scispace - formally typeset
Open AccessProceedings ArticleDOI

A study on word-level multi-script identification from video frames

TLDR
This study presents a study of various combinations of features and classifiers to explore whether the traditional script identification techniques can be applied to video frames and reveals that gradient features are more suitable for script identification than the texture features when using traditional scripts identification techniques on video frames.
Abstract
The presence of multiple scripts in multi-lingual document images makes Optical Character Recognition (OCR) of such documents a challenging task. Due to the unavailability of a single OCR system which can handle multiple scripts, script identification becomes an essential step for choosing the appropriate OCR. Although, there are various techniques available for script identification from handwritten and printed documents having simple backgrounds, however script identification from video frames has been seldom explored. Video frames are coloured and suffer from low resolution, blur, complex background and noise to mention a few, which makes the script identification process a challenging task. This paper presents a study of various combinations of features and classifiers to explore whether the traditional script identification techniques can be applied to video frames. A texture based feature namely, Local Binary Pattern (LBP), Gradient based features namely, Histogram of Oriented Gradient (HoG) and Gradient Local Auto-Correlation (GLAC) were used in the study. Combination of the features with SVMs and ANNs where used for classification. Three popular scripts, namely English, Bengali and Hindi were considered in the present study. Due to the inherent problems with the video, a super resolution technique was applied as a pre-processing step. Experiments show that the GLAC feature has performed better than the other features, and an accuracy of 94.25% was achieved when testing on 1271 words from three different scripts. The study also reveals that gradient features are more suitable for script identification than the texture features when using traditional script identification techniques on video frames.

read more

Content maybe subject to copyright    Report

A study on word-level multi-script identification from video
frames
Author
Sharma, Nabin, Pal, Umapada, Blumenstein, Michael
Published
2014
Conference Title
Neural Networks (IJCNN), 2014 International Joint Conference on
DOI
https://doi.org/10.1109/IJCNN.2014.6889906
Copyright Statement
© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing this
material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other
works.
Downloaded from
http://hdl.handle.net/10072/66731
Link to published version
http://www.ieee-wcci2014.org/
Griffith Research Online
https://research-repository.griffith.edu.au

A Study on Word-Level Multi-script Identification
from Video Frames
Nabin Sharma
, Umapada Pal
, Michael Blumenstein
School of Information and Communication Technology, Griffith University, Australia
Email: {nabin.sharma, m.blumenstein}@griffith.edu.au
Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India
Email: umapada@isical.ac.in
Abstract—The presence of multiple scripts in multi-lingual
document images makes Optical Character Recognition (OCR)
of such documents a challenging task. Due to the unavailability
of a single OCR system which can handle multiple scripts,
script identification becomes an essential step for choosing
the appropriate OCR. Although, there are various techniques
available for script identification from handwritten and
printed documents having simple backgrounds, however script
identification from video frames has been seldom explored.
Video frames are coloured and suffer from low resolution,
blur, complex background and noise to mention a few, which
makes the script identification process a challenging task. This
paper presents a study of various combinations of features and
classifiers to explore whether the traditional script identification
techniques can be applied to video frames. A texture based
feature namely, Local Binary Pattern (LBP), Gradient based
features namely, Histogram of Oriented Gradient (HoG) and
Gradient Local Auto-Correlation (GLAC) were used in the
study. Combination of the features with SVMs and ANNs where
used for classification. Three popular scripts, namely English,
Bengali and Hindi were considered in the present study. Due
to the inherent problems with the video, a super resolution
technique was applied as a pre-processing step. Experiments
show that the GLAC feature has performed better than the
other features, and an accuracy of 94.25% was achieved when
testing on 1271 words from three different scripts. The study
also reveals that gradient features are more suitable for script
identification than the texture features when using traditional
script identification techniques on video frames.
Keywords: Video document analysis, Script identification, Word
segmentation, OCR.
I. INTRODUCTION
India is a multi-lingual and multi-script country where
the use of multiple scripts is quite common for informa-
tion communication through news and advertisement videos
transmitted across various television channels. The massive
information explosion across multiple communication channels
creates a very large database of videos, which makes indexing
an essential task for effective management of the database.
Thus, text present in the video plays an important role in
automatic video indexing and retrieval. Hence, OCR of the
multi-lingual video text is essential. Due to the unavailability
of a universal OCR to recognize the multi-lingual text, script
identification followed by the use of appropriate OCR is a
legitimate approach to recognizing the text.
The research on script identification to date primarily
focuses on processing scanned documents with simple back-
grounds and good resolution required for OCR. Whereas the
difficulties involved in script identification from video frames
include low resolution, blur, complex backgrounds, multiple
font types and size and orientation of the text [2], [3]. Samples
of video frames having text written in multiple scripts are
shown in Figure 1. Figure 1(a) is an example of a video
frame having text written in English and Hindi with different
orientations, fonts, and size. Figure 1(b, c) are examples of
video frames having text in low resolution and blur. Figure
1(b) has text written in Hindi and English in a single text line.
Figure 1(c) is an example of a video frame having text written
in Bengali (Bangla) and English in a single text line. Figure
1(d) is an example of a video frame having both graphics
and scene text written in Hindi and English, respectively. The
English text line has little blur compared to the Hindi text
which is much clearer. Figure 1 itself explains the necessity
of script identification and the challenges involved when video
frames are considered. An important characteristic of multilin-
gual videos in India is that the text is generally written in two
scripts, where the first script is English (Roman) and the other
one is a regional language.
(a)
(b)
(c)
(d)
Fig. 1: Samples of video frame having text in multiple scripts
Script identification from video frames has not been ex-
plored much as compared to traditional scanned documents.

Recently, a few papers [4], [5], [6] have been published, which
focus on the video script identification problem. Sharma et. al
[4] presented a study on word-wise script identification from
video frames using three different features namely, Zernike
moment, Gabor and 400-dimensional gradient. They used
SVMs for classification. The study established that traditional
script identification techniques can be applied to video frames
provided appropriate pre-processing technique are applied to
the video frames to overcome the problems with video. Zhao
et al. [5] on the other hand proposed Spatial-Gradient-Features
at the block level to identify six different scripts. The method
considers text lines extracted from the video frames for the
experiments, assuming that a video frame contains text written
in a single script. Six different scripts were considered in
the work and an average classification rate of 82.1% was
reported on a dataset of 770 frames. Phan et al. [6] also
proposed a line-wise script identification technique based on
the smoothness and cursiveness of the lines. A video text line
was horizontally divided into five equal zones to study the
smoothness and cursiveness of the upper and lower lines for
script identification. English, Chinese and Tamil script pairs
were considered in their experiments. Li and Tan [14] proposed
a statistical script identification approach from camera-based
images.
There are many methods [1], [9], [8], [12], [7] avail-
able for script identification from scanned documents having
simple backgrounds. A review of various script identification
techniques used for script identification at the page, line
and word levels, was presented by Ghosh et al. [1]. The
various techniques can be classified into two broad categories,
namely: structure-based and visual appearance-based methods.
The review [1] shows that the methods used for traditional
scanned document can be used for camera-based documents
even though the former have much better resolution than
the video frames, and the latter suffer from issues such as
low resolution, and complex backgrounds, to mention a few.
A two-stage approach based word-wise script identification
technique was proposed by Chanda et al. [9]. In the first
stage, a high speed identification method of scripts in noisy
environment was used. The second stage processes the samples
where a low recognition confidence was achieved. Finally they
used a majority voting-based method to identify the script.
Two different features namely, 64-dimensional chain code
histogram and 400-dimensional gradient features were used in
the first and second stages, respectively. English, Devanagari
and Bengali scripts were considered for the experiments. The
study presented by Pati and Ramakrishna et al. [8] revealed
that the use of Gabor features with nearest neighbor or SVM
classifiers gave a better performance for word-level multi-
script identification. A combination of discrete cosine trans-
form (DCT) features with SVMs, nearest neighbor and linear
discriminant classifiers were also evaluated in their study. The
authors [8] used a dataset comprising of images with simple
backgrounds for their experiments.
Although there are works on line-wise video script identi-
fication, to the best of our knowledge there is only one work
[4] reported in the literature on word-wise script identification
from video. In this paper, a study of word-wise script identifi-
cation techniques from video is presented considering Indian
languages. The three most popular scripts in India namely,
English, Bengali and Hindi (Devanagari) were considered for
experimentation. Considering words for script identification
rather than a complete text line allows the identification of
the words written in different scripts, which in turn help
in better OCR of the complete text line written in multiple
scripts. This is an important advantage of considering words
to identify scripts. Our previous study [4] revealed that the use
of appropriate pre-processing techniques on the video frame
is essential in order to use traditional techniques for video
script identification. The present study attempts to investigate
the type of features more suitable for video script identifica-
tion considering the inherent problems with video, which is
the main contribution. Hence, a comparison of texture-based
features with gradient-based features is performed. A very
popular texture-based feature namely, Local Binary Pattern
(LBP) and two gradient-based features namely, Histogram of
Oriented Gradient (HoG) and Gradient Local Auto-Correlation
(GLAC) were used in the present study. Support Vector
Machines (SVMs) and Artificial Neural Networks (ANNs)
were used for classification. As mentioned in our previous
study [4] pre-processing does help in improving the video
script identification accuracy, but the choice of features is also
equally important. Hence, the features used in the present study
were carefully selected based on their ability to provide better
randomness and description of the structural differences of the
scripts, which will increase the overall accuracy.
Rest of the paper is organized as follows. The pre-
processing technique used is discussed in Section II. Section III
presents a brief description of the feature extraction techniques
used in the present study. In Section IV, the details of SVM
and ANN classifiers are discussed. Experimental results and a
discussion are presented in Section V. Section V also provides
an analysis of the errorneous results. Section VI concludes the
paper providing future directions for video script identification.
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Fig. 2: Sample video word images of English (1st Row),
Bengali (2nd Row) and Hindi (3rd Row) scripts.
II. PRE-PROCESSING
The text lines from the video frames were detected using
[11]. The words were segmentation from the text lines using
our word segmentation technique [10] and were used as input
for our experiments. A few samples of segmented word images
from video frames for the three scripts are shown in Figure 2.
The images shown in Figure 2 reveals that the text extracted
from the video frames suffers from low resolution, blur, and
complex backgrounds, to mention a few issues.

Our study in [4] showed that super resolution techniques
resulted in better accuracy. Hence we used the super reso-
lution technique for pre-processing the words to get better
resolution images for further processing. A single level of
super resolution images were used for our experiments. The
resolution of the word image was increased by 1.5% using
a cubic interpolation method [13]. Cubic interpolation was
chosen because it creates better images preserving the shape
of the original word images.
III. FEATURE EXTRACTION TECHNIQUES
Three feature extraction techniques were considered for
the present study. One texture-based feature namely, Local
Binary Pattern (LBP) and two gradient-based features namely,
Histogram of Oriented Gradients (HoG) and Gradient Local
Auto-Correlation (GLAC) were used. A brief description of
the feature extraction techniques are discussed below.
A. Local Binary Pattern (LBP)
Local Binary Pattern (LBP) [15] is an efficient texture
operator which labels each pixel of an image by thresholding
their neighbours. The idea behind the LBP operator is to
describe the image textures using two measures namely, local
spatial patterns and the gray scale contrast of its strenght.
We considered the original version of the LBP operator [15]
which forms labels of image pixels by thresholding the 3 × 3
neighbourhood of each pixel with the centre pixel value and the
result is considered as a binary number. As the neighbourhood
of the centre pixel has 8 pixels, 2
8
= 256 different labels can
be obtained and used as a texture descriptor. A histogram is
then computed over the cells, and forms the feature vector.
The basic LBP
P,R
operator is defined as follows,
LBP
P,R
(x
c
, y
c
) =
P 1
p=0
S(g
p
g
c
)2
P
(1)
where,
S(x) =
{
1, ifx >= 0
0, otherwise
S(x) is a thresholding function, (x
c
, y
c
) is the centre pixel
in the 8 pixel neighbourhood, g
c
is the gray level of the centre
pixel and g
p
denotes the gray value of a sampling point in an
equally spaced circular neighbourhood of P sampling points
and radius R around the point (x
c
, y
c
). An illustration of LBP
computation is shown in Figure 3. Figure 4 shows the LBP
images correspoding to the sample video word images shown
in Figure 2.
LBP was chosen for the present study because of its
ability to describe the local spatial pattern, which required
discriminate between the structurally similar scripts.
-18
-7
24
33
17
27
0
-16
0
0
1
1
1
1
1
0
(x
c
, y
c
)
50
32
43
74
83
67
77
50
34
(a) Sample pixel neighbourhood
(b) Difference result
(c) Thresholding result
Fig. 3: An example of LBP computation
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Fig. 4: LBP Images of the corresponding video word images
shown in Figure 2 for the three scripts.
B. Histogram of Oriented Gradients (HoG)
Histogram of Oriented Gradients (HoG) [16] is a robust
feature descriptor commonly used in computer vision and
image processing for object detection. Dalal and Triggs [16]
first described the HoG descriptors and primarily focused on
pedestian detection in static images. The basic idea behind the
HoG descriptor is that the shape and appearence of the object
within an image can be described by the intensity gradient
distribution or the edge directions.
The HoG descriptors are typically computed by dividing
an image into small spatial regions called ’cells’. A histogram
of the gradient direction of the pixels within the cells is
computed. The histogram bins/channels are evently spaced
over 0
to 180
or 0
to 360
based on the usage of signed
or unsigned gradient values. Combining the histogram of all
the cells produces the descriptors. For improving the accuracy
the local histograms can be contrast-normalized [16]. More
information about the HoG descriptor can be found in [16].
For our study the HoG feature suits the problem well
because it operates on the localized cells and it is capable
of describing the shape and appearance of the object, which
is the word in the present context. Figure 5 shows the HoG
images correspoding to the sample video word images shown
in Figure 2.
C. Gradient Local Auto-Correlation (GLAC)
Gradient Local Auto-Correlation (GLAC) was proposed
by Kobayashi and Otsu [17]. It utilizes the spatial and the
orientational auto-correlations of local gradients for feature
extraction. The features not only capture the information about
the gradients but also the curvature of the image surface, and
are described in terms of both magnitude and orientation.

(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Fig. 5: HoG Images of the corresponding video word images
shown in Figure 2 for the three scripts.
GLAC can be viewed as an extension of 1st order statistics
(i.e. histograms) to the 2nd order statistics (that is the auto-
correlations).
Detailed information about GLAC features can be found in
[17]. The ability of GLAC features to describe the curvature
of the image surface, in addition to the gradient information,
inspired us to consider it in our study.
IV. CLASSIFIERS
We considered Support Vector Machines (SVMs) and Ar-
tificial Neural Networks (ANNs) for classification of the three
scripts. A brief description of the classifier are given below.
A. Support Vector Machine (SVM)
Given a training database of M data: {x
m
|m = 1, ..., M},
the linear SVM classifier is then defined as:
f(x) =
j
α
j
y
j
x
j
· x + b (2)
where, x
j
are the set of support vectors, y
j
is the set of
class labels {+1, -1} and the parameters a
j
and b have been
determined by solving a quadratic problem [18]. The linear
SVM can be extended to various non-linear variants; details
can be found in [18], [19]. In our experiments Gaussian kernel
SVM outperformed linear and other non-linear SVM kernels.
The Gaussian kernel is of the form:
k(x, y) = exp
−||x y||
2
2σ
2
(3)
We noticed from the initial experiments that the Gaussian
kernel gave the highest accuracy when the value of its gamma
parameter (1/2σ
2
) was varied between 1.0 and 5.0 for the
three different features and the penalty multiplier parameter
was set to 1. LibSVM [20] was used to conduct the SVM
classification experiments.
B. Artificial Neural Networks (ANNs)
In this study, feed-forward Multi-Layered Perceptrons
(MLPs) trained with the resilient backpropagation (BP) algo-
rithm were used. For experimental purposes, the architectures
were modified varying the number of inputs and the hidden
units. The number of output units were fixed to three as
three scripts were considered in the present study. The number
of input units varied because three different feature having
different feature dimension were considered in the present
study.
The number of hidden units investigated during ANN
training was experimentally set from 8 upto 30 hidden units.
The number of iterations set for training was increased from
1000 upto 3000. All the ANNs were trained with a learning
rate of 0.1 and a momentum rate of 0.1.
V. EXPERIMENTAL RESULTS
This section presents the experimental results obtained
using the various combinations of the three features, as well
as the SVM and ANN classifiers. In order to study the per-
formance of the features and classifiers, a video word dataset
was created, as there is no standard dataset available. Test
portion of a video frames were extracted using our video text
detection algorithm [11] and the words were later segmented
using our word segmentation technique [10]. A dataset of
1271 words was created after extraction of word from the
video text lines. The dataset comprised of 430 Hindi, 410
English and 431 Bengali words. The results obtained using the
combinations of features and classifiers are reported in Tables
I, II, III, IV, V, VI and VII. For all the experiments we used
a five fold cross validation technique to compute the script
identification accuracy. The reason for using cross validation
is that it provided unbaised results over the complete dataset.
The various experiments conducted in the study included
the perfomance evaluation of :
LBP features with SVM and ANN classifiers,
HoG features with SVMand ANN classifiers,
GLAC features with SVM and ANN classifiers.
Additionally, we also conducted experiments on Long-words
(words having four or more characters) and Short words (word
having three or less characters). We evaluated the performance
of HoG and GLAC features with SVM and ANN classifiers
on both Long and Short words. This experiments revealed the
discriminative capacity and robustness of the features when
applied on the same dataset and their impact on the accuracy.
A. Experimental settings
The parameter settings considered for each of the feature
extraction techniques used in the present study are given below.
1) LBP feature: as mentioned earlier, the basic LBP was
considered in our study, with 8 neighbours the feature
dimensions of the LBP feature vector was 256.
2) HoG feature: The block size considered was 5. That
is an image was divided in 5×5 blocks. The gradient
orientation was quantized into 16 directions/bins.
Thus, the feature dimensions of a HoG feature vector
was 5 × 5 × 16 = 400.
3) GLAC feature: the Roberts filter was used for gradi-
ent computation and the number of orientation bins
was set to 9. The other baseline parameter settings as
given in [17] were considered.

Citations
More filters
Journal ArticleDOI

Improving handwriting based gender classification using ensemble classifiers

TL;DR: A system to predict gender from images of handwriting using textural descriptors that is significantly better than those of the state-of-the-art systems on this problem validating the ideas put forward in this study.
Proceedings ArticleDOI

ICDAR2015 Competition on Video Script Identification (CVSI 2015)

TL;DR: The systems submitted by Google Inc. were the winner of the competition for all the tasks, whereas the systems received from Huazhong University of Science and Technology (HUST) and Computer Vision Center (CVC) were very close competitors.
Journal ArticleDOI

Script Identification of Multi-Script Documents: A Survey

TL;DR: The most vital processes in script identification are addressed in detail: identification and discriminating methods, features extraction (local and global, and classification), and classification.
Journal ArticleDOI

Integrating Local CNN and Global CNN for Script Identification in Natural Scene Images

TL;DR: A novel framework integrating Local CNN and Global CNN both of which are based on ResNet-20 for script identification is presented, which fully exploits the local features of the image, effectively revealing subtle differences among the scripts that are difficult to distinguish.
Journal ArticleDOI

Script identification algorithms: a survey

TL;DR: In this paper, an attempt is made to analyze and classify various script identification schemes for document images, and the comparison is made between these schemes, and discussion is made based upon their merits and demerits on a common platform.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

A Tutorial on Support Vector Machines for Pattern Recognition

TL;DR: There are several arguments which support the observed high accuracy of SVMs, which are reviewed and numerous examples and proofs of most of the key theorems are given.
Journal ArticleDOI

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

TL;DR: A generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis.
Related Papers (5)
Frequently Asked Questions (19)
Q1. What contributions have the authors mentioned in the paper "A study on word-level multi-script identification from video frames" ?

This paper presents a study of various combinations of features and classifiers to explore whether the traditional script identification techniques can be applied to video frames. A texture based feature namely, Local Binary Pattern ( LBP ), Gradient based features namely, Histogram of Oriented Gradient ( HoG ) and Gradient Local Auto-Correlation ( GLAC ) were used in the study. Three popular scripts, namely English, Bengali and Hindi were considered in the present study. The study also reveals that gradient features are more suitable for script identification than the texture features when using traditional script identification techniques on video frames. 

Future research plans include to study classifier fusion and feature-fusion based techniques on more Indian scripts in order to create a more robust system capable of handling multiple scripts, accurately. 

Future research plans include to study classifier fusion and feature-fusion based techniques on more Indian scripts in order to create a more robust system capable of handling multiple scripts, accurately. 

The basic idea behind the HoG descriptor is that the shape and appearence of the object within an image can be described by the intensity gradient distribution or the edge directions. 

The reason behind the better performance using GLAC features is that it is uses gradients and curvature of the image surface for feature description. 

The words were segmentation from the text lines using their word segmentation technique [10] and were used as input for their experiments. 

In total the low resolution and blurred image dataset was formed using 235 word images comprising of 71 English, 92 Bengali and 73 Hindi words. 

The two texture-based features namely, Zernike moments and the Gabor filter used in their previous study [4] also did not perform well compared to the gradient feature. 

the better resolution and sharp image dataset comprised of 1035 words, having 360 English, 318 Bengali and 357 Hindi words. 

For Hindi script, 6.85% error occurred when low resolution and blurred imageswere considered, whereas, the error reduced to 2.52% for high resolution images. 

Considering the low resolution and blurred images in English script, 5.71% error rate was observed, whereas, 2.5% error occurred with high resolution images. 

Histogram of Oriented Gradients (HoG) [16] is a robust feature descriptor commonly used in computer vision and image processing for object detection. 

The Bengali word images in Figure 6 (c, d) were misclassified as Hindi because of the very low resolution, blur and the fewer number of characters. 

confirming the observation from their previous study [4], that due to the presence of more characters in the long words, more script specific information is available, which resulted in the increase of the accuracy. 

As the neighbourhood of the centre pixel has 8 pixels, 28 = 256 different labels can be obtained and used as a texture descriptor. 

The error obtained on complete dataset, which is a mixture of both low resolution and high resolution images, was also computed to understand how much error does the low resoluton and blurred images contributed to the overall accuracy. 

The parameter settings considered for each of the feature extraction techniques used in the present study are given below.1) LBP feature: as mentioned earlier, the basic LBP was considered in their study, with 8 neighbours the feature dimensions of the LBP feature vector was 256. 

The Hindi word images shown in Figure 6 (e, f) were misclassified as Bengali: low-resolution and blur were the main reasons for the same. 

The confusion matrix in Table The authoralso reveals that highest confusion of about 9.51% and 11.37% was between Bengali and Hindi using SVM and ANN classifiers, respectively.