What future works have the authors mentioned in the paper "A study on word-level multi-script identification from video frames" ?

Future research plans include to study classifier fusion and feature-fusion based techniques on more Indian scripts in order to create a more robust system capable of handling multiple scripts, accurately.

What are the future plans for the study?

Future research plans include to study classifier fusion and feature-fusion based techniques on more Indian scripts in order to create a more robust system capable of handling multiple scripts, accurately.

What is the reason behind the better performance using GLAC features?

The reason behind the better performance using GLAC features is that it is uses gradients and curvature of the image surface for feature description.

What was the use of the text lines from the video frames?

The words were segmentation from the text lines using their word segmentation technique [10] and were used as input for their experiments.

How many words were used in the dataset?

In total the low resolution and blurred image dataset was formed using 235 word images comprising of 71 English, 92 Bengali and 73 Hindi words.

What were the two features used in the previous study?

The two texture-based features namely, Zernike moments and the Gabor filter used in their previous study [4] also did not perform well compared to the gradient feature.

How many words were used in the combined dataset?

the better resolution and sharp image dataset comprised of 1035 words, having 360 English, 318 Bengali and 357 Hindi words.

What was the error rate for the English script?

For Hindi script, 6.85% error occurred when low resolution and blurred imageswere considered, whereas, the error reduced to 2.52% for high resolution images.

What was the error rate in the English script?

Considering the low resolution and blurred images in English script, 5.71% error rate was observed, whereas, 2.5% error occurred with high resolution images.

What were the main reasons for the misclassification of Bengali word images?

The Bengali word images in Figure 6 (c, d) were misclassified as Hindi because of the very low resolution, blur and the fewer number of characters.

What is the reason for the increase in script accuracy?

confirming the observation from their previous study [4], that due to the presence of more characters in the long words, more script specific information is available, which resulted in the increase of the accuracy.

How many different labels can be obtained from the centre pixel?

As the neighbourhood of the centre pixel has 8 pixels, 28 = 256 different labels can be obtained and used as a texture descriptor.

What is the error rate of the complete dataset?

The error obtained on complete dataset, which is a mixture of both low resolution and high resolution images, was also computed to understand how much error does the low resoluton and blurred images contributed to the overall accuracy.

What were the main reasons for the misclassification of the Hindi word images?

The Hindi word images shown in Figure 6 (e, f) were misclassified as Bengali: low-resolution and blur were the main reasons for the same.

What is the confusion matrix in Table I?

The confusion matrix in Table The authoralso reveals that highest confusion of about 9.51% and 11.37% was between Bengali and Hindi using SVM and ANN classifiers, respectively.

(Open Access) A study on word-level multi-script identification from video frames (2014) | Nabin Sharma

A study on word-level multi-script identification from video

frames

Author

Sharma, Nabin, Pal, Umapada, Blumenstein, Michael

Published

2014

Conference Title

Neural Networks (IJCNN), 2014 International Joint Conference on

DOI

https://doi.org/10.1109/IJCNN.2014.6889906

obtained for all other uses, in any current or future media, including reprinting/republishing this

material for advertising or promotional purposes, creating new collective works, for resale or

redistribution to servers or lists, or reuse of any copyrighted component of this work in other

works.

Downloaded from

http://hdl.handle.net/10072/66731

Link to published version

http://www.ieee-wcci2014.org/

Griffith Research Online

https://research-repository.griffith.edu.au

A Study on Word-Level Multi-script Identiﬁcation

from Video Frames

Nabin Sharma

∗

, Umapada Pal

†

, Michael Blumenstein

∗

School of Information and Communication Technology, Grifﬁth University, Australia

Email: {nabin.sharma, m.blumenstein}@grifﬁth.edu.au

†

Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India

Email: umapada@isical.ac.in

Abstract—The presence of multiple scripts in multi-lingual

document images makes Optical Character Recognition (OCR)

of such documents a challenging task. Due to the unavailability

of a single OCR system which can handle multiple scripts,

script identiﬁcation becomes an essential step for choosing

the appropriate OCR. Although, there are various techniques

available for script identiﬁcation from handwritten and

printed documents having simple backgrounds, however script

identiﬁcation from video frames has been seldom explored.

Video frames are coloured and suffer from low resolution,

blur, complex background and noise to mention a few, which

makes the script identiﬁcation process a challenging task. This

paper presents a study of various combinations of features and

classiﬁers to explore whether the traditional script identiﬁcation

techniques can be applied to video frames. A texture based

feature namely, Local Binary Pattern (LBP), Gradient based

features namely, Histogram of Oriented Gradient (HoG) and

Gradient Local Auto-Correlation (GLAC) were used in the

study. Combination of the features with SVMs and ANNs where

used for classiﬁcation. Three popular scripts, namely English,

Bengali and Hindi were considered in the present study. Due

to the inherent problems with the video, a super resolution

technique was applied as a pre-processing step. Experiments

show that the GLAC feature has performed better than the

other features, and an accuracy of 94.25% was achieved when

testing on 1271 words from three different scripts. The study

also reveals that gradient features are more suitable for script

identiﬁcation than the texture features when using traditional

script identiﬁcation techniques on video frames.

Keywords: Video document analysis, Script identiﬁcation, Word

segmentation, OCR.

I. INTRODUCTION

India is a multi-lingual and multi-script country where

the use of multiple scripts is quite common for informa-

tion communication through news and advertisement videos

transmitted across various television channels. The massive

information explosion across multiple communication channels

creates a very large database of videos, which makes indexing

an essential task for effective management of the database.

Thus, text present in the video plays an important role in

automatic video indexing and retrieval. Hence, OCR of the

multi-lingual video text is essential. Due to the unavailability

of a universal OCR to recognize the multi-lingual text, script

identiﬁcation followed by the use of appropriate OCR is a

legitimate approach to recognizing the text.

The research on script identiﬁcation to date primarily

focuses on processing scanned documents with simple back-

grounds and good resolution required for OCR. Whereas the

difﬁculties involved in script identiﬁcation from video frames

include low resolution, blur, complex backgrounds, multiple

font types and size and orientation of the text [2], [3]. Samples

of video frames having text written in multiple scripts are

shown in Figure 1. Figure 1(a) is an example of a video

frame having text written in English and Hindi with different

orientations, fonts, and size. Figure 1(b, c) are examples of

video frames having text in low resolution and blur. Figure

1(b) has text written in Hindi and English in a single text line.

Figure 1(c) is an example of a video frame having text written

in Bengali (Bangla) and English in a single text line. Figure

1(d) is an example of a video frame having both graphics

and scene text written in Hindi and English, respectively. The

English text line has little blur compared to the Hindi text

which is much clearer. Figure 1 itself explains the necessity

of script identiﬁcation and the challenges involved when video

frames are considered. An important characteristic of multilin-

gual videos in India is that the text is generally written in two

scripts, where the ﬁrst script is English (Roman) and the other

one is a regional language.

(a)

(b)

(c)

(d)

Fig. 1: Samples of video frame having text in multiple scripts

Script identiﬁcation from video frames has not been ex-

plored much as compared to traditional scanned documents.

Recently, a few papers [4], [5], [6] have been published, which

focus on the video script identiﬁcation problem. Sharma et. al

[4] presented a study on word-wise script identiﬁcation from

video frames using three different features namely, Zernike

moment, Gabor and 400-dimensional gradient. They used

SVMs for classiﬁcation. The study established that traditional

script identiﬁcation techniques can be applied to video frames

provided appropriate pre-processing technique are applied to

the video frames to overcome the problems with video. Zhao

et al. [5] on the other hand proposed Spatial-Gradient-Features

at the block level to identify six different scripts. The method

considers text lines extracted from the video frames for the

experiments, assuming that a video frame contains text written

in a single script. Six different scripts were considered in

the work and an average classiﬁcation rate of 82.1% was

reported on a dataset of 770 frames. Phan et al. [6] also

proposed a line-wise script identiﬁcation technique based on

the smoothness and cursiveness of the lines. A video text line

was horizontally divided into ﬁve equal zones to study the

smoothness and cursiveness of the upper and lower lines for

script identiﬁcation. English, Chinese and Tamil script pairs

were considered in their experiments. Li and Tan [14] proposed

a statistical script identiﬁcation approach from camera-based

images.

There are many methods [1], [9], [8], [12], [7] avail-

able for script identiﬁcation from scanned documents having

simple backgrounds. A review of various script identiﬁcation

techniques used for script identiﬁcation at the page, line

and word levels, was presented by Ghosh et al. [1]. The

various techniques can be classiﬁed into two broad categories,

namely: structure-based and visual appearance-based methods.

The review [1] shows that the methods used for traditional

scanned document can be used for camera-based documents

even though the former have much better resolution than

the video frames, and the latter suffer from issues such as

low resolution, and complex backgrounds, to mention a few.

A two-stage approach based word-wise script identiﬁcation

technique was proposed by Chanda et al. [9]. In the ﬁrst

stage, a high speed identiﬁcation method of scripts in noisy

environment was used. The second stage processes the samples

where a low recognition conﬁdence was achieved. Finally they

used a majority voting-based method to identify the script.

Two different features namely, 64-dimensional chain code

histogram and 400-dimensional gradient features were used in

the ﬁrst and second stages, respectively. English, Devanagari

and Bengali scripts were considered for the experiments. The

study presented by Pati and Ramakrishna et al. [8] revealed

that the use of Gabor features with nearest neighbor or SVM

classiﬁers gave a better performance for word-level multi-

script identiﬁcation. A combination of discrete cosine trans-

form (DCT) features with SVMs, nearest neighbor and linear

discriminant classiﬁers were also evaluated in their study. The

authors [8] used a dataset comprising of images with simple

backgrounds for their experiments.

Although there are works on line-wise video script identi-

ﬁcation, to the best of our knowledge there is only one work

[4] reported in the literature on word-wise script identiﬁcation

from video. In this paper, a study of word-wise script identiﬁ-

cation techniques from video is presented considering Indian

languages. The three most popular scripts in India namely,

English, Bengali and Hindi (Devanagari) were considered for

experimentation. Considering words for script identiﬁcation

rather than a complete text line allows the identiﬁcation of

the words written in different scripts, which in turn help

in better OCR of the complete text line written in multiple

scripts. This is an important advantage of considering words

to identify scripts. Our previous study [4] revealed that the use

of appropriate pre-processing techniques on the video frame

is essential in order to use traditional techniques for video

script identiﬁcation. The present study attempts to investigate

the type of features more suitable for video script identiﬁca-

tion considering the inherent problems with video, which is

the main contribution. Hence, a comparison of texture-based

features with gradient-based features is performed. A very

popular texture-based feature namely, Local Binary Pattern

(LBP) and two gradient-based features namely, Histogram of

Oriented Gradient (HoG) and Gradient Local Auto-Correlation

(GLAC) were used in the present study. Support Vector

Machines (SVMs) and Artiﬁcial Neural Networks (ANNs)

were used for classiﬁcation. As mentioned in our previous

study [4] pre-processing does help in improving the video

script identiﬁcation accuracy, but the choice of features is also

equally important. Hence, the features used in the present study

were carefully selected based on their ability to provide better

randomness and description of the structural differences of the

scripts, which will increase the overall accuracy.

Rest of the paper is organized as follows. The pre-

processing technique used is discussed in Section II. Section III

presents a brief description of the feature extraction techniques

used in the present study. In Section IV, the details of SVM

and ANN classiﬁers are discussed. Experimental results and a

discussion are presented in Section V. Section V also provides

an analysis of the errorneous results. Section VI concludes the

paper providing future directions for video script identiﬁcation.

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 2: Sample video word images of English (1st Row),

Bengali (2nd Row) and Hindi (3rd Row) scripts.

II. PRE-PROCESSING

The text lines from the video frames were detected using

[11]. The words were segmentation from the text lines using

our word segmentation technique [10] and were used as input

for our experiments. A few samples of segmented word images

from video frames for the three scripts are shown in Figure 2.

The images shown in Figure 2 reveals that the text extracted

from the video frames suffers from low resolution, blur, and

complex backgrounds, to mention a few issues.

Our study in [4] showed that super resolution techniques

resulted in better accuracy. Hence we used the super reso-

lution technique for pre-processing the words to get better

resolution images for further processing. A single level of

super resolution images were used for our experiments. The

resolution of the word image was increased by 1.5% using

a cubic interpolation method [13]. Cubic interpolation was

chosen because it creates better images preserving the shape

of the original word images.

III. FEATURE EXTRACTION TECHNIQUES

Three feature extraction techniques were considered for

the present study. One texture-based feature namely, Local

Binary Pattern (LBP) and two gradient-based features namely,

Histogram of Oriented Gradients (HoG) and Gradient Local

Auto-Correlation (GLAC) were used. A brief description of

the feature extraction techniques are discussed below.

A. Local Binary Pattern (LBP)

Local Binary Pattern (LBP) [15] is an efﬁcient texture

operator which labels each pixel of an image by thresholding

their neighbours. The idea behind the LBP operator is to

describe the image textures using two measures namely, local

spatial patterns and the gray scale contrast of its strenght.

We considered the original version of the LBP operator [15]

which forms labels of image pixels by thresholding the 3 × 3

neighbourhood of each pixel with the centre pixel value and the

result is considered as a binary number. As the neighbourhood

of the centre pixel has 8 pixels, 2

= 256 different labels can

be obtained and used as a texture descriptor. A histogram is

then computed over the cells, and forms the feature vector.

The basic LBP

P,R

operator is deﬁned as follows,

LBP

P,R

, y

) =

P −1

∑

p=0

S(g

− g

(1)

where,

S(x) =

{

1, ifx >= 0

0, otherwise

S(x) is a thresholding function, (x

, y

) is the centre pixel

in the 8 pixel neighbourhood, g

is the gray level of the centre

pixel and g

denotes the gray value of a sampling point in an

equally spaced circular neighbourhood of P sampling points

and radius R around the point (x

, y

). An illustration of LBP

computation is shown in Figure 3. Figure 4 shows the LBP

images correspoding to the sample video word images shown

in Figure 2.

LBP was chosen for the present study because of its

ability to describe the local spatial pattern, which required

discriminate between the structurally similar scripts.

-18

-7

-16

, y

)

(a) Sample pixel neighbourhood

(b) Difference result

Fig. 3: An example of LBP computation

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 4: LBP Images of the corresponding video word images

shown in Figure 2 for the three scripts.

B. Histogram of Oriented Gradients (HoG)

Histogram of Oriented Gradients (HoG) [16] is a robust

feature descriptor commonly used in computer vision and

image processing for object detection. Dalal and Triggs [16]

ﬁrst described the HoG descriptors and primarily focused on

pedestian detection in static images. The basic idea behind the

HoG descriptor is that the shape and appearence of the object

within an image can be described by the intensity gradient

distribution or the edge directions.

The HoG descriptors are typically computed by dividing

an image into small spatial regions called ’cells’. A histogram

of the gradient direction of the pixels within the cells is

computed. The histogram bins/channels are evently spaced

over 0

◦

to 180

◦

or 0

◦

to 360

◦

based on the usage of signed

or unsigned gradient values. Combining the histogram of all

the cells produces the descriptors. For improving the accuracy

the local histograms can be contrast-normalized [16]. More

information about the HoG descriptor can be found in [16].

For our study the HoG feature suits the problem well

because it operates on the localized cells and it is capable

of describing the shape and appearance of the object, which

is the word in the present context. Figure 5 shows the HoG

images correspoding to the sample video word images shown

in Figure 2.

C. Gradient Local Auto-Correlation (GLAC)

Gradient Local Auto-Correlation (GLAC) was proposed

by Kobayashi and Otsu [17]. It utilizes the spatial and the

orientational auto-correlations of local gradients for feature

extraction. The features not only capture the information about

the gradients but also the curvature of the image surface, and

are described in terms of both magnitude and orientation.

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 5: HoG Images of the corresponding video word images

shown in Figure 2 for the three scripts.

GLAC can be viewed as an extension of 1st order statistics

(i.e. histograms) to the 2nd order statistics (that is the auto-

correlations).

Detailed information about GLAC features can be found in

[17]. The ability of GLAC features to describe the curvature

of the image surface, in addition to the gradient information,

inspired us to consider it in our study.

IV. CLASSIFIERS

We considered Support Vector Machines (SVMs) and Ar-

tiﬁcial Neural Networks (ANNs) for classiﬁcation of the three

scripts. A brief description of the classiﬁer are given below.

A. Support Vector Machine (SVM)

Given a training database of M data: {x

|m = 1, ..., M},

the linear SVM classiﬁer is then deﬁned as:

f(x) =

∑

· x + b (2)

where, x

are the set of support vectors, y

is the set of

class labels {+1, -1} and the parameters a

and b have been

determined by solving a quadratic problem [18]. The linear

SVM can be extended to various non-linear variants; details

can be found in [18], [19]. In our experiments Gaussian kernel

SVM outperformed linear and other non-linear SVM kernels.

The Gaussian kernel is of the form:

k(x, y) = exp

−||x − y||

2σ

(3)

We noticed from the initial experiments that the Gaussian

kernel gave the highest accuracy when the value of its gamma

parameter (1/2σ

) was varied between 1.0 and 5.0 for the

three different features and the penalty multiplier parameter

was set to 1. LibSVM [20] was used to conduct the SVM

classiﬁcation experiments.

B. Artiﬁcial Neural Networks (ANNs)

In this study, feed-forward Multi-Layered Perceptrons

(MLPs) trained with the resilient backpropagation (BP) algo-

rithm were used. For experimental purposes, the architectures

were modiﬁed varying the number of inputs and the hidden

units. The number of output units were ﬁxed to three as

three scripts were considered in the present study. The number

of input units varied because three different feature having

different feature dimension were considered in the present

study.

The number of hidden units investigated during ANN

training was experimentally set from 8 upto 30 hidden units.

The number of iterations set for training was increased from

1000 upto 3000. All the ANNs were trained with a learning

rate of 0.1 and a momentum rate of 0.1.

V. EXPERIMENTAL RESULTS

This section presents the experimental results obtained

using the various combinations of the three features, as well

as the SVM and ANN classiﬁers. In order to study the per-

formance of the features and classiﬁers, a video word dataset

was created, as there is no standard dataset available. Test

portion of a video frames were extracted using our video text

detection algorithm [11] and the words were later segmented

using our word segmentation technique [10]. A dataset of

1271 words was created after extraction of word from the

video text lines. The dataset comprised of 430 Hindi, 410

English and 431 Bengali words. The results obtained using the

combinations of features and classiﬁers are reported in Tables

I, II, III, IV, V, VI and VII. For all the experiments we used

a ﬁve fold cross validation technique to compute the script

identiﬁcation accuracy. The reason for using cross validation

is that it provided unbaised results over the complete dataset.

The various experiments conducted in the study included

the perfomance evaluation of :

• LBP features with SVM and ANN classiﬁers,

• HoG features with SVMand ANN classiﬁers,

• GLAC features with SVM and ANN classiﬁers.

Additionally, we also conducted experiments on Long-words

(words having four or more characters) and Short words (word

having three or less characters). We evaluated the performance

of HoG and GLAC features with SVM and ANN classiﬁers

on both Long and Short words. This experiments revealed the

discriminative capacity and robustness of the features when

applied on the same dataset and their impact on the accuracy.

A. Experimental settings

The parameter settings considered for each of the feature

extraction techniques used in the present study are given below.

1) LBP feature: as mentioned earlier, the basic LBP was

considered in our study, with 8 neighbours the feature

dimensions of the LBP feature vector was 256.

2) HoG feature: The block size considered was 5. That

is an image was divided in 5×5 blocks. The gradient

orientation was quantized into 16 directions/bins.

Thus, the feature dimensions of a HoG feature vector

was 5 × 5 × 16 = 400.

3) GLAC feature: the Roberts ﬁlter was used for gradi-

ent computation and the number of orientation bins

was set to 9. The other baseline parameter settings as

given in [17] were considered.

A study on word-level multi-script identification from video frames

Figures

Citations

Improving handwriting based gender classification using ensemble classifiers

ICDAR2015 Competition on Video Script Identification (CVSI 2015)

Script Identification of Multi-Script Documents: A Survey

Integrating Local CNN and Global CNN for Script Identification in Natural Scene Images

Script identification algorithms: a survey

References

LIBSVM: A library for support vector machines

The Nature of Statistical Learning Theory

Histograms of oriented gradients for human detection

A Tutorial on Support Vector Machines for Pattern Recognition

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

Related Papers (5)

Word-Wise Script Identification from Video Frames

Video Script Identification Based on Text Lines

Gradient-Angular-Features for Word-wise Video Script Identification

Word level multi-script identification

New Spatial-Gradient-Features for Video Script Identification

Frequently Asked Questions (19)

Q1. What contributions have the authors mentioned in the paper "A study on word-level multi-script identification from video frames" ?

Q2. What future works have the authors mentioned in the paper "A study on word-level multi-script identification from video frames" ?

Q3. What are the future plans for the study?

Q4. What is the basic idea behind the HoG descriptor?

Q5. What is the reason behind the better performance using GLAC features?

Q6. What was the use of the text lines from the video frames?

Q7. How many words were used in the dataset?

Q8. What were the two features used in the previous study?

Q9. How many words were used in the combined dataset?

Q10. What was the error rate for the English script?

Q11. What was the error rate in the English script?

Q12. What is the main idea behind the HoG descriptor?

Q13. What were the main reasons for the misclassification of Bengali word images?

Q14. What is the reason for the increase in script accuracy?

Q15. How many different labels can be obtained from the centre pixel?

Q16. What is the error rate of the complete dataset?

Q17. What were the parameters used in the present study?

Q18. What were the main reasons for the misclassification of the Hindi word images?

Q19. What is the confusion matrix in Table I?