Script Recognition—A Review

doi:10.1109/TPAMI.2010.30

IEEE

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009 1

Script Recognition–AReview

D. Ghosh, T. Dube, and A.P. Shivaprasad

Abstract—A variety of different scripts are used in writing languages throughout the world. In a multi-script, multilingual environment, it

is essential to know the script used in writing a document before an appropriate character recognition and document analysis algorithm

can be chosen. In view of this, several methods for automatic script identiﬁcation have been developed so far. They mainly belong to

two broad categories – structure-based and visual appearance-based techniques. This survey report gives an overview of the different

script identiﬁcation methodologies under each of these categories. Methods for script identiﬁcation in online data and video-texts are

also presented. It is noted that the research in this ﬁeld is relatively thin and still more research is to be done, particularly in case of

handwritten documents.

Index Terms—Document analysis, Optical character recognition, Script identiﬁcation, Multi-script document.

!

1

I

NTRODUCTION

O

NE interesting and challenging ﬁeld of research in

pattern recognition is Optical Character Recogni-

tion (OCR). Optical character recognition is the process

in which a paper document is optically scanned and then

converted into computer processable electronic format

by recognizing and associating symbolic identity with

every individual character in the document.

With the increasing demand for creating a paperless

world, many OCR algorithms have been developed over

the years [1], [2], [3], [4], [5], [6]. However, most OCR

systems are script-speciﬁc in the sense that they can read

characters written in one particular script only. Script is

deﬁned as the graphic form of the writing system used

to write statements expressible in language. That means,

a script class refers to a particular style of writing and

the set of characters used in it. Languages throughout

this world are typeset in many different scripts. A script

may be used by only one language or may be shared by

many languages, sometimes with slight variations from

one language to other. For example, Devnagari is used

for writing a number of Indian languages like Sanskrit,

Hindi, Konkani, Marathi, etc., English, French, German

and some other European languages use different vari-

ants of the Latin alphabet, and so on. Some languages

even use different scripts at different point of time and

space. One good example for this is Malay that uses

the Latin alphabet nowadays replacing previously used

Jawi. Another example is Sanskrit that is mainly written

in Devnagari in India but is also written in Sinhala

script in Sri Lanka. Therefore, in this multilingual and

multi-script world, OCR systems need to be capable of

• D.

Ghosh is with the Department of Electronics & Computer Engineering,

Indian Institute of Technology, Roorkee 247 667, India.

E-mail: ghoshfec@iitr.ernet.in

• T. Dube is with the Indian Institute of Management, Ahmedabad 380 015,

India.

• A.P. Shivaprasad is with the Department of Electronics & Communication

Engineering, Sambhram Institute of Technology, Bangalore 560 097, India.

recognizing characters irrespective of the script in which

they are written. In general, recognition of different

script characters in a single OCR module is difﬁcult. This

is because features necessary for character recognition

depend on the structural properties, style and nature

of writing which generally differs from one script to

another. For example, features used for recognition of

English alphabets are in general not good for recognizing

Chinese logograms.

Another option for handling documents in a multi-

script environment is to use a bank of OCRs corre-

sponding to all different scripts expected to be seen. The

characters in an input document can then be recognized

reliably by selecting the appropriate OCR system from

the OCR bank. Nevertheless, this will require to know

a priori the script in which the input document is writ-

ten. Unfortunately, this information may not be readily

available. At the same time, manual identiﬁcation of the

documents’ scripts may be tedious and time consuming.

Therefore, automatic script recognition techniques are

necessary to identify the script in the input document

and then redirect it to the appropriate character recog-

nition module, as illustrated in Fig. 1.

Script recognizer is also useful in reading multi-script

documents in which different paragraphs, text-blocks,

textlines or words in a page are written in different

scripts. Fig. 2 shows several examples of multi-script

documents. Analysis of such documents works in two

stages — identiﬁcation and separation of different script

Fig.

1.

Stages of document processing in a multi-script environment.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

2 IEEE

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

Fig.

2. Examples of multi-script document images: (a) a government

report in China containing mix of Chinese and English words, (b) a

medical report in Arabic containing words in English that do not have

exact Arabic equivalent, (c) portion of an ofﬁcial application form in India

containing different script-lines typeset in Hindi and English.

regions in the document followed by reading of each in-

dividual script region using corresponding OCR system.

Script identiﬁcation also serves as an essential precur-

sor for recognizing the language in which a document

is written. This is necessary for further processing of the

document, such as routing, indexing or translation. For

scripts used by only one language, script identiﬁcation

itself accomplishes language identiﬁcation. For scripts

shared by many languages, script recognition acts as

the ﬁrst level of classiﬁcation followed by language

identiﬁcation within the script.

Script recognition also helps in text area identiﬁcation,

video indexing and retrieval, and document sorting in

digital libraries when dealing with a multi-script envi-

ronment. Text area detection refers to either segment-

ing out text-blocks from other non-textual regions like

halftones, images, line drawings, etc. in a document

image, or extracting text printed against textured back-

grounds and/or embedded in images within a docu-

ment. To do this, the system takes advantage of script

speciﬁc distinctive characteristics of text which make it

stand out from other non-textual parts in the document.

Text extraction is also required in images and videos

for content-based browsing. One powerful index for

image/video retrieval is the text appearing in them.

Efﬁcient indexing and retrieval of digital image/video in

an international scenario, therefore, requires text extrac-

tion followed by script identiﬁcation and then character

recognition. Similarly, text found in documents can be

used for their annotation, indexing, sorting and retrieval.

Thus, script identiﬁcation plays an important role in

building a digital library containing documents written

in different scripts.

In short, automatic script identiﬁcation is crucial to

meet the growing demand for electronic processing of

volumes of documents written in different scripts. This

is important for business transactions across Europe and

Orient, and has great signiﬁcance in a country like India

which has many ofﬁcial state languages and scripts. Due

to this, there has been a growing interest in multi-script

OCR technology during recent years. A brief survey on

methods for script recognition had been reported earlier

in [7], with emphasis on script identiﬁcation in Indian

multi-script documents but little insights into the script

recognition methods for non-Indian scripts. A review

of script identiﬁcation research for Indian documents is

also available in [8]. A report on the key technologies

in multilingual OCR and their application in building

multilingual digital library can also be found in [9].

In this paper, we present a comprehensive survey of

different script recognition techniques developed mainly

for identiﬁcation of certain major scripts of the world,

viz. Chinese, Japanese, Korean, Arabic, Hebrew, Latin,

Cyrillic and the Brahmic family of Indian scripts. To

begin with, in Section 2, we give a brief description

of different script types highlighting their main dis-

criminating features. Methods for script recognition in

document images are described in Section 3 giving

comparative analysis among them. Section 4 discusses

several methods for script recognition in the realm of pen

computing. As said before, script identiﬁcation in video

text is also important. However, not much research has

been done on this topic. The only work that we have

found on this is outlined in Section 5. Section 6 raises

issues related to performance evaluation of multi-script

OCR systems. Finally, we state our concluding remarks

in Section 7, including some insights on the recent trends

and future scope of work in this ﬁeld.

2WRITING SYSTEMS AND SCRIPTS OF THE

WORLD

In the context of script recognition, it may be worth

studying the characteristics of various writing systems

and the structural properties of the characters used in

certain major scripts of the world. In Fig. 3, we draw

a tree diagram showing different classes of writing

systems. As said in [10], [11] and depicted in the tree

diagram, there are six prominent writing systems. Major

scripts that follow each of these writing systems are also

shown in the tree diagram and are described below.

2.1 Logographic system

A logogram, also called ideogram, refers to a symbol that

graphically represents a complete word. Accordingly,

the number of characters in a script for an ideographic

writing system generally runs into the thousands. This

makes recognition of logographic characters a difﬁcult

but interesting problem.

An example of logographic script is Han which is

mainly associated with Chinese. Japanese and Korean

writings also include Han modiﬁed as Kanji and Hanja,

respectively. Han characters are generally composed of

multiple short strokes giving them a complex and dense

look, distinctly different from other Western and Asian

scripts. Accordingly, character optical density and cer-

tain other visual appearance-based features have been

utilized by many researchers in distinguishing Han from

other scripts. Another interesting property of Han is its

directionality — words in a textline are written either

from left to right or from top to bottom.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

GHOSH

ET AL.: SCRIPT RECOGNITION – A REVIEW 3

Fig.

3.

Tree diagram showing broad classiﬁcation of prominent writing systems and scripts of the present world.

2.2 Syllabic system

In a syllabic system, every written symbol represents a

phonetic sound or syllable, as used in Japanese. The sym-

bols representing the Japanese syllables are known as

Kanas which are of two types — Hirakana and Katakana.

As indicated in Fig. 3, Japanese script uses a mix of lo-

gographic Kanji and syllabic Kanas. Hence, it is visually

similar to Chinese, but less dense due to the presence of

simpler Kanas in between the logograms.

2.3 Alphabetic system

An alphabet is a set of characters representing phonemes

of a spoken language. Examples of scripts following this

system are Greek, Latin, Cyrillic and Armenian. The

Latin script, also called Roman script, is used by many

languages throughout this world with varying degrees

of modiﬁcations from one language to another. It is used

for writing many European languages like English, Ital-

ian, French, German, Portuguese, Spanish, etc., and has

been adopted in many Amerindian and Austronesian

languages including modern Malay, Vietnamese and In-

donesian language. Fig. 4 shows few such variants of the

Latin script. Compared to other scripts, classical Latin

characters are simple in structure, mainly composed of

few lines and arcs. The other major script under the

alphabetic system is Cyrillic. This script is used by some

languages of Eastern Europe, Asia and Slavic regions

that include Bulgarian, Russian, Macedonian, Ukrainian,

Mongolian, etc. The basic properties of this script are

Fig.

4.

Examples of some languages using the Latin alphabet with

different modiﬁcations.

somewhat similar to that of Latin except that it uses

a different alphabet set. Some characters in the Cyrillic

alphabet are also borrowed from Latin and Greek, modi-

ﬁed with cedillas, crosshatches or diacritical marks. This

induces recognition ambiguity between Cyrillic, Latin

and Greek.

2.4 Abjads

The Abjad system of writing is similar to the alpha-

betic system, but has symbols for consonantal sounds

only. Unlike most other scripts in the world, Abjads are

written from right to left within a textline. This unique

feature is particularly useful for identifying Abjad-based

scripts in pen computing.

Two important scripts under this category are Arabic

and Hebrew. A typical Arabic character is formed of

a long main stroke along with one to three dots. The

characters in a word are generally conjoined giving

an overall cursive appearance to the written text. This

provides an important clue for the recognition of Arabic

script. The same applies to some other scripts of Arabic

origin such as Farsi (Persian), Urdu, Sindhi, Jawi, etc.

On the other hand, character strokes in Hebrew are more

uniform in length and the letters in a word are generally

discrete.

2.5 Abugidas

Abugida is another alphabetic-like writing system used

by the Brahmic family of scripts that originated from the

ancient Indian Brahmi script and includes nearly all the

scripts of India and southeast Asia. In Fig. 5, we draw a

tree diagram to illustrate the evolution of major Brahmic

scripts in India and southeast Asia. The northern group

of Brahmic scripts (e.g. Devnagari, Bengali, Manipuri,

Gurumukhi, Gujrati and Oriya) bear strong resemblance

to the original Brahmi script. On the other hand, scripts

in south India (Tamil, Telugu, Kannada and Malayalam)

as well as in southeast Asia (e.g. Thai, Lao, Burmese,

Javanese and Balinese) are derived from Brahmi through

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

4 IEEE

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

Fig.

5.

The Brahmic family of scripts used in India and southeast Asia.

many changes and so look quite different from the north-

ern group. One important characteristic of Devnagari,

Bengali, Gurumukhi and Manipuri is that the characters

in a word are generally written together without spaces,

so that the top bar is unbroken. This results in the

formation of headline, called shirorekha, at the top of each

word. Accordingly, these scripts can be separated from

other script types by detecting the presence of a large

number of horizontal lines in the textual portions of a

document.

2.6 Featural system

The last signiﬁcant form of writing system is the featural

system in which the symbols or characters represent the

features that make up the phonemes. One prominent

script of this sort is the Korean Hangul. As indicated in

Fig. 3, the Korean script is formed by mixing logographic

Hanja with featural Hangul. However, modern Korean

contains more of Hangul than Hanja. Consequently,

Korean script is relatively less complex and less dense

compared to Chinese and Japanese, containing more

circles and ellipses.

3SCRIPT RECOGNITION METHODOLOGIES

Script identiﬁcation relies on the fact that each script

has unique spatial distribution and visual attributes that

make it possible to distinguish it from other scripts. So,

the basic task involved in script recognition is to devise a

technique to discover these features from a given docu-

ment and then classify the document’s script accordingly.

Based on the nature of approach and features used,

these methods may be divided into two broad cate-

gories — structure-based and visual appearance-based

methods. Script recognition techniques in each of these

two categories may be further classiﬁed on the basis of

the level at which they are applied inside a document

image, viz. page-wise, paragraph-wise, textline-wise and

word-wise. The application mode of a method depends

on the minimum size of the text from which the fea-

tures proposed in the method can be extracted reliably.

Various algorithms under each of these categories are

summarized below.

3.1 Structure-based script recognition

In general, script classes differ from each other in their

stroke structure and connections, and the writing styles

associated with the character sets they use. One ap-

proach to script recognition may be to extract con-

nected components (continuous runs of pixels) in a doc-

ument [12] and then analyze their shapes and structures

so as to reveal the intrinsic morphological characteristics

of the script used in the document. In machine-printed

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

GHOSH

ET AL.: SCRIPT RECOGNITION – A REVIEW 5

Fig.

6. Spitz’s method of script identiﬁcation.

Latin, Greek, Han, etc., every individual character or part

of a character is a connected component. On the other

hand, in cursive handwritten documents, the characters

in a word or part of a word can touch each other to form

one single connected component. Likewise, in scripts like

Devnagari, Bengali, Arabic, etc., a word or a part of a

word forms a connected component. Script identiﬁcation

methods that are based on extraction and analysis of con-

nected components fall under the category of structure-

based methods.

3.1.1 Page-wise script identiﬁcation methods

A script identiﬁcation method that relies on the spa-

tial relationship of character structures was developed

by Spitz for differentiating Han and Latin scripts in

machine-printed documents. In his ﬁrst work on this

topic [13], he used character optical density for classi-

fying individual textlines in a document as being ei-

ther English or Japanese. In another paper, Spitz used

vertical distribution of upward concavities in characters

for discriminating Han from Latin with 100% success

in continuous production use [14]. Later, he developed

a two stage classiﬁer in [15] by combining these two

features. In the ﬁrst stage, Latin is separated from

Han-based scripts by comparing the variances of their

upward concavity distributions. Further classiﬁcation

within the Han-based scripts is performed by analyzing

the distribution of optical density in the text image. The

system also has provisions for language identiﬁcation

within documents using the Latin alphabet by observing

the most frequently occurring character shape codes. A

schematic diagram showing the ﬂow of information in

the process is given in Fig. 6.

The above works by Spitz was extended by Lee et

al in [16], and by Waked et al in [17] by incorporat-

ing some additional features. In [16], the script of a

printed document is identiﬁed via textline-wise script

recognition followed by a majority vote of the already

decided textline classiﬁcation results. The features used

are character height distribution and the top and bot-

tom proﬁles of character bounding boxes, in addition

to upward concavity distribution and optical density

features. Experimental results showed that these fea-

tures can separate Han-based (Chinese and Japanese)

documents from Latin-based (English, French, German,

Italian and Spanish) documents in 98.16% cases. In [17],

Waked et al used bounding box size distribution, char-

acter density distribution and horizontal projections, for

classifying printed documents written in Han, Latin,

Cyrillic and Arabic. These statistical features are more

robust compared to the structural features proposed by

Spitz and Lee et al. However, Waked et al achieved an

accuracy rate of only 91% when tested on documents of

varying kinds, diverse formats and qualities. This drop

in recognition accuracy is mainly due to misclassiﬁcation

between Latin and Cyrillic scripts, which are similar-

looking under this measure. Also, some test documents

of extremely poor quality account for this degradation

in performance.

Script identiﬁcation in machine-printed documents us-

ing statistical features has also been explored by Lam

et al [18]. In a ﬁrst level of classiﬁcation, documents

are classiﬁed as Latin, Chinese, Japanese or Korean

using horizontal projection proﬁles, height distributions

of connected components and enclosing structure of

connected components. Non-Latin documents that can-

not be recognized in this stage are classiﬁed in a sec-

ond level of recognition using structural features like

character complexity, presence of circles, ellipses and

vertical strokes. In the process, more than 95% correct

recognition was achieved.

The fact that every script class is composed of some

“textual symbols” of unique characteristic shapes had

been exploited by Hochberg et al in identifying the

script of a printed document [19]. First, textual symbols

obtained from documents of a known script are resized

and clustered to generate template symbols for that

script class, as depicted in Fig. 7. Textual symbols in-

clude character fragments, discrete characters, adjoined

characters, and even whole words. During classiﬁcation,

textual symbols extracted from the input document are

compared to the template symbols using Hamming dis-

tance and then scored against every script class on the

basis of their distances from the best match template

symbols in that script class. The script class with the best

average score is chosen as the script of the document.

Hochberg et al tested their method on as many as thir-

teen scripts, viz. Arabic, Armenian, Burmese, Chinese,

Cyrillic, Devanagari, Ethiopic, Greek, Hebrew, Japanese,

Korean, Latin and Thai, and obtained 96% accuracy.

In [20], Hochberg and others proposed a feature-

based approach for script identiﬁcation in handwritten

documents and achieved 88% accuracy in distinguishing

Arabic, Chinese, Cyrillic, Devnagari, Japanese and Latin.

In their method, a handwritten document is character-

ized in terms of mean, standard deviation and skew of

Fig.

7.

Hochberg et al’s method of script identiﬁcation in printed

documents.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

Script Recognition—A Review

Figures

Citations

Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques

The Blackwell encyclopedia of writing systems By Florian Coulmas (review)

Script identification in the wild via discriminative convolutional neural network

Script identification in natural scene image and video frames using an attention based Convolutional-LSTM network

Managing real-time handwriting recognition

References

Textural Features for Image Classification

A possibilistic approach to clustering

Historical review of OCR research and development

Indian script character recognition: a survey

Twenty years of document image analysis in PAMI

Related Papers (5)

Word level multi-script identification

Rotation invariant texture features and their use in automatic script identification

Texture for script identification

Automatic script identification from document images using cluster-based templates

Determination of the script and language content of document images