scispace - formally typeset
Open AccessJournal ArticleDOI

Script Recognition—A Review

Reads0
Chats0
TLDR
An overview of the different script identification methodologies under each of the two broad categories-structure-based and visual-appearance-based techniques is given.
Abstract
A variety of different scripts are used in writing languages throughout the world. In a multiscript, multilingual environment, it is essential to know the script used in writing a document before an appropriate character recognition and document analysis algorithm can be chosen. In view of this, several methods for automatic script identification have been developed so far. They mainly belong to two broad categories-structure-based and visual-appearance-based techniques. This survey report gives an overview of the different script identification methodologies under each of these categories. Methods for script identification in online data and video-texts are also presented. It is noted that the research in this field is relatively thin and still more research is to be done, particularly in the case of handwritten documents.

read more

Content maybe subject to copyright    Report

IEEE
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009 1
Script Recognition–AReview
D. Ghosh, T. Dube, and A.P. Shivaprasad
Abstract—A variety of different scripts are used in writing languages throughout the world. In a multi-script, multilingual environment, it
is essential to know the script used in writing a document before an appropriate character recognition and document analysis algorithm
can be chosen. In view of this, several methods for automatic script identification have been developed so far. They mainly belong to
two broad categories structure-based and visual appearance-based techniques. This survey report gives an overview of the different
script identification methodologies under each of these categories. Methods for script identification in online data and video-texts are
also presented. It is noted that the research in this field is relatively thin and still more research is to be done, particularly in case of
handwritten documents.
Index Terms—Document analysis, Optical character recognition, Script identification, Multi-script document.
!
1
I
NTRODUCTION
O
NE interesting and challenging field of research in
pattern recognition is Optical Character Recogni-
tion (OCR). Optical character recognition is the process
in which a paper document is optically scanned and then
converted into computer processable electronic format
by recognizing and associating symbolic identity with
every individual character in the document.
With the increasing demand for creating a paperless
world, many OCR algorithms have been developed over
the years [1], [2], [3], [4], [5], [6]. However, most OCR
systems are script-specific in the sense that they can read
characters written in one particular script only. Script is
defined as the graphic form of the writing system used
to write statements expressible in language. That means,
a script class refers to a particular style of writing and
the set of characters used in it. Languages throughout
this world are typeset in many different scripts. A script
may be used by only one language or may be shared by
many languages, sometimes with slight variations from
one language to other. For example, Devnagari is used
for writing a number of Indian languages like Sanskrit,
Hindi, Konkani, Marathi, etc., English, French, German
and some other European languages use different vari-
ants of the Latin alphabet, and so on. Some languages
even use different scripts at different point of time and
space. One good example for this is Malay that uses
the Latin alphabet nowadays replacing previously used
Jawi. Another example is Sanskrit that is mainly written
in Devnagari in India but is also written in Sinhala
script in Sri Lanka. Therefore, in this multilingual and
multi-script world, OCR systems need to be capable of
D.
Ghosh is with the Department of Electronics & Computer Engineering,
Indian Institute of Technology, Roorkee 247 667, India.
E-mail: ghoshfec@iitr.ernet.in
T. Dube is with the Indian Institute of Management, Ahmedabad 380 015,
India.
A.P. Shivaprasad is with the Department of Electronics & Communication
Engineering, Sambhram Institute of Technology, Bangalore 560 097, India.
recognizing characters irrespective of the script in which
they are written. In general, recognition of different
script characters in a single OCR module is difficult. This
is because features necessary for character recognition
depend on the structural properties, style and nature
of writing which generally differs from one script to
another. For example, features used for recognition of
English alphabets are in general not good for recognizing
Chinese logograms.
Another option for handling documents in a multi-
script environment is to use a bank of OCRs corre-
sponding to all different scripts expected to be seen. The
characters in an input document can then be recognized
reliably by selecting the appropriate OCR system from
the OCR bank. Nevertheless, this will require to know
a priori the script in which the input document is writ-
ten. Unfortunately, this information may not be readily
available. At the same time, manual identification of the
documents’ scripts may be tedious and time consuming.
Therefore, automatic script recognition techniques are
necessary to identify the script in the input document
and then redirect it to the appropriate character recog-
nition module, as illustrated in Fig. 1.
Script recognizer is also useful in reading multi-script
documents in which different paragraphs, text-blocks,
textlines or words in a page are written in different
scripts. Fig. 2 shows several examples of multi-script
documents. Analysis of such documents works in two
stages identification and separation of different script
Fig.
1.
Stages of document processing in a multi-script environment.
Digital Object Indentifier 10.1109/TPAMI.2010.30 0162-8828/10/$26.00 © 2010 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

2 IEEE
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
Fig.
2. Examples of multi-script document images: (a) a government
report in China containing mix of Chinese and English words, (b) a
medical report in Arabic containing words in English that do not have
exact Arabic equivalent, (c) portion of an official application form in India
containing different script-lines typeset in Hindi and English.
regions in the document followed by reading of each in-
dividual script region using corresponding OCR system.
Script identification also serves as an essential precur-
sor for recognizing the language in which a document
is written. This is necessary for further processing of the
document, such as routing, indexing or translation. For
scripts used by only one language, script identification
itself accomplishes language identification. For scripts
shared by many languages, script recognition acts as
the first level of classification followed by language
identification within the script.
Script recognition also helps in text area identification,
video indexing and retrieval, and document sorting in
digital libraries when dealing with a multi-script envi-
ronment. Text area detection refers to either segment-
ing out text-blocks from other non-textual regions like
halftones, images, line drawings, etc. in a document
image, or extracting text printed against textured back-
grounds and/or embedded in images within a docu-
ment. To do this, the system takes advantage of script
specific distinctive characteristics of text which make it
stand out from other non-textual parts in the document.
Text extraction is also required in images and videos
for content-based browsing. One powerful index for
image/video retrieval is the text appearing in them.
Efficient indexing and retrieval of digital image/video in
an international scenario, therefore, requires text extrac-
tion followed by script identification and then character
recognition. Similarly, text found in documents can be
used for their annotation, indexing, sorting and retrieval.
Thus, script identification plays an important role in
building a digital library containing documents written
in different scripts.
In short, automatic script identification is crucial to
meet the growing demand for electronic processing of
volumes of documents written in different scripts. This
is important for business transactions across Europe and
Orient, and has great significance in a country like India
which has many official state languages and scripts. Due
to this, there has been a growing interest in multi-script
OCR technology during recent years. A brief survey on
methods for script recognition had been reported earlier
in [7], with emphasis on script identification in Indian
multi-script documents but little insights into the script
recognition methods for non-Indian scripts. A review
of script identification research for Indian documents is
also available in [8]. A report on the key technologies
in multilingual OCR and their application in building
multilingual digital library can also be found in [9].
In this paper, we present a comprehensive survey of
different script recognition techniques developed mainly
for identification of certain major scripts of the world,
viz. Chinese, Japanese, Korean, Arabic, Hebrew, Latin,
Cyrillic and the Brahmic family of Indian scripts. To
begin with, in Section 2, we give a brief description
of different script types highlighting their main dis-
criminating features. Methods for script recognition in
document images are described in Section 3 giving
comparative analysis among them. Section 4 discusses
several methods for script recognition in the realm of pen
computing. As said before, script identification in video
text is also important. However, not much research has
been done on this topic. The only work that we have
found on this is outlined in Section 5. Section 6 raises
issues related to performance evaluation of multi-script
OCR systems. Finally, we state our concluding remarks
in Section 7, including some insights on the recent trends
and future scope of work in this field.
2WRITING SYSTEMS AND SCRIPTS OF THE
WORLD
In the context of script recognition, it may be worth
studying the characteristics of various writing systems
and the structural properties of the characters used in
certain major scripts of the world. In Fig. 3, we draw
a tree diagram showing different classes of writing
systems. As said in [10], [11] and depicted in the tree
diagram, there are six prominent writing systems. Major
scripts that follow each of these writing systems are also
shown in the tree diagram and are described below.
2.1 Logographic system
A logogram, also called ideogram, refers to a symbol that
graphically represents a complete word. Accordingly,
the number of characters in a script for an ideographic
writing system generally runs into the thousands. This
makes recognition of logographic characters a difficult
but interesting problem.
An example of logographic script is Han which is
mainly associated with Chinese. Japanese and Korean
writings also include Han modified as Kanji and Hanja,
respectively. Han characters are generally composed of
multiple short strokes giving them a complex and dense
look, distinctly different from other Western and Asian
scripts. Accordingly, character optical density and cer-
tain other visual appearance-based features have been
utilized by many researchers in distinguishing Han from
other scripts. Another interesting property of Han is its
directionality words in a textline are written either
from left to right or from top to bottom.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

GHOSH
ET AL.: SCRIPT RECOGNITION A REVIEW 3
Fig.
3.
Tree diagram showing broad classification of prominent writing systems and scripts of the present world.
2.2 Syllabic system
In a syllabic system, every written symbol represents a
phonetic sound or syllable, as used in Japanese. The sym-
bols representing the Japanese syllables are known as
Kanas which are of two types Hirakana and Katakana.
As indicated in Fig. 3, Japanese script uses a mix of lo-
gographic Kanji and syllabic Kanas. Hence, it is visually
similar to Chinese, but less dense due to the presence of
simpler Kanas in between the logograms.
2.3 Alphabetic system
An alphabet is a set of characters representing phonemes
of a spoken language. Examples of scripts following this
system are Greek, Latin, Cyrillic and Armenian. The
Latin script, also called Roman script, is used by many
languages throughout this world with varying degrees
of modifications from one language to another. It is used
for writing many European languages like English, Ital-
ian, French, German, Portuguese, Spanish, etc., and has
been adopted in many Amerindian and Austronesian
languages including modern Malay, Vietnamese and In-
donesian language. Fig. 4 shows few such variants of the
Latin script. Compared to other scripts, classical Latin
characters are simple in structure, mainly composed of
few lines and arcs. The other major script under the
alphabetic system is Cyrillic. This script is used by some
languages of Eastern Europe, Asia and Slavic regions
that include Bulgarian, Russian, Macedonian, Ukrainian,
Mongolian, etc. The basic properties of this script are
Fig.
4.
Examples of some languages using the Latin alphabet with
different modifications.
somewhat similar to that of Latin except that it uses
a different alphabet set. Some characters in the Cyrillic
alphabet are also borrowed from Latin and Greek, modi-
fied with cedillas, crosshatches or diacritical marks. This
induces recognition ambiguity between Cyrillic, Latin
and Greek.
2.4 Abjads
The Abjad system of writing is similar to the alpha-
betic system, but has symbols for consonantal sounds
only. Unlike most other scripts in the world, Abjads are
written from right to left within a textline. This unique
feature is particularly useful for identifying Abjad-based
scripts in pen computing.
Two important scripts under this category are Arabic
and Hebrew. A typical Arabic character is formed of
a long main stroke along with one to three dots. The
characters in a word are generally conjoined giving
an overall cursive appearance to the written text. This
provides an important clue for the recognition of Arabic
script. The same applies to some other scripts of Arabic
origin such as Farsi (Persian), Urdu, Sindhi, Jawi, etc.
On the other hand, character strokes in Hebrew are more
uniform in length and the letters in a word are generally
discrete.
2.5 Abugidas
Abugida is another alphabetic-like writing system used
by the Brahmic family of scripts that originated from the
ancient Indian Brahmi script and includes nearly all the
scripts of India and southeast Asia. In Fig. 5, we draw a
tree diagram to illustrate the evolution of major Brahmic
scripts in India and southeast Asia. The northern group
of Brahmic scripts (e.g. Devnagari, Bengali, Manipuri,
Gurumukhi, Gujrati and Oriya) bear strong resemblance
to the original Brahmi script. On the other hand, scripts
in south India (Tamil, Telugu, Kannada and Malayalam)
as well as in southeast Asia (e.g. Thai, Lao, Burmese,
Javanese and Balinese) are derived from Brahmi through
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

4 IEEE
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
Fig.
5.
The Brahmic family of scripts used in India and southeast Asia.
many changes and so look quite different from the north-
ern group. One important characteristic of Devnagari,
Bengali, Gurumukhi and Manipuri is that the characters
in a word are generally written together without spaces,
so that the top bar is unbroken. This results in the
formation of headline, called shirorekha, at the top of each
word. Accordingly, these scripts can be separated from
other script types by detecting the presence of a large
number of horizontal lines in the textual portions of a
document.
2.6 Featural system
The last significant form of writing system is the featural
system in which the symbols or characters represent the
features that make up the phonemes. One prominent
script of this sort is the Korean Hangul. As indicated in
Fig. 3, the Korean script is formed by mixing logographic
Hanja with featural Hangul. However, modern Korean
contains more of Hangul than Hanja. Consequently,
Korean script is relatively less complex and less dense
compared to Chinese and Japanese, containing more
circles and ellipses.
3SCRIPT RECOGNITION METHODOLOGIES
Script identification relies on the fact that each script
has unique spatial distribution and visual attributes that
make it possible to distinguish it from other scripts. So,
the basic task involved in script recognition is to devise a
technique to discover these features from a given docu-
ment and then classify the document’s script accordingly.
Based on the nature of approach and features used,
these methods may be divided into two broad cate-
gories structure-based and visual appearance-based
methods. Script recognition techniques in each of these
two categories may be further classified on the basis of
the level at which they are applied inside a document
image, viz. page-wise, paragraph-wise, textline-wise and
word-wise. The application mode of a method depends
on the minimum size of the text from which the fea-
tures proposed in the method can be extracted reliably.
Various algorithms under each of these categories are
summarized below.
3.1 Structure-based script recognition
In general, script classes differ from each other in their
stroke structure and connections, and the writing styles
associated with the character sets they use. One ap-
proach to script recognition may be to extract con-
nected components (continuous runs of pixels) in a doc-
ument [12] and then analyze their shapes and structures
so as to reveal the intrinsic morphological characteristics
of the script used in the document. In machine-printed
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

GHOSH
ET AL.: SCRIPT RECOGNITION A REVIEW 5
Fig.
6. Spitz’s method of script identification.
Latin, Greek, Han, etc., every individual character or part
of a character is a connected component. On the other
hand, in cursive handwritten documents, the characters
in a word or part of a word can touch each other to form
one single connected component. Likewise, in scripts like
Devnagari, Bengali, Arabic, etc., a word or a part of a
word forms a connected component. Script identification
methods that are based on extraction and analysis of con-
nected components fall under the category of structure-
based methods.
3.1.1 Page-wise script identification methods
A script identification method that relies on the spa-
tial relationship of character structures was developed
by Spitz for differentiating Han and Latin scripts in
machine-printed documents. In his first work on this
topic [13], he used character optical density for classi-
fying individual textlines in a document as being ei-
ther English or Japanese. In another paper, Spitz used
vertical distribution of upward concavities in characters
for discriminating Han from Latin with 100% success
in continuous production use [14]. Later, he developed
a two stage classifier in [15] by combining these two
features. In the first stage, Latin is separated from
Han-based scripts by comparing the variances of their
upward concavity distributions. Further classification
within the Han-based scripts is performed by analyzing
the distribution of optical density in the text image. The
system also has provisions for language identification
within documents using the Latin alphabet by observing
the most frequently occurring character shape codes. A
schematic diagram showing the flow of information in
the process is given in Fig. 6.
The above works by Spitz was extended by Lee et
al in [16], and by Waked et al in [17] by incorporat-
ing some additional features. In [16], the script of a
printed document is identified via textline-wise script
recognition followed by a majority vote of the already
decided textline classification results. The features used
are character height distribution and the top and bot-
tom profiles of character bounding boxes, in addition
to upward concavity distribution and optical density
features. Experimental results showed that these fea-
tures can separate Han-based (Chinese and Japanese)
documents from Latin-based (English, French, German,
Italian and Spanish) documents in 98.16% cases. In [17],
Waked et al used bounding box size distribution, char-
acter density distribution and horizontal projections, for
classifying printed documents written in Han, Latin,
Cyrillic and Arabic. These statistical features are more
robust compared to the structural features proposed by
Spitz and Lee et al. However, Waked et al achieved an
accuracy rate of only 91% when tested on documents of
varying kinds, diverse formats and qualities. This drop
in recognition accuracy is mainly due to misclassification
between Latin and Cyrillic scripts, which are similar-
looking under this measure. Also, some test documents
of extremely poor quality account for this degradation
in performance.
Script identification in machine-printed documents us-
ing statistical features has also been explored by Lam
et al [18]. In a first level of classification, documents
are classified as Latin, Chinese, Japanese or Korean
using horizontal projection profiles, height distributions
of connected components and enclosing structure of
connected components. Non-Latin documents that can-
not be recognized in this stage are classified in a sec-
ond level of recognition using structural features like
character complexity, presence of circles, ellipses and
vertical strokes. In the process, more than 95% correct
recognition was achieved.
The fact that every script class is composed of some
“textual symbols” of unique characteristic shapes had
been exploited by Hochberg et al in identifying the
script of a printed document [19]. First, textual symbols
obtained from documents of a known script are resized
and clustered to generate template symbols for that
script class, as depicted in Fig. 7. Textual symbols in-
clude character fragments, discrete characters, adjoined
characters, and even whole words. During classification,
textual symbols extracted from the input document are
compared to the template symbols using Hamming dis-
tance and then scored against every script class on the
basis of their distances from the best match template
symbols in that script class. The script class with the best
average score is chosen as the script of the document.
Hochberg et al tested their method on as many as thir-
teen scripts, viz. Arabic, Armenian, Burmese, Chinese,
Cyrillic, Devanagari, Ethiopic, Greek, Hebrew, Japanese,
Korean, Latin and Thai, and obtained 96% accuracy.
In [20], Hochberg and others proposed a feature-
based approach for script identification in handwritten
documents and achieved 88% accuracy in distinguishing
Arabic, Chinese, Cyrillic, Devnagari, Japanese and Latin.
In their method, a handwritten document is character-
ized in terms of mean, standard deviation and skew of
Fig.
7.
Hochberg et al’s method of script identification in printed
documents.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

Citations
More filters
Journal ArticleDOI

Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques

TL;DR: Various feature extraction and classification techniques associated with the offline handwriting recognition of the regional scripts are discussed in this survey, which will serve as a compendium not only for researchers in India, but also for policymakers and practitioners in India.
Journal ArticleDOI

Script identification in the wild via discriminative convolutional neural network

TL;DR: The proposed DiscCNN achieves state-of-the-art performances on scene, video and document scripts as well, not requiring any preprocess like binarization, segmentation or hand-crafted features.
Journal ArticleDOI

Script identification in natural scene image and video frames using an attention based Convolutional-LSTM network

TL;DR: A novel method that involves extraction of local and global features using CNN-LSTM framework and weighting them dynamically for script identification is proposed and achieves superior results in comparison to conventional methods.
Patent

Managing real-time handwriting recognition

TL;DR: In this paper, a handwriting recognition module is trained to have a repertoire comprising multiple non-overlapping scripts and capable of recognizing tens of thousands of characters using a single handwriting recognition model.
References
More filters
Journal ArticleDOI

Textural Features for Image Classification

TL;DR: These results indicate that the easily computable textural features based on gray-tone spatial dependancies probably have a general applicability for a wide variety of image-classification applications.
Journal ArticleDOI

A possibilistic approach to clustering

TL;DR: An appropriate objective function whose minimum will characterize a good possibilistic partition of the data is constructed, and the membership and prototype update equations are derived from necessary conditions for minimization of the criterion function.
Journal ArticleDOI

Historical review of OCR research and development

TL;DR: Both template matching and structure analysis approaches to R&D are considered and it is noted that the two approaches are coming closer and tending to merge.
Journal ArticleDOI

Indian script character recognition: a survey

TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.
Journal ArticleDOI

Twenty years of document image analysis in PAMI

TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
Related Papers (5)