scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Whole is Greater than Sum of Parts: Recognizing Scene Text Words

TL;DR: This work presents a holistic word recognition framework that represents the scene text image and synthetic images generated from lexicon words using gradient-based features, and recognizes the text in the image by matching the scene and synthetic image features with the novel weighted Dynamic Time Warping (wDTW) approach.
Abstract: Recognizing text in images taken in the wild is a challenging problem that has received great attention in recent years. Previous methods addressed this problem by first detecting individual characters, and then forming them into words. Such approaches often suffer from weak character detections, due to large intra-class variations, even more so than characters from scanned documents. We take a different view of the problem and present a holistic word recognition framework. In this, we first represent the scene text image and synthetic images generated from lexicon words using gradient-based features. We then recognize the text in the image by matching the scene and synthetic image features with our novel weighted Dynamic Time Warping (wDTW) approach. We perform experimental analysis on challenging public datasets, such as Street View Text and ICDAR 2003. Our proposed method significantly outperforms our earlier work in Mishra et al. (CVPR 2012), as well as many other recent works, such as Novikova et al. (ECCV 2012), Wang et al. al.(ICPR 2012), Wang et al.(ICCV 2011).

Summary (2 min read)

I. INTRODUCTION

  • The document image analysis community has shown a huge interest in the problem of scene text understanding in recent years [6] , [15] , [19] .
  • In [18] , each word in the lexicon is matched to the detected set of character windows, and the one with the highest score is reported as the predicted word.
  • This strongly top-down approach is prone to errors when characters are missed or detected with low confidence.
  • We, however, differ from their approach as follows.
  • The main contributions of their work are two fold: (i) We show that holistic word recognition for scene text images is possible with high accuracy, and achieve a significant improvement over prior art.the authors.the authors.

II. WORD REPRESENTATION AND MATCHING

  • The authors extract features from the image, and match them with those computed for each word in the lexicon.
  • To this end, the authors present a gradient based feature set, and then a weighted Dynamic Time Warping scheme in the remainder of this section.
  • The gradient orientations are accumulated into histograms over vertical strips extracted from the image.
  • The problem is how to match the scene text and synthetic lexicon based images 1 .
  • For this the authors cluster all the feature vectors computed over vertical strips of synthetic images and entropy of each cluster as follows.

H(cluster

  • High entropy of a cluster indicates that the features corresponding to that cluster are almost equally distributed in all the word classes.
  • In other words, such features are less informative, and thus are assigned a low weight during matching.
  • To give high penalty to those warping paths which deviate from the near diagonal paths the authors multiply them with a penalty function log 10 (wp − wp o ), where wp and wp o are warping path of DTW matching and diagonal warping path respectively.
  • Given a scene text and a ranked list of matched synthetic words (each corresponding to one of the lexicon words), their goal is to find the text label.
  • Randomness is maximum when all the top k retrievals are different words, and is minimum (i.e. zero) when all the top k retrieval are same.

B. Implementation Details

  • For every lexicon word the authors generated synthetic words with 20 different styles and fonts using ImageMagic.
  • The authors observations suggest that font selection is not a very crucial step for overall performance of their method.
  • A five pixel-width padding was done for all the images.
  • The authors used the binarization method in [10] prior to computing the profile features.
  • Given a scene text image to recognize, the authors retrieve word images from database of synthetic words.

C. Comparison with Previous Work

  • The authors retrieve synthetic word images corresponding to lexicon words and use dynamic k-NN to assign text label to a given scene text image.
  • A specific preprocessing or more variations in the synthetic dataset may be needed to deal with such fonts.
  • Fig. 4 shows the qualitative performance of the proposed method on sample images.
  • In addition to being simple, their method significantly improves the prior art.
  • This gain in accuracy can be attributed to the robustness of their method, which (i) does not rely on character segmentation rather do holistic word recognition; and (ii) learns discriminitiveness of features in a principled way and use this information for robust matching using wDTW.

IV. CONCLUSION

  • The authors method neither requires character segmentation 3 cvit.iiit.ac.in/projects/SceneTextUnderstanding/ nor relies on binarization, but instead performs holistic word recognition.
  • The authors show a significantly improved performance over the most recent works from 2011 and 2012.
  • The authors thus establish a new state-of-the-art on lexicon-driven scene text recognition.
  • The robustness of their word matching approach shows that the natural extension of this work can be in direction of "text to scene image" retrieval.

Did you find this useful? Give us your feedback

Figures (7)

Content maybe subject to copyright    Report

Whole is Greater than Sum of Parts:
Recognizing Scene Text Words
Vibhor Goel
1
Anand Mishra
1
Karteek Alahari
2
C. V. Jawahar
1
1
Center for Visual Information Technology, IIIT Hyderabad, India
2
INRIA - WILLOW /
´
Ecole Normale Sup
´
erieure, Paris, France
Abstract—Recognizing text in images taken in the wild is a
challenging problem that has received great attention in recent
years. Previous methods addressed this problem by first detecting
individual characters, and then forming them into words. Such
approaches often suffer from weak character detections, due to
large intra-class variations, even more so than characters from
scanned documents. We take a different view of the problem
and present a holistic word recognition framework. In this,
we first represent the scene text image and synthetic images
generated from lexicon words using gradient-based features. We
then recognize the text in the image by matching the scene and
synthetic image features with our novel weighted Dynamic Time
Warping (wDTW) approach.
We perform experimental analysis on challenging public
datasets, such as Street View Text and ICDAR 2003. Our
proposed method significantly outperforms our earlier work in
Mishra et al. (CVPR 2012), as well as many other recent works,
such as Novikova et al. (ECCV 2012), Wang et al. (ICPR 2012),
Wang et al. (ICCV 2011).
I. INTRODUCTION
The document image analysis community has shown a
huge interest in the problem of scene text understanding in
recent years [6], [15], [19]. This problem involves various sub-
tasks, such as text detection, isolated character recognition,
word recognition. Due to recent works [5], [8], [13], text
detection accuracies have significantly improved. However, the
success of methods for recognizing words still leaves a lot to
be desired. We aim to address this issue in our work.
The problem of recognizing words has been looked at
in two broads contexts with and without the use of a
lexicon [11], [12], [18], [20]. In the case of lexicon-driven
word recognition, a list of words is available for every scene
text image. The problem of recognizing the word now reduces
to that of finding the best match from the list. This is relevant
in many applications, such as: (1) recognizing certain text in
a grocery store, where a list of grocery items can serve as a
lexicon, (2) robotic vision in an indoor/outdoor environment.
Lexicon-driven scene text recognition may appear to be
an easy task, but the best methods up until now have only
achieved accuracies in the low 70s on this problem. Some of
these recent methods can be summarized as follows. In [18],
each word in the lexicon is matched to the detected set
of character windows, and the one with the highest score
is reported as the predicted word. This strongly top-down
approach is prone to errors when characters are missed or
detected with low confidence. In our earlier work [12], we
improved upon on this model by introducing a framework,
which uses top-down as well as bottom-up cues. Rather
than pre-selecting a set of character detections, we defined
a global model that incorporates language priors (top-down)
and all potential characters (bottom-up). In [19], Wang et al.
combined unsupervised feature learning and multi-layer neural
networks for scene text detection and recognition. While both
these recent methods improved the previous art significantly,
they suffer from the following drawbacks: (i) The need for
language-specific character training data. (ii) Do not use the
entire visual appearance of the word. (iii) Prone to errors due
to false or weak character detections.
In this paper, we choose an alternative path and propose
a holistic word recognition method for scene text images. We
address the problem in a recognition by retrieval framework.
This is achieved by transforming the lexicon into a collection
of synthetic word images, and then posing the recognition task
as the problem of retrieving the best match from the lexicon
image set. The retrieval framework introduced in our approach
is similar in spirit to the influential work of [16] in the area
of handwritten and printed word spotting. We, however, differ
from their approach as follows. (1) Our matching score is based
on a novel feature set, which shows better performance than
the profile features in [16]. (2) We formulate the problem of
finding the best word match in a maximum likelihood frame-
work and maximize the probability of two features sequences
originating from same word class. (3) We propose a robust way
to find the match for a word, where k in k-NN is not hand
picked, rather dynamically decided based on the randomness
of the top retrievals.
Motivation and Overview. The problem of recognizing text
(including printed and handwritten text) has been addressed in
many ways. Detecting characters and combining them to form
a word is a popular approach as mentioned above [12], [18].
Often these methods suffer from weak character detections
as shown in Fig. 2(a). An alternative scheme is to learn a
model for words [4]. There are also approaches that recognize
a word by first binarizing the image, and then finding each
connected component [9]. These methods inherently rely on
finding a model to represent each character or word. In the
context of scene text recognition, this creates the need for a
large amount of training data to cover the variations in scene
text. Examples of such variations are shown in Fig. 2(b). Our
method is designed to overcome these issues.
We begin by generating synthetic images for the words
from the lexicon with various fonts and styles. Then, we
compute gradient-based features for all these images as well
as the scene text (test) image. We then recognize the text in

STEWART
BEECHERS
IRISH
PERENNIAL
PIROSHKY
STEWART
:
SEATTLE
RESTAURANT
Lexicon List
Test Image
143
152 158 167 172
wDTW Matching Scores
. . . .
. . . .
Fig. 1. Overview of the proposed system. We recognize the word in the test image by matching it with synthetic images corresponding to the lexicon words.
A novel gradient based feature set is used to represent words. Matching is done with a weighted DTW scores computed with these features. We use the top k
matches to determine the most likely word in the the scene text image.
the image by matching the scene and synthetic image features
with our novel weighted Dynamic Time Warping (DTW).
The weights in the DTW matching scores are learned from
the synthetic images, and determine the discriminativeness of
the features. We use the top k retrieved synthetic images to
determine the word most likely to represent the scene text
image (see Section II). An overview of our method is shown
in Fig. 1.
We present results on two challenging public datasets,
namely Street View Text (SVT) and ICDAR 2003 (see Sec-
tion III). We experimentally show that popular features like
profile features are not robust enough to deal with challenging
scene text images. Our experiments also suggest that the
proposed gradient at edges based features outperform profile
features for the word matching task. In addition to being
simple, the proposed method improves the accuracy by more
than 5% over recent works [12], [18], [19].
The main contributions of our work are two fold: (i) We
show that holistic word recognition for scene text images
is possible with high accuracy, and achieve a significant
improvement over prior art. (ii) The proposed method does not
use any language-specific information, and thus can be easily
adapted to any language. Additionally, the robust synthetic
word retrieval for scene text queries also shows that our
framework can be easily extended for text to image retrieval.
However, this is beyond the scope of the paper.
II. WORD REPRESENTATION AND MATCHING
We propose a novel method to recognize the word con-
tained in an image as a whole. We extract features from the
image, and match them with those computed for each word in
the lexicon. To this end, we present a gradient based feature
set, and then a weighted Dynamic Time Warping scheme in
the remainder of this section.
Gradient based features. Some of the previous approaches
binarize a word image into character vs non-character regions
before computing features [9]. While such pre-processing
steps can be effective to reduce the dimensionality of the
(a) Weak character detections due to high inter-class and intra-class confu-
sion as noted in [12].
(b) Large intra-class variations in scene text words.
Fig. 2. (a) Character detection is a challenging problem in the context of
scene text images. A couple of examples are shown, where weak character
detections lead to incorrect word recognition. (b) Large intra-class variations
in scene text images makes it challenging to learn models to represent words.
Moreover, getting sufficient training data for each word is not trivial.
feature space, it comes with its disadvantages. The results
of binarization are seldom perfect, contain noise, and this
continues to be an unsolved problem in the context of scene
text images. Thus, we look for other effective features, which
do not rely on binarized images. Inspired by the success of
Histogram of Oriented Gradient (HOG) features [7] in many
vision tasks, we adapted them to the word recognition problem.
To compute the adapted HOG features, we begin by
applying the Canny edge operator on the image. Note that
we do not expect a clean edge map from this result. We
then compute the orientation of gradient at each edge pixel.
The gradient orientations are accumulated into histograms over
vertical (overlapping) strips extracted from the image. The
histograms are weighted by the magnitude of the gradient.
An illustration of the feature computation process in shown
in Fig. 3. At the end of this step, we have a representation of
the image in terms of a set of histograms. In the experimental
section we will show that these easy to compute features are
robust for the word matching problem.
Matching words. Once words are represented using a set of
features, we need a mechanism to match them. The problem

Test Image
Overlap
Window
Shift
Edge Image
Histogram of Gradient
Orientations
Fig. 3. An illustration of feature computation. We divide the word image into
vertical strips. In each strip we compute histogram of gradient orientation at
edges. These features are computed for overlapping vertical strips.
is how to match the scene text and synthetic lexicon based
images
1
. We formulate the problem of matching scene text
and synthetic words in a maximum likelihood framework.
Let X = {x
1
, x
2
, . . . , x
m
} and Y = {y
1
, y
2
, . . . , y
m
} be
the feature sequences from a given word and its candidate
match respectively. Each vector x
i
and y
i
is a histogram of
gradient features extracted from a vertical strip. Let ω =
{ω
1
, ω
2
, . . . , ω
K
} represent a set of word images where K is
the total number of lexicon words. Since we assume features at
each vertical strips are independent, the joint probability that
the feature sequences X and Y originate from the same word
ω
k
, i.e. P (X, Y |ω
k
) can be written as the multiplication of
joint probabilities of features originating from the same strip,
i.e.,
P (X, Y |ω
k
) =
Y
i
P (x
i
, y
i
|ω
k
). (1)
In a maximum likelihood framework, the problem of find-
ing an optimal feature sequence Y for a given feature se-
quence X is equivalent to maximize
Q
i
P (x
i
, y
i
|ω
k
) over
all possible Y s. This can be written as minimization of an
objective function f, i.e., min
Y
P
i
f(x
i
, y
i
|ω
k
). Where f is
the weighted squared l
2
-distance between feature sequences
X and Y i.e.,f(x
i
, y
i
) = (x
i
y
j
)w
i
(x
i
y
j
). Here w
i
is
the weight to feature x
i
. These weights are learned from the
synthetic images, and are proportional to the discriminitiveness
of features. In other words, given a feature sequence X and
a set of candidate sequences Y s, the problem of finding
the optimal matching sequence becomes as minimizing f
over all candidate sequences Y . This leads to the problem
of alignment of sequences. We propose a weighted dynamic
programming based solution to solve this problem. Dynamic
Time Warping [17] is used to compute a distance between
two time series. The weighted DTW distance DT W (m, n)
between the sequences X and Y can be recursively computed
using dynamic programming as:
DT W (i, j) = min
(
DT W (i 1, j) + D(i, j)
DT W (i, j 1) + D(i, j)
DT W (i 1, j 1) + D(i, j),
(2)
where D(i, j) is the distance between features x
i
and y
j
,
and the local distance matrix D is written as: D = (X
Y )
T
W (X Y ). The diagonal matrix W is learnt from
synthetic images. For this we cluster all the feature vectors
computed over vertical strips of synthetic images and entropy
of each cluster as follows.
1
Details for generating the synthetic lexicon-based images are given in
Section III.
H(cluster
p
) =
P
K
k=1
P r(y
j
ω
k
, y
j
cluster
p
)
× log
K
(P r(y
j
ω
k
, y
j
cluster
p
)), (3)
where P r is the joint probability of feature y
j
originating
from class ω
k
and falling in cluster
p
. High entropy of a
cluster indicates that the features corresponding to that cluster
are almost equally distributed in all the word classes. In
other words, such features are less informative, and thus
are assigned a low weight during matching. The weight w
j
associated with a feature vector y
j
is computed as: w
j
=
1 H(cluster
p
), if y
j
cluster
p
.
Warping path deviation based penalty. To give high penalty
to those warping paths which deviate from the near diagonal
paths we multiply them with a penalty function log
10
(wp
wp
o
), where wp and wp
o
are warping path of DTW matching
and diagonal warping path respectively. This penalizes warping
paths where a small portion in one word is matched with a
large portion in another word.
Dynamic k-NN. Given a scene text and a ranked list of
matched synthetic words (each corresponding to one of the
lexicon words), our goal is to find the text label. To do so,
we apply k-nearest neighbor. One of the issues with a nearest
neighbor approach is finding a good k. This parameter is often
set manually. To avoid this, we use dynamic k-NN. We start
with an initial value of k and measure the randomness of the
top k retrievals. Randomness is maximum when all the top k
retrievals are different words, and is minimum (i.e. zero) when
all the top k retrieval are same. We increment k by 1 until this
randomness decreases. At this point we assign the label of the
most frequently occurring synthetic word to a given scene text.
In summary, given a scene text word and a set of lexicon
words, we transform each lexicon into a collection of synthetic
images, and then represent each image as a sequence of fea-
tures. We then pose the problem of finding candidate optimal
matches for a scene text image in a maximum likelihood
framework and solve it using weighted DTW. The weighted
DTW scheme provides a set of candidate optimal matches. We
then use dynamic k-NN to find the optimal word in a given
scene text image.
III. EXPERIMENTS AND RESULTS
In this section we present implementation details of our
approach, and its detailed evaluation, and compare it with the
best performing methods for this task namely [12], [14], [18],
[19].
A. Datasets
For the experimental analysis we used two datasets, namely
Street View Text (SVT) [1] and ICDAR 2003 robust word
recognition [2]. The SVT dataset contains images taken from
Google Street View. We used the SVT-word dataset, which
contains 647 images, relevant for the recognition task. A lexi-
con of 50 words is also provided with each image. The lexicon
for the ICDAR dataset was obtained from [18]. Following the
protocol of [18], we ignore words with less than two characters
or with non-alphanumeric characters, which results in 863
words overall. Note that we could not use the ICDAR 2011
dataset since it has no associated lexicon.

Fig. 4. Few sample results. Top-5 synthetic word retrieval results for scene text query. First column shows the test image. Top-5 retrieval for the test image are
shown from left to right in each row. The icon in the right most column shows whether a word is correctly recognized or not. We observe that the proposed word
matching method is robust to variations in fonts and character size. In the fourth row, despite the unseen style of word image “liquid” the top two retrievals are
correct. (Note that following the experimental protocol of [18], we do case-insensitive recognition). The last two rows are failure cases of our method, mainly
due to near edit distance words (like center and centers) or high degradations in the word image.
Method SVT-WORD ICDAR(50)
Profile features + DTW [16] 38.02 55.39
Gradient based features + wDTW 75.43 87.25
NL + Gradient based features + wDTW 77.28 89.69
TABLE I. Feature Comparison: We observe that gradient based features
outperform profile features for the holistic word recognition task. This is
primarily due to the robustness of gradient features in dealing with blur,
noise, large intra-class variations. Non-local (NL) means filtering of scene
text images further improves recognition performance.
B. Implementation Details
Synthetic Word Generation. For every lexicon word we
generated synthetic words with 20 different styles and fonts
using ImageMagic.
2
We chose some of the most commonly
occurring fonts, such as Arial, Times, Georgia. Our obser-
vations suggest that font selection is not a very crucial step
for overall performance of our method. A ve pixel-width
padding was done for all the images. We noted that all the
lexicon words were in uppercase, and that the scene text may
contain lowercase letters. To account for these variations, we
also generated word images where, (i) only the first character
is in upper case; and (ii) all characters are in lower case.
This results in 3 × lexicon size × 20 images in the synthetic
database. For the SVT dataset, the synthetic dataset contains
around 3000 images.
Preprocessing. Prior to feature computation, we resized all
the word images to a width of 300 pixels, with the respective
aspect ratio. We then applied the popular non-local means filter
2
www.imagemagick.org/
smoothing on scene text images. We also remove the stray
edges pixels less than 20 in number. Empirically, we did not
find this filtering step to be very critical in our approach.
Features. We used vertical strips of width 4 pixels and a
2-pixel horizontal shift to extract the histogram of gradient
orientation features. We computed signed gradient orienta-
tion in this step. Each vertical strip was represented with a
histogram of 9 bins. We evaluated the performance of these
features in Table I, in comparison with that of profile features
used in [16]. Profile features consist of: (1) projection profile,
which counts the number of black pixels in each column.
(2) upper and lower profile, which measures the number of
background pixels between the word and the word-boundary
(3) transition profile, is calculated as the number of text-
background transitions per column. We used the binarization
method in [10] prior to computing the profile features. Profile
features have shown noteworthy performance on tasks such as
handwritten and printed word spotting, but fail to cope with
the additional complexities in scene text (e.g., low contrast,
noise, blur, large intra-class variations). Infact, our results show
that gradient features substantially outperform profile based
features for scene text recognition.
Weighted Dynamic Time Warping. In our experiments we
used 30 clusters to compute the weights. Our analysis com-
paring various methods are shown in Table I. We observe that
with wDTW, we achieve a high recognition accuracy on both
the datasets.
Dynamic k-Nearest Neighbor. Given a scene text image to
recognize, we retrieve word images from database of synthetic
words. The retrieval is ranked based on similarity score. In

Fig. 5. Few images from ICDAR 2003 dataset where our method fails. This
may be addressed with inclusion of more variations in our synthetic image
database.
Method SVT-WORD ICDAR(50)
ABBYY [3] 35 56
Wang et al. [18] 56 72
Wang et al. [19] 70 90
Novikova et al. [14] 72 82
Mishra et al. [12] 73 82
This work 77.28 89.69
TABLE II. Cropped Word Recognition Accuracy (in %): We show a
comparison of the proposed method to the popular commercial OCR system
ABBYY and many recent methods. We achieve a significant improvement
over previous works on SVT and ICDAR.
other words, synthetic words more similar to the scene text
word get a higher rank. We use dynamic k-NN with an initial
value of k = 3 for all the experiments.
We estimate all the parameters on the train sets of respec-
tive datasets. Code for synthetic image generation and feature
computation will be made available on our project page.
3
C. Comparison with Previous Work
We retrieve synthetic word images corresponding to lexicon
words and use dynamic k-NN to assign text label to a
given scene text image. We compared our method with the
most recent previous works related to this task, and also the
commercial OCR ABBYY in Table II. From the results, we
see that the proposed holistic word matching based scheme
outperforms not only our earlier work [12], but also many
recent works as [14], [18], [19] on the SVT dataset. On the
ICDAR dataset, we perform better than almost all the methods,
except [19]. This marginally inferior performance (of about
0.3%) is mainly because our synthetic database fails to model
few of the fonts in ICDAR dataset (Fig. 5). These type of fonts
are rare in the street view images. A specific preprocessing or
more variations in the synthetic dataset may be needed to deal
with such fonts. Fig. 4 shows the qualitative performance of
the proposed method on sample images. We observe that the
proposed method is robust to noise, blur, low contrast and
background variations.
In addition to being simple, our method significantly im-
proves the prior art. This gain in accuracy can be attributed
to the robustness of our method, which (i) does not rely on
character segmentation rather do holistic word recognition; and
(ii) learns discriminitiveness of features in a principled way
and use this information for robust matching using wDTW.
IV. CONCLUSION
In summary, we proposed an effective method to recognize
scene text. Our method neither requires character segmentation
3
cvit.iiit.ac.in/projects/SceneTextUnderstanding/
nor relies on binarization, but instead performs holistic word
recognition. We show a significantly improved performance
over the most recent works from 2011 and 2012. We thus
establish a new state-of-the-art on lexicon-driven scene text
recognition. The robustness of our word matching approach
shows that the natural extension of this work can be in
direction of “text to scene image” retrieval. As a part of future
work, we would explore the benefits of introducing a hidden
Markov models for this problem.
Acknowledgments. This work is partly supported by MCIT,
New Delhi. Anand Mishra is supported the Microsoft Research
India PhD fellowship 2012 award. Karteek Alahari is partly
supported by the Quaero programme funded by the OSEO.
REFERENCES
[1] Street View Text dataset, http://vision.ucsd.edu/
˜
kai/
svt/.
[2] Robust word recognition dataset.
http://algoval.essex.ac.uk/icdar/RobustWord.
html.
[3] ABBYY Finereader 9.0. http://www.abbyy.com/.
[4] J. Almazan, A. Gordo, A. Forns, and E. Valveny. Efficient exemplar
word spotting. In BMVC, 2012.
[5] X. Chen and A. L. Yuille. Detecting and reading text in natural scenes.
In CVPR, 2004.
[6] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, T. Wang,
D. J. Wu, and A. Y. Ng. Text detection and character recognition in
scene images with unsupervised feature learning. In ICDAR, 2011.
[7] N. Dalal and B. Triggs. Histograms of oriented gradients for human
detection. In CVPR, 2005.
[8] B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natural scenes
with stroke width transform. In CVPR, 2010.
[9] D. Kumar, M. N. A. Prasad, and A. G. Ramakrishnan. Maps: midline
analysis and propagation of segmentation. In ICVGIP, 2012.
[10] A. Mishra, K. Alahari, and C. V. Jawahar. An MRF model for
binarization of natural scene text. In ICDAR, 2011.
[11] A. Mishra, K. Alahari, and C. V. Jawahar. Scene text recognition using
higher order langauge priors. In BMVC, 2012.
[12] A. Mishra, K. Alahari, and C. V. Jawahar. Top-down and bottom-up
cues for scene text recognition. In CVPR, 2012.
[13] L. Neumann and J. Matas. Real-time scene text localization and
recognition. In CVPR, 2012.
[14] T. Novikova, O. Barinova, P. Kohli, and V. S. Lempitsky. Large-lexicon
attribute-consistent text recognition in natural images. In ECCV, 2012.
[15] T. Q. Phan, P. Shivakumara, B. Su, and C. L. Tan. A gradient vector
flow-based method for video character segmentation. In ICDAR, 2011.
[16] T. M. Rath and R. Manmatha. Word image matching using dynamic
time warping. In CVPR, 2003.
[17] D. Sankoff and J. Kruskal. Time warps, string edits, and macro-
molecules: the theory and practice of sequence comparison. Addison-
Wesley, 1983.
[18] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text
recognition. In ICCV, 2011.
[19] T. Wang, D. Wu, A. Coates, and A. Ng. End-to-end text recognition
with convolutional neural networks. In ICPR, 2012.
[20] J. J. Weinman, E. G. Learned-Miller, and A. R. Hanson. Scene
text recognition using similarity and a lexicon with sparse belief
propagation. PAMI, 2009.
Citations
More filters
Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, and achieved remarkable performances in both lexicon free and lexicon-based scene text recognition tasks.
Abstract: Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

2,184 citations

Journal ArticleDOI
TL;DR: An end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval and a real-world application to allow thousands of hours of news footage to be instantly searchable via a text query is demonstrated.
Abstract: In this work we present an end-to-end system for text spotting--localising and recognising text in natural scene images--and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.

1,054 citations


Cites background or methods from "Whole is Greater than Sum of Parts:..."

  • ...[25] use whole word sub-image features to recognise words by comparing to simple black-and-white font-renderings of lexicon words....

    [...]

  • ...For scene text recognition, methods can be split into two groups – character based recognition [5, 7, 31, 48, 49, 56–59] and whole word based recognition [4, 25, 30, 39, 45, 51]....

    [...]

Posted Content
TL;DR: This work presents a framework for the recognition of natural scene text that does not require any human-labelled data, and performs word recognition on the whole image holistically, departing from the character based recognition systems of the past.
Abstract: In this work we present a framework for the recognition of natural scene text. Our framework does not require any human-labelled data, and performs word recognition on the whole image holistically, departing from the character based recognition systems of the past. The deep neural network models at the centre of this framework are trained solely on data produced by a synthetic text generation engine -- synthetic data that is highly realistic and sufficient to replace real data, giving us infinite amounts of training data. This excess of data exposes new possibilities for word recognition models, and here we consider three models, each one "reading" words in a different way: via 90k-way dictionary encoding, character sequence encoding, and bag-of-N-grams encoding. In the scenarios of language based and completely unconstrained text recognition we greatly improve upon state-of-the-art performance on standard datasets, using our fast, simple machinery and requiring zero data-acquisition costs.

875 citations


Cites background or methods from "Whole is Greater than Sum of Parts:..."

  • ...In contrast to these approaches based on character classification, the work by [7, 17, 21, 24] instead uses the notion of holistic word recognition....

    [...]

  • ...[7] use whole word-image features to recognize words by comparing to simple black-and-white font-renderings of lexicon words....

    [...]

Journal ArticleDOI
TL;DR: This review provides a fundamental comparison and analysis of the remaining problems in the field and summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems.
Abstract: This paper analyzes, compares, and contrasts technical challenges, methods, and the performance of text detection and recognition research in color imagery It summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems Existing techniques are categorized as either stepwise or integrated and sub-problems are highlighted including text localization, verification, segmentation and recognition Special issues associated with the enhancement of degraded text and the processing of video text, multi-oriented, perspectively distorted and multilingual text are also addressed The categories and sub-categories of text are illustrated, benchmark datasets are enumerated, and the performance of the most representative approaches is compared This review provides a fundamental comparison and analysis of the remaining problems in the field

709 citations


Cites background from "Whole is Greater than Sum of Parts:..."

  • ...Goel [179] 2013 Holistic recognition by gradient based features and dynamic matching ICDAR’03 50 0....

    [...]

  • ...The motivation of word spotting is that ”the whole is greater than the sum of parts”, and the task looks to match specific words in a given lexicon with image patches using character and word models [118], [179]....

    [...]

Book ChapterDOI
06 Sep 2014
TL;DR: A Convolutional Neural Network classifier is developed that can be used for text spotting in natural images and a method of automated data mining of Flickr, that generates word and character level annotations is used to form an end-to-end, state-of-the-art text spotting system.
Abstract: The goal of this work is text spotting in natural images. This is divided into two sequential tasks: detecting words regions in the image, and recognizing the words within these regions. We make the following contributions: first, we develop a Convolutional Neural Network (CNN) classifier that can be used for both tasks. The CNN has a novel architecture that enables efficient feature sharing (by using a number of layers in common) for text detection, character case-sensitive and insensitive classification, and bigram classification. It exceeds the state-of-the-art performance for all of these. Second, we make a number of technical changes over the traditional CNN architectures, including no downsampling for a per-pixel sliding window, and multi-mode learning with a mixture of linear models (maxout). Third, we have a method of automated data mining of Flickr, that generates word and character level annotations. Finally, these components are used together to form an end-to-end, state-of-the-art text spotting system. We evaluate the text-spotting system on two standard benchmarks, the ICDAR Robust Reading data set and the Street View Text data set, and demonstrate improvements over the state-of-the-art on multiple measures.

681 citations


Cites background from "Whole is Greater than Sum of Parts:..."

  • ...3% off stateof-the-art, improves on the next best result [19] by 8....

    [...]

References
More filters
Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Whole is Greater than Sum of Parts:..." refers methods in this paper

  • ...Inspired by the success of Histogram of Oriented Gradient (HOG) features [7] in many vision tasks, we adapted them to the word recognition problem....

    [...]

Book
01 Aug 1983
TL;DR: In this paper, a mudflap assembly for use with a dump vehicle having dual tires at the rear end thereof and including a pair of flexible flap sections one of which is supported by a rigid member adjacent the dual tires and the other is located above and to the rear of the rigid member and is secured at its upper end to the dump body.
Abstract: A mudflap assembly for use with a dump vehicle having dual tires at the rear end thereof and including a pair of flexible flap sections one of which is supported by a rigid member adjacent the dual tires and the other is located above and to the rear of the rigid member and is secured at its upper end to the dump body. The rigid member is pivotally connected to the dump body and is combined with a cable which assures that the attached flap section maintains substantially the same position when the dump body is in the lowered-carry-position or raised-dump-position.

1,669 citations

Proceedings ArticleDOI
13 Jun 2010
TL;DR: A novel image operator is presented that seeks to find the value of stroke width for each image pixel, and its use on the task of text detection in natural images is demonstrated.
Abstract: We present a novel image operator that seeks to find the value of stroke width for each image pixel, and demonstrate its use on the task of text detection in natural images. The suggested operator is local and data dependent, which makes it fast and robust enough to eliminate the need for multi-scale computation or scanning windows. Extensive testing shows that the suggested scheme outperforms the latest published algorithms. Its simplicity allows the algorithm to detect texts in many fonts and languages.

1,531 citations


"Whole is Greater than Sum of Parts:..." refers background in this paper

  • ...Due to recent works [5], [8], [13], text detection accuracies have significantly improved....

    [...]

Proceedings ArticleDOI
06 Nov 2011
TL;DR: While scene text recognition has generally been treated with highly domain-specific methods, the results demonstrate the suitability of applying generic computer vision methods.
Abstract: This paper focuses on the problem of word detection and recognition in natural images. The problem is significantly more challenging than reading text in scanned documents, and has only recently gained attention from the computer vision community. Sub-components of the problem, such as text detection and cropped image word recognition, have been studied in isolation [7, 4, 20]. However, what is unclear is how these recent approaches contribute to solving the end-to-end problem of word recognition. We fill this gap by constructing and evaluating two systems. The first, representing the de facto state-of-the-art, is a two stage pipeline consisting of text detection followed by a leading OCR engine. The second is a system rooted in generic object recognition, an extension of our previous work in [20]. We show that the latter approach achieves superior performance. While scene text recognition has generally been treated with highly domain-specific methods, our results demonstrate the suitability of applying generic computer vision methods. Adopting this approach opens the door for real world scene text recognition to benefit from the rapid advances that have been taking place in object recognition.

1,074 citations


"Whole is Greater than Sum of Parts:..." refers background or methods in this paper

  • ...(Note that following the experimental protocol of [18], we do case-insensitive recognition)....

    [...]

  • ...From the results, we see that the proposed holistic word matching based scheme outperforms not only our earlier work [12], but also many recent works as [14], [18], [19] on the SVT dataset....

    [...]

  • ...In [18], each word in the lexicon is matched to the detected set of character windows, and the one with the highest score is reported as the predicted word....

    [...]

  • ...The problem of recognizing words has been looked at in two broads contexts – with and without the use of a lexicon [11], [12], [18], [20]....

    [...]

  • ...Following the protocol of [18], we ignore words with less than two characters or with non-alphanumeric characters, which results in 863 words overall....

    [...]

Proceedings Article
01 Nov 2012
TL;DR: This paper combines the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows them to use a common framework to train highly-accurate text detector and character recognizer modules.
Abstract: Full end-to-end text recognition in natural images is a challenging problem that has received much attention recently. Traditional systems in this area have relied on elaborate models incorporating carefully hand-engineered features or large amounts of prior knowledge. In this paper, we take a different route and combine the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows us to use a common framework to train highly-accurate text detector and character recognizer modules. Then, using only simple off-the-shelf methods, we integrate these two modules into a full end-to-end, lexicon-driven, scene text recognition system that achieves state-of-the-art performance on standard benchmarks, namely Street View Text and ICDAR 2003.

900 citations


"Whole is Greater than Sum of Parts:..." refers background or methods in this paper

  • ...From the results, we see that the proposed holistic word matching based scheme outperforms not only our earlier work [12], but also many recent works as [14], [18], [19] on the SVT dataset....

    [...]

  • ...In this section we present implementation details of our approach, and its detailed evaluation, and compare it with the best performing methods for this task namely [12], [14], [18], [19]....

    [...]

  • ...In addition to being simple, the proposed method improves the accuracy by more than 5% over recent works [12], [18], [19]....

    [...]

  • ...The document image analysis community has shown a huge interest in the problem of scene text understanding in recent years [6], [15], [19]....

    [...]

  • ...On the ICDAR dataset, we perform better than almost all the methods, except [19]....

    [...]

Frequently Asked Questions (17)
Q1. What are the contributions in "Whole is greater than sum of parts: recognizing scene text words" ?

The authors take a different view of the problem and present a holistic word recognition framework. In this, the authors first represent the scene text image and synthetic images generated from lexicon words using gradient-based features. The authors then recognize the text in the image by matching the scene and synthetic image features with their novel weighted Dynamic Time Warping ( wDTW ) approach. The authors perform experimental analysis on challenging public datasets, such as Street View Text and ICDAR 2003. 

As a part of future work, the authors would explore the benefits of introducing a hidden Markov models for this problem. 

(1)In a maximum likelihood framework, the problem of finding an optimal feature sequence Y for a given feature sequence X is equivalent to maximize ∏i P (xi, yi|ωk) over all possible Y s. 

The authors used vertical strips of width 4 pixels and a 2-pixel horizontal shift to extract the histogram of gradient orientation features. 

For the experimental analysis the authors used two datasets, namely Street View Text (SVT) [1] and ICDAR 2003 robust word recognition [2]. 

In other words, given a feature sequence X and a set of candidate sequences Y s, the problem of finding the optimal matching sequence becomes as minimizing f over all candidate sequences Y . 

Since the authors assume features at each vertical strips are independent, the joint probability that the feature sequences X and Y originate from the same word ωk, i.e. P (X, Y |ωk) can be written as the multiplication of joint probabilities of features originating from the same strip, i.e.,P (X, Y |ωk) = ∏iP (xi, yi|ωk). 

To give high penalty to those warping paths which deviate from the near diagonal paths the authors multiply them with a penalty function log10 (wp −wpo), where wp and wpo are warping path of DTW matching and diagonal warping path respectively. 

Profile features have shown noteworthy performance on tasks such as handwritten and printed word spotting, but fail to cope with the additional complexities in scene text (e.g., low contrast, noise, blur, large intra-class variations). 

Given a scene text and a ranked list of matched synthetic words (each corresponding to one of the lexicon words), their goal is to find the text label. 

This gain in accuracy can be attributed to the robustness of their method, which (i) does not rely on character segmentation rather do holistic word recognition; and (ii) learns discriminitiveness of features in a principled way and use this information for robust matching using wDTW. 

Inspired by the success of Histogram of Oriented Gradient (HOG) features [7] in many vision tasks, the authors adapted them to the word recognition problem. 

High entropy of a cluster indicates that the features corresponding to that cluster are almost equally distributed in all the word classes. 

Following the protocol of [18], the authors ignore words with less than two characters or with non-alphanumeric characters, which results in 863 words overall. 

Let X = {x1, x2, . . . , xm} and Y = {y1, y2, . . . , ym} be the feature sequences from a given word and its candidate match respectively. 

In summary, given a scene text word and a set of lexicon words, the authors transform each lexicon into a collection of synthetic images, and then represent each image as a sequence of features. 

To this end, the authors present a gradient based feature set, and then a weighted Dynamic Time Warping scheme in the remainder of this section.