scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Top-down and bottom-up cues for scene text recognition

TL;DR: This work presents a framework that exploits both bottom-up and top-down cues in the problem of recognizing text extracted from street images, and shows significant improvements in accuracies on two challenging public datasets, namely Street View Text and ICDAR 2003.
Abstract: Scene text recognition has gained significant attention from the computer vision community in recent years. Recognizing such text is a challenging problem, even more so than the recognition of scanned documents. In this work, we focus on the problem of recognizing text extracted from street images. We present a framework that exploits both bottom-up and top-down cues. The bottom-up cues are derived from individual character detections from the image. We build a Conditional Random Field model on these detections to jointly model the strength of the detections and the interactions between them. We impose top-down cues obtained from a lexicon-based prior, i.e. language statistics, on the model. The optimal word represented by the text image is obtained by minimizing the energy function corresponding to the random field model. We show significant improvements in accuracies on two challenging public datasets, namely Street View Text (over 15%) and ICDAR 2003 (nearly 10%).

Summary (3 min read)

1. Introduction

  • The problem of understanding scenes semantically has been one of the challenging goals in computer vision for many decades.
  • Popular recognition methods ignore the text, and identify other objects such as car, person, tree, regions such as road, sky.
  • The probabilistic approach the authors propose in this paper achieves an accuracy of over73% under identical experimental settings.
  • The authors build a Conditional Random Field (CRF) model [21] on these detections to determine not only the true positive detections, but also what word they rep- resent jointly.

2.1. Sliding Window Detection

  • Sliding window based detectors have been very successful for challenging tasks, such as face [28] and pedestrian [8] detection.
  • In Figure 3(a), the window containing parts of the characters ‘o’ can be confused with ‘x’.
  • Letφi denote the features extracted from a window locationli.
  • This basic sliding window detection approach produces many potential character windows, but not all of them are useful for recognizing words.

2.2. Pruning Windows

  • For every potential character window, the authors compute a score based on: (i) classifier confidence; and (ii) a measure of the aspect ratio of the character detected and the aspect ratio learnt for that character from training data.
  • The mean aspect ratio (computed from training data) for the charactercj is denoted byµaj .
  • A low goodness score indicates a weak detection, and is removed from the set of candidate character windows.
  • The authors select detections which have a high confidence score, and do not overlap significantly with any of the other stronger detections.
  • The authors believe that this bottom-up approach alone cannot address all the issues related to detecting characters.

3. Recognizing Words

  • The character detection step provides us with a large set of windows potentially containing characters within them.
  • The authors goal is to find the most likely word from this set of characters.
  • The authors formulate this problem in an energy minimization framework, where the best energy solution represents the ground truth word they aim to find.

3.1. The Word Model

  • Note that the set of random variables includes windows representing not only true positive detections, but also many false positive detections, which must be suppressed.
  • 2We assume here that the windows have been pruned based on aspect ratio of character windows.the authors.the authors.
  • TheTRW-S algorithm maximizes a concave lower bound on the energy.
  • These distributions are then used to reweight the messages being passed during loopyBP [24] on each tree.

3.2. Computing the Lexicon Prior

  • Such language models are frequently used in speech recognition, machine translation, and to some extent inOCR systems [27].
  • The authors explore two types of lexicon priors for the word recognition problem.
  • LetP (ci, cj) denote the probability of occurrence of a character pair (ci, cj) in the lexicon.
  • When the lexicon increases in size, the bi-gram model loses its effectiveness.
  • The node-specific pairwise cost for the character pair (P,R) to occur at the beginning of the word is higher than for it to occur at the end of word.

4. Experiments

  • Given a word image extracted from a street scene and a lexicon, their problem is to find all the characters, and also to recognize the word as a whole.
  • The authors evaluate various components of the proposed approach to justify their choices.
  • The authors compare their results with two of the best performing methods [29, 30] for this task.

4.1. Datasets

  • The authors used the Street View Text (SVT) [30]3 and theICDAR 2003 robust word recognition [1] datasets in their experiments.
  • To maintain identical experimental settings to those in [29], the authors use the lexica provided by them for these two datasets.
  • The dataset is divided intoSVT-SPOTandSVT-WORD, meant for 3http://vision.ucsd.edu/∼kai/svt the tasks of locating words and recognizing words respectively.
  • Since, in their work, the authors focus on the word recognition task, they used theSVT-WORD dataset, which contains 647 word images.

4.2. Character Detection

  • Sliding window based character detection is an important component of their framework, as their random field model is defined on the detections obtained.
  • The authors use the intersection over union measure [1, 12] thresholded at 90% to determine whether a detection has been retrieved or not.
  • The authors used the goodness score measure in (1), and discarded all windows with a score less than0.1.
  • Table 1 summarizes an evaluation of the quality of their sliding window approach for theSVT-CHAR dataset.
  • Note that more than 97% of the characters are detected.

4.3. Cropped Word Recognition

  • The authors use the detections obtained to build theCRF model as discussed in Section 3.1.
  • The authors add one node for every detection window, and connect it to other windows based on its spatial distance and overlap.
  • The authors choose the lexicon prior parameter in (7) and (8) λl = 2, for all their experiments.
  • The overlap penalty parameter in (5) and (6) is set toλo = 1, empirically for all their experiments.
  • A few challenging character examples the authors missed in the sliding window stage, also known as Figure 5.

4.4. Results and Discussion

  • Similar to the evaluation scheme in [29], the authors use the inferred result to retrieve the word with the smallest edit distance in the lexicon.
  • The best known results on theSVT and ICDAR datasets are reported in [29], where the authors used a pictorial structures model for recognizing words.
  • In their evaluation, the authors found the main causes of such failures to be: (i) weak unary term; and (ii) missing characters in the detection stage itself .
  • The authors method is inspired by the work of [29] and [10], but the authors differ from them in many aspects as detailed below.
  • In contrast, the authors build a model from the entire lexicon (top-down cues), combine it with all the character detections (bottomup cues), which have low or high scores, and infer the word with a joint inference scheme.

5. Conclusion

  • The authors model combines bottom-up cues from character detections and top-down cues from lexica.
  • The authors infer the location of true characters and the word they represent as a whole jointly.
  • The authors showed results on two of the most challenging scene text databases, and improved the latest results published atICCV 2011 [29] significantly.
  • The authors results show that scene text can be recognized with a reasonably high accuracy in natural, unconstrained images.
  • This could help in building vision systems, which can solve higher level semantic tasks, such as [15, 19].

Did you find this useful? Give us your feedback

Figures (5)

Content maybe subject to copyright    Report

Top-Down and Bottom-up Cues for Scene Text Recognition
Anand Mishra
1
Karteek Alahari
2
C. V. Jawahar
1
1
CVIT, IIIT Hyderabad, India
2
INRIA - WILLOW /
´
Ecole Normale Sup´erieure, Paris, France
Abstract
Scene text recognition has gained significant attention
from the computer vision community in recent years. Rec-
ognizing such text is a challenging problem, even more so
than the recognition of scanned documents. In this work,
we focus on the problem of recognizing text extracted from
street images. We present a framework that exploits both
bottom-up and top-down cues. The bottom-up cues are de-
rived from individual character detections from the image.
We build a Conditional Random Field model on these de-
tections to jointly model the strength of the detections and
the interactions between them. We impose top-down cues
obtained from a lexicon-based prior, i.e. language statis-
tics, on the model. The optimal word represented by the text
image is obtained by minimizing the energy function corre-
sponding to the random field model.
We show significant improvements in accuracies on two
challenging public datasets, namely Street View Text (over
15%) and ICDAR 2003 (nearly 10%).
1. Introduction
The problem of understanding scenes semantically has
been one of the challenging goals in computer vision for
many decades. It has gained considerable attention over
the past few years, in particular, in the context of street
scenes [3, 20]. This problem has manifested itself in various
forms, namely, object detection [10, 13], object recognition
and segmentation [22, 25]. There have also been significant
attempts at addressing all these tasks jointly [14, 16, 20].
Although these approaches interpret most of the scene suc-
cessfully, regions containing text tend to be ignored. As an
example, consider an image of a typical street scene taken
from Google Street View in Figure 1. One of the first things
we notice in this scene is the sign board and the text it con-
tains. However, popular recognition methods ignore the
text, and identify other objects such as car, person, tree, re-
gions such as road, sky. The importance of text in images
is also highlighted in the experimental study conducted by
Judd et al. [17]. They found that viewers fixate on text when
Figure 1: A typical street scene image taken from Google
Street View [29]. It contains very prominent sign boards
(with text) on the building and its windows. It also contains
objects such as car, person, tree, and regions such as road,
sky. Many scene understanding methods recognize these
objects and regions in the image successfully, but tend to
ignore the text on the sign board, which contains rich, useful
information. Our goal is to fill-in this gap in understanding
the scene.
shown images containing text and other objects. This is fur-
ther evidence that text recognition forms a useful compo-
nent of the scene understanding problem.
Given the rapid growth of camera-based applications
readily available on mobile phones, understanding scene
text is more important than ever. One could, for instance,
foresee an application to answer questions such as, What
does this sign say?”. This is related to the problem of Opti-
cal Character Recognition (OCR), which has a long history
in the computer vision community. However, the success
of OCR systems is largely restricted to text from scanned
documents. Scene text exhibits a large variability in ap-
pearances, as shown in Figures 1 and 2, and can prove to be
challenging even for the state-of-the-art OCR methods.
A few recent works have explored the problem of de-
tecting and/or recognizing text in scenes [4, 6, 7, 11, 23,

Figure 2: Scene text often contains examples that have a large variety of appearances. Here we show a few sample images
from the SVT [30] and ICDAR [1] datasets, with issues such as, very different fonts, shadows, low resolution, occlusions.
These images are much more complex than the ones seen in typical OCR datasets. Standard off-the-shelf OCRs perform very
poorly on these datasets [23, 29].
26, 29, 30, 31]. Chen and Yuille [6] and later, Epshtein et
al. [11] have addressed the problem of detecting text in nat-
ural scenes. These two methods achieve significant detec-
tion results, but rely on an off-the-shelf OCR for subsequent
recognition. Thus, they are not directly applicable for the
challenging datasets we consider. De Campos et al. [9] pro-
posed a method for recognizing cropped scene text char-
acters. Although character recognition forms an essential
component of text understanding, extending this framework
to recognize words is not trivial. Weinman et al. [31] and
Smith et al. [26] showed impressive scene text recognition
results using similarity constraints and language statistics,
but on a simpler dataset. It consists of “roughly fronto-
parallel” pictures of signs [31], which are quite similar to
those found in a traditional OCR setting. In contrast, we
show results on a more challenging street view dataset [29],
where the words vary in appearance significantly. Further-
more, we evaluate our approach on over 1000 words com-
pared to 215 words in [26, 31].
The proposed approach is more closely related to those
in [23, 29, 30], which address the problem of simultane-
ously localizing and recognizing words. On one hand, these
methods localize text with a significant accuracy, but on the
other hand, their recognitionresults leave a lot to be desired.
Since words in the scene text dataset have been localized
with a good accuracy, we focus on the problem of recogniz-
ing words, given their location. This is commonly referred
to as the cropped word recognition problem. Note that the
challenges of this task are evident from the best published
accuracy of only 56% on the scene text dataset [29]. The
probabilistic approach we propose in this paper achieves an
accuracy of over 73% under identical experimental settings.
Our method is inspired by the many advancements made
in the object detection and recognition problems [8, 10, 13,
25]. We present a framework that exploits both bottom-up
and top-down cues. The bottom-up cues are derived from
individual character detections from the image. Naturally,
these windows contain true as well as false positive detec-
tions of characters. We build a Conditional Random Field
(CRF) model [21] on these detections to determine not only
the true positive detections, but also what word they rep-
resent jointly. We impose top-down cues obtained from a
lexicon-based prior, i.e. language statistics, on the model.
In addition to disambiguating between characters, this prior
also helps us in recognizing words.
The remainder of the paper is organized as follows. In
Section 2 we present our character detection method. Our
framework to build the random field model with a top-down
lexicon-based prior on these detections is described in Sec-
tion 3. We provide results on two public datasets and com-
pare our method to related work in Section 4. Implemen-
tation details are also given in this section. We then make
concluding remarks in Section 5.
2. Character Detection
The first step in our approach is to detect potential loca-
tions of characters in a word image. We propose a sliding
window based approach to achieve this.
2.1. Sliding Window Detection
Sliding window based detectors have been very suc-
cessful for challenging tasks, such as face [28] and pedes-
trian [8] detection. Although character detection is simi-
lar to such problems, it has its unique challenges. Firstly,
there is the issue of dealing with a large number of cat-
egories (62 in all). Secondly, there is a large amount of
inter-character and intra-character confusion, as illustrated
in Figure 3. When a window contains parts of two char-
acters next to each other, it may have a very similar ap-
pearance to another character. In Figure 3(a), the window
containing parts of the characters o can be confused with
x’. Furthermore, a part of one character can have the same
appearance as that of another. In Figure 3(b), a part of the
character B’ can be confused with ‘E’. We have adopted an
additional pruning stage to overcome some of these issues.
We consider windows at multiple scales and spatial lo-
cations. The location of the i
th
window, l
i
, is given by its
center and size. The set K = {c
1
, c
2
, ..., c
k
}, denotes the set
of character classes in the dataset, e.g. k = 62 for English
characters and digits. Let φ
i
denote the features extracted
from a window location l
i
. Given the window l
i
, we com-
pute the likelihood, p(c
i
|φ
i
), of it taking a label c
i
for all the

(a) (b)
Figure 3: Typical challenges in multi-class character de-
tection. (a) Inter-character confusion: A window contain-
ing parts of the two o’s is falsely detected as x. (b) Intra-
character confusion: A window containing a part of the
character B is recognized as E.
classes in K. In our implementation, we used Histogram
of Gradient (HOG) features [8] for φ
i
, and the likelihoods
p(·) were learnt using a multi-class Support Vector Machine
(SVM) [5]. Details of the training procedure are provided in
Section 4.2.
This basic sliding window detection approach produces
many potential character windows, but not all of them are
useful for recognizing words. We discard some of the weak
detection windows using the following pruning method.
2.2. Pruning Windows
For every potential character window, we compute a
score based on: (i) classifier confidence; and (ii) a measure
of the aspect ratio of the character detected and the aspect
ratio learnt for that character from training data. The intu-
ition behind this score is that, a strong character window
candidate should have a high classifier confidence score,
and must fall within some range of sizes observed in the
training data. For a window l
i
with an aspect ratio a
i
, let
c
j
denote the character with the best classifier confidence
value given by S
ij
. The mean aspect ratio (computed from
training data) for the character c
j
is denoted by µ
a
j
. We
define a goodness score (GS) for the window l
i
as:
GS(l
i
) = S
ij
exp
(µ
a
j
a
i
)
2
2σ
2
a
j
!
, (1)
where σ
a
j
is the variance of the aspect ratio for character
class c
j
in the training data. Note that the aspect ratio statis-
tics are character-specific. A low goodness score indicates
a weak detection, and is removed from the set of candidate
character windows.
We then apply Non-Maximum Suppression (NMS), sim-
ilar to other sliding window detection methods [13], to ad-
dress the issue of multiple overlapping detections for each
instance of a character. We select detections which have a
high confidence score, and do not overlap significantly with
any of the other stronger detections. We perform NMS after
the aspect ratio pruning because wide windows containing
many characters may suppress overlapping single character
windows, when they are weaker.
We perform both the pruning steps conservatively, and
only discard the obvious false detections. We believe that
this bottom-up approach alone cannot address all the is-
sues related to detecting characters. Hence, we introduce
lexicon-based top-down cues to discard the remaining false
positives. We also use these cues to recognize the word as
described in the following section.
3. Recognizing Words
The character detection step provides us with a large set
of windows potentially containing characters within them.
Our goal is to find the most likely word from this set of char-
acters. We formulate this problem in an energy minimiza-
tion framework, where the best energy solution represents
the ground truth word we aim to find.
3.1. The Word Model
Each detection window is represented by a random vari-
able X
i
, which takes a label x
i
.
1
Let n be the total number
of detection windows. Note that the set of random variables
includes windows representing not only true positive detec-
tions, but also many false positive detections, which must be
suppressed. We introduce a null (or void) label ǫ to account
for these false windows. Thus, x
i
K
ǫ
= K {ǫ}. The set
K
n
ǫ
represents the set of all possible assignments of labels
to the random variables. An energy function E : K
n
ǫ
R,
maps any labelling to a real number E(·) called its energy
or cost. The function E(·) is commonly represented as a
sum of unary and pairwise terms as:
E(x) =
n
X
i=1
E
i
(x
i
) +
X
E
E
ij
(x
i
, x
j
), (2)
where x = {x
i
|i = 1, 2, . . . , n}, E
i
(·) represents the unary
term, E
ij
(·, ·) is the pairwise term, and E represents the set
of pairs of interacting detection windows, which is deter-
mined by the structure of the underlying graphical model.
Graph construction. We order the windows based on
their horizontal location, and add one node each for every
window sequentially from left to right. The nodes are then
connected by edges. One could make a complete graph by
connecting each node to every other node. However, it is
not natural for a window on the extreme left to be related
to another window on the extreme right, as evident in Fig-
ure 4.
2
Thus, we only connect windows with a significant
overlap between them or windows which are close to each
1
Our model has similarities to that proposed in [10] for object detection,
but the challenges (e.g. inter- and intra- character confusions) are greatly
different from those in the object detection problem.
2
We assume here that the windows have been pruned based on aspect
ratio of character windows. Without this pruning step, a window may con-
tain multiple characters and will perhaps require a more complete graph.

other. In the example in Figure 4, we show a few win-
dow samples and the edges between them. The intuition
behind connecting overlapping or close-proximity windows
is that they could represent either overlapping detections of
the same character or detections of two separate characters.
As we will see later, the edges are used to encode the lan-
guage model as top-down cues.
CRF energy. The unary term E(x
i
), which denotes the
cost of a node x
i
taking label c
j
6= ǫ, is defined as:
E
i
(x
i
= c
j
) = 1 p(c
j
|x
i
), (3)
where p(c
j
|x
i
) is the classifier confidence (e.g. SVM score)
of character class c
j
for node x
i
. For the null label ǫ,
E
i
(x
i
= ǫ) = max
j
p(c
j
|x
i
) exp
(µ
a
j
a
i
)
2
σ
2
a
j
!
, (4)
where a
i
is the aspect ratio of the window corresponding to
node x
i
, c
j
is character detected at node x
i
, and µ
a
j
and σ
a
j
are the mean and the variance of aspect ratio of the charac-
ter detected, which is learnt from the training data, respec-
tively. For a true window, which has a relatively good SVM
score, and matches the average aspect ratio size, this cost
of assigning a null label is high. On the other hand, false
windows, which either have poor SVM scores or vary from
the average aspect ratio size or both will be more likely to
take the null label ǫ.
The pairwise term E(x
i
, x
j
) is used to encode the top-
down cues from the language model in a principled way.
The cost of two neighbouring nodes x
i
and x
j
taking labels
c
i
6= ǫ and c
j
6= ǫ respectively is given by:
E
ij
(x
i
, x
j
) = E
l
ij
(x
i
, x
j
) + λ
o
exp(ψ(x
i
, x
j
)). (5)
Here, ψ(x
i
, x
j
) = (100 Overlap(x
i
, x
j
))
2
, is a mea-
sure of the overlap percentage between the two windows
X
i
and X
j
. The function E
l
ij
(x
i
, x
j
) denotes the lexi-
con prior.The parameter λ
o
determines the overlap-based
penalty. Computation of the lexicon prior E
l
ij
(·, ·) is dis-
cussed in Section 3.2. The pairwise cost (5) ensures that two
windows with sufficiently high overlap cannottake non-null
labels, i.e. at least one of them is likely to be a false window.
The costs involving the null label ǫ are computed as:
E
ij
(x
i
= c
i
, x
j
= ǫ) = λ
o
exp(ψ(x
i
, x
j
)). (6)
The pairwise cost E
ij
(x
i
= ǫ, x
j
= c
j
) is defined similarly.
Further, E
ij
(x
i
= ǫ, x
j
= ǫ) is uniformly set to zero.
Inference. Given these unary and pairwise terms, we
minimize the energy function (2). We use the sequential
tree-reweighted message passing (TRW-S) algorithm [18]
Figure 4: Summary of our approach. We first find a set of
potential character windows, shown at the top (only a few
are shown here for readability). We then build a random
field model on these detection windows by connecting them
with edges. The edge weights are computed based on the
characteristics of the two windows. Edges shown in green
indicate that the two windows it connects have a high prob-
ability of occurring together. Edges shown in red connect
two windows that are unlikely to be characters following
one another. A edge shown in red forces one of the two win-
dows to take the ǫ label, i.e. removes it from the candidate
character set. Based on these edges and the SVM scores for
each window, we infer the character classes of each window
as well the word, which is indicated by the green edge path.
(Best viewed in colour.)
because of its efficiency and accuracy on our recognition
problem. The TRW-S algorithm maximizes a concave lower
bound on the energy. It begins by considering a set of trees
from the random field, and computes probability distribu-
tions over each tree. These distributions are then used to
reweightthe messages being passed during loopy BP [24] on
each tree. The algorithm terminates when the lower bound
cannot be increased further, or the maximum number of it-
erations has reached.
3.2. Computing the Lexicon Prior
We use a lexicon to compute the prior E
l
ij
(x
i
, x
j
) in (5).
Such language models are frequently used in speech recog-
nition, machine translation, and to some extent in OCR sys-
tems [27]. We explore two types of lexicon priors for the
word recognition problem.
Bi-gram. Bi-gram based lexicon priors are learnt from
joint occurrences of characters in the lexicon. Character
pairs which never occur are highly penalized. Let P (c
i
, c
j
)
denote the probability of occurrence of a character pair
(c
i
, c
j
) in the lexicon. The pairwise cost is:
E
l
ij
(x
i
= c
i
, x
j
= c
j
) = λ
l
(1 P (c
i
, c
j
)), (7)
where the parameter λ
l
determines the penalty for a charac-
ter pair occurring in the lexicon.

Node-specific prior. When the lexicon increases in size,
the bi-gram model loses its effectiveness. It also fails to
capture the location-specific information of pairs of charac-
ters. As a toy example, consider a lexicon with only two
words CVPR and ICPR. The node-specific pairwise cost for
the character pair (P,R) to occur at the beginningof the word
is higher than for it to occur at the end of word. This useful
cue is ignored in the bi-gram prior model.
To overcome this, we divide each lexicon word into n
parts, where n is determined based on the number of nodes
in the graph and the spatial distance between nodes. We
then use only the first 1/n
th
of the word for computing the
pairwise cost between initial nodes, similarly next 1/n
th
for
computing the cost between the next few nodes, and so on.
In other words, we do a region of interest (ROI) based search
in the lexicon. The ROI is determined based on the spatial
position of a detected window in the word, e.g. if two win-
dows are on the left most side then only the first couple
of characters of lexicons are considered for calculating the
pairwise term between windows.
The pairwise cost using this prior is given by:
E
l
ij
(x
i
= c
i
, x
j
= c
j
) =
0 if (c
i
, c
j
) ROI,
λ
l
otherwise.
(8)
We evaluate our approach with both these pairwise terms,
and find that the node-specific prior achieves better perfor-
mance.
4. Experiments
In this section we present a detailed evaluation of our
method. Given a word image extracted from a street scene
and a lexicon, our problem is to find all the characters, and
also to recognize the word as a whole. We evaluate various
componentsof the proposedapproach to justify our choices.
We compare our results with two of the best performing
methods [29, 30] for this task.
4.1. Datasets
We used the Street View Text (SVT) [30]
3
and the IC-
DAR 2003 robust word recognition [1] datasets in our ex-
periments. To maintain identical experimental settings to
those in [29], we use the lexica provided by them for these
two datasets.
SVT. The Street View Text (SVT)
4
dataset contains im-
ages taken from Google Street View. As noted in [30], most
of the images come from business signage and exhibit a
high degree of variability in appearance and resolution. The
dataset is divided into SVT-SPOT and SVT-WORD, meant for
3
http://vision.ucsd.edu/kai/svt
4
Note that this dataset has been slightly updated since its original re-
lease in [30]. We use the updated version [29] in our experiments.
the tasks of locating words and recognizing words respec-
tively. Since, in our work, we focus on the word recognition
task, we used the SVT-WORD dataset, which contains 647
word images.
Our basic unit of recognition is a character, which needs
to be detected or localized before classification. A miss in
the localization will result in poorer word recognition. To
improve the robustness of the recognition architecture, we
need to quantitatively measure the accuracy of the charac-
ter detection module. For this purpose, we created ground
truth data for characters in the SVT-WORD dataset. Using
the ground truth at the character level we evaluated the per-
formance of the SVM classifier used for this task. Note that
no such evaluation has been reported on the SVT dataset as
yet. Our ground truth data set contains around 4000 char-
acters of 52 classes overall. We refer to this dataset as SVT-
CHAR.
5
ICDAR 2003 Dataset. The ICDAR 2003 dataset was orig-
inally created for cropped character classification, full im-
age text detection, cropped and full image word recogni-
tion, and various other tasks in document analysis. We used
the part corresponding to cropped image word recognition
called Robust Word Recognition [1]. Similar to [29], we
ignore words with less than two characters or with non-
alphanumeric characters, which results in 829 words over-
all. For subsequent discussion we refer to this dataset as
ICDAR(50).
4.2. Character Detection
Sliding window based character detection is an impor-
tant component of our framework, as our random field
model is defined on the detections obtained. At every possi-
ble location of the sliding window, we test a character clas-
sifier. This provides a likelihood of the window containing
a certain character. The alphabet of characters recognized
consists of 26 lowercase and 26 uppercase letters, and 10
digits.
We evaluated various features for recognizing charac-
ters. We observed that the HOG feature [8] outperforms the
features reported in [9], which uses a bag-of-words model.
This is perhaps due to the lack of geometric information in
the model. We computed dense HOG features with a cell
size of 4 × 4 using 10 bins, after resizing each image to a
22 × 20 window. We learnt a 1-vs-all SVM classifier with an
RBF kernel using these features. We used the standard LIB-
SVM package [5] for training and testing the SVMs. For the
SVT-CHAR evaluation, we trained the model on the ICDAR
2003 dataset due to the relatively small size of SVT-CHAR
( 4000 characters). We observed that the method using
HOG descriptors performs significantly better than others
with a classification accuracy of 61.86%.
5
Available at http://cvit.iiit.ac.in/projects/SceneTextUnderstanding

Citations
More filters
Proceedings ArticleDOI
01 Jul 2017
TL;DR: This work proposes a simple yet powerful pipeline that yields fast and accurate text detection in natural scenes, and significantly outperforms state-of-the-art methods in terms of both accuracy and efficiency.
Abstract: Previous approaches for scene text detection have already achieved promising performances across various benchmarks. However, they usually fall short when dealing with challenging scenarios, even when equipped with deep neural network models, because the overall performance is determined by the interplay of multiple stages and components in the pipelines. In this work, we propose a simple yet powerful pipeline that yields fast and accurate text detection in natural scenes. The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images, eliminating unnecessary intermediate steps (e.g., candidate aggregation and word partitioning), with a single neural network. The simplicity of our pipeline allows concentrating efforts on designing loss functions and neural network architecture. Experiments on standard datasets including ICDAR 2015, COCO-Text and MSRA-TD500 demonstrate that the proposed algorithm significantly outperforms state-of-the-art methods in terms of both accuracy and efficiency. On the ICDAR 2015 dataset, the proposed algorithm achieves an F-score of 0.7820 at 13.2fps at 720p resolution.

1,161 citations


Cites background from "Top-down and bottom-up cues for sce..."

  • ...Numerous inspiring ideas and effective approaches [5, 25, 26, 24, 27, 37, 11, 12, 7, 41, 42, 31] have been investigated....

    [...]

Proceedings Article
01 Nov 2012
TL;DR: This paper combines the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows them to use a common framework to train highly-accurate text detector and character recognizer modules.
Abstract: Full end-to-end text recognition in natural images is a challenging problem that has received much attention recently. Traditional systems in this area have relied on elaborate models incorporating carefully hand-engineered features or large amounts of prior knowledge. In this paper, we take a different route and combine the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows us to use a common framework to train highly-accurate text detector and character recognizer modules. Then, using only simple off-the-shelf methods, we integrate these two modules into a full end-to-end, lexicon-driven, scene text recognition system that achieves state-of-the-art performance on standard benchmarks, namely Street View Text and ICDAR 2003.

900 citations


Cites background or result from "Top-down and bottom-up cues for sce..."

  • ...Sophisticated models such as conditional random fields [11, 19] or pictorial structures [18] are also often required to combine the raw detection/recognition outputs into a complete system....

    [...]

  • ...Table 1 compares our results with [18] and the very recent work of [11]....

    [...]

Proceedings ArticleDOI
07 Sep 2009
TL;DR: A framework is presented that uses a higher order prior computed from an English dictionary to recognize a word, which may or may not be a part of the dictionary, and achieves significant improvement in word recognition accuracies without using a restricted word list.
Abstract: The problem of recognizing text in images taken in the wild has gained significant attention from the computer vision community in recent years. Contrary to recognition of printed documents, recognizing scene text is a challenging problem. We focus on the problem of recognizing text extracted from natural scene images and the web. Significant attempts have been made to address this problem in the recent past. However, many of these works benefit from the availability of strong context, which naturally limits their applicability. In this work we present a framework that uses a higher order prior computed from an English dictionary to recognize a word, which may or may not be a part of the dictionary. We show experimental results on publicly available datasets. Furthermore, we introduce a large challenging word dataset with five thousand words to evaluate various steps of our method exhaustively. The main contributions of this work are: (1) We present a framework, which incorporates higher order statistical language models to recognize words in an unconstrained manner (i.e. we overcome the need for restricted word lists, and instead use an English dictionary to compute the priors). (2) We achieve significant improvement (more than 20%) in word recognition accuracies without using a restricted word list. (3) We introduce a large word recognition dataset (atleast 5 times larger than other public datasets) with character level annotation and benchmark it.

789 citations


Cites background or methods from "Top-down and bottom-up cues for sce..."

  • ...Our method outperfo rms [13] not only on the (smaller)SVT and ICDAR 2003 datasets, but also on the IIIT 5K-Word dataset....

    [...]

  • ...Note that recent wor ks [13, 20, 21] on scene text recognition, recognize a word with the help of an image-spec ific small size lexicon (around ha l-0 08 18 18 3, v er si on 1 17 O ct 2 01 3...

    [...]

Journal ArticleDOI
TL;DR: This review provides a fundamental comparison and analysis of the remaining problems in the field and summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems.
Abstract: This paper analyzes, compares, and contrasts technical challenges, methods, and the performance of text detection and recognition research in color imagery It summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems Existing techniques are categorized as either stepwise or integrated and sub-problems are highlighted including text localization, verification, segmentation and recognition Special issues associated with the enhancement of degraded text and the processing of video text, multi-oriented, perspectively distorted and multilingual text are also addressed The categories and sub-categories of text are illustrated, benchmark datasets are enumerated, and the performance of the most representative approaches is compared This review provides a fundamental comparison and analysis of the remaining problems in the field

709 citations


Cites background or methods from "Top-down and bottom-up cues for sce..."

  • ...Shi et al. [195] proposed using DPMs to detect and recognize characters, then building a CRF model on the potential character locations to incorporate the classification scores, spatial constraints, and language priors for word recognition (Fig....

    [...]

  • ...Mishra [161] 2012 Recognition by integrating language prior and appearance features using CRF ICDAR’03 50 0....

    [...]

  • ...bottom-up cues [161], and high order language priors [165]....

    [...]

  • ...In this case, the character segmentation and character recognition can be integrated with language priors using optimization methods including Bayesian inference [25], [57], [64], [100], Integer programming [145], Markov [36], [83], [119], [206], CRF [161], [195], and graph models [56], [70], [123], [141], [143], [158], [189]....

    [...]

  • ...They use sliding window classification to obtain local maximum character detections, and a CRF model to jointly model the strength of the detections and the interactions among them....

    [...]

Journal ArticleDOI
TL;DR: This work introduces ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network that predicts a character sequence directly from the rectified image.
Abstract: A challenging aspect of scene text recognition is to handle text with distortions or irregular layout. In particular, perspective text and curved text are common in natural scenes and are difficult to recognize. In this work, we introduce ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network. The rectification network adaptively transforms an input image into a new one, rectifying the text in it. It is powered by a flexible Thin-Plate Spline transformation which handles a variety of text irregularities and is trained without human annotations. The recognition network is an attentional sequence-to-sequence model that predicts a character sequence directly from the rectified image. The whole model is trained end to end, requiring only images and their groundtruth text. Through extensive experiments, we verify the effectiveness of the rectification and demonstrate the state-of-the-art recognition performance of ASTER. Furthermore, we demonstrate that ASTER is a powerful component in end-to-end recognition systems, for its ability to enhance the detector.

592 citations


Cites methods from "Top-down and bottom-up cues for sce..."

  • ...IIIT5k-Words (IIIT5k) [44] contains 3,000 test images col-...

    [...]

  • ...Others take a learning-based approach and localize characters using the sliding window technique [35], [44], [45],...

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

40,826 citations


"Top-down and bottom-up cues for sce..." refers methods in this paper

  • ...We observed that all these potential character windows were missed due to poor SVM scores....

    [...]

  • ...On the other hand, false windows, which either have poor SVM scores or vary from the average aspect ratio size or both will be more likely to take the null label ǫ....

    [...]

  • ...For a true window, which has a relatively good SVM score, and matches the average aspect ratio size, this cost of assigning a null label is high....

    [...]

  • ...Some of the typical failures are due to poor resolution of the images, which leads to very weak SVM scores, as shown in Figure 5....

    [...]

  • ...We learnt a 1-vs-all SVM classifier with an RBF kernel using these features....

    [...]

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Top-down and bottom-up cues for sce..." refers methods in this paper

  • ...Sliding window based detectors have been very successful for challenging tasks, such as face [28] and pedestrian [8] detection....

    [...]

  • ...In our implementation, we used Histogram of Gradient (HOG) features [8] for φi, and the likelihoods p(·) were learnt using a multi-class Support Vector Machine (SVM) [5]....

    [...]

  • ...We observed that the HOG feature [8] outperforms the features reported in [9], which uses a bag-of-words model....

    [...]

  • ...Our method is inspired by the many advancements made in the object detection and recognition problems [8, 10, 13, 25]....

    [...]

Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations


Additional excerpts

  • ...Sliding window based detectors have been very successful for challenging tasks, such as face [28] and pedestrian [8] detection....

    [...]

Journal ArticleDOI
TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

15,935 citations


"Top-down and bottom-up cues for sce..." refers methods in this paper

  • ...We computed the intersection over union measure of a detected window compared to the ground truth, similar to PASCAL VOC [12] and ICDAR 2003 [1]....

    [...]

  • ...We use the intersection over union measure [1, 12] thresholded at 90% to determine whether a detection has been retrieved or not....

    [...]

Book
01 Jan 1988
TL;DR: Probabilistic Reasoning in Intelligent Systems as mentioned in this paper is a complete and accessible account of the theoretical foundations and computational methods that underlie plausible reasoning under uncertainty, and provides a coherent explication of probability as a language for reasoning with partial belief.
Abstract: From the Publisher: Probabilistic Reasoning in Intelligent Systems is a complete andaccessible account of the theoretical foundations and computational methods that underlie plausible reasoning under uncertainty. The author provides a coherent explication of probability as a language for reasoning with partial belief and offers a unifying perspective on other AI approaches to uncertainty, such as the Dempster-Shafer formalism, truth maintenance systems, and nonmonotonic logic. The author distinguishes syntactic and semantic approaches to uncertainty—and offers techniques, based on belief networks, that provide a mechanism for making semantics-based systems operational. Specifically, network-propagation techniques serve as a mechanism for combining the theoretical coherence of probability theory with modern demands of reasoning-systems technology: modular declarative inputs, conceptually meaningful inferences, and parallel distributed computation. Application areas include diagnosis, forecasting, image interpretation, multi-sensor fusion, decision support systems, plan recognition, planning, speech recognition—in short, almost every task requiring that conclusions be drawn from uncertain clues and incomplete information. Probabilistic Reasoning in Intelligent Systems will be of special interest to scholars and researchers in AI, decision theory, statistics, logic, philosophy, cognitive psychology, and the management sciences. Professionals in the areas of knowledge-based systems, operations research, engineering, and statistics will find theoretical and computational tools of immediate practical use. The book can also be used as an excellent text for graduate-level courses in AI, operations research, or applied probability.

15,671 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Top-down and bottom-up cues for scene text recognition" ?

In this work, the authors focus on the problem of recognizing text extracted from street images. The authors present a framework that exploits both bottom-up and top-down cues. The authors show significant improvements in accuracies on two challenging public datasets, namely Street View Text ( over 15 % ) and ICDAR 2003 ( nearly 10 % ).