Proceedings Article•DOI•

Top-down and bottom-up cues for scene text recognition

Anand Mishra¹, Karteek Alahari², C. V. Jawahar¹•Institutions (2)

International Institute of Information Technology, Hyderabad¹, École Normale Supérieure²

16 Jun 2012-pp 2687-2694

TL;DR: This work presents a framework that exploits both bottom-up and top-down cues in the problem of recognizing text extracted from street images, and shows significant improvements in accuracies on two challenging public datasets, namely Street View Text and ICDAR 2003.

read less

Abstract: Scene text recognition has gained significant attention from the computer vision community in recent years. Recognizing such text is a challenging problem, even more so than the recognition of scanned documents. In this work, we focus on the problem of recognizing text extracted from street images. We present a framework that exploits both bottom-up and top-down cues. The bottom-up cues are derived from individual character detections from the image. We build a Conditional Random Field model on these detections to jointly model the strength of the detections and the interactions between them. We impose top-down cues obtained from a lexicon-based prior, i.e. language statistics, on the model. The optimal word represented by the text image is obtained by minimizing the energy function corresponding to the random field model. We show significant improvements in accuracies on two challenging public datasets, namely Street View Text (over 15%) and ICDAR 2003 (nearly 10%).

...read moreread less

Summary (3 min read)

Jump to: [1. Introduction] – [2.1. Sliding Window Detection] – [2.2. Pruning Windows] – [3. Recognizing Words] – [3.1. The Word Model] – [3.2. Computing the Lexicon Prior] – [4. Experiments] – [4.1. Datasets] – [4.2. Character Detection] – [4.3. Cropped Word Recognition] – [4.4. Results and Discussion] and [5. Conclusion]

1. Introduction

The problem of understanding scenes semantically has been one of the challenging goals in computer vision for many decades.
Popular recognition methods ignore the text, and identify other objects such as car, person, tree, regions such as road, sky.
The probabilistic approach the authors propose in this paper achieves an accuracy of over73% under identical experimental settings.
The authors build a Conditional Random Field (CRF) model [21] on these detections to determine not only the true positive detections, but also what word they rep- resent jointly.

2.1. Sliding Window Detection

Sliding window based detectors have been very successful for challenging tasks, such as face [28] and pedestrian [8] detection.
In Figure 3(a), the window containing parts of the characters ‘o’ can be confused with ‘x’.
Letφi denote the features extracted from a window locationli.
This basic sliding window detection approach produces many potential character windows, but not all of them are useful for recognizing words.

2.2. Pruning Windows

For every potential character window, the authors compute a score based on: (i) classifier confidence; and (ii) a measure of the aspect ratio of the character detected and the aspect ratio learnt for that character from training data.
The mean aspect ratio (computed from training data) for the charactercj is denoted byµaj .
A low goodness score indicates a weak detection, and is removed from the set of candidate character windows.
The authors select detections which have a high confidence score, and do not overlap significantly with any of the other stronger detections.
The authors believe that this bottom-up approach alone cannot address all the issues related to detecting characters.

3. Recognizing Words

The character detection step provides us with a large set of windows potentially containing characters within them.
The authors goal is to find the most likely word from this set of characters.
The authors formulate this problem in an energy minimization framework, where the best energy solution represents the ground truth word they aim to find.

3.1. The Word Model

Note that the set of random variables includes windows representing not only true positive detections, but also many false positive detections, which must be suppressed.
2We assume here that the windows have been pruned based on aspect ratio of character windows.the authors.the authors.
TheTRW-S algorithm maximizes a concave lower bound on the energy.
These distributions are then used to reweight the messages being passed during loopyBP [24] on each tree.

3.2. Computing the Lexicon Prior

Such language models are frequently used in speech recognition, machine translation, and to some extent inOCR systems [27].
The authors explore two types of lexicon priors for the word recognition problem.
LetP (ci, cj) denote the probability of occurrence of a character pair (ci, cj) in the lexicon.
When the lexicon increases in size, the bi-gram model loses its effectiveness.
The node-specific pairwise cost for the character pair (P,R) to occur at the beginning of the word is higher than for it to occur at the end of word.

4. Experiments

Given a word image extracted from a street scene and a lexicon, their problem is to find all the characters, and also to recognize the word as a whole.
The authors evaluate various components of the proposed approach to justify their choices.
The authors compare their results with two of the best performing methods [29, 30] for this task.

4.1. Datasets

The authors used the Street View Text (SVT) [30]3 and theICDAR 2003 robust word recognition [1] datasets in their experiments.
To maintain identical experimental settings to those in [29], the authors use the lexica provided by them for these two datasets.
The dataset is divided intoSVT-SPOTandSVT-WORD, meant for 3http://vision.ucsd.edu/∼kai/svt the tasks of locating words and recognizing words respectively.
Since, in their work, the authors focus on the word recognition task, they used theSVT-WORD dataset, which contains 647 word images.

4.2. Character Detection

Sliding window based character detection is an important component of their framework, as their random field model is defined on the detections obtained.
The authors use the intersection over union measure [1, 12] thresholded at 90% to determine whether a detection has been retrieved or not.
The authors used the goodness score measure in (1), and discarded all windows with a score less than0.1.
Table 1 summarizes an evaluation of the quality of their sliding window approach for theSVT-CHAR dataset.
Note that more than 97% of the characters are detected.

4.3. Cropped Word Recognition

The authors use the detections obtained to build theCRF model as discussed in Section 3.1.
The authors add one node for every detection window, and connect it to other windows based on its spatial distance and overlap.
The authors choose the lexicon prior parameter in (7) and (8) λl = 2, for all their experiments.
The overlap penalty parameter in (5) and (6) is set toλo = 1, empirically for all their experiments.
A few challenging character examples the authors missed in the sliding window stage, also known as Figure 5.

4.4. Results and Discussion

Similar to the evaluation scheme in [29], the authors use the inferred result to retrieve the word with the smallest edit distance in the lexicon.
The best known results on theSVT and ICDAR datasets are reported in [29], where the authors used a pictorial structures model for recognizing words.
In their evaluation, the authors found the main causes of such failures to be: (i) weak unary term; and (ii) missing characters in the detection stage itself .
The authors method is inspired by the work of [29] and [10], but the authors differ from them in many aspects as detailed below.
In contrast, the authors build a model from the entire lexicon (top-down cues), combine it with all the character detections (bottomup cues), which have low or high scores, and infer the word with a joint inference scheme.

5. Conclusion

The authors model combines bottom-up cues from character detections and top-down cues from lexica.
The authors infer the location of true characters and the word they represent as a whole jointly.
The authors showed results on two of the most challenging scene text databases, and improved the latest results published atICCV 2011 [29] significantly.
The authors results show that scene text can be recognized with a reasonably high accuracy in natural, unconstrained images.
This could help in building vision systems, which can solve higher level semantic tasks, such as [15, 19].

Did you find this useful? Give us your feedback

Figures (5)

Figure 2:Scene text often contains examples that have a large varietyof appearances. Here we show a few sample images from theSVT [30] and ICDAR [1] datasets, with issues such as, very different fonts, shadows, low resolution, occlusions. These images are much more complex than the ones seen in typical OCR datasets. Standard off-the-shelfOCRs perform very poorly on these datasets [23, 29].

Table 2: Cropped Word Recognition Accuracy (in %): We show a comparison of the proposed random field model to the popular commercialOCR systemABBY , andPLEX proposed in [29]. *Note that the dataset has been revised since the original publication in [30], which makes that result not comparable directly. However, given that our method performs better than the improved version of [30], i.e. [29], we expect this trend to hold. We improve the accuracy by over 15% and 10% respectively onSVT-WORD and ICDAR datasets.

Figure 6:Bi-gram vs node-specific prior. The word inferred using bi-gram prior is shown on the left (in the blue box) and that inferred using node-specific prior is shown on the right (in the yellow box) at the top of every image. The nodespecific prior shows a better performance over the bi-gram prior as it captures relative occurrence of characters more effectively. (Best viewed in colour.)

Figure 3: Typical challenges in multi-class character detection. (a) Inter-character confusion: A window containing parts of the twoo’s is falsely detected asx. (b) Intracharacter confusion: A window containing a part of the character B is recognized as E.

Figure 7:Example word recognition results on challenging images from theSVT dataset [29] using the node-specific lexicon prior. Our method handles many different fonts, distortions (e.g.CAPOGIRO in row 3). There are few failure cases however, as shown in the bottom row. We found the main causes of such failures to be: (i) weak unary term; and (ii) missing characters in the detection stage itself (see Figure 5). Some of these issu s (e.g.CITY image in the bottom row) can be addressed by modifying the sliding window procedure.

Content maybe subject to copyright Report

Top-Down and Bottom-up Cues for Scene Text Recognition

Anand Mishra

Karteek Alahari

C. V. Jawahar

CVIT, IIIT Hyderabad, India

INRIA - WILLOW /

Ecole Normale Sup´erieure, Paris, France

Abstract

Scene text recognition has gained signiﬁcant attention

from the computer vision community in recent years. Rec-

ognizing such text is a challenging problem, even more so

than the recognition of scanned documents. In this work,

we focus on the problem of recognizing text extracted from

street images. We present a framework that exploits both

bottom-up and top-down cues. The bottom-up cues are de-

rived from individual character detections from the image.

We build a Conditional Random Field model on these de-

tections to jointly model the strength of the detections and

the interactions between them. We impose top-down cues

obtained from a lexicon-based prior, i.e. language statis-

tics, on the model. The optimal word represented by the text

image is obtained by minimizing the energy function corre-

sponding to the random ﬁeld model.

We show signiﬁcant improvements in accuracies on two

challenging public datasets, namely Street View Text (over

15%) and ICDAR 2003 (nearly 10%).

1. Introduction

The problem of understanding scenes semantically has

been one of the challenging goals in computer vision for

many decades. It has gained considerable attention over

the past few years, in particular, in the context of street

scenes [3, 20]. This problem has manifested itself in various

forms, namely, object detection [10, 13], object recognition

and segmentation [22, 25]. There have also been signiﬁcant

attempts at addressing all these tasks jointly [14, 16, 20].

Although these approaches interpret most of the scene suc-

cessfully, regions containing text tend to be ignored. As an

example, consider an image of a typical street scene taken

from Google Street View in Figure 1. One of the ﬁrst things

we notice in this scene is the sign board and the text it con-

tains. However, popular recognition methods ignore the

text, and identify other objects such as car, person, tree, re-

gions such as road, sky. The importance of text in images

is also highlighted in the experimental study conducted by

Judd et al. [17]. They found that viewers ﬁxate on text when

Figure 1: A typical street scene image taken from Google

Street View [29]. It contains very prominent sign boards

(with text) on the building and its windows. It also contains

objects such as car, person, tree, and regions such as road,

sky. Many scene understanding methods recognize these

objects and regions in the image successfully, but tend to

ignore the text on the sign board, which contains rich, useful

information. Our goal is to ﬁll-in this gap in understanding

the scene.

shown images containing text and other objects. This is fur-

ther evidence that text recognition forms a useful compo-

nent of the scene understanding problem.

Given the rapid growth of camera-based applications

readily available on mobile phones, understanding scene

text is more important than ever. One could, for instance,

foresee an application to answer questions such as, “What

does this sign say?”. This is related to the problem of Opti-

cal Character Recognition (OCR), which has a long history

in the computer vision community. However, the success

of OCR systems is largely restricted to text from scanned

documents. Scene text exhibits a large variability in ap-

pearances, as shown in Figures 1 and 2, and can prove to be

challenging even for the state-of-the-art OCR methods.

A few recent works have explored the problem of de-

tecting and/or recognizing text in scenes [4, 6, 7, 11, 23,

Figure 2: Scene text often contains examples that have a large variety of appearances. Here we show a few sample images

from the SVT [30] and ICDAR [1] datasets, with issues such as, very different fonts, shadows, low resolution, occlusions.

These images are much more complex than the ones seen in typical OCR datasets. Standard off-the-shelf OCRs perform very

poorly on these datasets [23, 29].

26, 29, 30, 31]. Chen and Yuille [6] and later, Epshtein et

al. [11] have addressed the problem of detecting text in nat-

ural scenes. These two methods achieve signiﬁcant detec-

tion results, but rely on an off-the-shelf OCR for subsequent

recognition. Thus, they are not directly applicable for the

challenging datasets we consider. De Campos et al. [9] pro-

posed a method for recognizing cropped scene text char-

acters. Although character recognition forms an essential

component of text understanding, extending this framework

to recognize words is not trivial. Weinman et al. [31] and

Smith et al. [26] showed impressive scene text recognition

results using similarity constraints and language statistics,

but on a simpler dataset. It consists of “roughly fronto-

parallel” pictures of signs [31], which are quite similar to

those found in a traditional OCR setting. In contrast, we

show results on a more challenging street view dataset [29],

where the words vary in appearance signiﬁcantly. Further-

more, we evaluate our approach on over 1000 words com-

pared to 215 words in [26, 31].

The proposed approach is more closely related to those

in [23, 29, 30], which address the problem of simultane-

ously localizing and recognizing words. On one hand, these

methods localize text with a signiﬁcant accuracy, but on the

other hand, their recognitionresults leave a lot to be desired.

Since words in the scene text dataset have been localized

with a good accuracy, we focus on the problem of recogniz-

ing words, given their location. This is commonly referred

to as the cropped word recognition problem. Note that the

challenges of this task are evident from the best published

accuracy of only 56% on the scene text dataset [29]. The

probabilistic approach we propose in this paper achieves an

accuracy of over 73% under identical experimental settings.

Our method is inspired by the many advancements made

in the object detection and recognition problems [8, 10, 13,

25]. We present a framework that exploits both bottom-up

and top-down cues. The bottom-up cues are derived from

individual character detections from the image. Naturally,

these windows contain true as well as false positive detec-

tions of characters. We build a Conditional Random Field

(CRF) model [21] on these detections to determine not only

the true positive detections, but also what word they rep-

resent jointly. We impose top-down cues obtained from a

lexicon-based prior, i.e. language statistics, on the model.

In addition to disambiguating between characters, this prior

also helps us in recognizing words.

The remainder of the paper is organized as follows. In

Section 2 we present our character detection method. Our

framework to build the random ﬁeld model with a top-down

lexicon-based prior on these detections is described in Sec-

tion 3. We provide results on two public datasets and com-

pare our method to related work in Section 4. Implemen-

tation details are also given in this section. We then make

concluding remarks in Section 5.

2. Character Detection

The ﬁrst step in our approach is to detect potential loca-

tions of characters in a word image. We propose a sliding

window based approach to achieve this.

2.1. Sliding Window Detection

Sliding window based detectors have been very suc-

cessful for challenging tasks, such as face [28] and pedes-

trian [8] detection. Although character detection is simi-

lar to such problems, it has its unique challenges. Firstly,

there is the issue of dealing with a large number of cat-

egories (62 in all). Secondly, there is a large amount of

inter-character and intra-character confusion, as illustrated

in Figure 3. When a window contains parts of two char-

acters next to each other, it may have a very similar ap-

pearance to another character. In Figure 3(a), the window

containing parts of the characters ‘o’ can be confused with

‘x’. Furthermore, a part of one character can have the same

appearance as that of another. In Figure 3(b), a part of the

character ‘B’ can be confused with ‘E’. We have adopted an

additional pruning stage to overcome some of these issues.

We consider windows at multiple scales and spatial lo-

cations. The location of the i

window, l

, is given by its

center and size. The set K = {c

, c

, ..., c

}, denotes the set

of character classes in the dataset, e.g. k = 62 for English

characters and digits. Let φ

denote the features extracted

from a window location l

. Given the window l

, we com-

pute the likelihood, p(c

|φ

), of it taking a label c

for all the

(a) (b)

Figure 3: Typical challenges in multi-class character de-

tection. (a) Inter-character confusion: A window contain-

ing parts of the two o’s is falsely detected as x. (b) Intra-

character confusion: A window containing a part of the

character B is recognized as E.

classes in K. In our implementation, we used Histogram

of Gradient (HOG) features [8] for φ

, and the likelihoods

p(·) were learnt using a multi-class Support Vector Machine

(SVM) [5]. Details of the training procedure are provided in

Section 4.2.

This basic sliding window detection approach produces

many potential character windows, but not all of them are

useful for recognizing words. We discard some of the weak

detection windows using the following pruning method.

2.2. Pruning Windows

For every potential character window, we compute a

score based on: (i) classiﬁer conﬁdence; and (ii) a measure

of the aspect ratio of the character detected and the aspect

ratio learnt for that character from training data. The intu-

ition behind this score is that, a strong character window

candidate should have a high classiﬁer conﬁdence score,

and must fall within some range of sizes observed in the

training data. For a window l

with an aspect ratio a

, let

denote the character with the best classiﬁer conﬁdence

value given by S

. The mean aspect ratio (computed from

training data) for the character c

is denoted by µ

. We

deﬁne a goodness score (GS) for the window l

as:

GS(l

) = S

exp

−

(µ

− a

)

2σ

, (1)

where σ

is the variance of the aspect ratio for character

class c

in the training data. Note that the aspect ratio statis-

tics are character-speciﬁc. A low goodness score indicates

a weak detection, and is removed from the set of candidate

character windows.

We then apply Non-Maximum Suppression (NMS), sim-

ilar to other sliding window detection methods [13], to ad-

dress the issue of multiple overlapping detections for each

instance of a character. We select detections which have a

high conﬁdence score, and do not overlap signiﬁcantly with

any of the other stronger detections. We perform NMS after

the aspect ratio pruning because wide windows containing

many characters may suppress overlapping single character

windows, when they are weaker.

We perform both the pruning steps conservatively, and

only discard the obvious false detections. We believe that

this bottom-up approach alone cannot address all the is-

sues related to detecting characters. Hence, we introduce

lexicon-based top-down cues to discard the remaining false

positives. We also use these cues to recognize the word as

described in the following section.

3. Recognizing Words

The character detection step provides us with a large set

of windows potentially containing characters within them.

Our goal is to ﬁnd the most likely word from this set of char-

acters. We formulate this problem in an energy minimiza-

tion framework, where the best energy solution represents

the ground truth word we aim to ﬁnd.

3.1. The Word Model

Each detection window is represented by a random vari-

able X

, which takes a label x

Let n be the total number

of detection windows. Note that the set of random variables

includes windows representing not only true positive detec-

tions, but also many false positive detections, which must be

suppressed. We introduce a null (or void) label ǫ to account

for these false windows. Thus, x

∈ K

= K ∪ {ǫ}. The set

represents the set of all possible assignments of labels

to the random variables. An energy function E : K

→ R,

maps any labelling to a real number E(·) called its energy

or cost. The function E(·) is commonly represented as a

sum of unary and pairwise terms as:

E(x) =

i=1

) +

, x

), (2)

where x = {x

|i = 1, 2, . . . , n}, E

(·) represents the unary

term, E

(·, ·) is the pairwise term, and E represents the set

of pairs of interacting detection windows, which is deter-

mined by the structure of the underlying graphical model.

Graph construction. We order the windows based on

their horizontal location, and add one node each for every

window sequentially from left to right. The nodes are then

connected by edges. One could make a complete graph by

connecting each node to every other node. However, it is

not natural for a window on the extreme left to be related

to another window on the extreme right, as evident in Fig-

ure 4.

Thus, we only connect windows with a signiﬁcant

overlap between them or windows which are close to each

Our model has similarities to that proposed in [10] for object detection,

but the challenges (e.g. inter- and intra- character confusions) are greatly

different from those in the object detection problem.

We assume here that the windows have been pruned based on aspect

ratio of character windows. Without this pruning step, a window may con-

tain multiple characters and will perhaps require a more complete graph.

other. In the example in Figure 4, we show a few win-

dow samples and the edges between them. The intuition

behind connecting overlapping or close-proximity windows

is that they could represent either overlapping detections of

the same character or detections of two separate characters.

As we will see later, the edges are used to encode the lan-

guage model as top-down cues.

CRF energy. The unary term E(x

), which denotes the

cost of a node x

taking label c

6= ǫ, is deﬁned as:

= c

) = 1 − p(c

), (3)

where p(c

) is the classiﬁer conﬁdence (e.g. SVM score)

of character class c

for node x

. For the null label ǫ,

= ǫ) = max

p(c

) exp

−

(µ

− a

)

, (4)

where a

is the aspect ratio of the window corresponding to

node x

, c

is character detected at node x

, and µ

and σ

are the mean and the variance of aspect ratio of the charac-

ter detected, which is learnt from the training data, respec-

tively. For a true window, which has a relatively good SVM

score, and matches the average aspect ratio size, this cost

of assigning a null label is high. On the other hand, false

windows, which either have poor SVM scores or vary from

the average aspect ratio size or both will be more likely to

take the null label ǫ.

The pairwise term E(x

, x

) is used to encode the top-

down cues from the language model in a principled way.

The cost of two neighbouring nodes x

and x

taking labels

6= ǫ and c

6= ǫ respectively is given by:

, x

) = E

, x

) + λ

exp(−ψ(x

, x

)). (5)

Here, ψ(x

, x

) = (100 − Overlap(x

, x

))

, is a mea-

sure of the overlap percentage between the two windows

and X

. The function E

, x

) denotes the lexi-

con prior.The parameter λ

determines the overlap-based

penalty. Computation of the lexicon prior E

(·, ·) is dis-

cussed in Section 3.2. The pairwise cost (5) ensures that two

windows with sufﬁciently high overlap cannottake non-null

labels, i.e. at least one of them is likely to be a false window.

The costs involving the null label ǫ are computed as:

= c

, x

= ǫ) = λ

exp(−ψ(x

, x

)). (6)

The pairwise cost E

= ǫ, x

= c

) is deﬁned similarly.

Further, E

= ǫ, x

= ǫ) is uniformly set to zero.

Inference. Given these unary and pairwise terms, we

minimize the energy function (2). We use the sequential

tree-reweighted message passing (TRW-S) algorithm [18]

Figure 4: Summary of our approach. We ﬁrst ﬁnd a set of

potential character windows, shown at the top (only a few

are shown here for readability). We then build a random

ﬁeld model on these detection windows by connecting them

with edges. The edge weights are computed based on the

characteristics of the two windows. Edges shown in green

indicate that the two windows it connects have a high prob-

ability of occurring together. Edges shown in red connect

two windows that are unlikely to be characters following

one another. A edge shown in red forces one of the two win-

dows to take the ǫ label, i.e. removes it from the candidate

character set. Based on these edges and the SVM scores for

each window, we infer the character classes of each window

as well the word, which is indicated by the green edge path.

(Best viewed in colour.)

because of its efﬁciency and accuracy on our recognition

problem. The TRW-S algorithm maximizes a concave lower

bound on the energy. It begins by considering a set of trees

from the random ﬁeld, and computes probability distribu-

tions over each tree. These distributions are then used to

reweightthe messages being passed during loopy BP [24] on

each tree. The algorithm terminates when the lower bound

cannot be increased further, or the maximum number of it-

erations has reached.

3.2. Computing the Lexicon Prior

We use a lexicon to compute the prior E

, x

) in (5).

Such language models are frequently used in speech recog-

nition, machine translation, and to some extent in OCR sys-

tems [27]. We explore two types of lexicon priors for the

word recognition problem.

Bi-gram. Bi-gram based lexicon priors are learnt from

joint occurrences of characters in the lexicon. Character

pairs which never occur are highly penalized. Let P (c

, c

)

denote the probability of occurrence of a character pair

, c

) in the lexicon. The pairwise cost is:

= c

, x

= c

) = λ

(1 − P (c

, c

)), (7)

where the parameter λ

determines the penalty for a charac-

ter pair occurring in the lexicon.

Node-speciﬁc prior. When the lexicon increases in size,

the bi-gram model loses its effectiveness. It also fails to

capture the location-speciﬁc information of pairs of charac-

ters. As a toy example, consider a lexicon with only two

words CVPR and ICPR. The node-speciﬁc pairwise cost for

the character pair (P,R) to occur at the beginningof the word

is higher than for it to occur at the end of word. This useful

cue is ignored in the bi-gram prior model.

To overcome this, we divide each lexicon word into n

parts, where n is determined based on the number of nodes

in the graph and the spatial distance between nodes. We

then use only the ﬁrst 1/n

of the word for computing the

pairwise cost between initial nodes, similarly next 1/n

for

computing the cost between the next few nodes, and so on.

In other words, we do a region of interest (ROI) based search

in the lexicon. The ROI is determined based on the spatial

position of a detected window in the word, e.g. if two win-

dows are on the left most side then only the ﬁrst couple

of characters of lexicons are considered for calculating the

pairwise term between windows.

The pairwise cost using this prior is given by:

= c

, x

= c

) =



0 if (c

, c

) ∈ ROI,

otherwise.

(8)

We evaluate our approach with both these pairwise terms,

and ﬁnd that the node-speciﬁc prior achieves better perfor-

mance.

4. Experiments

In this section we present a detailed evaluation of our

method. Given a word image extracted from a street scene

and a lexicon, our problem is to ﬁnd all the characters, and

also to recognize the word as a whole. We evaluate various

componentsof the proposedapproach to justify our choices.

We compare our results with two of the best performing

methods [29, 30] for this task.

4.1. Datasets

We used the Street View Text (SVT) [30]

and the IC-

DAR 2003 robust word recognition [1] datasets in our ex-

periments. To maintain identical experimental settings to

those in [29], we use the lexica provided by them for these

two datasets.

SVT. The Street View Text (SVT)

dataset contains im-

ages taken from Google Street View. As noted in [30], most

of the images come from business signage and exhibit a

high degree of variability in appearance and resolution. The

dataset is divided into SVT-SPOT and SVT-WORD, meant for

http://vision.ucsd.edu/∼kai/svt

Note that this dataset has been slightly updated since its original re-

lease in [30]. We use the updated version [29] in our experiments.

the tasks of locating words and recognizing words respec-

tively. Since, in our work, we focus on the word recognition

task, we used the SVT-WORD dataset, which contains 647

word images.

Our basic unit of recognition is a character, which needs

to be detected or localized before classiﬁcation. A miss in

the localization will result in poorer word recognition. To

improve the robustness of the recognition architecture, we

need to quantitatively measure the accuracy of the charac-

ter detection module. For this purpose, we created ground

truth data for characters in the SVT-WORD dataset. Using

the ground truth at the character level we evaluated the per-

formance of the SVM classiﬁer used for this task. Note that

no such evaluation has been reported on the SVT dataset as

yet. Our ground truth data set contains around 4000 char-

acters of 52 classes overall. We refer to this dataset as SVT-

CHAR.

ICDAR 2003 Dataset. The ICDAR 2003 dataset was orig-

inally created for cropped character classiﬁcation, full im-

age text detection, cropped and full image word recogni-

tion, and various other tasks in document analysis. We used

the part corresponding to cropped image word recognition

called Robust Word Recognition [1]. Similar to [29], we

ignore words with less than two characters or with non-

alphanumeric characters, which results in 829 words over-

all. For subsequent discussion we refer to this dataset as

ICDAR(50).

4.2. Character Detection

Sliding window based character detection is an impor-

tant component of our framework, as our random ﬁeld

model is deﬁned on the detections obtained. At every possi-

ble location of the sliding window, we test a character clas-

siﬁer. This provides a likelihood of the window containing

a certain character. The alphabet of characters recognized

consists of 26 lowercase and 26 uppercase letters, and 10

digits.

We evaluated various features for recognizing charac-

ters. We observed that the HOG feature [8] outperforms the

features reported in [9], which uses a bag-of-words model.

This is perhaps due to the lack of geometric information in

the model. We computed dense HOG features with a cell

size of 4 × 4 using 10 bins, after resizing each image to a

22 × 20 window. We learnt a 1-vs-all SVM classiﬁer with an

RBF kernel using these features. We used the standard LIB-

SVM package [5] for training and testing the SVMs. For the

SVT-CHAR evaluation, we trained the model on the ICDAR

2003 dataset due to the relatively small size of SVT-CHAR

(∼ 4000 characters). We observed that the method using

HOG descriptors performs signiﬁcantly better than others

with a classiﬁcation accuracy of 61.86%.

Available at http://cvit.iiit.ac.in/projects/SceneTextUnderstanding

HTML Viewer

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Top-down and bottom-up cues for scene text recognition" ?

In this work, the authors focus on the problem of recognizing text extracted from street images. The authors present a framework that exploits both bottom-up and top-down cues. The authors show significant improvements in accuracies on two challenging public datasets, namely Street View Text ( over 15 % ) and ICDAR 2003 ( nearly 10 % ).

Top-down and bottom-up cues for scene text recognition

Summary (3 min read)

1. Introduction

2.1. Sliding Window Detection

2.2. Pruning Windows

3. Recognizing Words

3.1. The Word Model

3.2. Computing the Lexicon Prior

4. Experiments

4.1. Datasets

4.2. Character Detection

4.3. Cropped Word Recognition

4.4. Results and Discussion

5. Conclusion

Figures (5)

Citations

Cites background from "Top-down and bottom-up cues for sce..."

Cites background or result from "Top-down and bottom-up cues for sce..."

Cites background or methods from "Top-down and bottom-up cues for sce..."

Cites background or methods from "Top-down and bottom-up cues for sce..."

Cites methods from "Top-down and bottom-up cues for sce..."

References

"Top-down and bottom-up cues for sce..." refers methods in this paper

"Top-down and bottom-up cues for sce..." refers methods in this paper

Additional excerpts

"Top-down and bottom-up cues for sce..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Top-down and bottom-up cues for scene text recognition" ?