What are the contributions in "Detecting and reading text in natural scenes" ?

This paper gives an algorithm for detecting and reading text in natural images. From this dataset, the authors manually label and extract the text regions. Next the authors perform statistical analysis of the text regions to determine which image features are reliable indicators of text and have low entropy ( i. e. feature response is similar for all text images ).

What is the important component of the algorithm?

The first, and most important, component of the algorithm is a strong classifier which is trained by the AdaBoostlearning algorithm [4],[19],[20] on labelled data.

What methods were used to learn the strong classifier?

The authors used standard AdaBoost training methods to learn the strong classifier [4] [5] combined with Viola and Jones’ cascade approach which uses asymmetric weighting [19].

How many false positive text regions were detected by AdaBoost?

the AdaBoost strong classifier (plus extension/binarization) detected 97.2 % of the visible text in their test dataset (text that could be detected by a normally sighted viewer).

What is the advantage of the AdaBoost algorithm?

These algorithms have the additional advantage that they use generative models [18] and can be applied directly to the image intensity without requiring binarization.

What is the common error in the text?

The most common remaining error are text string like ”111” or ”Ill” which correspond to vertical edges in the image caused, for example, by iron railings.

What are the properties with low entropy?

These are also properties with low entropy, since there will typically be a fixed number of long edges whatever the letters in the text region.

What is the simplest way to determine the threshold?

Niblack’s algorithm requires adaptively determining a threshold T for each pixel x from the intensity statistics within a local window of size r Tr(x) = µr(x) + k · σr(x), where µr(x) and σr(x) are the mean and standard deviation (std) of the pixel intensities within the window.

What is the result of this feature selection approach?

The result of this feature selection approach is that their final strong classifier, see next section, uses far fewer filter’s than Viola and Jones’ face detection classifier [19].

(Open Access) Detecting and reading text in natural scenes (2004) | Xiangrong Chen

Q: What is the way to classify text?

In ideal text images, the authors would be able to classify pixels as text or background directly from the intensity histogram which should have two peaks corresponding to text and background mean intensity.

Q: What are the three layers of the cascade onlyuse mean, STD and module of derivative?

The first three layers of the cascade onlyuse mean, STD and module of derivative features, since they can be easily calculated from integral images[19].

Detecting and Reading Text in Natural Scenes

Xiangrong Chen

Alan L. Yuille

1,2

Departments of Statistics

, Psychology

University of California, Los Angeles, Los Angeles, CA 90095.

emails: {xrchen,yuille}@stat.ucla.edu

Abstract

This paper gives an algorithm for detecting and read-

ing text in natural images. The algorithm is intended for

use by blind and visually impaired subjects walking through

city scenes. We ﬁrst obtain a dataset of city images t aken by

blind and normally sighted subjects. From this dataset, we

manually label and extract the text regions. Next we perform

statistical analysis of the text regions to determine which im-

age features are reliable indicators of text and have low en-

tropy (i.e. feature response is similar for all text images). We

obtain weak classiﬁers by using joint probabilities for fea-

ture responses on and off text. These weak classiﬁers are

used as input to an AdaBoost machine learning algorithm

to train a strong classiﬁer. In practice, we trained a cascade

with 4 strong classiﬁers containg 79 features. An adaptive

binarization and extension algorithm is applied to those re-

gions selected by the cascade classiﬁer. A commercial OCR

software is used to read the text or reject it as a non-text re-

gion. The overall algorithm has a success rate of over 90%

(evaluated by complete detection and reading of the text)

on the test set and the unread text is typically small and dis-

tant from the viewer.

1.. Introduction

This paper presents an algorithm for detecting and read-

ing text in city scenes. This text includes stereotypical forms

– such as street signs, hospital signs, and bus numbers –

as well as more variable forms such as shop signs, house

numbers, and billboards. Our database of city images were

taken in San Francisco partly by normally sighted viewers

and partly by blind volunteers who were accompanied by

sighted guides (for safety reasons) using automatic camera

settings and little practical knowledge of where the text was

located in the image. The databases have been labelled to

enable us to train part of our algorithm and to evaluate the

algorithm performance.

The ﬁrst, and most important, component of the algo-

rithm is a strong classiﬁer which is trained by the AdaBoost

learning algorithm [4],[19],[20] on labelled data. AdaBoost

requires specifying a set of features from which to build the

strong classiﬁer. This paper selects this feature set guided

by the principle of informative features (the feature set used

in [19] is not suitable for this problem). We calculate joint

probability distributions of these feature responses on and

off text, so weak classiﬁers can be obtained as log-likelihood

ratio tests. The strong classiﬁer is applied to sub-regions of

the image (at multiple scale) and outputs text candidate re-

gions. In this application, there are typically between 2-5

false positives in images of 2,048 x 1,536 pixels. The sec-

ond component is an extension and binarization [12] algo-

rithm that acts on the text region candidates. The extension

and binarization algorithm takes the text regions as inputs,

extends these regions, so as to include text that the strong

classiﬁer did not detect, and binarizes them (ideally, so that

the text is white and the background is black). The third

component is an OCR software program which acts on the

binarized regions (the OCR software gave far worse perfor-

mance when applied directly to the image). The OCR soft-

ware either determines that the regions are text, and reads

them, or rejects the region as text.

The performance is as follows: (I) Speed. The current al-

gorithm runs in under 3 seconds on images of size 2,048 by

1,536. (II) Quality of Results. We are able to detect text of

almost all form with false negative rate of 2.8 %. We are

able to read the detected text correctly at 93.0 % (correct-

ness is measured per complete word and not per letter). We

incorrectly read non-text as text for 10 % of cases. But only

1 % remains incorrectly read after we prune out text which

does not form coherent words. (Many of the remaining er-

rors correspond to outputting ”111” due to vertical struc-

tures in the image.)

2. Previous Work

There has been recent successful work on detecting text

in images. Some has concentrated on detecting individual

letters [1], [6],[7]. More relevant work is reported in [8],

[9], [10], [22] ,[23]. In particular, Lucas et al [10] report on

performance analysis of text detection algorithms on a stan-

dardized database. It is hard to do a direct comparison to

these papers. None of these methods use AdaBoost learn-

ing and the details of the algorithms evaluated by Lucas et

al are not given. The performance we report in this paper is

better than those reported in Lucas et al, but the datasets are

different and more precise comparison on the same datasets

are needed. We will be making our dataset available for test-

ing.

3. The Datasets

We used two image datasets with one used for training

the AdaBoost learning algorithm and the other used for test-

ing it.

The training dataset was 162 images of which 41 of them

were taken by scientists from SKERI (the Smith-Kettlewell

Eye Research Institute) and the rest taken by blind volun-

teers under the supervision of scientists from SKERI.

The test dataset of 117 images was taken entirely

by blind volunteers. Brieﬂy, the blind volunteers were

equipped with a Nikon camera mounted on the shoulder or

the stomach. They walked around the streets of San Fran-

cisco taking photographs. Two observers from SKERI ac-

companied the volunteers to assure their safety but took no

part in taking the photographs. The camera was set to the

default automatic setting for focus and contrast gain con-

trol.

From the dataset construction, see ﬁgure (1), we noted

that: (I) Blind volunteers could keep the camera approxi-

mately horizontal. (II) They could hold the camera fairly

steady so there was very little blur. (III) The automatic con-

trast gain control of the cameras was almost always sufﬁ-

cient to allow the images to have good contrast.

4.. Selection of Features for AdaBoost

The AdaBoost algorithm is a method for combining a

set of weak classiﬁers to make a strong classiﬁer. The weak

classiﬁers correspond to image features. Typically a large

set of features are speciﬁed in advance and the algorithm

selects which ones to use and how to combine them.

The problem is that the choice of feature set is critical

to the success and transparency of the algorithm. The set of

features used for face detection by Viola and Jones [19] con-

sists of a subset of Haar basis functions. But there was no

rationale for this choice of feature set apart from compu-

tational efﬁciency. Also there are important differences be-

tween text and face stimuli because the spatial variation per

pixel of text images is far greater than for faces. Facial fea-

tures, such as eyes, are in approximately the same spatial

position for any face and have similar appearance. But the

positions of letters in text is varied and the shapes of letters

differ. For example, PCA analysis of text, see ﬁgure (2), has

Figure 1. Example images in the training

dataset taken by blind volunteers (top two

panels) and by scientists from SKERI (bot-

tom two panels). The blind volunteers are, of

course, poor at centering the signs and in

keeping the camera horizontally alligned.

far more non-zero eigenvalues than for faces (where Pent-

land reported that 15 eigenfaces capture over ninety percent

of the variance [14]).

Ideally, we should select informative features which give

similar results on all text regions, and hence have low en-

tropy, and which are also good for discriminating between

text and non-text. Statistical analysis of the dataset of train-

ing text images shows that there are many statistical regu-

larities.

For example, we align samples from our text dataset

(precise alignment is unnecessary) and analyze the response

of the modulus of the x and y derivative ﬁlters at each pixel.

The means of the derivatives have an obvious pattern, see

ﬁgure (3), where the derivatives are small in the background

0 50 100 150 200 250 300 350 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Number of Eigen values

Energy captured

Figure 2. PCA on our dataset of text images

(40 x 20 pixels). Observe that about 150 com-

ponents are required to get 90 percent of the

variance. Faces require only 15 components

to achieve this variance [14].

regions above and below the text. The x derivatives tend to

be large in the central (i.e. text) region while the y deriva-

tives are large at the top and bottom of the text and small

in the central region. But the variances of the x derivatives

are very large within the central region (because letters have

different shapes and positions). However, the y derivatives

tend to have low variance, and hence low entropy.

Our ﬁrst set of features are based on these observations.

By averaging over regions we obtain features which have

lower entropy. Based on the observation in ﬁgure (3), we

designed block patterns inside the sub-window, correspond-

ing to horizontal and vertical derivative. We also designed

three symmetrical block patterns, see ﬁgure (4), which are

chosen so that there is (usually) a text element within each

sub-window. This gives features based on block based mean

and STD of intensity and modulus of x and y derivative ﬁl-

ters.

We build weak classiﬁers from these features by comput-

ing probability distributions. Formally, a good feature f (I)

will determine two probability distributions P (



f(I)|text)

and P (



f(I)|non-text). We can obtain a weak classiﬁer by

using the log-likelihood ratio test. This is made easier if we

can ﬁnd tests for which P (



f(I)|text) is strongly peaked

(i.e. has low-entropy because it gives similar results for ev-

ery image of text) provided this peak occurs at a place where

P (



f(I)|non-text) is small. Such tests are computationally

cheap to implement because they only involved checking

the value of



f(I) within a small range.

We also have a second class of features which are more

complicated. These include tests based on the histograms

of the intensity, gradient direction, and intensity gradient.

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

STD of module of horizontal derivative

0.1

0.2

0.3

0.4

0.5

STD of module of vertical derivative

Figure 3. The means of the moduli of the

x (left top) and y (right top) derivative ﬁl-

ters have this pattern. Observe that the aver-

age is different directly above/below the text

compared to the response on the text. The y

derivative is small everywhere. The x deriva-

tives tend to have large variance (bottom left)

and the y derivatives have small variance

(bottom right).

In ideal text images, we would be able to classify pixels

as text or background directly from the intensity histogram

which should have two peaks corresponding to text and

background mean intensity. But, in practice, the histograms

typically only have a single peak, see ﬁgure (5) (top right).

But by getting a joint histogram on the intensity and the in-

tensity derivative, see ﬁgure (5) (bottom left), we are able

to estimate the text and background mean intensities. These

joint histograms are useful tests for distinguishing between

text and non-text.

Figure 4. Block patterns. Features which

compute properties averaged within these re-

gions will typically have low entropy, because

the ﬂuctuations shown in the previous ﬁgure

have been averaged out.

0 32 64 96 128 160 192 224 256

500

1000

1500

Intensity

Number of pixels

128

160

192

224

256

100

150

200

250

300

Module of intensity

derivative

Intensity

Number of pixels

0 32 64 96 128 160 192 224 256

100

200

300

400

500

600

700

800

Intensity

Number of pixels

Figure 5. Original image (top left) has inten-

sity histogram (top right) with only a single

peak. But the joint histograms of intensity

and intensity gradient shows two peaks (bot-

tom left) and shown in proﬁle (bottom right).

The intensity histogram is contaminated by

edge pixels which have high intensity gradi-

ent and intensity values which are intermedi-

ate between the background and foreground

mean intensity. The intensity gradient infor-

mation helps remove this contamination.

Our third, and ﬁnal, class of features based on perform-

ing edge detection, by intensity gradient thresholding, fol-

lowed by edge linking. These features are more computa-

tionally expensive than the previous tests, so we only use

them later in the AdaBoost cascade, see next section. Such

features count the number of extended edges in the image.

These are also properties with low entropy, since there will

typically be a ﬁxed number of long edges whatever the let-

ters in the text region.

In summary, we had : (i) 40 ﬁrst class features includ-

ing 4 intensity mean features, 12 intensity standard devi-

ation features, 24 derivative features, (ii) 24 second class

features including 14 histograms features, and (iii) 25 third

class features based on edge linking.

Ideally, we would learn a joint distributions

P (



f(I)|text) and P (



f(I)|non-text) for all features f.

In practice, this is impossible because of the dimension-

ally of the feature set and because we do not know which

set of features should be chosen. We would need an im-

mense amount of training data.

Instead, we use both single features and joint distribu-

tions for pairs of features, followed by log-likelihood ratio

tests, as our weak classiﬁers. See ﬁgure (6). These are then

combined together by standard AdaBoost techniques. It is

Figure 6. Joint histograms of the ﬁrst fea-

tures that AdaBoost selected.

worth noting that all weak classiﬁers selected by AdaBoost

are from joint distributions, indicating that it is more ”dis-

criminant” and making the learning process less greedy.

The result of this feature selection approach is that our

ﬁnal strong classiﬁer, see next section, uses far fewer ﬁl-

ter’s than Viola and Jones’ face detection classiﬁer [19].

This helps the transparency of the system.

5.. AdaBoost Learning

The AdaBoost algorithm [4] has been shown to be ar-

guably the most effective method for detecting target ob-

jects in images [19]. Its performance on detecting faces [19]

compares favorably with other successful algorithms for de-

tecting faces [3, 15, 17, 21, 24].

The standard AdaBoost algorithm learns a “strong clas-

siﬁer” H

Ada

(I) by combining a set of T “weak classiﬁers”

(I)} using a set of weights {α

Ada

(I) = sign(



t=1

(I)).

The selection of features and weights are learned through

supervised training off-line [4]. Formally, AdaBoost uses a

set of input data {I

: i =1, .., N } where I

is the in-

put, in this case image windows described below, and y

the classiﬁcation where y

=1indicates text, y

= −1 is

not-text. The algorithm uses a set of weak classiﬁers de-

noted by {h

(.)}. These weak classiﬁers correspond to a

decision of text or non-text based on simple tests of visual

cues (see next paragraph). These weak classiﬁers are only

required to make the correct classiﬁcations slightly over half

the time. The AdaBoost algorithm proceeds by deﬁning a

set of weights D

(i) on the samples. At t =1,thesam-

ples are equally weighted so D

(i)=1/N . The update rule

consists of three stages. Firstly, update the weights by

t+1

(i)=D

(i)e

−y

)

[α

where Z

is a normalization factor chosen so that



i=1

t+1

(i)=1. The algorithm selects the α

(.) that

minimize Z

[α

(.)]. Then the process repeats and out-

puts a strong classiﬁer H

(I) = sign(



t=1

(I)).It

can be shown that this classiﬁer will converge to the op-

timal classiﬁer as the number of classiﬁers increases

[4].

AdaBoost requires a set of classiﬁed data with image

windows labelled manually as being text or non-text. Fig-

ure (7) shows some text examples. We performed this la-

belling for the training dataset and and divided each text

window into several overlapping text segments with ﬁxed

width-to-height ratio 2:1. This lead to a total of 7,132 text

segments which were used as positive examples. The nega-

tive examples were obtained by a bootstrap process similar

to Drucker et al [2]. First we selected negative examples by

randomly sampling from windows in the image dataset. Af-

ter training with these samples, we applied the AdaBoost

algorithm to classify all windows in the training images

(at a range of sizes). Those misclassiﬁed as text were then

used as negative examples for retraining AdaBoost. The im-

age regions most easily confused with text were vegetation,

repetitive structures such as railings or building facades, and

some chance patterns.

Figure 7. Text example used for getting pos-

itive examples for training AdaBoost. Ob-

server the low quality of some of the exam-

ples.

The previous section described the weak classiﬁers we

used for training AdaBoost. We used standard AdaBoost

training methods to learn the strong classiﬁer [4] [5] com-

bined with Viola and Jones’ cascade approach which uses

asymmetric weighting [19]. The cascade approach enables

the algorithm to rule out most of the image as text locations

with a few tests (so we do not have to apply all the tests ev-

erywhere in the image). This makes the algorithm extremely

fast when applied to the test dataset and yields order of mag-

nitude speed-up over standard AdaBoost [19]. Our algo-

rithm had a total of 4 cascade layers. Each layer has 1, 10,

30, 50 tests respectively. The overall algorithm uses 91 dif-

ferent feature tests. The ﬁrst three layers of the cascade only

use mean, STD and module of derivative features, since they

can be easily calculated from integral images[19]. Compu-

tation intensive features, histogram and edge linking, in-

volve all pixels inside the sub-window. So we only let them

be selected in the last layer.

In the test stage, we applied the AdaBoost strong clas-

siﬁer H(I) to windows of the input images at a range of

scales. There was a total of 14 different window sizes, rang-

ing from 20 by 10 to 212 by 106, with a scaling factor of

1.2. Each window was classiﬁed by the algorithm as text or

non-text. There was often overlap between windows classi-

ﬁed as text. We merged these regions by taking the union of

the text windows.

In our test stage, AdaBoost gave very high performance

with low false positives and false negatives (in agreement

with previous work on faces [19]). When applied to over

20,000,000 image windows, taken from 35 images, the to-

tal number of false positives was just over 118 and the num-

ber of false negatives was 27. By altering the threshold we

could reduce the number of false negatives to 5 but at the

price of raising the number of false positives, see table (1).

We decided to keep not to alter the threshold so as to keep

the number of false positives down to an average of 4 per

image (almost all of which will be eliminated at the read-

ing stage).

Thresh False Pos. False Neg. Images Subwindows

0.00 118 27 35 20,183,316

-0.05 1879 5 35 20,183,316

Table 1. Performance of AdaBoost at differ-

ent thresholds. Observe the excellent overall

performance and the trade-off between false

positives and false negatives.

We illustrate these results by showing the windows that

AdaBoost classiﬁes as text for typical images in the test

dataset, see ﬁgure (8).

6. Extension and Binarization

Our next stage produces binarized text regions to be used

as inputs to the OCR reading stage. (It is possible to run

OCR directly on intensity images but we obtain substan-

tially worse performance if we do so). In addition to bina-

rization, we must extend the text regions found by the Ad-

aBoost strong classiﬁers because these regions sometimes

miss letters or digits at the start and end of the text.

We start by applying adaptive binarization [12] to the

text regions detected by the AdaBoost strong classiﬁer. This

is followed by a connected component algorithm [13] which

Detecting and reading text in natural scenes

Figures

Citations

Detecting text in natural scenes with stroke width transform

End-to-end scene text recognition

Reading Text in the Wild with Convolutional Neural Networks

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

End-to-end text recognition with convolutional neural networks

References

Eigenfaces for recognition

Experiments with a new boosting algorithm

Experiment with a new boosting algorithm

Additive Logistic Regression : A Statistical View of Boosting

Neural network-based face detection

Related Papers (5)

Detecting text in natural scenes with stroke width transform

Real-time scene text localization and recognition

End-to-end scene text recognition

Detecting texts of arbitrary orientations in natural images

Text information extraction in images and video: a survey

Frequently Asked Questions (14)

Q1. What are the contributions in "Detecting and reading text in natural scenes" ?

Q2. What is the important component of the algorithm?

Q3. What methods were used to learn the strong classifier?

Q4. How many false positive text regions were detected by AdaBoost?

Q5. What is the way to classify text?

Q6. What is the advantage of the AdaBoost algorithm?

Q7. What is the common error in the text?

Q8. What are the properties with low entropy?

Q9. What are the three layers of the cascade onlyuse mean, STD and module of derivative?

Q10. What is the simplest way to determine the threshold?

Q11. What was the way to classify windows in the training images?

Q12. What is the third component of the OCR algorithm?

Q13. What is the result of this feature selection approach?

Q14. How many text segments were used in the training dataset?