scispace - formally typeset
Open AccessProceedings ArticleDOI

Detecting and reading text in natural scenes

TLDR
The overall algorithm has a success rate of over 90% (evaluated by complete detection and reading of the text) on the test set and the unread text is typically small and distant from the viewer.
Abstract
This paper gives an algorithm for detecting and reading text in natural images. The algorithm is intended for use by blind and visually impaired subjects walking through city scenes. We first obtain a dataset of city images taken by blind and normally sighted subjects. From this dataset, we manually label and extract the text regions. Next we perform statistical analysis of the text regions to determine which image features are reliable indicators of text and have low entropy (i.e. feature response is similar for all text images). We obtain weak classifiers by using joint probabilities for feature responses on and off text. These weak classifiers are used as input to an AdaBoost machine learning algorithm to train a strong classifier. In practice, we trained a cascade with 4 strong classifiers containing 79 features. An adaptive binarization and extension algorithm is applied to those regions selected by the cascade classifier. Commercial OCR software is used to read the text or reject it as a non-text region. The overall algorithm has a success rate of over 90% (evaluated by complete detection and reading of the text) on the test set and the unread text is typically small and distant from the viewer.

read more

Content maybe subject to copyright    Report

Detecting and Reading Text in Natural Scenes
Xiangrong Chen
1
Alan L. Yuille
1,2
Departments of Statistics
1
, Psychology
2
,
University of California, Los Angeles, Los Angeles, CA 90095.
emails: {xrchen,yuille}@stat.ucla.edu
Abstract
This paper gives an algorithm for detecting and read-
ing text in natural images. The algorithm is intended for
use by blind and visually impaired subjects walking through
city scenes. We first obtain a dataset of city images t aken by
blind and normally sighted subjects. From this dataset, we
manually label and extract the text regions. Next we perform
statistical analysis of the text regions to determine which im-
age features are reliable indicators of text and have low en-
tropy (i.e. feature response is similar for all text images). We
obtain weak classifiers by using joint probabilities for fea-
ture responses on and off text. These weak classifiers are
used as input to an AdaBoost machine learning algorithm
to train a strong classifier. In practice, we trained a cascade
with 4 strong classifiers containg 79 features. An adaptive
binarization and extension algorithm is applied to those re-
gions selected by the cascade classifier. A commercial OCR
software is used to read the text or reject it as a non-text re-
gion. The overall algorithm has a success rate of over 90%
(evaluated by complete detection and reading of the text)
on the test set and the unread text is typically small and dis-
tant from the viewer.
1.. Introduction
This paper presents an algorithm for detecting and read-
ing text in city scenes. This text includes stereotypical forms
such as street signs, hospital signs, and bus numbers
as well as more variable forms such as shop signs, house
numbers, and billboards. Our database of city images were
taken in San Francisco partly by normally sighted viewers
and partly by blind volunteers who were accompanied by
sighted guides (for safety reasons) using automatic camera
settings and little practical knowledge of where the text was
located in the image. The databases have been labelled to
enable us to train part of our algorithm and to evaluate the
algorithm performance.
The first, and most important, component of the algo-
rithm is a strong classifier which is trained by the AdaBoost
learning algorithm [4],[19],[20] on labelled data. AdaBoost
requires specifying a set of features from which to build the
strong classifier. This paper selects this feature set guided
by the principle of informative features (the feature set used
in [19] is not suitable for this problem). We calculate joint
probability distributions of these feature responses on and
off text, so weak classifiers can be obtained as log-likelihood
ratio tests. The strong classifier is applied to sub-regions of
the image (at multiple scale) and outputs text candidate re-
gions. In this application, there are typically between 2-5
false positives in images of 2,048 x 1,536 pixels. The sec-
ond component is an extension and binarization [12] algo-
rithm that acts on the text region candidates. The extension
and binarization algorithm takes the text regions as inputs,
extends these regions, so as to include text that the strong
classifier did not detect, and binarizes them (ideally, so that
the text is white and the background is black). The third
component is an OCR software program which acts on the
binarized regions (the OCR software gave far worse perfor-
mance when applied directly to the image). The OCR soft-
ware either determines that the regions are text, and reads
them, or rejects the region as text.
The performance is as follows: (I) Speed. The current al-
gorithm runs in under 3 seconds on images of size 2,048 by
1,536. (II) Quality of Results. We are able to detect text of
almost all form with false negative rate of 2.8 %. We are
able to read the detected text correctly at 93.0 % (correct-
ness is measured per complete word and not per letter). We
incorrectly read non-text as text for 10 % of cases. But only
1 % remains incorrectly read after we prune out text which
does not form coherent words. (Many of the remaining er-
rors correspond to outputting ”111” due to vertical struc-
tures in the image.)
2. Previous Work
There has been recent successful work on detecting text
in images. Some has concentrated on detecting individual
letters [1], [6],[7]. More relevant work is reported in [8],
[9], [10], [22] ,[23]. In particular, Lucas et al [10] report on
0-7695-2158-4/04 $20.00 (C) 2004 IEEE

performance analysis of text detection algorithms on a stan-
dardized database. It is hard to do a direct comparison to
these papers. None of these methods use AdaBoost learn-
ing and the details of the algorithms evaluated by Lucas et
al are not given. The performance we report in this paper is
better than those reported in Lucas et al, but the datasets are
different and more precise comparison on the same datasets
are needed. We will be making our dataset available for test-
ing.
3. The Datasets
We used two image datasets with one used for training
the AdaBoost learning algorithm and the other used for test-
ing it.
The training dataset was 162 images of which 41 of them
were taken by scientists from SKERI (the Smith-Kettlewell
Eye Research Institute) and the rest taken by blind volun-
teers under the supervision of scientists from SKERI.
The test dataset of 117 images was taken entirely
by blind volunteers. Briefly, the blind volunteers were
equipped with a Nikon camera mounted on the shoulder or
the stomach. They walked around the streets of San Fran-
cisco taking photographs. Two observers from SKERI ac-
companied the volunteers to assure their safety but took no
part in taking the photographs. The camera was set to the
default automatic setting for focus and contrast gain con-
trol.
From the dataset construction, see figure (1), we noted
that: (I) Blind volunteers could keep the camera approxi-
mately horizontal. (II) They could hold the camera fairly
steady so there was very little blur. (III) The automatic con-
trast gain control of the cameras was almost always suffi-
cient to allow the images to have good contrast.
4.. Selection of Features for AdaBoost
The AdaBoost algorithm is a method for combining a
set of weak classifiers to make a strong classifier. The weak
classifiers correspond to image features. Typically a large
set of features are specified in advance and the algorithm
selects which ones to use and how to combine them.
The problem is that the choice of feature set is critical
to the success and transparency of the algorithm. The set of
features used for face detection by Viola and Jones [19] con-
sists of a subset of Haar basis functions. But there was no
rationale for this choice of feature set apart from compu-
tational efficiency. Also there are important differences be-
tween text and face stimuli because the spatial variation per
pixel of text images is far greater than for faces. Facial fea-
tures, such as eyes, are in approximately the same spatial
position for any face and have similar appearance. But the
positions of letters in text is varied and the shapes of letters
differ. For example, PCA analysis of text, see figure (2), has
Figure 1. Example images in the training
dataset taken by blind volunteers (top two
panels) and by scientists from SKERI (bot-
tom two panels). The blind volunteers are, of
course, poor at centering the signs and in
keeping the camera horizontally alligned.
far more non-zero eigenvalues than for faces (where Pent-
land reported that 15 eigenfaces capture over ninety percent
of the variance [14]).
Ideally, we should select informative features which give
similar results on all text regions, and hence have low en-
tropy, and which are also good for discriminating between
text and non-text. Statistical analysis of the dataset of train-
ing text images shows that there are many statistical regu-
larities.
For example, we align samples from our text dataset
(precise alignment is unnecessary) and analyze the response
of the modulus of the x and y derivative filters at each pixel.
The means of the derivatives have an obvious pattern, see
figure (3), where the derivatives are small in the background
0-7695-2158-4/04 $20.00 (C) 2004 IEEE

0 50 100 150 200 250 300 350 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Eigen values
Energy captured
Figure 2. PCA on our dataset of text images
(40 x 20 pixels). Observe that about 150 com-
ponents are required to get 90 percent of the
variance. Faces require only 15 components
to achieve this variance [14].
regions above and below the text. The x derivatives tend to
be large in the central (i.e. text) region while the y deriva-
tives are large at the top and bottom of the text and small
in the central region. But the variances of the x derivatives
are very large within the central region (because letters have
different shapes and positions). However, the y derivatives
tend to have low variance, and hence low entropy.
Our first set of features are based on these observations.
By averaging over regions we obtain features which have
lower entropy. Based on the observation in figure (3), we
designed block patterns inside the sub-window, correspond-
ing to horizontal and vertical derivative. We also designed
three symmetrical block patterns, see figure (4), which are
chosen so that there is (usually) a text element within each
sub-window. This gives features based on block based mean
and STD of intensity and modulus of x and y derivative fil-
ters.
We build weak classifiers from these features by comput-
ing probability distributions. Formally, a good feature f (I)
will determine two probability distributions P (
f(I)|text)
and P (
f(I)|non-text). We can obtain a weak classifier by
using the log-likelihood ratio test. This is made easier if we
can find tests for which P (
f(I)|text) is strongly peaked
(i.e. has low-entropy because it gives similar results for ev-
ery image of text) provided this peak occurs at a place where
P (
f(I)|non-text) is small. Such tests are computationally
cheap to implement because they only involved checking
the value of
f(I) within a small range.
We also have a second class of features which are more
complicated. These include tests based on the histograms
of the intensity, gradient direction, and intensity gradient.
0
10
20
30
40
0
5
10
15
20
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
x
y
STD of module of horizontal derivative
0
10
20
30
40
0
5
10
15
20
0
0.1
0.2
0.3
0.4
0.5
x
y
STD of module of vertical derivative
Figure 3. The means of the moduli of the
x (left top) and y (right top) derivative fil-
ters have this pattern. Observe that the aver-
age is different directly above/below the text
compared to the response on the text. The y
derivative is small everywhere. The x deriva-
tives tend to have large variance (bottom left)
and the y derivatives have small variance
(bottom right).
In ideal text images, we would be able to classify pixels
as text or background directly from the intensity histogram
which should have two peaks corresponding to text and
background mean intensity. But, in practice, the histograms
typically only have a single peak, see figure (5) (top right).
But by getting a joint histogram on the intensity and the in-
tensity derivative, see figure (5) (bottom left), we are able
to estimate the text and background mean intensities. These
joint histograms are useful tests for distinguishing between
text and non-text.
Figure 4. Block patterns. Features which
compute properties averaged within these re-
gions will typically have low entropy, because
the fluctuations shown in the previous figure
have been averaged out.
0-7695-2158-4/04 $20.00 (C) 2004 IEEE

0 32 64 96 128 160 192 224 256
0
500
1000
1500
Intensity
Number of pixels
0
32
64
96
128
160
192
224
256
0
5
10
15
0
50
100
150
200
250
300
Module of intensity
derivative
Intensity
Number of pixels
0 32 64 96 128 160 192 224 256
0
100
200
300
400
500
600
700
800
Intensity
Number of pixels
Figure 5. Original image (top left) has inten-
sity histogram (top right) with only a single
peak. But the joint histograms of intensity
and intensity gradient shows two peaks (bot-
tom left) and shown in profile (bottom right).
The intensity histogram is contaminated by
edge pixels which have high intensity gradi-
ent and intensity values which are intermedi-
ate between the background and foreground
mean intensity. The intensity gradient infor-
mation helps remove this contamination.
Our third, and final, class of features based on perform-
ing edge detection, by intensity gradient thresholding, fol-
lowed by edge linking. These features are more computa-
tionally expensive than the previous tests, so we only use
them later in the AdaBoost cascade, see next section. Such
features count the number of extended edges in the image.
These are also properties with low entropy, since there will
typically be a fixed number of long edges whatever the let-
ters in the text region.
In summary, we had : (i) 40 first class features includ-
ing 4 intensity mean features, 12 intensity standard devi-
ation features, 24 derivative features, (ii) 24 second class
features including 14 histograms features, and (iii) 25 third
class features based on edge linking.
Ideally, we would learn a joint distributions
P (
f(I)|text) and P (
f(I)|non-text) for all features f.
In practice, this is impossible because of the dimension-
ally of the feature set and because we do not know which
set of features should be chosen. We would need an im-
mense amount of training data.
Instead, we use both single features and joint distribu-
tions for pairs of features, followed by log-likelihood ratio
tests, as our weak classifiers. See figure (6). These are then
combined together by standard AdaBoost techniques. It is
Figure 6. Joint histograms of the first fea-
tures that AdaBoost selected.
worth noting that all weak classifiers selected by AdaBoost
are from joint distributions, indicating that it is more ”dis-
criminant” and making the learning process less greedy.
The result of this feature selection approach is that our
final strong classifier, see next section, uses far fewer fil-
ter’s than Viola and Jones’ face detection classifier [19].
This helps the transparency of the system.
5.. AdaBoost Learning
The AdaBoost algorithm [4] has been shown to be ar-
guably the most effective method for detecting target ob-
jects in images [19]. Its performance on detecting faces [19]
compares favorably with other successful algorithms for de-
tecting faces [3, 15, 17, 21, 24].
The standard AdaBoost algorithm learns a “strong clas-
sifier” H
Ada
(I) by combining a set of T “weak classifiers”
{h
t
(I)} using a set of weights {α
t
}:
H
Ada
(I) = sign(
T
t=1
α
t
h
t
(I)).
The selection of features and weights are learned through
supervised training off-line [4]. Formally, AdaBoost uses a
set of input data {I
i
,y
i
: i =1, .., N } where I
i
is the in-
put, in this case image windows described below, and y
i
is
the classification where y
i
=1indicates text, y
i
= 1 is
not-text. The algorithm uses a set of weak classifiers de-
noted by {h
µ
(.)}. These weak classifiers correspond to a
decision of text or non-text based on simple tests of visual
cues (see next paragraph). These weak classifiers are only
required to make the correct classifications slightly over half
the time. The AdaBoost algorithm proceeds by defining a
set of weights D
t
(i) on the samples. At t =1,thesam-
ples are equally weighted so D
1
(i)=1/N . The update rule
consists of three stages. Firstly, update the weights by
D
t+1
(i)=D
t
(i)e
y
i
α
t
h
t
(I
i
)
/Z
t
[α
t
,h
t
],
0-7695-2158-4/04 $20.00 (C) 2004 IEEE

where Z
t
is a normalization factor chosen so that
N
i=1
D
t+1
(i)=1. The algorithm selects the α
t
,h
t
(.) that
minimize Z
t
[α
t
,h
t
(.)]. Then the process repeats and out-
puts a strong classifier H
t
(I) = sign(
T
t=1
α
t
h
t
(I)).It
can be shown that this classifier will converge to the op-
timal classifier as the number of classifiers increases
[4].
AdaBoost requires a set of classified data with image
windows labelled manually as being text or non-text. Fig-
ure (7) shows some text examples. We performed this la-
belling for the training dataset and and divided each text
window into several overlapping text segments with fixed
width-to-height ratio 2:1. This lead to a total of 7,132 text
segments which were used as positive examples. The nega-
tive examples were obtained by a bootstrap process similar
to Drucker et al [2]. First we selected negative examples by
randomly sampling from windows in the image dataset. Af-
ter training with these samples, we applied the AdaBoost
algorithm to classify all windows in the training images
(at a range of sizes). Those misclassified as text were then
used as negative examples for retraining AdaBoost. The im-
age regions most easily confused with text were vegetation,
repetitive structures such as railings or building facades, and
some chance patterns.
Figure 7. Text example used for getting pos-
itive examples for training AdaBoost. Ob-
server the low quality of some of the exam-
ples.
The previous section described the weak classifiers we
used for training AdaBoost. We used standard AdaBoost
training methods to learn the strong classifier [4] [5] com-
bined with Viola and Jones’ cascade approach which uses
asymmetric weighting [19]. The cascade approach enables
the algorithm to rule out most of the image as text locations
with a few tests (so we do not have to apply all the tests ev-
erywhere in the image). This makes the algorithm extremely
fast when applied to the test dataset and yields order of mag-
nitude speed-up over standard AdaBoost [19]. Our algo-
rithm had a total of 4 cascade layers. Each layer has 1, 10,
30, 50 tests respectively. The overall algorithm uses 91 dif-
ferent feature tests. The first three layers of the cascade only
use mean, STD and module of derivative features, since they
can be easily calculated from integral images[19]. Compu-
tation intensive features, histogram and edge linking, in-
volve all pixels inside the sub-window. So we only let them
be selected in the last layer.
In the test stage, we applied the AdaBoost strong clas-
sifier H(I) to windows of the input images at a range of
scales. There was a total of 14 different window sizes, rang-
ing from 20 by 10 to 212 by 106, with a scaling factor of
1.2. Each window was classified by the algorithm as text or
non-text. There was often overlap between windows classi-
fied as text. We merged these regions by taking the union of
the text windows.
In our test stage, AdaBoost gave very high performance
with low false positives and false negatives (in agreement
with previous work on faces [19]). When applied to over
20,000,000 image windows, taken from 35 images, the to-
tal number of false positives was just over 118 and the num-
ber of false negatives was 27. By altering the threshold we
could reduce the number of false negatives to 5 but at the
price of raising the number of false positives, see table (1).
We decided to keep not to alter the threshold so as to keep
the number of false positives down to an average of 4 per
image (almost all of which will be eliminated at the read-
ing stage).
Thresh False Pos. False Neg. Images Subwindows
0.00 118 27 35 20,183,316
-0.05 1879 5 35 20,183,316
Table 1. Performance of AdaBoost at differ-
ent thresholds. Observe the excellent overall
performance and the trade-off between false
positives and false negatives.
We illustrate these results by showing the windows that
AdaBoost classifies as text for typical images in the test
dataset, see figure (8).
6. Extension and Binarization
Our next stage produces binarized text regions to be used
as inputs to the OCR reading stage. (It is possible to run
OCR directly on intensity images but we obtain substan-
tially worse performance if we do so). In addition to bina-
rization, we must extend the text regions found by the Ad-
aBoost strong classifiers because these regions sometimes
miss letters or digits at the start and end of the text.
We start by applying adaptive binarization [12] to the
text regions detected by the AdaBoost strong classifier. This
is followed by a connected component algorithm [13] which
0-7695-2158-4/04 $20.00 (C) 2004 IEEE

Figures
Citations
More filters
Proceedings ArticleDOI

Detecting text in natural scenes with stroke width transform

TL;DR: A novel image operator is presented that seeks to find the value of stroke width for each image pixel, and its use on the task of text detection in natural images is demonstrated.
Proceedings ArticleDOI

End-to-end scene text recognition

TL;DR: While scene text recognition has generally been treated with highly domain-specific methods, the results demonstrate the suitability of applying generic computer vision methods.
Journal ArticleDOI

Reading Text in the Wild with Convolutional Neural Networks

TL;DR: An end-to-end system for text spotting—localising and recognising text in natural scene images—and text based image retrieval and a real-world application to allow thousands of hours of news footage to be instantly searchable via a text query is demonstrated.
Journal ArticleDOI

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

TL;DR: The Rotation Region Proposal Networks are designed to generate inclined proposals with text orientation angle information that are adapted for bounding box regression to make the proposals more accurately fit into the text region in terms of the orientation.
Proceedings Article

End-to-end text recognition with convolutional neural networks

TL;DR: This paper combines the representational power of large, multilayer neural networks together with recent developments in unsupervised feature learning, which allows them to use a common framework to train highly-accurate text detector and character recognizer modules.
References
More filters
Journal ArticleDOI

Eigenfaces for recognition

TL;DR: A near-real-time computer system that can locate and track a subject's head, and then recognize the person by comparing characteristics of the face to those of known individuals, and that is easy to implement using a neural network architecture.
Proceedings Article

Experiments with a new boosting algorithm

TL;DR: This paper describes experiments carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems and compared boosting to Breiman's "bagging" method when used to aggregate various classifiers.
Journal ArticleDOI

Additive Logistic Regression : A Statistical View of Boosting

TL;DR: This work shows that this seemingly mysterious phenomenon of boosting can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood, and develops more direct approximations and shows that they exhibit nearly identical results to boosting.
Journal ArticleDOI

Neural network-based face detection

TL;DR: A neural network-based upright frontal face detection system that arbitrates between multiple networks to improve performance over a single network, and a straightforward procedure for aligning positive face examples for training.
Frequently Asked Questions (14)
Q1. What are the contributions in "Detecting and reading text in natural scenes" ?

This paper gives an algorithm for detecting and reading text in natural images. From this dataset, the authors manually label and extract the text regions. Next the authors perform statistical analysis of the text regions to determine which image features are reliable indicators of text and have low entropy ( i. e. feature response is similar for all text images ). 

The first, and most important, component of the algorithm is a strong classifier which is trained by the AdaBoostlearning algorithm [4],[19],[20] on labelled data. 

The authors used standard AdaBoost training methods to learn the strong classifier [4] [5] combined with Viola and Jones’ cascade approach which uses asymmetric weighting [19]. 

the AdaBoost strong classifier (plus extension/binarization) detected 97.2 % of the visible text in their test dataset (text that could be detected by a normally sighted viewer). 

In ideal text images, the authors would be able to classify pixels as text or background directly from the intensity histogram which should have two peaks corresponding to text and background mean intensity. 

These algorithms have the additional advantage that they use generative models [18] and can be applied directly to the image intensity without requiring binarization. 

The most common remaining error are text string like ”111” or ”Ill” which correspond to vertical edges in the image caused, for example, by iron railings. 

These are also properties with low entropy, since there will typically be a fixed number of long edges whatever the letters in the text region. 

The first three layers of the cascade onlyuse mean, STD and module of derivative features, since they can be easily calculated from integral images[19]. 

Niblack’s algorithm requires adaptively determining a threshold T for each pixel x from the intensity statistics within a local window of size r Tr(x) = µr(x) + k · σr(x), where µr(x) and σr(x) are the mean and standard deviation (std) of the pixel intensities within the window. 

After training with these samples, the authors applied the AdaBoost algorithm to classify all windows in the training images (at a range of sizes). 

The third component is an OCR software program which acts on the binarized regions (the OCR software gave far worse performance when applied directly to the image). 

The result of this feature selection approach is that their final strong classifier, see next section, uses far fewer filter’s than Viola and Jones’ face detection classifier [19]. 

The authors performed this labelling for the training dataset and and divided each text window into several overlapping text segments with fixed width-to-height ratio 2:1.