What is the method for detecting text from uncontrolled images?

They typically rely on brittle techniques such as binarization, where the first stage of processing is a simple thresholding operation used to divide an image into text and non-text pixels [19].

How much does the larger beam improve the recall?

The larger beam width improves recall by only 0.5%, suggesting that approximate inference is not an important limit on performance.

How many languages can be recognized simultaneously?

A single worker is designed to recognize multiple languages simultaneously; at present the authors support 29 languages with Latin script.

How does the system perform on the major text recognition benchmarks?

The system achieves record performance on all major text recognition benchmarks, and high quality text extraction from typical smartphone imagery with sub-second latency.

What is the classifier probability for the ith segment?

c is vector of class assignments, the ith segment being assigned label ci. Ψ(ci, bi, bi+1) is the classifier probability for class assignment ci to the pixels between bi and bi+1.

What is the way to correct the character labels?

The extracted character bounding boxes come from their OCR system, but any errors in the character labels are corrected by alignment against the source text.

How much of the word error rate is reduced by adding the character level model?

Adding the word level model gives a further 4% word er-Training Set SizeCharacter ClassifierAccuracy (%)Word Recognition Rate (%)1.1× 107 (*) 92.18 70.99 3.9× 106 91.79 70.47 1.9× 106 90.98 68.83 9.7× 105 90.60 69.20 4.9× 105 89.19 65.77 2.5× 105 88.38 60.50 1.2× 105 86.74 53.20 6.3× 104 85.21 46.74ror rate reduction.

(Open Access) PhotoOCR: Reading Text in Uncontrolled Conditions (2013) | Alessandro Bissacco

Q: What have the authors contributed in "Photoocr: reading text in uncontrolled conditions" ?

The authors describe PhotoOCR, a system for text extraction from images. Their approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. The authors evaluate their system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks.

Q: How do the authors train their neural network character classifier?

The authors train their neural network character classifier using stochastic gradient descent with Adagrad [7] and dropout [10], using the distributed training design described in [6].

Q: How many tokens are trained on the deep neural network?

In particular, their deep neural network character classifier is trained on up to 2 million manually labelled examples, and their language model is learned on a corpus of more than a trillion tokens.

PhotoOCR: Reading Text in Uncontrolled Conditions

Alessandro Bissacco

∗

, Mark Cummins

∗

, Yuval Netzer

∗

, Hartmut Neven

Google Inc.

Abstract

We describe PhotoOCR, a system for text extraction from

images. Our particular focus is reliable text extraction from

smartphone imagery, with the goal of text recognition as a

user input modality similar to speech recognition. Commer-

cially available OCR performs poorly on this task. Recent

progress in machine learning has substantially improved

isolated character classiﬁcation; we build on this progress

by demonstrating a complete OCR system using these tech-

niques. We also incorporate modern datacenter-scale dis-

tributed language modelling. Our approach is capable of

recognizing text in a variety of challenging imaging con-

ditions where traditional OCR systems fail, notably in the

presence of substantial blur, low resolution, low contrast,

high image noise and other distortions. It also operates

with low latency; mean processing time is 600 ms per im-

age. We evaluate our system on public benchmark datasets

for text extraction and outperform all previously reported

results, more than halving the error rate on multiple bench-

marks. The system is currently in use in many applica-

tions at Google, and is available as a user input modality in

Google Translate for Android.

1. Introduction

Extraction of text from uncontrolled images is a chal-

lenging problem with many practical applications. Reliable

text recognition would provide a useful input modality for

smartphones, particularly in applications such as translation

where the text may be difﬁcult for a user to input by other

means. Text extraction is also useful in robotics, as a search

signal in large image collections, in wearable devices and

numerous other areas.

Commercially available OCR systems are designed pri-

marily for document images such as those from a ﬂatbed

scanner, and perform poorly on general imagery. They typ-

ically rely on brittle techniques such as binarization, where

the ﬁrst stage of processing is a simple thresholding oper-

ation used to divide an image into text and non-text pixels

[19]. Challenging input for existing commercial systems

∗

These authors contributed equally.

Figure 1: An example of scene text detected and recognized

by our system.

include both scene text (such as Figure 1) and also more

document-like text that suffers from blur, low resolution or

other degradations that are common in smartphone imagery

(see Figure 2).

To address these problems, this paper describes the de-

sign of a complete OCR system built using modern com-

puter vision techniques. In particular we take advantage

of substantial recent progress in machine learning [6, 10]

and large scale language modelling [1]. Our system out-

performs previous approaches by a wide margin, more than

halving the error rate on the main public benchmarks. We

scale the individual components of our system to a regime

orders of magnitude larger than explored in prior work. In

particular, our deep neural network character classiﬁer is

trained on up to 2 million manually labelled examples, and

our language model is learned on a corpus of more than a

trillion tokens. We maintain sub-second recognition latency

primarily though careful engineering. We have trained ver-

sions of our system for 29 languages with Latin script, plus

Greek, Hebrew and four Cyrillic languages.

2. Related Work

Design of a complete OCR system for natural images is

a substantial task, and as such there are relatively few exam-

ples in the literature. Many publications address sub-tasks

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.102

785

2013 IEEE International Conference on Computer Vision

DOI 10.1109/ICCV.2013.102

785

such as text detection and isolated character classiﬁcation.

One of the best performing text detection methods is the

stroke width transform of [8]. Isolated character classiﬁca-

tion is widely used as a machine learning benchmark [17].

However the mid-level problem of fusing character classi-

ﬁer and language model signals for complete text extrac-

tion is less commonly addressed. Application papers often

perform text detection and preprocessing before applying a

commercial OCR system designed for printed documents,

as for example in [4].

Among fully complete systems for the scene text extrac-

tion problem, language modelling is often less developed

than the image processing components. For example the

method of [18] uses a bigram language model together with

a set of hand-designed image features and a support vec-

tor machine classiﬁer for text detection and recognition. In

the method of [27], text detection is assumed and recogni-

tion is performed by fusing appearance, self-similarity, lex-

icon and bigram language signals in a sparse belief propa-

gation framework. The work of [23] focuses on the use of

appearance similarity constraints to improve performance.

Several other systems simplify the problem by assuming

that the words to be recognized come from a ﬁxed lexi-

con. The system of [20] describes a large-lexicon design,

using weighted ﬁnite state transducers to perform joint in-

ference over appearance, self-consistency and language sig-

nals. Other systems assume a small lexicon (transforming

the problem from OCR to word spotting). One such method

is [25], where detection and character classiﬁcation are per-

formed in a single step using randomized ferns. The method

of [15] also relies on small lexicons to allow for a strong bi-

gram constraint in a conditional random ﬁeld (CRF) model.

Finally, we note that the classic work on handwriting

recognition systems [14] tackles essentially the same prob-

lem as discussed in this paper, and our system bears many

similarities to that design. Work on low-resolution docu-

ment OCR [11] is also closely related.

3. System Design

In general outline our system takes a conventional multi-

stage approach to text extraction, similar to designs such

as [2]. We begin by performing text detection on the input

image. The detector returns candidate regions containing

individual lines of text. Detection is tuned for high recall,

relying on later stages of processing to reject false positives.

Candidate text lines from the detection stage are processed

for text recognition. Recognition begins with a 1D over-

segmentation of the text line to identify candidate character

regions. We then search through the space of segmentations

to maximize a score which combines the character classi-

ﬁer and language model likelihoods. The top hypotheses

produced by this search are then re-scored using additional

signals which are too expensive to apply during initial infer-

ence. The reason for this staged approach is computational:

it would be prohibitively expensive to apply full inference

at all locations and scales in the input image, or to apply our

character classiﬁer at all locations in each candidate text re-

gion. Our primary intended application is text extraction as

an input modality for smartphone users, which limits total

acceptable processing time to at most one or two seconds

per image. This constraint informs several of our design

decision. The following sections describe each stage of the

process in detail.

3.1. Text Detection

A detailed description of the text detection portion of our

system is outside the scope of this paper. Brieﬂy, we com-

bine the output of three different text detection approaches.

The ﬁrst approach is a Viola-Jones style boosted cascade

of Haar wavelets [24]. The second approach extracts con-

nected components from a binarized image and uses graph

cuts to ﬁnd a text/non-text labelling by minimizing an MRF

with learned energy terms, broadly similar to [21]. The ﬁnal

detector is a novel approach based on anisotropic Gaussian

ﬁlters. This portion of the system also deals with splitting

text regions into individual lines and correcting orientation

to horizontal, both of which are relatively trivial. For the re-

mainder of the paper we will focus on extracting text from

the horizontal line region candidates.

3.2. Over-Segmentation

The over-segmentation step divides the text line into seg-

ments which should contain no more than one character (but

characters may be split into multiple segments). This is a

1D segmentation task. We combine the output of two dif-

ferent segmentation methods to improve recall.

The ﬁrst segmentation method used is typical of docu-

ment OCR systems. The input image is binarized using

Niblack binarization [19], a morphological opening oper-

ation is applied, and connected components are extracted

from the resulting binary image. This simple approach is

very effective on easier images, but can fail in the presence

of blur, low resolution, complex backgrounds, etc.

The second segmentation approach is intended to han-

dle these more complex cases. It consists of a binary slid-

ing window classiﬁer, trained to detect segmentation points

between characters. We use a combination of HOG fea-

tures [5] and Weighted Direction Code Histogram features

(WDCH) [13]. WDCH features are similar to HOG com-

puted on a binarized version of the image, and are typical

in OCR systems. A binary logistic classiﬁer is trained on

the combined feature vector. The classiﬁer is evaluated in

a window of width equal to the line height with a stride of

0.1 times line height, and all responses above a threshold

are accepted as segmentation points. We evaluated multiple

classiﬁer and feature options for this segmenter and chose

786786

this conﬁguration as balancing speed and recall.

The segmentation stage outputs a vector B containing

the positions of the detected segmentation points, including

the start and end points of the text detection box.

3.3. Beam Search

We now search the space of segmentations to ﬁnd one

which maximizes our score function, given by:

S(b, c)=



i=1:N

log Ψ (c

i+1

)+α log Φ (c

i−1

,...,c

)

(1)

Here b ⊂ B is a vector of N +1segmentation points b

which deﬁne a segmentation of the line into N segments.

c is vector of class assignments, the i

segment being as-

signed label c

. Ψ(c

i+1

) is the classiﬁer probability

for class assignment c

to the pixels between b

and b

i+1

Φ(c

i−1

,...,c

) is the language model probability for

the i

class assignment given the previous class assign-

ments in the line. α controls the relative strength of the

character classiﬁer and language model. The total score

is thus the average per-character log-likelihood of the text

line under the classiﬁer and language model. We choose

the per-character average as it gives a score which is com-

parable across lines with different numbers of recognized

characters. We now wish to ﬁnd b, c that maximize S.

We perform this maximization using beam search [22],

which is the typical approach for similar problems in speech

recognition. The result space naturally forms a graph de-

ﬁned by the segmentation points. Beam search is a best-ﬁrst

search through this graph which relies on the fact that each

node in the graph is a partial result (corresponding to the

recognition of part of the text line) which can be scored by

our scoring function. At each step of the beam search, all

successors of the current search nodes are scored, but only a

ﬁxed number of top scoring nodes (the beam width) are re-

tained for the next search step. We initialize the search sim-

ply at the left edge of the text detection box. We prune the

search slightly by dropping segmentation candidates whose

aspect ratio is too large.

Clearly there is no guarantee of locating an optimal solu-

tion using beam search. The reason for using this approach

in preference to a framework such as a CRF is that it per-

mits greater ﬂexibility in the design of the score function.

In particular, the language model imposes high-order con-

straints (up to order eight in our case) which make exact

inference in a CRF-type framework effectively intractable.

In our experience, it is better to use a good score function

with approximate inference than a weaker score function

with perfect inference. Results are presented in Section 5

which suggest that in any case our approximate inference

often locates the optimal solution, despite the combinato-

rial search space. The practical bottleneck on recognition

performance appears to come from classiﬁer and language

model quality, rather than failure to ﬁnd the solution which

maximizes the score function.

3.4. Character Classiﬁer

We use a deep neural network for character classiﬁca-

tion. We have explored networks trained on both raw pix-

els and HOG features. The networks trained on raw pix-

els achieve similar or slightly better performance than those

trained on HOGs. However, we ﬁnd that the raw pixel net-

works are deeper and wider than HOG-input networks at

comparable performance. Even accounting for HOG com-

putation time, the HOG-input networks have lower compu-

tational cost. Since we are designing for speed as well as

high accuracy, we ﬁnd that overall the HOG-input networks

are preferable. This ﬁnding may be speciﬁc to text, since

the strong gradients present in characters are a good match

for HOG features.

Our best performing conﬁguration is a network with ﬁve

hidden layers in conﬁguration 422-960-480-480-480-480-

100. The layers are fully connected and use rectiﬁed linear

units [16]. The 422-parameter input layer consists of HOG

coefﬁcients and three geometry features. The output layer

is a softmax over 99 character classes plus a noise class. We

quantize weights and activations to 8 bits, which improves

speed with negligible accuracy penalty. Details of training

are discussed in Section 4.

We run this classiﬁer on all combinations of segments

chosen by the beam search. Since the segmentation pro-

vided is only 1D, for each segment we reﬁne the top and

bottom boundaries by a simple heuristic which snaps to the

ﬁrst strong edge. This provides a more tightly cropped char-

acter to the classiﬁer. We compute two HOG features on

this character patch. The ﬁrst uses a 5x5 grid with 5 bins

per histogram, computed directly on the (non-square) patch.

The second uses unsigned gradient histograms in a 7x7 grid

with 6 bins per histogram. The character patch is normal-

ized to 65x65 pixels for computing this second feature. The

three geometry features used in addition to HOG encode the

original aspect ratio and the position of the top and bottom

edge of the pixels relative to the height of the overall text

detection.

It is worth mentioning that we also explored convolu-

tional neural networks, but they are not computationally

competitive with non-convolutional networks in our archi-

tecture. Our segmentation based approach offers fewer op-

portunities for the convolutional network to reuse computa-

tions compared to a sliding window design.

3.5. Language Model

In structured classiﬁcation tasks such as OCR and speech

recognition, a strong language prior makes a major contri-

bution to ﬁnal performance. Some classes are almost indis-

787787

tinguishable as isolated examples, such as the number “1”

and the letter “l”; these cases must be disambiguated by

context.

We use a standard ngram approach for language mod-

elling. Because our system is designed for use in datacenter

environments, we adopt a two-level language model design.

A compact character-level ngram model is held in RAM by

each OCR worker. This model provides the beam search

language score, Φ(c

i−1

,...,c

). This score forms part

of the inner loop of our system, so rapid evaluation is es-

sential. Our second level language model is a much larger

distributed word-ngram model using the design of [1]. This

model is shared by all OCR workers in a data center and

is accessed over the network. Due to the latency overhead

this imposes, we evaluate this model only during reranking

(Section 3.6).

For our character-level ngram model, we are quite lim-

ited in RAM budget. A copy of this model is held by

each OCR worker. A single worker is designed to recog-

nize multiple languages simultaneously; at present we sup-

port 29 languages with Latin script. Consequently we al-

locate only 60 MB of RAM per language for the character

ngram model. In this regime, in common with [3], we have

found little beneﬁt in going beyond 8-gram models. We

train each of these model on 10

characters of training data,

retaining all ngrams which occur 40 times or more. For a

ﬁxed size model, we ﬁnd negligible beneﬁt in increasing

the training data beyond 10

characters. We have also com-

pared smoothing methods and we ﬁnd that the very sim-

ple Stupid Backoff method [1] performs as well as more

complex methods in terms of ﬁnal recognition performance.

Since it permits an optimized implementation, we choose

this approach.

In addition to the character ngram model, we also main-

tain a simple dictionary of the top 100k words per language.

We use this as an additional signal in our language score; it

provides a small performance increase over the character

ngram model alone. The dictionary provides a soft scoring

signal only; recognition is not limited to dictionary words.

Our second-level distributed language model uses word

4-grams. The English model is trained on a 1.3 × 10

to-

ken training set. The ﬁnal model contains 2 × 10

ngrams

over a 1M word vocabulary. Our models of non-English

languages are approximately 3x smaller in ngram count and

10x smaller in training set size. At serving time the com-

bined multi-language model is distributed over 80 machines

and occupies approximately 400 GB of RAM. In the word-

level language model we also incorporate a small number

of parsers for common patterns such as emails and URLs,

which are not well captured by ngrams.

3.6. Reranking

The beam search terminates with ranked list of recogni-

tion hypotheses, of beam width size. We perform several

post-processing and reranking operations to this list:

Punctuation Search

The ﬁrst operation we perform is a punctuation search.

Punctuation is recognized in the same way as any other

character class in the initial beam search, but recall is com-

paratively low since it can be difﬁcult to distinguish small

punctuation characters from background clutter. We thus

perform a second pass search for punctuation taking advan-

tage of the stronger scale and location constrains available

after initial recognition.

Secondary Language Scoring

As discussed in Section 3.5, we use a distributed word-level

language model which cannot be accessed during the beam

search for latency reasons. We instead use this model to

re-score the ﬁnal recognition hypotheses. We re-score in a

straightforward way by adding log-likelihoods:

∗

(b, c)=S(b, c)+βL(c) (2)

where L(c) is the score from the word-ngram model, and β

is a weighting parameter. As in Equation 1, we use mean

per-character log-likelihood for the language model score,

in order to make it comparable between lines of different

lengths. Out-of-vocabulary words are assigned a ﬁxed per-

character log-likelihood.

Shape Model

We also re-score the hypotheses using shape information.

The shape model computes the expected relative size and

position of character bounding boxes for the recognized text

in 20 common fonts, and scores the line based on the devi-

ation from the best matching font. The shape model gives

a small but consistent improvement to overall performance.

In principle this could be incorporated into the beam search

score, but for computational reasons we use it as a reranker.

Junk Filter

Finally we apply a junk ﬁlter with a few simple heuristics

to discard common OCR errors such as recognizing repeat-

ing vertical structures as “iiiii”. While these are naturally

penalized by the language model, a simple ﬁlter provides

some additional precision improvement.

4. Training

We train our neural network character classiﬁer using

stochastic gradient descent with Adagrad [7] and dropout

[10], using the distributed training design described in [6].

We minimize log-loss. We achieve best performance with

788788

dropout of 5% in the input layer and 12.5% in the hid-

den layers. This dropout setting is lower than has been re-

ported elsewhere, but seems consistent with the relatively

small size of our network and the large quantity of training

data used. We train the network on 200 cores for two days.

We trained several dozen networks with varying depth and

width parameters and selected the candidate which offered

the best compromise between accuracy and computational

cost. We have tried training the best network for longer, but

it does not appear to improve further.

Our training set consists of 2.2 million manually labelled

characters. The training set is extracted from a random sam-

ple of the past OCR queries of users who have agreed for

their data to be used to improve Google services. Text lines

in the imagery are manually annotated with the class of each

character and the segmentation points between characters.

We do not use synthetic data or synthetic distortions. The

data is used to train both our segmenter and character clas-

siﬁer. We ﬁrst train the segmenter on the labelled segmen-

tation points. To train the character classiﬁer, we do not

use the manually segmented characters directly since they

may be dissimilar to what our automatic segmentation pro-

duces. Instead we run our segmenter on the annotated lines

and choose segments visited by the beam search which have

high overlap with a manually labelled character as positive

training examples. We also select training examples for a

“noise class” consisting of segments visited by the beam

search which overlap partial/multiple ground truth charac-

ters or background clutter. The ﬁnal training set consists of

45% noise examples and 55% character class examples, for

3.9 million total training examples.

For the Latin alphabet version of our system, we learn 99

character classes plus the noise class. The character classes

cover upper and lowercase letters, punctuation and some

additional common characters such as currency symbols.

Many characters exist in multiple diacritic variants, such as

“aäáâ”. For training we consider these variants together as

a single class, and rely on the language model to select the

correct variant at recognition time. Our ﬁnal output is thus

from a larger space of several hundred possible character

classes.

Finally we set free parameters of the system, such as

the language model weights α and β, by optimizing end-to-

end system performance on a validation set using Powell’s

method.

Self-supervised training data

We built an additional larger training set using a self-

supervision mechanism. A full description of the system

is beyond the scope of the paper, however we present a

brief summary here. We began by running our OCR system

on 5 million images from the logs of Google Goggles and

Google Translate for Android. The key idea is that much

Figure 2: An example of training data automatically gen-

erated by our self-supervision approach. The green boxes

show characters which have been extracted by our system

and veriﬁed by alignment against an online text. The lower

part of the ﬁgure shows an enlarged view of one of the au-

tomatically ground truthed regions. To preserve privacy the

image shown here was taken by the authors, but it is typical

of the user imagery in our logs.

of the text in these real-world images also exists verbatim

somewhere on the web. If we achieve partially correct text

extraction from an image, we can often locate the source

text by issuing multiple web queries. We can verify the

match by performing edit distance based alignment. This

produces a partial ground truth for the image. An example

result is shown in Figure 2. The extracted character bound-

ing boxes come from our OCR system, but any errors in

the character labels are corrected by alignment against the

source text. The accuracy of labels produced by this ap-

proach is at least as good as our manually labelled data. The

majority of matches come from images of dense text such

as newspaper articles. However, we ﬁnd matches for a wide

variety of other objects such as food containers, posters,

wine bottles, packaged goods, etc.

From 5 million source images, we ﬁnd acceptable

matches for 200k, yielding 40 million automatically la-

belled characters. These characters are obviously biased to-

wards easier examples, so to augment our training set we

use 4 million characters with the lowest conﬁdence scores.

5. Results

For comparison to other published work, we ﬁrst present

results on public OCR benchmarks. The most suitable pub-

789789

PhotoOCR: Reading Text in Uncontrolled Conditions

Figures

Citations

An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition

Speeding up Convolutional Neural Networks with Low Rank Expansions

Reading Text in the Wild with Convolutional Neural Networks

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

References

Gradient-based learning applied to document recognition

Histograms of oriented gradients for human detection

Rapid object detection using a boosted cascade of simple features

Artificial Intelligence: A Modern Approach

Rectified Linear Units Improve Restricted Boltzmann Machines

Related Papers (5)

End-to-end scene text recognition

Reading Text in the Wild with Convolutional Neural Networks

End-to-end text recognition with convolutional neural networks

Real-time scene text localization and recognition

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Photoocr: reading text in uncontrolled conditions" ?

Q2. What is the first method used to extract text from a binary image?

Q3. How do the authors train their neural network character classifier?

Q4. What is the method for detecting text from uncontrolled images?

Q5. How many tokens are trained on the deep neural network?

Q6. How much does the larger beam improve the recall?

Q7. How many languages can be recognized simultaneously?

Q8. How does the system perform on the major text recognition benchmarks?

Q9. What is the classifier probability for the ith segment?

Q10. What is the way to correct the character labels?

Q11. How much of the word error rate is reduced by adding the character level model?