What is the energy function in Equation (1)?

Due to the introduction of GMMS the energy function in Equation (1) now becomes:E(x, k, θ, z) = Ei(x, k, θ, z) + Eij(x, z), (4)i.e. the data term depends on its assignment to GMM component.

How long does it take to produce a binary image?

The proposed method takes 32 seconds on average to produce final binary result for an image on system with 2 GB RAM and Intel R© CoreTM 2 Duo CPU with 2.93 GHz processor system.

What is the common term used in the literature?

the smoothness term most commonly used in literature is the Potts model:Eij(x, z) = λ ∑(i,j)∈N exp−(zi − zj)2 2β2 [xi = xj ] dist(i, j) ,where λ determines the degree of smoothness, dist(i, j) is the Euclidean distance between neighbouring pixels i and j.

What is the gradient orientation of the edge pixel?

For every such edge pixel p the authors traverse the edge image in direction of θ until the authors hit an edge pixel q whose gradient orientation is (π−θ)± π36 (i.e. approximately opposite gradient direction).

(Open Access) An MRF Model for Binarization of Natural Scene Text (2011) | Anand Mishra

Q: What is the way to make the binarization process more efficient?

iterative graph cut based binarization is also more suitable for their application as it refines seeds and, binarization output at each iteration and thus produces a clean binarization result even in case of noisy foreground/background distributions.

Q: What is the meaning of edginess difference?

(Note that by edginess difference term the authors mean, energy function with gradient magnitude difference in addition to difference in RGB colour space).

Q: how to make the energy function robust to low contrast colour images?

(5)In order to make the energy function robust to low contrast colour images the authors modify the smoothness term of the energy function by adding a new term which measures the “edginess” of the pixels as follows:Eij(x, z) = λ1 ∑(i,j)∈N [xi = xj ]exp(−β||zi − zj||2)+λ2 ∑(i,j)∈N [xi = xj ]exp(−β||wi − wj ||2).

Q: What is the way to get the pixel colour from a GMMRF?

ITERATIVE GRAPH CUT BASED BINARIZATIONIn GMMRF framework [4], each pixel colour is generated from one of the 2c Gaussian Mixture Models (GMMS) (c GMMS for foreground and background each) with mean μ and covariance Σ i.e. each foreground colour pixel is generated from following distribution:p(zi|xi, θ, ki) = N (z, θ; μ(xi, ki), Σ(xi, ki)), (3) where N denotes a Gaussian distribution, xi ∈ {0, 1} and ki ∈ {1, ..., c}.

HAL Id: hal-00817972

https://hal.inria.fr/hal-00817972

Submitted on 17 Oct 2013

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

An MRF Model for Binarization of Natural Scene Text

Anand Mishra, Karteek Alahari, C.V. Jawahar

To cite this version:

Anand Mishra, Karteek Alahari, C.V. Jawahar. An MRF Model for Binarization of Natural Scene

Text. ICDAR - International Conference on Document Analysis and Recognition, Sep 2011, Beijing,

China. �10.1109/ICDAR.2011.12�. �hal-00817972�

An MRF Model for Binarization of Natural Scene Text

Anand Mishra

∗

, Karteek Alahari

†

and C.V. Jawahar

∗

International Institute of Information Technology Hyderabad, India

†

INRIA - Willow, ENS, Paris, France

Email: anand.mishra@research.iiit.ac.in, karteek.alahari@ens.fr, jawahar@iiit.ac.in

Abstract—Inspired by the success of MRF models for solving

object segmentation problems, we formulate the binarization

problem in this framework. We represent the pixels in a docu-

ment image as random variables in an MRF, and introduce a

new energy (or cost) function on these variables. Each variable

takes a foreground or background label, and the quality of the

binarization (or labelling) is determined by the value of the

energy function. We minimize the energy function, i.e. ﬁnd the

optimal binarization, using an iterative graph cut scheme. Our

model is robust to variations in foreground and background

colours as we use a Gaussian Mixture Model in the energy

function. In addition, our algorithm is efﬁcient to compute,

and adapts to a variety of document images. We show results

on word images from the challenging ICDAR 2003 dataset, and

compare our performance with previously reported methods.

Our approach shows signiﬁcant improvement in pixel level

accuracy as well as OCR accuracy.

Keyw ords-MRF, GMM, Graph Cut, Binarization

I. INTRODUCTION

Binarization is one of the key preprocessing steps in any

document image analysis system. The performance of the

subsequent steps like character segmentation and recogni-

tion are highly dependant on the success of binarization.

Document image binarization is an active area of research

for many years. Is binarization a solved problem? Obviously

not, especially, due to the emerging need for recognition

of text in video sequences, digital-born (Web and email)

images, old historic manuscripts and natural scenes where

the state of art recognition performance is really poor. In

this regard, designing a powerful binarization algorithm

can be considered as a major step towards robust text

understanding. The recent interest of the community by

organising a binarization contest like DIBCO 2009 [1] at

10th International Conference on Document Analysis and

Recognition (ICDAR 2009) also supports our claim. Note

that DIBCO 2009 had 43 submissions which shows active

interest in this research area.

We, in this work, focus on binarization of natural scene

text. Natural scene texts contain numerous degradations not

usually present in machine printed ones such as uneven

lighting, blur, complex background, perspective distortion,

multiple colours etc. Methods such as interactive graph

cut by Boykov et al. [2] and thereafter GrabCut [3] have

shown p romising performance in foreground/background

segmentation of natural scenes in recent years. We formulate

Figure 1. Some samples images we considered in this work

the binarization problem in this framework (where text is

foreground and anything else is background), and deﬁne a

novel energy (cost) function such that the quality of the

binarization is determined by the energy value. We minimize

this energy functio n to ﬁnd the optimal binarization using

an iterative graph cut scheme. The graph cut method needs

to be initialized with foreground/background seeds. To make

the binarization fully automatic, we obtain initial seeds for

graph cuts by our auto-seeding algorithm. At each iteration

of graph cut, seeds and binarization are reﬁned. This makes

it more powerful compared to one shot graph cut algorithm.

Moreover, we model foreground and background colours in

a G MMRF framework [4] to make the binarization robust

to variations in foreground and background colours.

The remainder of the paper is organised as follows.

We discuss related work in Section II. In Section III, the

binarization problem is formulated as a labelling problem,

where we deﬁne an energy function such that its minimum

corresponds to the target binary image. This section also

brieﬂy introduces graph cut method. Section IV explains

proposed iterative graph cut based binarization scheme. It

also elaborates the method of ﬁnding auto-seeds for the

graph cut. Section V describes experiments and results based

on the challenging ICDAR 2003 word dataset. Some sample

images of this dataset are shown in Figure 1. We ﬁnally

conclude the work in Section VI.

II. R

ELATED WORK

Traditional thresholding based binarization can be cat-

egorized into two categories: the one which uses global

threshold for the given document (like Otsu [5], Kittler et

al. [6]) and the one with local thresholds (like Sauvola [7],

Niblack [8]). An exhaustive review of thresholding based

binarization is beyond the scope of this paper. The reader

is encouraged to see [9] for this. Although most of these

previous algorithms perform satisfactorily for many cases,

they suffer from the problems like: (1) Manual tuning of pa-

rameters, (2) High sensitivity to the choice of parameters, (3)

Handling images with uneven lighting, noisy background,

similar foreground-background colours.

Recently, Markov Random Field (MRF) based binariza-

tion has been applied for degraded documents. In [10],

Wolf et al. proposed binarization in an energy minimization

framework and applied a less powerful and computationally

expensive simulated annealing (SA) for energy minimiza-

tion. In [11], authors classiﬁed document into Text Region

(TR), Near Text Region (NTR) and Background Regions

(BR) and then applied graph cut to produce ﬁnal binary

image. MRF based binarization for hand-held device cap-

tured document images was proposed in [12], where authors

ﬁrst used thresholding based technique to produce a binary

image and then applied graph cuts to remove noise and

smooth binarization output. However, these methods can

not be directly applied to natural scene text images due to

additional challenges like blur, hardly distinguishable fore-

ground/background colours, variable font sizes, and styles.

Researchers have also shown interest in colour image

binarization in recent years (see [13], [14]). But these

methods lack a principled formulation of the binarization

problem of complex colour documents, and hence can not

be generalized.

III. T

HE BINARIZATION PROBLEM

We deﬁne the binarization problem in a labelling frame-

work as follows: the binarization of an image can be

expressed as a vector of binary random variables X =

, ..., X

}, where each random variable X

takes a

label x

∈{0, 1} based on whether it is text (foreground)

or non-text (background). Most of the heuristic based al-

gorithms take the decision of assigning label 0 or 1 to x

based on the pixel value at that position or local statistics.

Such algorithms are not effective in our case because of the

variations in foreground/background colour distributions.

In this work, we formulate the problem in a more princi-

pled framework where we represent image pixels as nodes in

a Markov Random Field and associate a unary and pairwise

cost of labelling pixels. We then solve the problem in an

energy minimization framework where the “Gibbs” energy

function E of following form is deﬁned:

E(x, θ, z)=E

(x, θ, z)+E

(x, z), (1)

such that its minimum corresponds to the target binary

image. Here x = {x

, ..., x

} is a set of labels at each

pixel. θ is the set of model parameters which is learnt

from the foreground/background colour distributions and the

vector z = {z

, ..., z

} denotes the colour intensities of

pixels.

In Equation (1), E

(·) and E

(·) corresponds to data term

and smoothness term respectively. Data term E

(·) measures

the degree of agreement o f the inferred label x

to the

observed image data z

. The smoothness term measures the

cost of assigning labels x

, x

to adjacent pixels and is used

to impose spatial smoothness. A typical unary term can be

expressed as:

(x, θ, z)=−



log p(x

Similarly, the smoothness term most commonly used in

literature is the Potts model:

(x, z)=λ



(i,j)∈N

exp

−(z

− z

)

2β

= x

]

dist(i, j)

where λ determines the degree of smoothness, dist(i, j) is

the Euclidean distance between neighbouring pixels i and j.

The constant β allows discontinuity preserving smoothing. N

denotes the neighbourhood system deﬁned in MRF. Further,

the smoothness term imposes cost only for those adjacent

pixels which have different labels (i.e. [x

= x

]).

The problem of binarization is now to ﬁnd the global

minima of the Gibbs energy, i.e.,

∗

=argmin

E(x, θ , z). (2)

The global minima of this energy function can be efﬁciently

computed by graph cut [15] su bject to fulﬁlling the cri-

teria of sub modularity [16]. For this a weighted graph

G =(V,E) is formed where each vertex corresponds to an

image pixel, and edges link adjacent pixels. Two additional

vertices source (s) and sink (t) are added to the graph. All the

other vertices are connected to them with weighted edges.

The weights of all the edges are deﬁned in such a way that

every cut of the graph is equivalent to some label assignment

to the energy function. Note that the cut of the graph G

is a partition of set of vertices V into two disjoint sets S

and T and the cost of the cut is deﬁned as the sum of the

weights of edges going from vertices belonging to set S to T

(see [16]). The min cut of such a graph corresponds to the

global m inima of the energy function. There are efﬁcient

implementations available for ﬁnding min cut of such a

graph [15].

In [2], the set of model parameters θ describe im-

age foreground/background histograms. The histograms are

constructed directly from the foreground/background seeds

which are obtained with user interaction. However, the

foreground/background distribution in our case (see images

in Figure 1) can not be captured efﬁciently by a naive

histogram distribution. Rather, we assume each pixel colour

is generated from a Gaussian Mixture Model (GMM). In this

regard, we are highly inspired by the success of the GrabCut

[3] for object segmentation. But at the same time, we want

to avoid any user interaction to make the binarization fully

automatic. We achieve this by our auto seeding algorithm

which we describe in the Section IV-A. Furthermore, iter-

ative graph cut based binarization is also more suitable for

our application as it reﬁnes seeds and, binarization output at

each iteration and thus produces a clean binarization result

even in case of noisy foreground/background distributions.

IV. I

TERATIVE GRAPH CUT BASED BINARIZATION

In GMMR F framework [4], each pixel colour is generated

from one of the 2c Gaussian Mixture Models (GMM

S)(c

GMM

S for foreground and background each) with mean

μ and covariance Σ i.e. each foreground colour pixel is

generated from following distribution:

p(z

, θ,k

)=N(z, θ; μ(x

), Σ(x

)), (3)

where N denotes a Gaussian distribution, x

∈{0, 1}

and k

∈{1, ..., c}. To model foreground colour using

above distribution, an additional vector k = {k

, ..., k

}

is introduced where each k

takes one of the c GMM

components. Similarly, background colour is modelled from

one of the c GMM components. Further, the likelihood prob-

abilities of observation can be assumed to be independent

from the pixel position. Thus can be expressed as:

p(z|x, θ, k)=



p(z

, θ,k

)



π(x

)



det(Σ(x

))

exp(−

− μ(x

))

Σ(x

)

−1

− μ(x

))).

Here π(·) is Gaussian mixture weighting coefﬁcient.

Due to the introduction of GMM

S the energy function in

Equation (1) now becomes:

E(x, k, θ, z)=E

(x, k, θ, z)+E

(x, z), (4)

i.e. the data term d epends on its assignment to GMM

component. It is given by:

(x, k, θ, z)=−



log p(z|x, θ, k). (5)

In order to make the energy function robust to low

contrast colour images we modify the smoothness term of

the energy function by adding a new term which measures

the “edginess” of the pixels as follows:

(x, z)=λ



(i,j)∈N

= x

]exp(−β||z

− z

)

+λ



(i,j)∈N

= x

]exp(−β||w

− w

(6)

Here w

denotes the magnitude of gradient (edginess) at

pixel i and N denotes the neighbourhood system deﬁned for

the MRF model. The two neighbouring pixels with similar

edginess values are more likely to belong to the same class.

The edginess term enforces this constraint. The constants

and λ

determine the relative strength of the colour and

edginess differences respectively. Parameters λ

and β are

learnt automatically from the image.

The Gaussian Mixture Models, in Equation (5), need to

be initialized with foreground/background seeds. Since our

objective is to make the binar ization fully automatic, we

initialize GMM

S by foreground-background seeds obtained

from our auto seeding algorithm. Then, at each iteration, the

seeds are reﬁned and n ew GMM

S are learnt from them. It

makes the algorithm more powerful and allows it to adapt

to the variations in foreground/background.

A. Auto-seeding

To perform automatic binarization we need to compute

foreground and background seeds for graph cut. Given an

image we ﬁrst convert it to an edge image using Canny edge

operator and then ﬁnd the foreground and background seeds

as follows:

1) Foreground seeds: Our foreground seeding algorithm

is highly motivated from the fact that there exist a parallel

edge curve (line) for every edge curve (line) in a character

i.e. if an edge pixel has gradient orientation θ then in

direction of θ there exists an edge pixel whose gradient

orientation is π − θ

Step 1: Let p be a non-traversed edge pixel with gradient

orientation θ. For every such edge pixel p we traverse the

edge image in direction of θ until we hit an edge pixel q

whose gradient orientation is (π−θ)±

(i.e. approximately

opposite gradient direction). We mark this line segment

as foreground seed candidate and store the length of it. We

repeat this process for all the non-traversed edge pixels.

After ﬁnding all foreground seed candidates, we remove all

those line segments whose length is too high or too low with

respect to the majority of seed candidates. The remaining

line segments are marked as foreground seeds.

Step 2: Handling images with light text on dark back-

ground: When we have such image we rarely get parallel

edge curves (lines) with the above mentioned traversal,

rather many line segments

pq start hitting the image bound-

ary. We automatically detect such situations and subtract π

to the original orientation a nd then follow the same process

as Step 1.

2) Background seeds: For background seeding we adopt

the following scheme: Given an edge image we ﬁnd out

the horizontal/vertical line having no edge pixel. We mar k

that line as background. When we do not get background

seeds in the above method then we relax our criteria and

mark all those regions as background which are accessible

(without hitting an edge pixel) from at least two sides of

image boundary. In practice, for some cases we do not get

enough background seeds even after relaxation. For such

cases we traverse the edge image from all four sides o f

the image boundary till we hit an edge. We mark all these

regions as background seeds. Figure 2 shows typical initial

seeds fo r the iterative graph cut.

(a) (b)

Figure 2. (a) Input Image (b) Its foreground-background seeds, Red and

blue colour shows foreground and background seeds respectiv ely (Best

vie wed in colour).

Figure 3. Images where auto-seeding fails

Although the proposed auto-seeding method performs

satisfactorily well, it tend to fail in cases where Canny

edge operator produces too many noisy or broken edges.

In such cases some foreground regions are falsely marked

as background and vice-versa, which leads to poor bina-

rization. We show two such example in Figure 3, where our

auto-seeding algorithm fails to mark foreground-background

regions appropriately.

In summary, once we obtain initial seeds, GMM

S for

foreground and background colours are learnt. Then, based

on the data and smoothness terms in Equation (5) and (6)

respectively, the graph is formed. We use standard graph cut

algorithm [15] to obtain initial binarization result. We then

re-estimate GMM

S using an initial binarization result and

iterate the graph cut over new data and smooth ness term,

until convergence. This reﬁnes the binary image at each

iteration an d ﬁnally pr oduces a clean binary image.

V. R

ESULTS AND DISCUSSIONS

We use sample images from the ICDAR 2003 Robust

Word Recognition dataset [17] for our experiments. It con-

sists of 171 natural scene text images. These images have

several degradations due to uneven lighting, complex back-

ground, blur and similar foreground background colours. To

evaluate the performance of proposed binarization algorithm,

we compare it with the well-known thresholding based bina-

rization techniques like Otsu [5], Sauvola [7], Niblack [8],

Kittler et al. [6]. We also compare our binarization algorithm

with colour thresholding based method proposed in [14].

Note that these classical binarization algorithms produce

white text on black background in case of images with light

text on dark background. On the contrary, our binarization

algorithm works in object segmentation framework and thus

produces black text on white background always. However,

for fair comparison we reverse the colour of binarized output

of the classical methods if they produce white text on black

background.

For the proposed binarization algorithm we used 10

GMM components (5 each for foreground and background).

We empirically determine the number of iteration for graph

cuts as 8, since no signiﬁcant change in binarization, was

observed beyond 8 iterations. We also show our results with

and without edginess difference in the the pairwise term.

(Note that by edginess difference term we mean, energy

function with gradient magnitude difference in addition to

difference in RGB colour space). For parameter sensitive

algorithms like [7] and [8] we use the parameters from which

we obtain the best OCR accuracy.

All the implementations of the proposed method are done

using C++ graph cut code [15] and Matlab. The proposed

method takes 32 seconds on average to produce ﬁnal binary

result for an image on system with 2 GB RAM and Intel



Core

2 Duo CPU with 2.93 GHz processor system.

A. Qualitative evaluation

First we compare the proposed binarization algorithm with

thresholding based methods intuitively in Figure 4. Samples

of images with uneven lighting, hardly distinguishable fore-

ground/background colours, noisy foreground colours, are

shown in this ﬁgure. We observe that our approach produces

clearly readable binary images. Further, our algorithm pro-

duces lesser noise compared to the local thresholding based

algorithms like [7], [8], which also helps to improve the

OCR accuracy.

B. Quantitative evaluation

Quantitative evaluation of bin arization is o ne of the

biggest challenge for document image community [9]. In

this work, we demonstrate the performance of binarization

not only based on OCR accuracy but also in terms of pixel

level accuracy.

1) OCR accuracy: We test OCR accuracy to verify

robustness of our algorithm. For this we fed the binariza-

tion result of all algorithms to commercial OCR ABBYY

ﬁne reader 9.0 [18]. The word and c haracter r ecognition

accuracies are summarized in Table I. Since this dataset

consists of images of tight word boundaries, global methods

(like [5], [6]) performs better than popular local methods.

Furthermore, OCR fails to perform well in case of noisy

binarization output (as in the case of Sauvola and Niblack).

Otsu followed by colour thresholding binarization proposed

in [14] improves the word recognition accuracy but not sig-

niﬁcantly. However, since the proposed algorithm produces

clean binary images, it shows signiﬁcant improvement in

OCR accuracy.

2) Pixel level accuracy: For comparing various bina-

rization algorithms based on pixel accuracy,wepicked30

images from the ICDAR 2003 word dataset and produced

pixel level binarization ground truth for it. These images

An MRF Model for Binarization of Natural Scene Text

Figures

Citations

Whole is Greater than Sum of Parts: Recognizing Scene Text Words

Toward Integrated Scene Text Reading

Strokelets: A Learned Multi-Scale Mid-Level Representation for Scene Text Recognition

Scene Text Detection and Segmentation Based on Cascaded Convolution Neural Networks

Image Binarization for End-to-End Text Understanding in Natural Images

References

A threshold selection method from gray level histograms

"GrabCut": interactive foreground extraction using iterated graph cuts

An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision

Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images

An Experimental Comparison of Min-cut/Max-flow Algorithms for Energy Minimization in Vision

Related Papers (5)

A threshold selection method from gray level histograms

Detecting text in natural scenes with stroke width transform

An introduction to digital image processing

End-to-end scene text recognition

Real-time scene text localization and recognition

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "An mrf model for binarization of natural scene text" ?

Q2. What is the way to make the binarization process more efficient?

Q3. What is the energy function in Equation (1)?

Q4. How long does it take to produce a binary image?

Q5. What is the meaning of edginess difference?

Q6. What is the common term used in the literature?

Q7. What is the gradient orientation of the edge pixel?

Q8. how to make the energy function robust to low contrast colour images?

Q9. What is the way to get the pixel colour from a GMMRF?

Q10. What are the main problems of the previous binarization algorithms?

Q11. How do the authors determine the background of the graph?

Q12. What is the main difference between the two methods?