scispace - formally typeset

Journal ArticleDOI

Preprocessing of Low-Quality Handwritten Documents Using Markov Random Fields

01 Jul 2009-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Computer Society)-Vol. 31, Iss: 7, pp 1184-1194

TL;DR: This paper presents a statistical approach to the preprocessing of degraded handwritten forms including the steps of binarization and form line removal including the modification of the MRF model to drop the preprinted ruling lines from the image.

AbstractThis paper presents a statistical approach to the preprocessing of degraded handwritten forms including the steps of binarization and form line removal. The degraded image is modeled by a Markov random field (MRF) where the hidden-layer prior probability is learned from a training set of high-quality binarized images and the observation probability density is learned on-the-fly from the gray-level histogram of the input image. We have modified the MRF model to drop the preprinted ruling lines from the image. We use the patch-based topology of the MRF and belief propagation (BP) for efficiency in processing. To further improve the processing speed, we prune unlikely solutions from the search space while solving the MRF. Experimental results show higher accuracy on two data sets of degraded handwritten images than previously used methods.

Topics: Markov random field (58%), Image processing (55%), Markov model (54%), Image segmentation (54%), Histogram (52%)

Summary (4 min read)

1 INTRODUCTION

  • THE goal of this paper is the preprocessing of degradedhandwritten document images such as carbon forms for subsequent recognition and retrieval.
  • This is largely due to the extremely low image quality.
  • People tend to write lightly at the turns of strokes.
  • Therefore, binarizing the carbon copy images of handwritten documents is very challenging.
  • The authors can learn the observation model on the fly from the local histogram of the test image.

2.1 Locally Adaptive Methods for Binarization

  • By assuming that the background changes slowly, the problem of varying illumination is solved by adaptive binarization algorithms such as Niblack [15] and Sauvola [18].
  • The idea is to determine the threshold locally, using histogram analysis, statistical measures (mean, variance, etc.), or the intensity of the extracted background.
  • The resulting blurring affects handwriting recognition accuracy.
  • Approaches of heuristic analysis of local connectivity, such as Kamel/Zhao [11], Yang/Yan [21], and Milewski/Govindaraju [14], solve the problem to some extent by searching for stroke locations and targeting only nonstroke areas.
  • In all of these approaches, the spatial constraints applied to the images are determined by a heuristic.

2.2 The Markov Random Field for Binarization

  • In recent years, inspired by the success of applying the MRF to image restoration [4], [5], [6], attempts have been made to apply MRF to the preprocessing of degraded document images [7], [8], [20].
  • Wolf and Doermann [20] defined the prior model on a 4 4 clique, which is appropriate for textual images in low-resolution video.
  • Gupta et al. [7], [8] studied the restoration and binarization of blurred images of license plate digits.
  • They adopted the factorized style of MRF using the product of compatibility functions [4], [5], [6], which are defined as mixtures of multivariate normal distributions computed over samples of the training set.
  • The authors describe an MRF adapted for handling handwritten documents that overcomes the computational challenges caused by highresolution data and low accuracy rates of current handwriting recognizers.

2.3 Ruling Line Removal

  • The process of removing preprinted ruling lines while preserving the overlapping textual matter is referred to as image in-painting (Fig. 1) and is performed by inferring the removed overlapping portion of images from spatial constraints.
  • Previously reported work on line removal in document images uses heuristic [1], [14], [23].
  • Bai and Huo [1] remove the underline in machine-printed documents by estimating its width.
  • Yoo et al. [23] describe a sophisticated method that classifies the missing parts of strokes into different categories such as horizontal, vertical, and diagonal and connects them with runs (of black pixels) in the corresponding directions.
  • It relies on many heuristic rules and is not accurate when strokes are lightly touching the ruling line.

3 MARKOV RANDOM FIELD MODEL FOR

  • The authors use an MRF model (Fig. 2) with the same topology as the one described in [5].
  • Each binarized patch conditionally depends on its four neighboring binarized patches in both the horizontal and vertical directions, and each observed patch conditionally depends y on its corresponding binarized patch.
  • An edge in the graph represents the conditional dependence of two vertices.
  • It is impossible to compute either (3) or (4) directly for large graphs because the computation grows exponentially as the number of vertices increases.

4.1 Belief Propagation

  • An iteration only involves local computation between the neighboring vertices.
  • The formulas for the BP algorithm for MAP estimation are similar to (8) and (9) except that P xj xj and P xk are replaced with argmaxxj and maxxk , respectively: x̂jMAP ¼ argmax xj ðxj; yjÞ Y k Mkj ; ð10Þ.
  • The pairwise compatibility functions and are usually heuristically defined as functions with the distance between two patches as the variable.

4.3 Learning the Observation Model PrðyjjxjÞ

  • The observation model on the pixel level can be estimated from the distribution of gray-scale densities of pixels [20].
  • The authors algorithm is described as follows: 1. Background extraction.
  • The authors mark the background pixels in the original image using the binarized image and estimate the mean b0 and variance b0 of density pb from the extracted background pixels.
  • EM algorithm for estimating the 2-GMM.
  • The p.d.f. estimation algorithm using the EM algorithm has an advantage over the algorithms using Niblack thresholding because it avoids the problem of sharply cutting the histogram and has a smoother estima- tion at the intersection of two Gaussian distributions.

4.4 Ruling Line Removal

  • First, the ruling lines are located by template matching; this is relatively straightforward to implement because of the fixed form layout and is true for most types of forms in other applications as well.
  • The authors replace (24) with (29) for the compound tasks of binarization and line removal.

4.5 Pruning the Search Space of MRF Inference

  • To this point, MRF-based preprocessing has been pre- sented as a self-contained general-purpose algorithm.
  • From the above analysis, the authors have the following two-step strategy to accelerate the algorithm: 1. Find a global threshold thrprune such that 90 percent of the pixels in the test image are below thrprune.
  • If PRUNEjðlÞ is true, Cl is pruned from the search space for solving xj.
  • For the patches that contain pixels to in-paint, Prmin should be greater than the prior probability of any state in the codebook, i.e., Prmin < min l PrðClÞ, so that any state will not be pruned in the first iteration of BP.

5.1 Test Data Sets

  • The authors test data includes the PCR carbon forms and handwriting images from IAM database 3.0 [13]: 1. PCR forms.
  • In New York state, all patients who enter the Emergency Medical System (EMS) are tracked through their prehospital care to the emergency room using the PCRs.
  • The PCR is used to gather vital patient information.
  • D. Medical lexicons of words are large (more than 4,000 entries).
  • The IAM database contains highquality images of unconstrained handwritten English text, which were scanned as gray-scale images at 300 dpi.

5.2 Display of Preprocessing Results

  • First, the authors applied their algorithm to the input image shown in Fig.
  • By aligning the input image with a template form image, rough estimations of the positions of lines and unwanted machineprinted blocks are detected.
  • The authors test images and the images for training the prior model are from different writers.
  • After the first iteration, the message has not yet been passed between neighbors.
  • After four iterations, nearly all of the strokes are restored, although a few tiny artifacts are still visible.

5.3 Results of Acceleration: Speed versus Accuracy

  • The authors have tested the effect of different values of parameter Prmin on the speed and accuracy of their algorithm using the PCR carbon form image in Fig. 7 and the IAM handwriting image in Fig. 10.
  • In order to compare the results obtained by their algorithm with different values of Prmin , the authors have taken the output images of Prmin ¼ 0 (which indicates no speedup) as reference images and have counted the pixels in the output images with various Prmin s that are different from the reference images.

5.4 Comparison to Other Preprocessing Methods

  • In Fig. 11, the authors compare their approach with the preprocessing algorithm of Milewski and Govindaraju [14], the Niblack algorithm [15], and the Otsu algorithm [16].
  • The Niblack and Otsu algorithms are for binarization only.
  • From the result of the MRF-based algorithm, the text “67 yo , pt found” is clear and the text “MFG X ray” is obscured but some letters are still legible.
  • Set #1 contains 1,203 word images that are not affected by overlapping form lines, i.e., no intersection of stroke and line;.
  • The word recognition rates of the original images among all three methods are very close.

6 CONCLUSIONS

  • The authors have presented a novel method for binarizing degraded document images containing handwriting and removing preprinted form lines.
  • In their MRF model, the authors reduce the large search space of the prior model to a class of 114 representatives by VQ and learn the observation model directly from the input image.
  • The authors work is the first attempt at applying a stochastic method to the preprocessing of degraded highresolution handwritten documents.
  • The authors model is targeted toward document images and therefore may not handle large variations in illumination, complex backgrounds, and blurring that are common in video and scene text processing.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

Preprocessing of Low-Quality Handwritten
Documents Using Markov Random Fields
Huaigu Cao, Member, IEEE, and Venu Govindaraju, Fellow, IEEE
Abstract—This paper presents a statistical approach to the preprocessing of degraded handwritten forms including the steps of
binarization and form line removal. The degraded image is modeled by a Markov Random Field (MRF) where the hidden-layer prior
probability is learned from a training set of high-quality binarized images and the observation probability density is learned on-the-fly
from the gray-level histogram of the input image. We have modified the MRF model to drop the preprinted ruling lines from the image.
We use the patch-based topology of the MRF and Belief Propagation (BP) for efficiency in processing. To further improve the
processing speed, we prune unlikely solutions from the search space while solving the MRF. Experimental results show higher
accuracy on two data sets of degraded handwritten images than previously used methods.
Index Terms—Markov random field, image segmentation, document analysis, handwriting recognition.
Ç
1INTRODUCTION
T
HE goal of this paper is the preprocessing of degraded
handwritten document images such as carbon forms for
subsequent recognition and retrieval. Carbon form recogni-
tion is generally considered to be a very hard problem. This
is largely due to the extremely low image quality. Although
the background variation is not very intense, the hand-
writing is often occluded by extreme noise from two
sources: 1) the extra carbon powder imprinted on the form
because of accidental pressure and 2) the inconsistent force
of writing. For example, people tend to write lightly at the
turns of strokes. This is not a serious problem for writing on
regular paper. However, when writing on carbon paper, the
light writing causes notches along the stroke. Furthermore,
most multipart carbon forms have a colored background so
that the different copies can be distinguished. This results in
very low contrast and a very low signal-to-noise ratio. Thus,
the image quality of carbon copies is generally poorer than
that of degraded documents that are not carbon copies.
Therefore, binarizing the carbon copy images of hand-
written documents is very challenging.
Traditional document image binarization algorithms [16],
[15], [18], [11], [21] separate the foreground from the back-
ground by histogram thresholding and analysis of the
connectivity of strokes. These algorithms, although effective,
rely on heuristic rules of spatial constraints, which are not
scalable across applications. Recent research [7], [8], [20] has
applied the Markov random field (MRF) to document image
binarization. Although these algorithms make various
assumptions applicable only to low-resolution document
images, we take advantage of the ability of the MRF to
model spatial constraints in the case of high-resolution
handwritten documents.
We present a method that uses a collection of standard
patches to represent each patch of the binarized image from
the test set. The input and output images are divided into
nonoverlapping blocks (patches), and a Markov network is
used to m odel the conditional dep endence between
neighboring patches. These representatives are obtained
by clustering patches of binarized images in the training set.
The use of representatives reduces the domain of the prior
model to a manageable size. Since our objective is not image
restoration (from linear or nonlinear degradation), we do
not need an image/scene pair for learning the observation
model. We can learn the observation model on the fly from
the local hi stogram of the test image. T herefore, our
algorithm achieves performance similar to adaptive thresh-
olding algorithms [15], [18] even without using the prior
model. As one might expect, the result improves with the
inclusion of spatial constraints added by the prior model. In
addition to binarization, we also apply our algorithm to the
removal of form lines by modeling the way the probability
density of the observation model is computed.
One significant improvement in this paper since our
prior work [3] is the use of a more reliable method of
estimating the observation model. This uses mathematical
morphology to obtain the background, followed by Gaus-
sian Mixture Modeling to estimate the foreground and
background probability densities. Another improvement is
the use of more efficient pruning methods to reduce the
search space of the MRF effectively by identifying the
patches that are surrounded by background patches. We
present experimental results on the Prehospital Care Report
(PCR) data set of handwritten carbon forms [14] and
provide a quantitative comparison of word recognition
rates on forms binarized by our method versus other
approaches.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. X, XXXXXXX 2009 1
. The authors are with the Center for Unified Biometrics and Sensors
(CUBS), Department of Computer Science and Engineering, University at
Buffalo, Amherst, NY 14260.
E-mail: hcao@bbn.com, venu@cubs.buffalo.edu.
Manuscript received 26 June 2007; revised 13 Nov. 2007; accepted 6 May
2008; published online 14 May 2008.
Recommended for acceptance by D. Lopresti.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2007-06-0389.
Digital Object Identifier no. 10.1109/TPAMI.2008.126.
0162-8828/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society

2RELATED W ORK
2.1 Locally Adaptive Methods for Binarization
Usually, the quality of a document image is affected by
variations in illumination and noise. By assuming that the
background changes slowly, the problem of varying
illumination is solved by adaptive binarization algorithms
such as Niblack [15] and Sauvola [18]. The idea is to
determine the threshold locally, using histogram analysis,
statistical measures (mean, variance, etc.), or the intensity of
the extracted background. Although noise can be reduced
by smoothing, the resulting blurring affects handwriting
recognition accuracy. Approaches of heuristic analysis of
local connectivity, such as Kamel/Zhao [11], Yang/Yan
[21], and Milewski/Govindaraju [14], solve the problem to
some extent by searching for stroke locations and targeting
only nonstroke areas. The Kamel/Zhao algorithm strokes
by estimating the stroke width and then removes the noise
in nonstroke areas using an interpolation and thresholding
step. The Yang/Yan algorithm is a variant of the same
method. The Milewski/Govindaraju algorithm examines
neighboring blocks in orientations to search for nonstroke
areas. However, in all of these approaches, the spatial
constraints applied to the images are determined by a
heuristic. Our objective is to find a probabilistic trainable
approach to modeling the spatial constraints of the
binarized image.
2.2 The Markov Random Field for Binarization
In recent years, inspired by the success of applying the MRF
to image restoration [4], [5], [6], attempts have been made to
apply MRF to the preprocessing of degraded document
images [7], [8], [20]. The advantage of the MRF model over
heuristic methods is that it allows us to describe the
conditional dependence of neighboring pixels as the prior
probability and to learn it from training data. Wolf and
Doermann [20] defined the prior model on a 4 4 clique,
which is appropriate for textual images in low-resolution
video. However, for 300 dpi high-resolution handwritten
document images, it is not computationally feasible to learn
the potentials if we simply try to define a much larger
neighborhood. Gupta et al. [7], [8] studied the restoration
and binarization of blurred images of license plate digits.
They adopted the factorized style of MRF using the product
of compatibility functions [4], [5], [6], which are defined as
mixtures of multivariate normal distributions computed
over samples of the training set. They incorporated
recognition into the MRF to reduce the number of samples
involved in the calculation of the compatibility functions.
However, this scheme also cannot be directly applied to
unconstrained handwriting because of the larger number of
classes and the low performance of existing handwriting
recognition algorithms. In this paper, we describe an MRF
adapted for handling handwritten documents that over-
comes the computational challenges caused by high-
resolution data and low accuracy rates of current hand-
writing recognizers.
2.3 Ruling Line Removal
The process of removing preprinted ruling lines while
preserving the overlapping textual matter is referred to as
image in-painting (Fig. 1) and is performed by inferring the
removed overlap ping p ortion of images from s patial
constraints. MRF is ideally suited to this task and has been
used successfully on natural scene images [2], [22]. Our task
on document images is similar but more difficult: In both
cases, spatial constraints are used to paint in the missing
pixels, but the missing portions in document images often
contain strokes with high-frequency components and de-
tails. Previously reported work on line removal in docu-
ment images uses heuristic [1], [14], [23]. Bai and Huo [1]
remove the underline in machine-printed documents by
estimating its width. This works on m achine-printed
documents because the number of possible situations in
which strokes and underlines intersect is limited. Milewski
and Govindaraju [14] proposed restoring the strokes of
handwritten forms using a simple interpolation of neigh-
boring pixels. Yoo et al. [23] describe a sophisticated
method that classifies the missing parts of strokes into
different categories such as horizontal, vertical, and
diagonal and connects them with runs (of black pixels) in
the corresponding directions. It relies on many heuristic
rules and is not accurate when strokes are lightly
(tangentially) touching the ruling line.
3MARKOV RANDOM FIELD MODEL FOR
HANDWRITING IMAGES
We use an MRF model (Fig. 2) with the same topology as
the one described in [5]. A binarized image x is divided into
nonoverlapping square patches, x
1
;x
2
; ...;x
N
,andthe
input image, or the observation y, is also divided into
patches y
1
;y
2
; ...;y
N
so that x
i
corresponds to y
i
for any
1 i N. Each binarized patch conditionally depends on
its four neighboring binarized patches in both the hor-
izontal and vertical directions, and each observed patch
conditionally depends y on its corresponding binarized
patch. Thus,
Prðx
i
jx
1
; ...;x
i1
;x
iþ1
; ...;x
N
;y
1
; ...y
N
Þ¼
Prðx
i
jx
n
1
;i
;x
n
2
;i
;x
n
3
;i
;x
n
4
;i
Þ; 1 i N;
ð1Þ
where x
n
1
;i
, x
n
2
;i
, x
n
3
;i
, and x
n
4
;i
are the four neighboring
vertices of x
i
and
Prðy
i
jx
1
; ...;x
N
;y
1
; ...;y
i1
;y
iþ1
; ...;y
N
Þ¼
Prðy
i
jx
i
Þ; 1 i N:
ð2Þ
An edge in the graph represents the conditional dependence
of two vertices. The advantage of such a patch-based topology
is that relatively large areas of the local image are condition-
ally dependent. Our objective is to estimate the binarized
2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. X, XXXXXXX 2009
Fig. 1. Stroke-preserving line removal. (a) A word image with an
underline across the text. (b) Binarized image with the underline
removed. (c) Binarized image with the underline removed and strokes
repaired.

image x from the posterior probability PrðxjyÞ¼
Prðx;yÞ
PrðyÞ
. Since
PrðyÞ is a constant over x, we only need to estimate x from the
joint probability Prðx; yÞ¼Prðx
1
; ...;x
N
;y
1
; ...;y
N
Þ. This
can be done by either the MMSE or the MAP approach
[4], [5]. In the MMSE approach, the estimation of each x
j
is
obtained by computing the marginal probability:
^
x
jMMSE
¼
X
x
j
x
j
X
x
1
...x
j1
x
jþ1
...x
N
Prðx
1
; ...;x
N
;y
1
; ...;y
N
Þ:
ð3Þ
In the MAP approach, the estimation of each x
j
is
obtaine d by taking the maximum of the probability
Prðx
1
; ...;x
N
;y
1
; ...;y
N
Þ, i.e.,
^x
jMAP
¼ argmax
x
j
max
x
1
...x
j1
x
jþ1
...x
N
Prðx
1
; ...;x
N
;y
1
; ...;y
N
Þ: ð4Þ
Estimation of the hidden vertices fx
j
g using (3) or (4) is
referred to as inference. It is impossible to compute either
(3) or (4) directly for large graphs because the computation
grows exponentially as the number of vertices increases. We
can use the Belief Propagation (BP) algorithm [17] to
approximate the MMSE or MAP estimation in time linear
in the number of vertices in the graph.
4INFERENCE IN THE MRF USING BELIEF
PROPAGATION
4.1 Belief Propagation
In the BP algorithm, the joint probability of the hidden
image x and the observed image y from an MRF is
represented by the following factorized form [5], [6]:
Prðx
1
; ...;x
N
;y
1
; ...;y
N
Þ¼
Y
ði;jÞ
ðx
i
;x
j
Þ
Y
k
ðx
k
;y
k
Þ; ð5Þ
where ði; jÞ are neighboring hidden nodes and and are
pairwise compatibil ity fun ctions between neighboring
nodes, learned from the training data. The MMSE and
MAP objective functions can be rewritten as
^
x
jMMSE
¼
X
x
j
x
j
X
x
1
...x
j1
x
jþ1
...x
N
Y
ði;jÞ
ðx
i
;x
j
Þ
Y
k
ðx
k
;y
k
Þ;
ð6Þ
^
x
jMAP
¼ argmax
x
j
max
x
1
...x
j1
x
jþ1
...x
N
Y
ði;jÞ
ðx
i
;x
j
Þ
Y
k
ðx
k
;y
k
Þ: ð7Þ
The BP algorithm provides an approximate estimation of
^
x
jMMSE
or
^
x
jMAP
in (6) and (7) by iterative steps. An
iteration only involves local compu tation be tween the
neighboring vertices. In the BP algorithm for MMSE, (6) is
approximately computed by two iterative equations:
^
x
jMMSE
¼
X
x
j
x
j
ðx
j
;y
j
Þ
Y
k
M
k
j
; ð8Þ
M
k
j
¼
X
x
k
ðx
j
;x
k
Þðx
k
;y
k
Þ
Y
lj
~
M
l
k
: ð9Þ
In (8), k runs over any of the four neighboring hidden vertices
of x
j
. M
k
j
is the “message” passed from j to k and is calculated
from (9) (the expression of M
k
j
only involves the compatibility
functions related to vertices j and k,soM
k
j
can be thought of as
the message passed from vertex j to vertex k).
~
M
l
k
is M
l
k
from
the previous iteration. Note that M
k
j
is actually a function of
x
j
. Initially, M
k
j
ðx
j
Þ¼1 for any j and any value of x
j
.
The formulas for the BP algorithm for MAP estimation
are similar to (8) and (9) except that
P
x
j
x
j
and
P
x
k
are
replaced with argmax
x
j
and max
x
k
, respectively:
^
x
jMAP
¼ argmax
x
j
ðx
j
;y
j
Þ
Y
k
M
k
j
; ð10Þ
M
k
j
¼ max
x
k
ðx
j
;x
k
Þðx
k
;y
k
Þ
Y
lj
~
M
l
k
: ð11Þ
In our exper iments , we use MAP e stimatio n. The
pairwise compatibility functi ons and are usual ly
heuristically defined as functions with the distance between
two patches as the variable. We have found that a simple
form is not suitable for binarized images because the
distance can only take on a few values. Another way to
select the form of and is to use pairwise joi nt
probabilities [4], [5]:
CAO AND GOVINDARAJU: PREPROCESSING OF LOW-QUALITY HANDWRITTEN DOCUMENTS USING MARKOV RANDOM FIELDS 3
Fig. 2. The topology of the Markov network. (a) The input image y and
the inferred image x. (b) The Markov network generalized from (a). In
(b), each node x
i
in the field is connected to its four neighbors. Each
observation node y
i
is connected to node x
i
. An edge indicates the
conditional dependence of two nodes.

ðx
j
;x
k
Þ¼
Prðx
j
;x
k
Þ
Prðx
j
Þ Prðx
k
Þ
; ð12Þ
ðx
k
;y
k
Þ¼Prðx
k
;y
k
Þ: ð13Þ
Replacing the and functions in (10) and (11) with the
definitions in (12) and (13), we obtain
^
x
j MAP
¼ argmax
x
j
Prðx
j
Þ Prðy
j
jx
j
Þ
Y
k
M
k
j
ð14Þ
and
M
k
j
¼ max
x
k
Prðx
k
jx
j
Þ Prðy
k
jx
k
Þ
Y
lj
~
M
l
k
: ð15Þ
In order to avoid arithmetic overflow, we calculate the log
values of the factors in (14) and (15):
L
k
j
¼ max
x
k
log Prðx
k
jx
j
Þþlog Prðy
k
jx
k
Þþ
X
lj
~
L
l
k
!
; ð16Þ
^
x
jMAP
¼ argmax
x
j
log Prðx
j
Þþlog Prðy
j
jx
j
Þþ
X
k
L
k
j
!
; ð17Þ
where L
k
j
¼ log M
k
j
,
~
L
l
k
¼ log
~
M
l
k
, and the initial values of
~
L
k
j
s are set to zero.
To use (14) and (15), the probabilities Prðx
j
Þ and
Prðx
k
jx
j
Þ (prior model) and the observation probability
density Prðy
j
jx
j
Þ (observation model) have to be estimated.
4.2 Learning the Prior Model Prðx
j
Þ and Prðx
k
jx
j
Þ
The prior probabilities Prðx
j
Þ and Prðx
k
jx
j
Þ are learned
from a training set of clean handwriting images. The
training set contains three high-quality binarized hand-
writing images from different writers. We can extract about
two million patch images from these samples. Some
samples from the training set are shown in Fig. 5. For
training, we use clean samples because unlike the observed
image, the hidden image should have good quality.
Assuming that the size of a patch is B B, the number of
states of a binarized patch x
j
is 2
B
2
.IfB ¼ 5, for example,
there will be about 34 million states. This makes searching
for the maximum in (15) intractable. In order to solve this
problem, we convert the original set of states to a much
smaller set and then estimate the probabilities over the
smaller set of states. Normally, this is done by dimension
reduction using transforms like PCA. It is difficult,
however, to apply such a transform to binarized images.
Therefore, w e use a number of standard pat ches to
represent all of the 2
B
2
states. This is similar to vector
quantization (VQ) used in data compression. The set of
representatives is referred to as the VQ codebook. Our
method is inspired by the idea that images of similar objects
can be represented by a very small number of the shared
patches in the spatial domain.
Recently, Jojic et al. [10] explored this possibility of
representing an image by shared patches. Similarly, the
binarized document images with handwriting of nearly the
same stroke width under the same resolution can also be
decomposed into patches that appear frequently (Fig. 3).
The representatives are learned by clustering all the patches
in our training set. We use the following approach: After
every iteration of K-Means clustering, we round all the
dimensions of each cluster center to zero or one. Given a
training set of B B binary patches, represented by fp
i
g,
we run the K-Means clustering starting with 1,024 clusters
and remove the duplicate clusters and clusters containing
less than 1,000 samples. The remaining cluster centers are
taken as the representatives.
If the codebook is denoted by
e
C ¼fC
1
;C
2
; ...;C
M
g,
where C
1
; ...;C
M
are M representatives, the error of VQ is
given by the following equation:
vq
¼
P
i
dðp
i
;
e
CÞ
hi
2
#fp
i
gB
2
; ð18Þ
where dðp
i
;
e
CÞ denotes the euclidean distance from p
i
to its
nearest neighbor(s) in
e
C and #fp
i
g denotes the number of
elements in fp
i
g.
vq
is the square error normalized by the
total number of pixels in the training set.
We can use the quantization error
vq
to determine the
parameter B. A larger patch size provides stronger local
dependence, but it is difficult to represent very large patches
because of the variety of writing styles exhibited by different
writers. We tried different values of B ranging between five
and eight, which coincide with the range of a typical stroke
width in handwriting images scanned at 300 dpi, and chose
the largest value of B that led to an
vq
that is below 0.01. Thus,
we determined the patch size B ¼ 5. Then, the representation
error
vq
¼ 0:0079 and 114 representatives are generated
(Fig. 4). The size of the search space of a binarized patch is
reduced from 2
5
2
(about 34 million) to 114.
Now, we can estimate the prior probability Prðx
j
Þ over
codebook
e
C,
X
M
l¼1
Prðx
j
¼ C
l
Þ¼1 ð19Þ
so that the prior probabilities Prðx
j
Þ over the reduced search
space must add up to one. We estimate Prðx
j
Þ from the
relative size of the cluster centered at C
l
. A patch p
i
from the
training set is a member of cluster C
l
ð1 l MÞ if C
l
is a
nearest neighbor of p
i
among all of C
1
; ...;C
M
and is
denoted by p
i
2 C
l
. Note that a patch p
i
from the training set
may have multiple nearest neighbors among C
1
; ...;C
M
.
The number of nearest neighbors of p
i
in
e
C is denoted by
n
e
C
ðp
i
Þ. Thus, the probability Prðx
j
Þ is estimated by
4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. X, XXXXXXX 2009
Fig. 3. Shared patches in a binary document image.

^
Prðx
j
¼ C
l
Þ¼
P
p
i
2C
l
1
n
e
C
ðp
i
Þ
#fp
i
g
;l¼ 1; 2; ...;M; ð20Þ
where #fp
i
g is the number of patches in fp
i
g.
^
Prðx
j
¼ C
l
Þ in
(20) is estimated by the size of cluster C
l
normalized by the
total number of training patches. It is easy to verify that the
probabilities in (20) add up to one.
Prðx
j
;x
k
Þ are estimated in the horizontal and vertical
directions, respectively. Similar to (20), the Prðx
j
;x
k
Þ
ðx
j
;x
k
2
e
CÞ in horizontal direction is estimated by
^
Prðx
j
¼ C
l
1
;x
k
¼ C
l
2
Þ¼
P
ðp
i
1
;p
i
2
Þ;
p
i
1
2C
l
1
;p
i
2
2C
l
2
1
n
e
C
ðp
i
1
Þn
e
C
ðp
i
2
Þ
# ðp
i
1
;p
i
2
Þ
fg
;
l
1
¼ 1; 2; ...;M; l
2
¼ 1; 2; ...;M;
ð21Þ
where ð p
i
1
;p
i
2
Þ runs for all pairs of patches in the training
set fp
i
g such tha t p
i
1
is the left neighbor of p
i
2
and
#p
i
1
;p
i
2
Þg is the number of pairs of left-and-right
neighboring patches in fp
i
g.
The Prðx
j
;x
k
Þðx
j
;x
k
2
e
CÞ in the vertical direction is
estimated by an equation similar to (21) except that p
i1
is the
upper neighbor of p
i2
.
4.3 Learning the Observation Model Prðy
j
jx
j
Þ
The observation model on the pixel level can be estimated
from the distribution of gray-scale densities of pixels [20].
For the patch-level observation model, we need to map the
single-pixel version to the vector space of patches. The
pixels of an observed patch y
j
are denoted by y
r;s
j
,
1 r; s 5. The pixels of a binarized patch x
j
are denoted
by x
r;s
j
, 1 r; s 5. We assume that the pixels inside an
observed patch y
j
and the respective binarized patch x
j
obey a similar conditional dependence assumption as the
patches in the patch-based topology (2), i.e.,
Pr y
r;s
j
jy
1;1
j
; ...;y
r;s1
j
;y
r;sþ1
j
; ...;y
5;5
j
;x
1;1
j
; ...;x
5;5
j

¼ Pr y
r;s
j
jx
r;s
j

; 1 r; s 5:
ð22Þ
Thus, it can be proven that
Pr y
1;1
j
; ...;y
5;5
j
jx
1;1
j
; ...x
5;5
j

¼
Y
5
r¼1
Y
5
s¼1
Pr y
r;s
j
jx
r;s
j

: ð23Þ
Given the distribution of the intensity of foreground (strokes)
p
f
ðy
r;s
j
Þ¼Prðy
r;s
j
jx
r;s
j
¼ 0Þ and the distribution of the intensity
of background p
b
ðy
r;s
j
Þ¼Prðy
r;s
j
jx
r;s
j
¼ 1Þ, according to (23),
the conditional p.d.f Prðy
j
jx
j
Þ is calculated as
Prðy
j
jx
j
Þ¼
Y
1r;s5;x
r;s
j
¼0
p
f
y
r;s
j

Y
1r;s5;x
r;s
j
¼1
p
b
y
r;s
j

: ð24Þ
The expression 1 r; s 5, x
r;s
j
¼ 0 means that the scope of
the product is for any r and s such that 1 r; s 5 and
x
r;s
j
¼ 0. The expression 1 r, s 5, x
r;s
j
¼ 1 is specified in
the same way.
The probability densities p
f
and p
b
change over an image
while the intensity of the background is changing. How-
ever, this is not a problem as we can use regularization
techniques such as Background Surface Thresholding (BST)
[19] to obtain the background and normalize the images.
This background mapping technique is equivalent to
adaptive thresholding algorithms such as the Niblack
algorithm [15].
Learning the probability density functions p
f
and p
b
is
unsupervised. Assuming that p
f
and p
b
are two normal
distributions, one way to compute p
f
and p
b
is given as
follows: First, we determine a threshold T by an adaptive
thresholding method such as the Niblack algorithm. Then,
we use all of the pixels with gray level T to estimate the
mean and variance of p
f
and use the remaining pixels to
estimate the mean and variance of p
b
. This method for
estimating the observation probability densities is affected
by the sharp truncation of “tails” in both normal distribu-
tions. Instead, we estimate the densities by modeling them
as a two-Gaussian Mixture Model (2-GMM) using the
Expectation-Maximization (EM) algorithm. The 2-GMM is
not always reliable, due to the fact that the signals are not
strictly Gaussian and that the algorithm is unsupervised
with respect to the foreground/background categories. Our
strategy is to get a reliable estimation of the p.d.f. of the
background by background extraction and refine it when
fitting the mixture model. Our algorithm is described as
follows:
1. Background ext raction . Estimate the mean and
variance
2
of the entire input image. Binarize the
CAO AND GOVINDARAJU: PREPROCESSING OF LOW-QUALITY HANDWRITTEN DOCUMENTS USING MARKOV RANDOM FIELDS 5
Fig. 4. The 114 representatives of shared patches obtained from
clustering.
Fig. 5. Binarized images from three writers for learning the prior model.

Figures (12)
Citations
More filters

Journal ArticleDOI
TL;DR: This paper addresses a pixel-based binarization evaluation methodology for historical handwritten/machine-printed document images using a weighting scheme that diminishes any potential evaluation bias.
Abstract: Document image binarization is of great importance in the document image analysis and recognition pipeline since it affects further stages of the recognition process. The evaluation of a binarization method aids in studying its algorithmic behavior, as well as verifying its effectiveness, by providing qualitative and quantitative indication of its performance. This paper addresses a pixel-based binarization evaluation methodology for historical handwritten/machine-printed document images. In the proposed evaluation scheme, the recall and precision evaluation measures are properly modified using a weighting scheme that diminishes any potential evaluation bias. Additional performance metrics of the proposed evaluation scheme consist of the percentage rates of broken and missed text, false alarms, background noise, character enlargement, and merging. Several experiments conducted in comparison with other pixel-based evaluation measures demonstrate the validity of the proposed evaluation scheme.

115 citations


Journal ArticleDOI
TL;DR: This work achieves binarization of document images by taking advantage of local probabilistic models and of a flexible active contour scheme, which is highly successful in other contexts, such as medical image segmentation and road network extraction from satellite images.
Abstract: Document image binarization is a difficult task, especially for complex document images. Nonuniform background, stains, and variation in the intensity of the printed characters are some examples of challenging document features. In this work, binarization is accomplished by taking advantage of local probabilistic models and of a flexible active contour scheme. More specifically, local linear models are used to estimate both the expected stroke and the background pixel intensities. This information is then used as the main driving force in the propagation of an active contour. In addition, a curvature-based force is used to control the viscosity of the contour and leads to more natural-looking results. The proposed implementation benefits from the level set framework, which is highly successful in other contexts, such as medical image segmentation and road network extraction from satellite images. The validity of the proposed approach is demonstrated on both recent and historical document images of various types and languages. In addition, this method was submitted to the Document Image Binarization Contest (DIBCO’09), at which it placed 3rd.

47 citations


Cites background from "Preprocessing of Low-Quality Handwr..."

  • ...A few other successful approaches in binarization of document images are morphological operators [15], Markov Random Fields [16], local adaptive partitioning methods [17]....

    [...]


Proceedings ArticleDOI
Hongwei Zhang1, Changsong Liu1, Cheng Yang1, Xiaoqing Ding1, Kongqiao Wang2 
18 Sep 2011
TL;DR: A two-step iterative CRF algorithm with a Belief Propagation inference and an OCR filtering stage for extracting multiple text lines and two kinds of neighborhood relationship graph are used.
Abstract: Over the past few years, research on scene text extraction has developed rapidly. Recently, condition random field (CRF) has been used to give connected components (CCs) 'text' or 'non-text' labels. However, a burning issue in CRF model comes from multiple text lines extraction. In this paper, we propose a two-step iterative CRF algorithm with a Belief Propagation inference and an OCR filtering stage. Two kinds of neighborhood relationship graph are used in the respective iterations for extracting multiple text lines. Furthermore, OCR confidence is used as an indicator for identifying the text regions, while a traditional OCR filter module only considered the recognition results. The first CRF iteration aims at finding certain text CCs, especially in multiple text lines, and sending uncertain CCs to the second iteration. The second iteration gives second chance for the uncertain CCs and filter false alarm CCs with the help of OCR. Experiments based on the public dataset of ICDAR 2005 prove that the proposed method is comparative with the existing algorithms.

39 citations


Cites background from "Preprocessing of Low-Quality Handwr..."

  • ...MRF and CRF based approaches have been successful in modeling low level vision problems such as image restoration, segmentation [4], etc....

    [...]


Journal ArticleDOI
TL;DR: This paper proposes a novel multiscale segmentation scheme for MRC document encoding based upon the sequential application of two algorithms and shows that the new algorithm achieves greater accuracy of text detection but with a lower false detection rate of nontext features.
Abstract: The mixed raster content (MRC) standard (ITU-T T.44) specifies a framework for document compression which can dramatically improve the compression/quality tradeoff as compared to traditional lossy image compression algorithms. The key to MRC compression is the separation of the document into foreground and background layers, represented as a binary mask. Therefore, the resulting quality and compression ratio of a MRC document encoder is highly dependent upon the segmentation algorithm used to compute the binary mask. In this paper, we propose a novel multiscale segmentation scheme for MRC document encoding based upon the sequential application of two algorithms. The first algorithm, cost optimized segmentation (COS), is a blockwise segmentation algorithm formulated in a global cost optimization framework. The second algorithm, connected component classification (CCC), refines the initial segmentation by classifying feature vectors of connected components using an Markov random field (MRF) model. The combined COS/CCC segmentation algorithms are then incorporated into a multiscale framework in order to improve the segmentation accuracy of text with varying size. In comparisons to state-of-the-art commercial MRC products and selected segmentation algorithms in the literature, we show that the new algorithm achieves greater accuracy of text detection but with a lower false detection rate of nontext features. We also demonstrate that the proposed segmentation algorithm can improve the quality of decoded documents while simultaneously lowering the bit rate.

37 citations


Journal ArticleDOI
TL;DR: Experimental results on a set of machine-printed documents which have been annotated by multiple writers in an office/collaborative environment show that the proposed segmentation of handwritten text and machine printed text from annotated documents is robust and provides good text separation performance.
Abstract: The convenience of search, both on the personal computer hard disk as well as on the web, is still limited mainly to machine printed text documents and images because of the poor accuracy of handwriting recognizers. The focus of research in this paper is the segmentation of handwritten text and machine printed text from annotated documents sometimes referred to as the task of "ink separation" to advance the state-of-art in realizing search of hand-annotated documents. We propose a method which contains two main steps--patch level separation and pixel level separation. In the patch level separation step, the entire document is modeled as a Markov Random Field (MRF). Three different classes (machine printed text, handwritten text and overlapped text) are initially identified using G-means based classification followed by a MRF based relabeling procedure. A MRF based classification approach is then used to separate overlapped text into machine printed text and handwritten text using pixel level features forming the second step of the method. Experimental results on a set of machine-printed documents which have been annotated by multiple writers in an office/collaborative environment show that our method is robust and provides good text separation performance.

34 citations


Cites methods from "Preprocessing of Low-Quality Handwr..."

  • ...Cao and Govindaraju [6,7] proposed a method using small fixed size patches to represent handwriting and restore broken handwritten text based on a MRF framework....

    [...]

  • ...X. Peng (B) · S. Setlur · V. Govindaraju Department of Computer Science and Engineering, UB Commons, Center for Unified Biometrics and Sensors, 520 Lee Entrance, Suite 202, SUNY at Buffalo, Amherst, NY, 14228 USA e-mail: xpeng@buffalo.edu S. Setlur e-mail: setlur@buffalo.edu V. Govindaraju e-mail: govind@buffalo.edu R. Sitaram HP Labs India, Hosur Main Road, Adugodi, Bangalore 560030, India e-mail: sitaram@hp.com Keywords Text identification · Markov Random Field · Documents retrieval · Ink separation · Segmentation...

    [...]


References
More filters

Journal ArticleDOI

31,977 citations


"Preprocessing of Low-Quality Handwr..." refers methods in this paper

  • ...Traditional document image binarization algorithms [ 16 ], [15], [18], [11], [21] separate the foreground from the background by histogram thresholding and analysis of the connectivity of strokes....

    [...]

  • ...5.4 Comparison to Other Preprocessing Methods In Fig. 11, we compare our approach with the preprocessing algorithm of Milewski and Govindaraju [14], the Niblack algorithm [15], and the Otsu algorithm [ 16 ]....

    [...]


Journal ArticleDOI
TL;DR: The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Abstract: We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.

18,328 citations


Additional excerpts

  • ...THE goal of this paper is the preprocessing of degradedhandwritten document images such as carbon forms for subsequent recognition and retrieval....

    [...]


Book
01 Jan 1988
Abstract: From the Publisher: Probabilistic Reasoning in Intelligent Systems is a complete andaccessible account of the theoretical foundations and computational methods that underlie plausible reasoning under uncertainty. The author provides a coherent explication of probability as a language for reasoning with partial belief and offers a unifying perspective on other AI approaches to uncertainty, such as the Dempster-Shafer formalism, truth maintenance systems, and nonmonotonic logic. The author distinguishes syntactic and semantic approaches to uncertainty—and offers techniques, based on belief networks, that provide a mechanism for making semantics-based systems operational. Specifically, network-propagation techniques serve as a mechanism for combining the theoretical coherence of probability theory with modern demands of reasoning-systems technology: modular declarative inputs, conceptually meaningful inferences, and parallel distributed computation. Application areas include diagnosis, forecasting, image interpretation, multi-sensor fusion, decision support systems, plan recognition, planning, speech recognition—in short, almost every task requiring that conclusions be drawn from uncertain clues and incomplete information. Probabilistic Reasoning in Intelligent Systems will be of special interest to scholars and researchers in AI, decision theory, statistics, logic, philosophy, cognitive psychology, and the management sciences. Professionals in the areas of knowledge-based systems, operations research, engineering, and statistics will find theoretical and computational tools of immediate practical use. The book can also be used as an excellent text for graduate-level courses in AI, operations research, or applied probability.

15,149 citations


Proceedings ArticleDOI
01 Jul 2000
TL;DR: A novel algorithm for digital inpainting of still images that attempts to replicate the basic techniques used by professional restorators, and does not require the user to specify where the novel information comes from.
Abstract: Inpainting, the technique of modifying an image in an undetectable form, is as ancient as art itself. The goals and applications of inpainting are numerous, from the restoration of damaged paintings and photographs to the removal/replacement of selected objects. In this paper, we introduce a novel algorithm for digital inpainting of still images that attempts to replicate the basic techniques used by professional restorators. After the user selects the regions to be restored, the algorithm automatically fills-in these regions with information surrounding them. The fill-in is done in such a way that isophote lines arriving at the regions' boundaries are completed inside. In contrast with previous approaches, the technique here introduced does not require the user to specify where the novel information comes from. This is automatically done (and in a fast way), thereby allowing to simultaneously fill-in numerous regions containing completely different structures and surrounding backgrounds. In addition, no limitations are imposed on the topology of the region to be inpainted. Applications of this technique include the restoration of old photographs and damaged film; removal of superimposed text like dates, subtitles, or publicity; and the removal of entire objects from the image like microphones or wires in special effects.

3,421 citations


"Preprocessing of Low-Quality Handwr..." refers methods in this paper

  • ...MRF is ideally suited to this task and has been used successfully on natural scene images [2], [22]....

    [...]


Book
01 Jan 1986

1,719 citations


Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Preprocessing of low-quality handwritten documents using markov random fields" ?

This paper presents a statistical approach to the preprocessing of degraded handwritten forms including the steps of binarization and form line removal. To further improve the processing speed, the authors prune unlikely solutions from the search space while solving the MRF. 

The authors will investigate approaches to generalize their model to these applications in their future work.