scispace - formally typeset
Open AccessProceedings ArticleDOI

Flattening curved documents in images

TLDR
This paper proposes to model the page surface by a developable surface, and exploit the properties of the printed textual content on the page to recover the surface shape, and is the first reported method able to process general curved documents in images without camera calibration.
Abstract
Compared to scanned images, document pictures captured by camera can suffer from distortions due to perspective and page warping. It is necessary to restore a frontal planar view of the page before other OCR techniques can be applied. In this paper we describe a novel approach for flattening a curved document in a single picture captured by an uncalibrated camera. To our knowledge this is the first reported method able to process general curved documents in images without camera calibration. We propose to model the page surface by a developable surface, and exploit the properties (parallelism and equal line spacing) of the printed textual content on the page to recover the surface shape. Experiments show that the output images are much more OCR friendly than the original ones. While our method is designed to work with any general developable surfaces, it can be adapted for typical special cases including planar pages, scans of thick books, and opened books.

read more

Content maybe subject to copyright    Report

Flattening Curved Documents in Images
Jian Liang, Daniel DeMenthon, David Doermann
Language And Media Processing Laboratory
University of Maryland
College Park, MD, 20770
{lj,daniel,doermann}@cfar.umd.edu
Abstract
Compared to scanned images, document pictures
captured by camera can suffer from distortions due to
perspective and page warping. It is necessary to re-
store a frontal planar view of the page before other
OCR techniques can be applied. In this paper we de-
scribe a novel approach for flattening a curved docu-
ment in a single picture captured by an uncalibrated
camera. To our knowledge this is the first reported
method able to process general curved documents in im-
ages without camera calibration. We propose to model
the page surface by a developable surface, and exploit
the properties (parallelism and equal line spacing) of
the printed textual content on the page to recover the
surface shape. Experiments show that the output im-
ages are much more OCR friendly than the original
ones. While our method is designed to work with any
general developable surfaces, it can be adapted for typ-
ical special cases including planar pages, scans of thick
books, and opened books.
1. Introduction
Digital cameras have proliferated rapidly in recent
years due to their small size, ease of use, fast response,
rich set of features, and dropping price. For the OCR
community, they present an attractive alternative to
scanners as imaging device s for capturing documents
because of their flexibility. However, compared to digi-
tal scans, camera captured document images often suf-
fer from many degradations both from intrinsic limits
of the devices and because of the unconstrained exter-
nal environment. Among many new challenges, one of
the most important is the distortion due to perspec-
tive and curved pages. Current OCR te chniques are
designed to work with scans of flat 2D documents, and
cannot handle distortions involving 3D factors.
One way of dealing with these 3D factors is to use
special equipments such as structured light to measure
the 3D range data of the document, and recover the 2D
plane of the page [1, 12]. The requirement for costly
equipment, however, makes these approaches unattrac-
tive.
The problem of recovering planar surface orienta-
tions from images has been addressed by m any re-
searchers inside the general framework of shape es tima-
tion [5, 7, 10], and applied to the removal of perspective
in images of flat documents [3, 4, 11]. However, page
warping adds a non-linear, non-parametric process on
top of this, making it much more difficult to recover the
3D shape. As a way out, people add in more domain
knowledge and constraints. For example, when scan-
ning thick bo oks, the portion near the book spine forms
a cylinder shape [8], and results in curved text lines in
the image. Zhang and Tan [16] estimate the cylinder
shape from the varying shade in the image, assuming
that flatbed scanners have a fixed light projection di-
rection. In terms of camera captured document images,
Cao et al. [2] use a parametrical approach to estimate
the cylinder shape of an opened book. Their method
relies on text lines formed by bottom-up clustering of
connected components. Apart from the cylinder shape
assumption, they also have a restriction on the pose
that requires the image plane to be parallel to the gen-
eratrix of the page cylinder. Gumerov et al. [6] present
a method for shape estimation from single views of de-
velopable surfaces. They do not require cylinder shapes
and special poses. However, they require correspon-
dences between closed contours in the image and in
the unrolled page. They propose to use the rectilin-
ear page boundaries or margins in document images as
contours. This may not be applicable when part of the
page is occluded.
Another way out is to bypass the shape estimation
step, and come up with an approximate flat view of the
page, with what we call shape-free methods. For scans
of thick bound volumes, Zhang and Tan [15] have an-

other method for straightening curved text lines. They
find text line curves by clustering connected compo-
nents, and move the components to restore straight
horizontal base lines. The shape is still unknown but
image can be OCRed. Under the same cylinder shape
and parallel view assumptions as Cao et. al have, Tsoi
et al. [14] flatten images of opened b ooks by a bilin-
ear morphing operation which maps the curved page
boundaries to a rectangle. Their method is also shape-
free. Although shape-free methods are simpler, they
can only deal with small distortions and can not be
applied when shape and pose are arbitrary.
Our goal is to restore a frontal planar image of a
warped document page from a single picture captured
by an uncalibrated digital camera. Our method is
based on two key observations: 1) a curved document
page can be modeled by a developable surface, and
2) printed textual content on the page forms texture
flow fields that provide strong constraints on the un-
derlying surface shape [9]. More specifically, we extract
two texture flow fields from the textual area in the pro-
jected image, which represent the local orientations of
projected text lines and vertical character strokes re-
spectively. The intrinsic parallelism of the texture flow
vectors on the curved page is used to detect the pro-
jected rulings, and the equal text line spacing property
on the page is used to compute the vanishing p oints of
the surface rulings. Then a developable surface is fitted
to the rulings and texture flow fields, and the surface
is unrolled to generate the flat page image.
Printed textual content provides the most promi-
nent and stable visual features in document images
[3, 11, 2, 15]. In real applications, other visual cues
are not as reliable. For example, shade may be biased
by multiple light sources; contours and edges may be
occluded. In term of the way of using printed textual
content in images, our work differs from [15, 2] in that
we do not rely on connected component analysis which
may have difficulty with figures or tables. The mixture
of text and non-text elements will also make traditional
shape-from-texture techniques difficult to apply, while
our texture flow based method can still work. Over-
all, compared to others’ work, our method does not
require a flat page, does not require 3D range data,
does not require camera calibration, does not require
special shapes or poses, and can be applied to arbitrary
developable document pages.
The remainder of this paper is organized into five
sections. Section 2 introduces developable surfaces and
describes the texture flow fields generated by printed
text on document pages. Section 3 focuses on texture
flow field extraction. We describe the details of surface
estimation in Section 4, and discuss the experimental
r1
r2
r3
v
p1
p2
p3
N1
N2
N3
Figure 1. Strip approximation of a developable surface.
results in Section 5. Section 6 concludes the paper.
2. Problem Modeling
The shape of a smoothly rolled document page can
be modeled by a developable surface. A developable
surface can be mapped isometrically onto a Euclidean
plane, or in plain English, can be unrolled onto a plane
without tearing or stretching. This process is called de-
velopment. Development does not change the intrinsic
prop e rties of the surface such as curve length or angle
formed by curves.
Rulings play a very important role in defining de-
velopable surfaces. Through any point on the surface
there is one and only one ruling, except for the degen-
erated case of a plane. Any two rulings do not inter-
sect except for conic vertices. All p oints along a ruling
share a common tangent plane. It is well known in
elementary differe ntial geometry that given sufficient
differentiability a developable surface is either a plane,
a cylinder, a cone, the collection of the tangents of a
curve in space, or a composition of these types. On a
cylindrical surface, all rulings are parallel; on a conic
surface, all rulings intersect at the conic vertex; for the
tangent surface case, the rulings are the tangent lines
of the underlying space curve; only in the planar case
are rulings not uniquely defined.
The fact that all points along a ruling of a devel-
opable surface share a common tangent plane to the
surface leads to the result that the surface is the enve-
lope of a one-parameter family of planes, which are its
tangent planes. Therefore a developable surface can be
piecewise approximated by planar strips that belong to
the family of tangent planes (Fig. 1). Although this is
only a first order approximation, it is sufficie nt for our
application. The group of planar strips can be fully de-
scribed by a se t of reference points {P
i
} along a curve
on the surface, and the surface normals {N
i
} at these
points.
Supp ose that for every point on a developable sur-
face we select a tangent vector; we say that the tan-
gents are parallel with respect to the underlying surface

if when the surface is developed, all tangents are par-
allel in the 2D space. A developable surface covered
by a uniformly distributed non-isotropic texture can
result in the perception of a parallel tangent field. On
document pages, the texture of printed textual content
forms two parallel tangent fields: the first field follows
the local text line orientation, and the second field fol-
lows the vertical character stroke orientation. Since the
text line orientation is more prominent, we call the first
field the major tangent field and the second the minor
tangent field.
The two 3D tangent fields are projected to two 2D
flow fields in camera captured images, which we call
the major and minor texture flow fields, denoted as
E
M
and E
m
. The 3D rulings on the surface are also
projected to 2D lines on the image, which we call the
2D rulings or projected rulings.
The texture flow fields and 2D rulings are not di-
rectly visible. Section 3 introduces our method of ex-
tracting texture flow from textual regions of do c ument
images. The texture flow is used in Section 4 to derive
projected rulings, find vanishing points of rulings, and
estimate the page shape.
3. Texture Flow Computation
We are only interested in texture flow produced by
printed textual content in the image, therefore we need
to first detect the textual area and textual content.
Among various text detection schemes proposed in the
literature we adopt a simple one since this is not our
focus in this work. We use an edge detector to find pix-
els with strong gradient, and apply an open operator to
expand those pixels into textual area. Although sim-
ple, this method works well for document images w ith
simple backgrounds. Then we use Niblack’s adaptive
thresholding [13] to get binary images of textual con-
tent (Fig. 2). The binarization does not have to be
perfect, since we only use it to compute the texture
flow fields, not for OCR.
The local texture flow direction can be viewed as a
local skew direction. We divide the image into small
blocks, and use projection profile analysis to compute
the local skew at the center of each block. Instead of
computing one skew angle, we compute several promi-
nent skew angles as candidates. Initially their confi-
dence values represent the energy of the corresponding
projection profiles. A relaxation process follows to ad-
just confidences in such a way that the candidates that
agree with neighbors get higher confidences. As a re-
sult, the local text line directions are found. The relax-
ation process is nec es sary because due to randomness
in small image blocks, the text line orientations may
(a)
(b) (c)
Figure 2. Text area detection and text binarization. (a) A
do cument image captured by a camera. (b) Detected text
area. (c) Binary text image.
not initially be the most prominent. We use interpola-
tion and extrapolation to fill a dense texture flow field
E
M
that covers every pixel. Next, we remove the major
texture flow directions from the local skew candidates,
reset confidences for the remaining candidates, and ap-
ply the relaxation again. This time the results are the
local vertical character stroke orientations. We com-
pute a dense minor texture flow field E
m
in the same
way.
Fig. 3 shows the major and minor texture flow fields
computed from a binarized text image. Notice that
E
M
is quite good in Fig. 3(c) even though two figures
are embedded in the text. Overall, E
M
is much more
accurate than E
m
.
4. Page Shape Estimation
4.1. Finding Projected Rulings
Consider a developable surface D, a ruling R on D,
the tangent plane T at R, and a parallel tangent field
V defined on D. For a group of points {P
i
} along
R, all the tangents {V(P
i
)} at these points lie on T ,
and are parallel. Suppose the camera projection maps
{P
i
} to {p
i
}, and {V(P
i
)} to {v(p
i
)}. Then under
orthographic projection, {v(p
i
)} are parallel lines on
the image plane; under s pherical projection, {v(p
i
)}
all lie on great circles on the view sphere that intersect
at two common points; and under perspective projec-
tion, {v(p
i
)} are lines that share a common vanishing
point. Therefore, theoretically if we have E
M
or E
m
,
we can detect projected rulings by testing the texture
flow orientations along a ruling candidate against the

(a)
(b) (c)
(d) (e)
Figure 3. Texture flow detection. (a) The four local skew
candidates in a small block. After relaxation the two middle
candidates are eliminated. (b) (c) Visualization of major
texture flow field. (d) (e) Visualization of minor texture
flow field.
above principles.
However, due to the errors in estimated E
M
and E
m
,
we have found that te xture flow at individual pixels has
too much noise for this direct metho d to work well.
Instead, we propose to use small blocks of texture flow
field to increase the robustness of ruling detection.
In a simplified case, consider the image of a cylindri-
cal surface covered with a parallel tangent field under
orthographic projection. Suppose we take two small
patches of the same shape (this is possible for a cylin-
der surface) on the surface along a ruling. We can
show that the two tangent sub-fields in the two patches
project to two identical texture flow sub-fields in the
image. This idea can be expanded to general devel-
opable surfaces and perspective projections, as locally
a developable surface can be approximated by cylin-
der surfaces, and the projection can be approximated
by orthographic projection. If the two patches are not
taken along the same ruling, however, the above prop-
erty will not hold. Therefore we have the following
pseudo-code for detection of a 2D ruling that passes
through a given point (x, y) (see Fig. 4):
1. For each ruling direction candidate θ [0, π) do
the following
(a) Fix the line l(θ, x, y) that passes through
(x, y) and has angle θ with respect to the x-
axis
(b) Slide the center of a window along l at equal
steps and collect the major texture flow field
inside the window as a sub-field {E
i
}
n
i=1
,
where n is the number of such sub-fields
(c) The score of the candidate l(θ, x, y) is
s(θ) =
P
n
i=2
d(E
i1
, E
i
)
n
(1)
where d(E
i1
, E
i
) measures the difference be-
tween two sub-fields, which in our implemen-
tation is the sum of squared differences
2. Output the θ that corresponds to the smallest s(θ )
as ruling direction
We have found that the result has weak sensitivity
to a large range of window sizes or moving steps.
To find a group of projected rulings that cover the
whole text area, first a group of reference points are
automatically selecte d, then for each point a projected
ruling is computed. Because any two rulings do not
intersect inside the 3D page, we have an additional
restriction that two nearby projected rulings must not
intersect inside the textual area.
As Fig. 4 shows, our ruling detection scheme works
better in high c urvature parts of the surface than in
flat parts. One reason is that in flat parts the rulings
are not uniquely defined. On the other hand, note that
when the surface curvature is small, the shape recovery
is not sensitive to the ruling detection res ult, so the
reduced accuracy in ruling computation does not have
severe adverse effects on the final result.
4.2. Computing Vanishing Points of Rulings
We compute the vanishing points of rulings based
on the e qual text line spacing property in documents.
For printed text lines in a paragraph, the line spac-
ing is usually fixed. When a 3D ruling intersects with
these text lines, the intersections are equidistant in 3D
space. Under perspective projection, if the 3D ruling is
not parallel to the image plane, these intersections will
project to non-equidistant points on the image, and
the changes of distances can reveal the vanishing point
position:
Let {P
i
}
i=−∞
be a set of points along a line in 3D
space such that |P
i
P
i+1
| is constant. A perspective pro-
jection maps P
i
to p
i
on the image plane. Then by the
invariance of cros s ratio we have

c
d
e
(a) (b)
(c) (d) (e)
Figure 4. Projected ruling estimation. (a) Two projected
ruling candidates and three image patches along the ruling
candidates. (b) The estimated rulings. (c)(d)(e) Enlarged
image patches. Notice that (c) and (d) have similar texture
flow (but dissimilar texture) and are both different from
(e).
|p
i
p
j
||p
k
p
l
|
|p
i
p
k
||p
j
p
l
|
=
|P
i
P
j
||P
k
P
l
|
|P
i
P
k
||P
j
P
l
|
=
|i j||k l|
|i k||j l|
, i, j, k, l. (2)
And as a result we have
|p
i
p
i+1
||p
i+2
p
i+3
|
|p
i
p
i+2
||p
i+1
p
i+3
|
=
1
4
, i, (3)
and
|p
i
p
i+1
||p
i+2
v|
|p
i
p
i+2
||p
i+1
v|
=
1
2
, i, (4)
where v is the vanishing point corresponding to p
or
p
−∞
.
We will come back to Eq. 4 and Eq. 3 after we
describe how we find {p
i
}. We use a modified pro-
jection profile analysis to find the intersections of a
projected ruling and text lines. Usually a projection
profile is built by projecting pixels in a fixed direc-
tion onto a base line, such that each bin of the profile
is
P
I(x, y : ax + by = 0). We call this a linear pro-
jection profile, which is suitable for straight text lines.
When text lines are curved, we project pixels along the
curve onto the base line (the projected ruling in our
context), such that each bin is
P
I(x, y : f(x, y) = 0)
where f defines the curve. We call the result a curve-
based projection profile (CBPP). The peaks of a CBPP
corresponds to positions where text lines intersect the
base line (assuming text pixels have intensity 1). Fig. 5
1
2
3
(a)
(b)
1
2
3
(c)
Figure 5. Computing the vanishing p oint of a 2D ruling.
(a) A 2D ruling in the document image. (b) The curve
based projection profile (CBPP) along the ruling in (a).
(c) The smoothed and binarized CBPP with three text
blo cks identified. In each text block, the line spacing be-
tween top lines in the image (to the left in the profile graph)
is smaller than that between lower lines (although this is
not very visible to the eye). Such difference is due to
perspective foreshortening and is explored to recover the
vanishing point. In this particular case, the true vanish-
ing point is (3083.70, 6225.06) and the estimated value is
(3113, 5907) (both in pixel units).
shows how we identify the text line positions along a
ruling.
The sequence of text line positions is clustered into
K groups, {p
k
i
}
K
k=1
, such that each group {p
k
i
}
n
k
i=1
satis-
fies Eq. 3 within an error threshold. The purpose of the
clustering is to separate text paragraphs, and remove
paragraphs that have less than three lines.
To find the best vanishing point v that satisfies Eq. 4
for every group of {p
k
i
}
n
k
i=1
, first we represent p
k
i
by its
1D coordinate a
k
i
along the ruling r (the origin can be
any point on r). We write a
k
i
= b
k
i
+e
k
i
, where e
k
i
is the
error term and b
k
i
is the true but unknown position of
text line.
Under the assumption that e
k
i
follows a normal dis-
tribution, the best v should minimize the error function

Figures
Citations
More filters
Patent

Triggering applications based on a captured text in a mixed media environment

TL;DR: In this paper, the MMR system provides mechanisms for forming a mixed media document that includes media of at least two types (e.g., printed paper as a first medium and digital content and/or web link as a second medium).
Patent

System And Methods For Creation And Use Of A Mixed Media Environment

TL;DR: In this article, a Mixed Media Reality (MMR) system and associated techniques are described, which provides mechanisms for forming a mixed media document that includes media of at least two types, such as printed paper as a first medium and a digital photograph, digital movie, digital audio file or web link as a second medium.
Patent

Visibly-perceptible hot spots in documents

TL;DR: The MMR system as discussed by the authors provides mechanisms for forming a mixed media document that includes media of at least two types (e.g., printed paper as a first medium and digital content and/or web link as a second medium).
Journal ArticleDOI

Geometric Rectification of Camera-Captured Document Images

TL;DR: This work presents a geometric rectification framework for restoring the frontal-flat view of a document from a single camera-captured image and estimates the 3D document shape from texture flow information obtained directly from the image without requiring additional 3D/metric data or prior camera calibration.
Patent

Dynamic presentation of targeted information in a mixed media reality recognition system

TL;DR: In this article, a context-aware targeted information delivery system comprises a mobile device, an MMR matching unit, a plurality of databases for user profiles, user context and advertising information.
References
More filters
Journal ArticleDOI

Evaluation of binarization methods for document images

TL;DR: This paper presents an evaluation of eleven locally adaptive binarization methods for gray scale images with low contrast, variable background intensity and noise and Niblack's method with the addition of the postprocessing step of Yanowitz and Bruckstein's method (1989) performed the best and was also one of the fastest binarized methods.
Proceedings ArticleDOI

Metric rectification for perspective images of planes

TL;DR: The novel contributions are that in a stratified context the various forms of providing metric information can be represented as circular constraints on the parameters of an affine transformation of the plane, providing a simple and uniform framework for integrating constraints.
Proceedings ArticleDOI

Global and local document degradation models

TL;DR: An illumination model is described to account for the nonlinear intensity change occuring across a page in a perspective-distorted document.
Journal ArticleDOI

Image restoration of arbitrarily warped documents

TL;DR: This framework acquires and flattens the 3D shape of a warped document to determine a nonlinear image transform that can correct for image distortion caused by the document's shape.
Book ChapterDOI

Structure of Applicable Surfaces from Single Views

TL;DR: A new exact integration of these equations is developed that relates the 3-D structure of the applicable surface to an image that allows recovery of the full geometric structure from a single image of the surface and knowledge of its undeformed shape.
Frequently Asked Questions (16)
Q1. What have the authors contributed in "Flattening curved documents in images" ?

In this paper the authors describe a novel approach for flattening a curved document in a single picture captured by an uncalibrated camera. To their knowledge this is the first reported method able to process general curved documents in images without camera calibration. The authors propose to model the page surface by a developable surface, and exploit the properties ( parallelism and equal line spacing ) of the printed textual content on the page to recover the surface shape. 

Subsequent experiments are being conducted to test the system ’ s performance on large scale data, and the results will be reported in a future publication. 

The authors use an edge detector to find pixels with strong gradient, and apply an open operator to expand those pixels into textual area. 

The restoration of a frontal planar view of a warped document from a single picture is the first part of their planed work on camera-captured document processing. 

In the unfolded plane, the authors expect the text lines to be continuous, parallel, straight, and orthogonal to the vertical character stroke direction. 

Digital cameras have proliferated rapidly in recent years due to their small size, ease of use, fast response, rich set of features, and dropping price. 

The reason that results obtained from synthetic images are better than from real images is because synthetic images have better quality in terms of focus, noise and contrast. 

The common convergence point of the major texture flow field EM is the horizontal vanishing point of the plane, while Em gives the vertical vanishing point. 

The group of strips {Pi}n+1i=1 is globally smooth, that is, the change between the normals of two adjacent planar strips is not abrupt. 

It is well known in elementary differential geometry that given sufficient differentiability a developable surface is either a plane, a cylinder, a cone, the collection of the tangents of a curve in space, or a composition of these types. 

The score of the candidate l(θ, x, y) is s(θ) = ∑ni=2 d(Ei−1, Ei) n(1)where d(Ei−1, Ei) measures the difference between two sub-fields, which in their implementation is the sum of squared differences2. 

More specifically, the authors extract two texture flow fields from the textual area in the projected image, which represent the local orientations of projected text lines and vertical character strokes respectively. 

due to the errors in estimated EM and Em, the authors have found that texture flow at individual pixels has too much noise for this direct method to work well. 

Apart from the cylinder shape assumption, they also have a restriction on the pose that requires the image plane to be parallel to the generatrix of the page cylinder. 

the authors remove the major texture flow directions from the local skew candidates, reset confidences for the remaining candidates, and apply the relaxation again. 

One way of dealing with these 3D factors is to use special equipments such as structured light to measure the 3D range data of the document, and recover the 2D plane of the page [1, 12].