scispace - formally typeset
Open AccessProceedings ArticleDOI

Learning low-level vision

TLDR
This work shows a learning-based method for low-level vision problems-estimating scenes from images with a Markov network, and applies VISTA to the "super-resolution" problem (estimating high frequency details from a low-resolution image), showing good results.
Abstract
We show a learning-based method for low-level vision problems-estimating scenes from images. We generate a synthetic world of scenes and their corresponding rendered images. We model that world with a Markov network, learning the network parameters from the examples. Bayesian belief propagation allows us to efficiently find a local maximum of the posterior probability for the scene, given the image. We call this approach VISTA-Vision by Image/Scene TrAining. We apply VISTA to the "super-resolution" problem (estimating high frequency details from a low-resolution image), showing good results. For the motion estimation problem, we show figure/ground discrimination, solution of the aperture problem, and filling-in arising from application of the same probabilistic machinery.

read more

Content maybe subject to copyright    Report

International Journal of Computer Vision 40(1), 25–47, 2000
c
° 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.
Learning Low-Level Vision
WILLIAM T. FREEMAN
Mitsubishi Electric Research Labs., 201 Broadway, Cambridge, MA 02139
Freeman@merl.com
EGON C. PASZTOR
MIT Media Laboratory, E15-385, 20 Ames St., Cambridge, MA, 02139
egon@media.mit.edu
OWEN T. CARMICHAEL
209 Smith Hall, Carnegie-Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213
otc@andrew.cmu.edu
Abstract. We describe a learning-based method for low-level vision problems—estimating scenes from images.
We generate a synthetic world of scenes and their corresponding rendered images, modeling their relationships
with a Markov network. Bayesian belief propagation allows us to efficiently find a local maximum of the posterior
probability for the scene, given an image. We call this approach VISTA—Vision by Image/Scene TrAining.
We apply VISTA to the “super-resolution” problem (estimating high frequency details from a low-resolution
image), showing good results. To illustrate the potential breadth of the technique, we also apply it in two other
problem domains, both simplified. We learn to distinguish shading from reflectance variations in a single image
under particular lighting conditions. For the motion estimation problem in a “blobs world”, we show figure/ground
discrimination, solution of the aperture problem, and filling-in arising from application of the same probabilistic
machinery.
Keywords: vision and learning, belief propagation, low-level vision, super-resolution, shading and reflectance,
motion estimation
1. Introduction
We seek machinery for learning low-level vision prob-
lems, such as motion analysis, inferring shape and re-
flectance from a photograph, or extrapolating image
detail. For these problems, given image data, we want
to estimate an underlying scene (Fig. 1). The scene
quantities to be estimated might be projected object
velocities, surface shapes and reflectance patterns, or
missinghighfrequencydetails.Theseestimatesareim-
portant for various tasks in image analysis, database
search, and robotics.
Low-level vision problems are typically under-
constrained, so Bayesian (Berger, 1985; Knill and
Richards, 1996; Szeliski, 1989) and regularization
techniques (Poggio et al., 1985) are fundamental.
There has been much work and progress (for example,
Knill and Richards, 1996; Landy and Movshon, 1991;
Horn, 1986), but difficulties remain in working with
complex, real images. Typically, prior probabilities or
constraints are hypothesized, rather than learned.
A recent research theme has been to study the statis-
tics of natural images. Researchers have related those
statistics to properties of the human visual system
(Olshausen and Field, 1996; Bell and Sejnowski, 1997;
Simoncelli,1997),orhave used statisticalcharacteriza-
tions of images to analyse and synthesize realistic tex-
tures (Heeger and Bergen, 1995; DeBonet and Viola,
1998; Zhu and Mumford, 1997; Simoncelli, 1997).
These methods may help us understand the early stages

26 Freeman, Pasztor and Carmichael
Figure 1. Example low-level vision problems. For given “image” information, we want to estimate an underlying “scene” that created it
(idealized scene estimates shown).
of representation and processing, but unfortunately,
they don’t address how a visual system might inter-
pret images, i.e., estimate the underlying scene.
We want to combine the two research themes of
scene estimation and statistical learning. We study the
statistical properties of a synthetically generated world
ofimageslabelledwiththeirunderlyingscenes,tolearn
how to infer scenes from images. Our prior probabili-
tiesand rendering models canthenbe rich ones,learned
from the training data.
Several researchers have applied related learning ap-
proaches to low-level vision problems, but restricted
themselves to linear models (Kersten et al., 1987;
HurlbertandPoggio,1988),tooweakformanyapplica-
tions. Our approach is similar in spirit to relaxation la-
belling (Rosenfeld et al., 1976; Kittler and Illingworth,
1985), but our Bayesian propagation algorithm is more
efficient and we use training data to derive propagation
parameters.
Weinterpret imagesby modeling therelationship be-
tween local regions of images and scenes, and between
neighboring local scene regions.Theformer allowsini-
tial scene estimates; the later allows the estimates to
propagate. We train from image/scene pairs and apply
the Bayesian machinery of graphical models (Pearl,
1988; Binford et al., 1988; Jordan, 1998). We were
influenced by the work of Weiss (Weiss, 1997), who
pointed out the speed advantage of Bayesian methods
over conventional relaxation methods for propagating
localmeasurement information.Fora relatedapproach,
but with heuristically derived propagation rules, see
Saund (1999).
WecallourapproachVISTA,VisionbyImage/Scene
TrAining. It is a general machinery that may apply to
various vision problems. We illustrate it for estimating
missing image details, disambiguating shading from
reflectance effects, and estimating motion.
2. Markov Network
For given image data, y, we seek to estimate the un-
derlying scene, x (we omit the vector symbols for
notational simplicity). We first calculate the posterior
probability, P(x | y) = cP(x, y) (the normalization,
c =
1
P(y)
, is a constant over x). Under two common
loss functions (Berger, 1985), the best scene estimate,
ˆx, is the mean (minimum mean squared error, MMSE)
or the mode (maximum a posteriori, MAP) of the pos-
terior probability.
In general, ˆx can be difficult to compute with-
out approximations (Knill and Richards, 1996). We
make the Markov assumption: we divide both the
image and scene into patches, and assign one node
of a Markov network (Geman and Geman, 1984;
Pearl, 1988; Jordan, 1998) to each patch. We draw
the network as nodes connected by lines, which in-
dicate statistical dependencies. Given the variables at
intervening nodes, two nodes of a Markov network
are statistically independent. We connect each scene
patch both to its corresponding image patch and to its
spatial neighbors, Fig. 2. For some problems where
long-range interactions are important, we add layers of
image and scene patches at other spatial scales, con-
necting scene patches to image patches at the same
scale, and to scene patches at neighboring scales and
positions. (Unlike Luettgen et al. (1994), this is not
a tree because of the connections between spatial
neighbors).
The Markov network topology of Fig. 2 implies that
knowing the scene at position j: (1) provides all the
information about the rendered image there, because
x
j
has the only link to y
j
, and (2) gives information
about nearby scenes values, by the links from x
j
to
nearby scene neighbors. We will call problems with
these properties low-level vision problems.

Learning Low-Level Vision 27
Figure 2. Markov network for vision problems. Each node in the
network describes a local patch of image or scene. Observations, y,
have underlying scene explanations, x. Lines in the graph indicate
statistical dependencies between nodes.
Solving a Markov network involves a learning
phase, where the parameters of the network connec-
tions are learned from training data, and an inference
phase, when the scene corresponding to particular im-
age data is estimated.
Fora Markov random field, the joint probability over
the scenes x and images y can be written (Besag 1974;
Geman and Geman, 1984; Geiger and Girosi, 1991):
P(x
1
, x
2
,...,x
N
,y
1
,y
2
,...,y
N
)
=
Y
(i,j)
9(x
i
, x
j
)
Y
k
8(x
k
, y
k
), (1)
where we haveintroduced pairwise compatibility func-
tions, 9 and 8, which are learned from the training
data. (i, j) indicates neighboring nodes i, j and N is
the number of image and scene nodes.
We can write the MAP and MMSE estimates for
ˆx
j
by marginalizing (MMSE) or taking the maximum
(MAP) over the other variables in the posterior prob-
ability. For discrete variables, the marginalization in-
volves summations overthe discrete valuesof the scene
variables at each node, indicated by the summations
below:
ˆx
jMMSE
=
X
x
j
x
j
X
all x
i
,i6= j
× P(x
1
, x
2
,...,x
N
,y
1
,y
2
,...,y
N
) (2)
ˆx
jMAP
= arg max
x
j
max
[all x
i
, i 6= j]
× P(x
1
, x
2
,...,x
N
,y
1
,y
2
,...,y
N
). (3)
For networks larger than toy examples, Eqs. (2) and
(3) are infeasible to evaluate directly because of the
high dimensionality of the scene variables over which
P(x
1
, x
2
,...,x
N
,y
1
,y
2
,...,y
N
)must be summed or
maximized. When the networks form chains or trees,
however, we can evaluate the equations.
2.1. Inference in Networks Without Loops
For networks without loops, the Markov assumption
leads to simple “message-passing” rules for computing
the MAP and MMSE estimates during inference
(Pearl, 1988; Weiss, 1998; Jordan, 1998). The factor-
ized structure of Eq. (1) allows the marginalization
and maximization operators of Eqs. (2) and (3) to pass
through 9 and8 factors with unrelated arguments. For
example, for the network in Fig. 3, substituting Eq. (1)
for P(x, y) into Eq. (3) for ˆx
jMAP
at node 1 gives
ˆx
1MAP
= arg max
x1
max
x2
max
x3
P(x
1
, x
2
, x
3
, y
1
, y
2
, y
3
) (4)
= arg max
x1
max
x2
max
x3
8(x
1
, y
1
)8(x
2
, y
2
)8(x
3
, y
3
)
9(x
1
, x
2
)9(x
2
, x
3
) (5)
= arg max
x1
8(x
1
, y
1
)
max
x2
9(x
1
, x
2
)8(x
2
, y
2
)
max
x3
9(x
2
, x
3
)8(x
3
, y
3
). (6)
Each line of Eq. (6) is a local computation involv-
ing only one node and its neighbors. The analogous
expressions for x
2MAP
and x
3MAP
also use local calcu-
lations. Passing local “messages” between neighbors,
as described below, gives an efficient way to compute
the MAP estimates.
Assuming a network without loops, Eqs. (3) and (2)
can be computed by iterating the following steps
(Pearl, 1988; Weiss, 1998; Jordan, 1998). The MAP
Figure 3. Example Markov network without any loops, used for
belief propagation example described in text. The compatibility
functions 8 and 9 are defined below.

28 Freeman, Pasztor and Carmichael
estimate at node j is
ˆx
jMAP
= arg max
x
j
8(x
j
, y
j
)
Y
k
M
k
j
, (7)
where k runs over all scene node neighbors of node
j, and M
k
j
is the message from node k to node j.We
calculate M
k
j
from:
M
k
j
= max
[x
k
]
9(x
j
, x
k
)8(x
k
, y
k
)
Y
l6= j
˜
M
l
k
, (8)
where
˜
M
l
k
is M
l
k
from the previous iteration. The initial
˜
M
k
j
s are set to column vectors of 1’s, of the dimen-
sionality of the variable x
j
.
To illustrate how these equations are used, we show
howEq. (7)reduces to Eq. (6) for the example of Fig. 3.
First, a note about the compatibility matrices, 9 and
8. For a given observed image-patch, y
k
, the image-
scene compatibility function, 8(x
k
, y
k
), is a column
vector, indexed by the different possible states of x
k
,
the scene at node k. The scene-scene compatibility
function, 9(x
i
, x
j
), will be a matrix with the differ-
ent possible states of x
i
and x
j
, the scenes at nodes
i and j, indexing the rows and columns. Because the
initial messages are 1’s, at the first iteration, all the
messages in the network are:
M
2
1
= max
x
2
9(x
1
, x
2
)8(x
2
, y
2
) (9)
M
3
2
= max
x
3
9(x
2
, x
3
)8(x
3
, y
3
) (10)
M
1
2
= max
x
1
9(x
2
, x
1
)8(x
1
, y
1
) (11)
M
2
3
= max
x
2
9(x
3
, x
2
)8(x
2
, y
2
). (12)
The second iteration uses the messages above as the
˜
M variables in Eq. (8):
M
2
1
= max
x
2
9(x
1
, x
2
)8(x
2
, y
2
)
˜
M
3
2
(13)
M
3
2
= max
x
3
9(x
2
, x
3
)8(x
3
, y
3
) (14)
M
2
3
= max
x
2
9(x
3
, x
2
)8(x
2
, y
2
)
˜
M
1
2
(15)
M
1
2
= max
x
1
9(x
2
, x
1
)8(x
1
, y
1
). (16)
Substituting M
3
2
of Eq. (10) for
˜
M
3
2
in Eq. (13) gives
M
2
1
= max
x
2
9(x
1
, x
2
)8(x
2
, y
2
)
× max
x
3
9(x
2
, x
3
)8(x
3
, y
3
). (17)
For this example, the messages don’t change in subse-
quent iterations. We substitute the final messages into
Eq. (7) to compute the MAP estimates, for example,
ˆx
1MAP
= argmax
x
1
8(x
1
, y
1
)M
2
1
. (18)
Substituting Eq. (17), the converged message value for
M
2
1
, in Eq. (18) abovegives precisely Eq.(6) for x
1MAP
.
The exact MAP estimates for x
2
and x
3
are found anal-
ogously.
It can be shown (Pearl, 1988; Weiss, 1988; Jordan,
1998) that after at most one global iteration of Eq. (8)
for each node in the network, Eq. (7) gives the desired
optimal estimate, ˆx
j
MAP
, at each node j.
The MMSE estimate, Eq. (3), has analogous formu-
lae, with the max
x
k
of Eq. (8) replaced by
P
x
k
, and
argmax
x
j
of Eq. (7) replaced by
P
x
j
x
j
. For Markov
networks without loops, these propagation rules are
equivalent to standard Bayesian inference methods,
such as the Kalman filter and the forward-backward
algorithm for Hidden Markov Models (Pearl, 1988;
Luettgen et al., 1994; Weiss, 1997; Smyth et al., 1997;
Frey, 1998; Jordan, 1998).
A second factorization of the joint probability can
also be used instead of Eq. (1), although it is only valid
for chains or trees, while Eq. (1) is valid for general
Markov networks. This is a the chain rule factorization
of the joint probability, similar to Pearl (1988). For
Fig. 3, using the Markov properties, we can write
P(x
1
, y
1
, x
2
, y
2
, x
3
, y
3
)
= P(x
1
)P(y
1
| x
1
)P(x
2
| x
1
)
× P(y
2
| x
2
)P(x
3
| x
2
)P(y
3
| x
3
). (19)
Following the same reasoning as in Eqs. (4)–(6), this
factorization leads to the following MAP update and
estimation rules:
M
k
j
= max
x
k
P(x
k
| x
j
)P(y
k
| x
k
)
Y
l6= j
˜
M
l
k
, (20)
x
jMAP
= arg max
x
j
P(x
j
)P(y
j
| x
j
)
Y
k
M
k
j
. (21)
where k runs over all scene node neighbors of node j.
While the expression for the joint probability does

Learning Low-Level Vision 29
not generalize to a network with loops, we nonethe-
less found good results for some problems using these
update rules (for Section 5 and much of Section 3).
2.2. Networks with Loops
For a network with loops, Eqs. (2) and (3) do not fac-
tor into local calculations as in Eq. (6). Finding exact
MAP or MMSE values for a Markov network with
loops can be computationally prohibitive. Researchers
have proposed a variety of approximations (Geman
and Geman, 1984; Geiger and Girosi, 1991; Jordan,
1998). Strong empirical results in “Turbo codes”
(Kschischang and Frey, 1998; McEliece et al., 1998),
layered image analysis (Frey, 2000) and recent theo-
retical work (Weiss, 1998; Weiss and Freeman, 1999;
Yedidia et al., 2000) provide support for a very sim-
ple approximation: applying the propagation rules of
Eqs. (8) and (7) even in the network with loops. Table 1
summarizes results from Weiss and Freeman (1999):
(1) for Gaussian processes, the MMSE propagation
scheme can converge only to the true posterior means.
(2) Even for non-Gaussian processes, if the MAP prop-
agation scheme converges, it finds at least a local max-
imum of the true posterior probability. Furthermore,
this condition of local optimality for the converged so-
lution of the MAP algorithm is a strong one. For every
subset of nodes of the network which form a tree, if the
remaining network nodes are constrained to their con-
verged values, the values of the sub-tree’s nodes found
by the MAP algorithm are the global maximum over
that tree’s nodes (Weiss and Freeman, 2000). Yedidia
et al. (2000) show that the MMSE belief propagation
equations are equivalent to the stationarity conditions
for the Bethe approximation to the “free energy” of
the network. These experimental and theoretical re-
sults motivate applying the belief propagation rules of
Table 1. Summary of results from Weiss and Freeman (1999)
regarding belief propagation results after convergence.
Network topology
Belief propagation
algorithm No loops Arbitrary topology
MMSE rules MMSE, correct For Gaussians,
posterior marginal correct means,
probs. wrong covs.
MAP rules MAP estimate Local max. of
posterior, even for
non-Gaussians.
Eqs. (8) and (7) even in a Markov network with loops.
(There is not the corresponding theoretical justification
for applying Eqs. (20) and (21)in a network with loops;
we rely on experiment).
2.3. Representation
We need to chose a representation for the image and
scene variables. The images and scenes are arrays of
vector valued pixels, indicating, for example, color
image intensities or surface height and reflectance in-
formation. We divide these into patches. For both com-
pression and generalization, we use principle compo-
nents analysis (PCA) to find a set of lower dimensional
basis functions for the patches of image and scene pix-
els. We measure distances in this representation using
a Euclidean norm, unless otherwise stated.
We also need to pick a form for the compatibil-
ity functions 8(x
j
, y
j
) and 9(x
j
, x
k
) in Eqs. (7) and
(8), as well as the messages, M
k
j
. One could repre-
sent those functions as Gaussian mixtures (Freeman
and Pasztor, 1999) over the joint spaces x
j
× y
j
and
x
j
× x
k
; however multiplications of the Gaussian mix-
tures is cumbersome, requiring repeated pruning to
restore the product Gaussian mixtures to a manageable
number of Gaussians.
We prefer a discrete representation. The most
straight-forward approach would be to evenly sample
all possible states of each image and scene variable
at each patch. Unfortunately, for reasonably sized
patches, the scene and image variables need to be of a
high enough dimensionality that an evenly-spaced dis-
crete sampling of the entire high dimensional space is
not feasible.
Toaddress that,weevaluate8(x
j
, y
j
) and9(x
j
, x
k
)
only at a restricted set of discrete points, a subset of
our training set. (For other sample-based representa-
tions see Isard and Blake (1996), DeBonet and Viola
(1998)). Our final MAP (or MMSE) estimates will be
maxima over (or weights on) a subset of training sam-
ples. In all our examples, we used the MAP estimate.
The estimated scene at each patch was always be some
example from the training set.
At each node we collect a set of 10 or 20 “scene can-
didates” from the training data which have image data
closely matching the observation, or local evidence,
at that node. (We think of these as a “line-up of sus-
pects”, as in a police line-up.) We will evaluate proba-
bilities only at those scene values. This simplification
focuses the computational effort on only those scenes

Figures
Citations
More filters
Journal ArticleDOI

Image Super-Resolution Via Sparse Representation

TL;DR: This paper presents a new approach to single-image superresolution, based upon sparse signal representation, which generates high-resolution images that are competitive or even superior in quality to images produced by other similar SR methods.
Book ChapterDOI

Learning a Deep Convolutional Network for Image Super-Resolution

TL;DR: This work proposes a deep learning method for single image super-resolution (SR) that directly learns an end-to-end mapping between the low/high-resolution images and shows that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network.
Proceedings Article

Robot vision

TL;DR: A scheme is developed for classifying the types of motion perceived by a humanlike robot and equations, theorems, concepts, clues, etc., relating the objects, their positions, and their motion to their images on the focal plane are presented.
Proceedings ArticleDOI

Super-resolution through neighbor embedding

TL;DR: This paper proposes a novel method for solving single-image super-resolution problems, given a low-resolution image as input, and recovers its high-resolution counterpart using a set of training examples, inspired by recent manifold teaming methods.
Journal ArticleDOI

SIFT Flow: Dense Correspondence across Scenes and Its Applications

TL;DR: SIFT flow is proposed, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence.
References
More filters
Book

Neural networks for pattern recognition

TL;DR: This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition, and is designed as a text, with over 100 exercises, to benefit anyone involved in the fields of neural computation and pattern recognition.
Journal ArticleDOI

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

TL;DR: The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Book

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference

TL;DR: Probabilistic Reasoning in Intelligent Systems as mentioned in this paper is a complete and accessible account of the theoretical foundations and computational methods that underlie plausible reasoning under uncertainty, and provides a coherent explication of probability as a language for reasoning with partial belief.
Book ChapterDOI

Neural Networks for Pattern Recognition

TL;DR: The chapter discusses two important directions of research to improve learning algorithms: the dynamic node generation, which is used by the cascade correlation algorithm; and designing learning algorithms where the choice of parameters is not an issue.
Journal ArticleDOI

The Laplacian Pyramid as a Compact Image Code

TL;DR: A technique for image encoding in which local operators of many scales but identical shape serve as the basis functions, which tends to enhance salient image features and is well suited for many image analysis tasks as well as for image compression.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What have the authors contributed in "Learning low-level vision" ?

The authors describe a learning-based method for low-level vision problems—estimating scenes from images. The authors apply VISTA to the “ super-resolution ” problem ( estimating high frequency details from a low-resolution image ), showing good results. For the motion estimation problem in a “ blobs world ”, the authors show figure/ground discrimination, solution of the aperture problem, and filling-in arising from application of the same probabilistic machinery. To illustrate the potential breadth of the technique, the authors also apply it in two other problem domains, both simplified. 

The authors set σi to allow roughly 10 samples at each node to be within two standard deviations of the observed image patches, and set σs to allow roughly 5 or 10 matrix transitions to be appreciably different than zero. 

To find the best scene explanation given newimage data, the authors apply belief propagation in the Markov network even though it has loops, an approach supported by experimental and theoretical studies. 

To compute the compatibilities between neighboring patches at different scales, the authors first interpolated the lower-resolution patch by a factor of 2 in each dimension so that it had the same sampling rate as the high resolution patch. 

for reasonably sized patches, the scene and image variables need to be of a high enough dimensionality that an evenly-spaced discrete sampling of the entire high dimensional space is not feasible. 

The first method uses the message-passing rules of Eqs. (21) and (20), based on the joint probability factorization which is not valid for a network with loops. 

the successes of recent texture synthesis methods (Heeger and Bergen, 1995; DeBonet and Viola, 1998; Zhu and Mumford, 1997; Simoncelli, 1997), gives us hope to handle textured areas well, too. 

Letting dljk be the pixels of the lth candidate in the high resolution patch k, and dmkj be the pixels of the mth candidate in the interpolated lowresolution patch j , the authors take as the compatibility,9 ( xlk, x m j ) = exp−|dljk−dmkj |2/2σ 2s , (30) where the authors scale σs to give the same per pixel variance as for the compatibility function between patches at the same scale.