What have the authors contributed in "Learning low-level vision" ?

The authors describe a learning-based method for low-level vision problems—estimating scenes from images. The authors apply VISTA to the “ super-resolution ” problem ( estimating high frequency details from a low-resolution image ), showing good results. For the motion estimation problem in a “ blobs world ”, the authors show figure/ground discrimination, solution of the aperture problem, and filling-in arising from application of the same probabilistic machinery. To illustrate the potential breadth of the technique, the authors also apply it in two other problem domains, both simplified.

How many samples are allowed to be within two standard deviations of the observed image patches?

The authors set σi to allow roughly 10 samples at each node to be within two standard deviations of the observed image patches, and set σs to allow roughly 5 or 10 matrix transitions to be appreciably different than zero.

What is the scene explanation given new image data?

To find the best scene explanation given newimage data, the authors apply belief propagation in the Markov network even though it has loops, an approach supported by experimental and theoretical studies.

How do the authors compute the compatibility between neighboring patches?

To compute the compatibilities between neighboring patches at different scales, the authors first interpolated the lower-resolution patch by a factor of 2 in each dimension so that it had the same sampling rate as the high resolution patch.

What is the first method for finding the compatibility functions of Eqs. 21 and 20?

The first method uses the message-passing rules of Eqs. (21) and (20), based on the joint probability factorization which is not valid for a network with loops.

What is the compatibility function between patches at the same scale?

Letting dljk be the pixels of the lth candidate in the high resolution patch k, and dmkj be the pixels of the mth candidate in the interpolated lowresolution patch j , the authors take as the compatibility,9 ( xlk, x m j ) = exp−|dljk−dmkj |2/2σ 2s , (30) where the authors scale σs to give the same per pixel variance as for the compatibility function between patches at the same scale.

(Open Access) Learning low-level vision (1999) | William T. Freeman

Q: What is the way to handle textured areas?

the successes of recent texture synthesis methods (Heeger and Bergen, 1995; DeBonet and Viola, 1998; Zhu and Mumford, 1997; Simoncelli, 1997), gives us hope to handle textured areas well, too.

International Journal of Computer Vision 40(1), 25–47, 2000

° 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.

Learning Low-Level Vision

WILLIAM T. FREEMAN

Mitsubishi Electric Research Labs., 201 Broadway, Cambridge, MA 02139

Freeman@merl.com

EGON C. PASZTOR

MIT Media Laboratory, E15-385, 20 Ames St., Cambridge, MA, 02139

egon@media.mit.edu

OWEN T. CARMICHAEL

209 Smith Hall, Carnegie-Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213

otc@andrew.cmu.edu

Abstract. We describe a learning-based method for low-level vision problems—estimating scenes from images.

We generate a synthetic world of scenes and their corresponding rendered images, modeling their relationships

with a Markov network. Bayesian belief propagation allows us to efﬁciently ﬁnd a local maximum of the posterior

probability for the scene, given an image. We call this approach VISTA—Vision by Image/Scene TrAining.

We apply VISTA to the “super-resolution” problem (estimating high frequency details from a low-resolution

image), showing good results. To illustrate the potential breadth of the technique, we also apply it in two other

problem domains, both simpliﬁed. We learn to distinguish shading from reﬂectance variations in a single image

under particular lighting conditions. For the motion estimation problem in a “blobs world”, we show ﬁgure/ground

discrimination, solution of the aperture problem, and ﬁlling-in arising from application of the same probabilistic

machinery.

Keywords: vision and learning, belief propagation, low-level vision, super-resolution, shading and reﬂectance,

motion estimation

1. Introduction

We seek machinery for learning low-level vision prob-

lems, such as motion analysis, inferring shape and re-

ﬂectance from a photograph, or extrapolating image

detail. For these problems, given image data, we want

to estimate an underlying scene (Fig. 1). The scene

quantities to be estimated might be projected object

velocities, surface shapes and reﬂectance patterns, or

missinghighfrequencydetails.Theseestimatesareim-

portant for various tasks in image analysis, database

search, and robotics.

Low-level vision problems are typically under-

constrained, so Bayesian (Berger, 1985; Knill and

Richards, 1996; Szeliski, 1989) and regularization

techniques (Poggio et al., 1985) are fundamental.

There has been much work and progress (for example,

Knill and Richards, 1996; Landy and Movshon, 1991;

Horn, 1986), but difﬁculties remain in working with

complex, real images. Typically, prior probabilities or

constraints are hypothesized, rather than learned.

A recent research theme has been to study the statis-

tics of natural images. Researchers have related those

statistics to properties of the human visual system

(Olshausen and Field, 1996; Bell and Sejnowski, 1997;

Simoncelli,1997),orhave used statisticalcharacteriza-

tions of images to analyse and synthesize realistic tex-

tures (Heeger and Bergen, 1995; DeBonet and Viola,

1998; Zhu and Mumford, 1997; Simoncelli, 1997).

These methods may help us understand the early stages

26 Freeman, Pasztor and Carmichael

Figure 1. Example low-level vision problems. For given “image” information, we want to estimate an underlying “scene” that created it

(idealized scene estimates shown).

of representation and processing, but unfortunately,

they don’t address how a visual system might inter-

pret images, i.e., estimate the underlying scene.

We want to combine the two research themes of

scene estimation and statistical learning. We study the

statistical properties of a synthetically generated world

ofimageslabelledwiththeirunderlyingscenes,tolearn

how to infer scenes from images. Our prior probabili-

tiesand rendering models canthenbe rich ones,learned

from the training data.

Several researchers have applied related learning ap-

proaches to low-level vision problems, but restricted

themselves to linear models (Kersten et al., 1987;

HurlbertandPoggio,1988),tooweakformanyapplica-

tions. Our approach is similar in spirit to relaxation la-

belling (Rosenfeld et al., 1976; Kittler and Illingworth,

1985), but our Bayesian propagation algorithm is more

efﬁcient and we use training data to derive propagation

parameters.

Weinterpret imagesby modeling therelationship be-

tween local regions of images and scenes, and between

neighboring local scene regions.Theformer allowsini-

tial scene estimates; the later allows the estimates to

propagate. We train from image/scene pairs and apply

the Bayesian machinery of graphical models (Pearl,

1988; Binford et al., 1988; Jordan, 1998). We were

inﬂuenced by the work of Weiss (Weiss, 1997), who

pointed out the speed advantage of Bayesian methods

over conventional relaxation methods for propagating

localmeasurement information.Fora relatedapproach,

but with heuristically derived propagation rules, see

Saund (1999).

WecallourapproachVISTA,VisionbyImage/Scene

TrAining. It is a general machinery that may apply to

various vision problems. We illustrate it for estimating

missing image details, disambiguating shading from

reﬂectance effects, and estimating motion.

2. Markov Network

For given image data, y, we seek to estimate the un-

derlying scene, x (we omit the vector symbols for

notational simplicity). We ﬁrst calculate the posterior

probability, P(x | y) = cP(x, y) (the normalization,

c =

P(y)

, is a constant over x). Under two common

loss functions (Berger, 1985), the best scene estimate,

ˆx, is the mean (minimum mean squared error, MMSE)

or the mode (maximum a posteriori, MAP) of the pos-

terior probability.

In general, ˆx can be difﬁcult to compute with-

out approximations (Knill and Richards, 1996). We

make the Markov assumption: we divide both the

image and scene into patches, and assign one node

of a Markov network (Geman and Geman, 1984;

Pearl, 1988; Jordan, 1998) to each patch. We draw

the network as nodes connected by lines, which in-

dicate statistical dependencies. Given the variables at

intervening nodes, two nodes of a Markov network

are statistically independent. We connect each scene

patch both to its corresponding image patch and to its

spatial neighbors, Fig. 2. For some problems where

long-range interactions are important, we add layers of

image and scene patches at other spatial scales, con-

necting scene patches to image patches at the same

scale, and to scene patches at neighboring scales and

positions. (Unlike Luettgen et al. (1994), this is not

a tree because of the connections between spatial

neighbors).

The Markov network topology of Fig. 2 implies that

knowing the scene at position j: (1) provides all the

information about the rendered image there, because

has the only link to y

, and (2) gives information

about nearby scenes values, by the links from x

nearby scene neighbors. We will call problems with

these properties low-level vision problems.

Learning Low-Level Vision 27

Figure 2. Markov network for vision problems. Each node in the

network describes a local patch of image or scene. Observations, y,

have underlying scene explanations, x. Lines in the graph indicate

statistical dependencies between nodes.

Solving a Markov network involves a learning

phase, where the parameters of the network connec-

tions are learned from training data, and an inference

phase, when the scene corresponding to particular im-

age data is estimated.

Fora Markov random ﬁeld, the joint probability over

the scenes x and images y can be written (Besag 1974;

Geman and Geman, 1984; Geiger and Girosi, 1991):

P(x

, x

,...,x

,...,y

)

(i,j)

9(x

, x

)

8(x

, y

), (1)

where we haveintroduced pairwise compatibility func-

tions, 9 and 8, which are learned from the training

data. (i, j) indicates neighboring nodes i, j and N is

the number of image and scene nodes.

We can write the MAP and MMSE estimates for

ˆx

by marginalizing (MMSE) or taking the maximum

(MAP) over the other variables in the posterior prob-

ability. For discrete variables, the marginalization in-

volves summations overthe discrete valuesof the scene

variables at each node, indicated by the summations

below:

ˆx

jMMSE

all x

,i6= j

× P(x

, x

,...,x

,...,y

) (2)

ˆx

jMAP

= arg max

max

[all x

, i 6= j]

× P(x

, x

,...,x

,...,y

). (3)

For networks larger than toy examples, Eqs. (2) and

(3) are infeasible to evaluate directly because of the

high dimensionality of the scene variables over which

P(x

, x

,...,x

,...,y

)must be summed or

maximized. When the networks form chains or trees,

however, we can evaluate the equations.

2.1. Inference in Networks Without Loops

For networks without loops, the Markov assumption

leads to simple “message-passing” rules for computing

the MAP and MMSE estimates during inference

(Pearl, 1988; Weiss, 1998; Jordan, 1998). The factor-

ized structure of Eq. (1) allows the marginalization

and maximization operators of Eqs. (2) and (3) to pass

through 9 and8 factors with unrelated arguments. For

example, for the network in Fig. 3, substituting Eq. (1)

for P(x, y) into Eq. (3) for ˆx

jMAP

at node 1 gives

ˆx

1MAP

= arg max

max

P(x

, x

, y

) (4)

= arg max

max

8(x

, y

)8(x

, y

)8(x

, y

)

9(x

, x

)9(x

, x

) (5)

= arg max

8(x

, y

)

max

9(x

, x

)8(x

, y

)

max

9(x

, x

)8(x

, y

). (6)

Each line of Eq. (6) is a local computation involv-

ing only one node and its neighbors. The analogous

expressions for x

2MAP

and x

3MAP

also use local calcu-

lations. Passing local “messages” between neighbors,

as described below, gives an efﬁcient way to compute

the MAP estimates.

Assuming a network without loops, Eqs. (3) and (2)

can be computed by iterating the following steps

(Pearl, 1988; Weiss, 1998; Jordan, 1998). The MAP

Figure 3. Example Markov network without any loops, used for

belief propagation example described in text. The compatibility

functions 8 and 9 are deﬁned below.

28 Freeman, Pasztor and Carmichael

estimate at node j is

ˆx

jMAP

= arg max

8(x

, y

)

, (7)

where k runs over all scene node neighbors of node

j, and M

is the message from node k to node j.We

calculate M

from:

= max

]

9(x

, x

)8(x

, y

)

l6= j

, (8)

where

is M

from the previous iteration. The initial

’s are set to column vectors of 1’s, of the dimen-

sionality of the variable x

To illustrate how these equations are used, we show

howEq. (7)reduces to Eq. (6) for the example of Fig. 3.

First, a note about the compatibility matrices, 9 and

8. For a given observed image-patch, y

, the image-

scene compatibility function, 8(x

, y

), is a column

vector, indexed by the different possible states of x

the scene at node k. The scene-scene compatibility

function, 9(x

, x

), will be a matrix with the differ-

ent possible states of x

and x

, the scenes at nodes

i and j, indexing the rows and columns. Because the

initial messages are 1’s, at the ﬁrst iteration, all the

messages in the network are:

= max

9(x

, x

)8(x

, y

) (9)

= max

9(x

, x

)8(x

, y

) (10)

= max

9(x

, x

)8(x

, y

) (11)

= max

9(x

, x

)8(x

, y

). (12)

The second iteration uses the messages above as the

M variables in Eq. (8):

= max

9(x

, x

)8(x

, y

)

(13)

= max

9(x

, x

)8(x

, y

) (14)

= max

9(x

, x

)8(x

, y

)

(15)

= max

9(x

, x

)8(x

, y

). (16)

Substituting M

of Eq. (10) for

in Eq. (13) gives

= max

9(x

, x

)8(x

, y

)

× max

9(x

, x

)8(x

, y

). (17)

For this example, the messages don’t change in subse-

quent iterations. We substitute the ﬁnal messages into

Eq. (7) to compute the MAP estimates, for example,

ˆx

1MAP

= argmax

8(x

, y

. (18)

Substituting Eq. (17), the converged message value for

, in Eq. (18) abovegives precisely Eq.(6) for x

1MAP

The exact MAP estimates for x

and x

are found anal-

ogously.

It can be shown (Pearl, 1988; Weiss, 1988; Jordan,

1998) that after at most one global iteration of Eq. (8)

for each node in the network, Eq. (7) gives the desired

optimal estimate, ˆx

MAP

, at each node j.

The MMSE estimate, Eq. (3), has analogous formu-

lae, with the max

of Eq. (8) replaced by

, and

argmax

of Eq. (7) replaced by

. For Markov

networks without loops, these propagation rules are

equivalent to standard Bayesian inference methods,

such as the Kalman ﬁlter and the forward-backward

algorithm for Hidden Markov Models (Pearl, 1988;

Luettgen et al., 1994; Weiss, 1997; Smyth et al., 1997;

Frey, 1998; Jordan, 1998).

A second factorization of the joint probability can

also be used instead of Eq. (1), although it is only valid

for chains or trees, while Eq. (1) is valid for general

Markov networks. This is a the chain rule factorization

of the joint probability, similar to Pearl (1988). For

Fig. 3, using the Markov properties, we can write

P(x

, y

, x

, y

, x

, y

)

= P(x

)P(y

| x

)P(x

| x

)

× P(y

| x

)P(x

| x

)P(y

| x

). (19)

Following the same reasoning as in Eqs. (4)–(6), this

factorization leads to the following MAP update and

estimation rules:

= max

P(x

| x

)P(y

| x

)

l6= j

, (20)

jMAP

= arg max

P(x

)P(y

| x

)

. (21)

where k runs over all scene node neighbors of node j.

While the expression for the joint probability does

Learning Low-Level Vision 29

not generalize to a network with loops, we nonethe-

less found good results for some problems using these

update rules (for Section 5 and much of Section 3).

2.2. Networks with Loops

For a network with loops, Eqs. (2) and (3) do not fac-

tor into local calculations as in Eq. (6). Finding exact

MAP or MMSE values for a Markov network with

loops can be computationally prohibitive. Researchers

have proposed a variety of approximations (Geman

and Geman, 1984; Geiger and Girosi, 1991; Jordan,

1998). Strong empirical results in “Turbo codes”

(Kschischang and Frey, 1998; McEliece et al., 1998),

layered image analysis (Frey, 2000) and recent theo-

retical work (Weiss, 1998; Weiss and Freeman, 1999;

Yedidia et al., 2000) provide support for a very sim-

ple approximation: applying the propagation rules of

Eqs. (8) and (7) even in the network with loops. Table 1

summarizes results from Weiss and Freeman (1999):

(1) for Gaussian processes, the MMSE propagation

scheme can converge only to the true posterior means.

(2) Even for non-Gaussian processes, if the MAP prop-

agation scheme converges, it ﬁnds at least a local max-

imum of the true posterior probability. Furthermore,

this condition of local optimality for the converged so-

lution of the MAP algorithm is a strong one. For every

subset of nodes of the network which form a tree, if the

remaining network nodes are constrained to their con-

verged values, the values of the sub-tree’s nodes found

by the MAP algorithm are the global maximum over

that tree’s nodes (Weiss and Freeman, 2000). Yedidia

et al. (2000) show that the MMSE belief propagation

equations are equivalent to the stationarity conditions

for the Bethe approximation to the “free energy” of

the network. These experimental and theoretical re-

sults motivate applying the belief propagation rules of

Table 1. Summary of results from Weiss and Freeman (1999)

regarding belief propagation results after convergence.

Network topology

Belief propagation

algorithm No loops Arbitrary topology

MMSE rules MMSE, correct For Gaussians,

posterior marginal correct means,

probs. wrong covs.

MAP rules MAP estimate Local max. of

posterior, even for

non-Gaussians.

Eqs. (8) and (7) even in a Markov network with loops.

(There is not the corresponding theoretical justiﬁcation

for applying Eqs. (20) and (21)in a network with loops;

we rely on experiment).

2.3. Representation

We need to chose a representation for the image and

scene variables. The images and scenes are arrays of

vector valued pixels, indicating, for example, color

image intensities or surface height and reﬂectance in-

formation. We divide these into patches. For both com-

pression and generalization, we use principle compo-

nents analysis (PCA) to ﬁnd a set of lower dimensional

basis functions for the patches of image and scene pix-

els. We measure distances in this representation using

a Euclidean norm, unless otherwise stated.

We also need to pick a form for the compatibil-

ity functions 8(x

, y

) and 9(x

, x

) in Eqs. (7) and

(8), as well as the messages, M

. One could repre-

sent those functions as Gaussian mixtures (Freeman

and Pasztor, 1999) over the joint spaces x

× y

and

× x

; however multiplications of the Gaussian mix-

tures is cumbersome, requiring repeated pruning to

restore the product Gaussian mixtures to a manageable

number of Gaussians.

We prefer a discrete representation. The most

straight-forward approach would be to evenly sample

all possible states of each image and scene variable

at each patch. Unfortunately, for reasonably sized

patches, the scene and image variables need to be of a

high enough dimensionality that an evenly-spaced dis-

crete sampling of the entire high dimensional space is

not feasible.

Toaddress that,weevaluate8(x

, y

) and9(x

, x

)

only at a restricted set of discrete points, a subset of

our training set. (For other sample-based representa-

tions see Isard and Blake (1996), DeBonet and Viola

(1998)). Our ﬁnal MAP (or MMSE) estimates will be

maxima over (or weights on) a subset of training sam-

ples. In all our examples, we used the MAP estimate.

The estimated scene at each patch was always be some

example from the training set.

At each node we collect a set of 10 or 20 “scene can-

didates” from the training data which have image data

closely matching the observation, or local evidence,

at that node. (We think of these as a “line-up of sus-

pects”, as in a police line-up.) We will evaluate proba-

bilities only at those scene values. This simpliﬁcation

focuses the computational effort on only those scenes

Learning low-level vision

Figures

Citations

Image Super-Resolution Via Sparse Representation

Learning a Deep Convolutional Network for Image Super-Resolution

Robot vision

Super-resolution through neighbor embedding

SIFT Flow: Dense Correspondence across Scenes and Its Applications

References

Neural networks for pattern recognition

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference

Neural Networks for Pattern Recognition

The Laplacian Pyramid as a Compact Image Code

Related Papers (5)

Example-based super-resolution

Super-resolution through neighbor embedding

Image Super-Resolution Via Sparse Representation

Improving resolution by image registration

New edge-directed interpolation

Frequently Asked Questions (8)

Q1. What have the authors contributed in "Learning low-level vision" ?

Q2. How many samples are allowed to be within two standard deviations of the observed image patches?

Q3. What is the scene explanation given new image data?

Q4. How do the authors compute the compatibility between neighboring patches?

Q5. What is the way to sample the scene and image variables?

Q6. What is the first method for finding the compatibility functions of Eqs. 21 and 20?

Q7. What is the way to handle textured areas?

Q8. What is the compatibility function between patches at the same scale?