What is the common method of ensuring positive weighting coefficients?

For instance, it is common to use an exponential function to ensure positive weighting coefficients, such as w(θ) = eθ for a scalar θ, as employed in work like [12].

How can the authors compute the gradient of a vector?

by computing ∂L(x*(θ),t) ∂θi T HE(x*)−1 rather than HE(x*)−1 ∂g ∂w ∂w ∂θi, only one call to the solver is necessary to compute gradient for all parameters in the vector θ.

What is the gradient of the energy function E(x, y, )?

In this case the gradient of the energy function E(x, y;α, β) can then be written as∂E(x, y;α, β) ∂x = FT ρ′(u) + 2(x − y) (12)where ρ′(u) is a function that is applied elementwise to the vector u =

(Open Access) Learning optimized MAP estimates in continuously-valued MRF models (2009) | Kegan G. G. Samuel

Q: What are the contributions mentioned in the paper "Learning optimized map estimates in continuously-valued mrf models" ?

The authors present a new approach for the discriminative training of continuous-valued Markov Random Field ( MRF ) model parameters. In their approach the authors train the MRF model by optimizing the parameters so that the minimum energy solution of the model is as similar as possible to the ground-truth. The authors demonstrate the effectiveness of their technique by applying it to the problems of image denoising and inpainting using the Field of Experts model.

Q: What is the key advantage of the Variational Mode Learning approach?

The key advantage of their approach over Variational Mode Learning, is that the result of the variational optimization must be recomputed every time the gradient of the loss function is recomputed.

Q: How do the authors train the MRF model?

As will be explained in Section 2, the authors train the MRF model by optimizing its parameters so that the minimum energy solution of the model is as similar as possible to the ground-truth.

Q: What is the argument for a probabilistic approach to learning the parameters of a continuousvalue?

While Section 2.1 explores this issue in more detail, their high-level argument is that if one’s goal is to use MAP estimates and evaluate the results using some criterion then optimizing MRF parameters using a probabilistic criterion, like maximum-likelihood parameter estimation, will not necessarily lead to the optimal system.

Q: What is the strategy for maximizing the PSNR?

Returning to the example in the previous paragraph, if the goal is to maximize PSNR, maximizing a likelihood and then hoping that it leads to MAP estimates with good PSNR values is not the best strategy.

Learning Optimized MAP Estimates in Continuously-Valued MRF Models

Kegan G. G. Samuel, Marshall F. Tappen

University of Central Florida

School of Electrical Engineering and Computer Science, Orlando, FL

{kegan, mtappen}@eecs.ucf.edu

Abstract

We present a new approach for the discriminative train-

ing of continuous-valued Markov Random Field (MRF)

model parameters. In our approach we train the MRF

model by optimizing the parameters so that the minimum

energy solution of the model is as similar as possible to the

ground-truth. This leads to parameters which are directly

optimized to increase the quality of the MAP estimates dur-

ing inference. Our proposed technique allows us to develop

a framework that is ﬂexible and intuitively easy to under-

stand and implement, which makes it an attractive alter-

native to learn the parameters of a continuous-valued MRF

model. We demonstrate the effectiveness of our technique by

applying it to the problems of image denoising and inpaint-

ing using the Field of Experts model. In our experiments,

the performance of our system compares favourably to the

Field of Experts model trained using contrastive divergence

when applied to the denoising and inpainting tasks.

1. Introduction

Recent years have seen the introduction of several new

approaches for learning the parameters of continuous-

valued Markov Random Field models [4, 12, 16, 21, 10, 13].

Models based on these methods have proven to be particu-

larly useful for expresssing image priors in low-level vision

systems, such as denoising, and have led to state-of-the-art

results for MRF-based systems.

In this paper, we introduce a new approach for discrimi-

natively training MRF parameters. As will be explained in

Section 2, we train the MRF model by optimizing its pa-

rameters so that the minimum energy solution of the model

is as similar as possible to the ground-truth. While previ-

ous work has relied on time-consuming iterative approxi-

mations [15] or stochastic approximations [10], we show

how implicit differentiation, can be used to analytically dif-

ferentiate the overall training loss with respect to the MRF

parameters. This leads to an efﬁcient, ﬂexible learning al-

gorithm that can be applied to a number of different models.

Using Roth and Black’s Field of Experts model, Section 4

shows how this new learning algorithm leads to improved

results over FoE models trained using other methods.

Our approach also has unique beneﬁts for systems based

on MAP-inference. While Section 2.1 explores this issue in

more detail, our high-level argument is that if one’s goal is

to use MAP estimates and evaluate the results using some

criterion then optimizing MRF parameters using a proba-

bilistic criterion, like maximum-likelihood parameter esti-

mation, will not necessarily lead to the optimal system. In-

stead, the system should be directly optimized so that the

MAP solution scores as well as possible.

2. The Basic Learning Model

Similar to the discriminative, max-margin framework

proposed by Taskar et al. [17] and the Energy-Based Mod-

els [8], we learn the MRF parameters by deﬁning a loss

function that measures the similarity between the ground

truth t and the minimum energy conﬁguration of the MRF

model. Throughout this paper, we will denote this minimum

energy conﬁguration as x*.

This discriminative training criterion can be expressed

formally as the following minimization:

min

L(x*(θ), t)

where x*(θ) = arg min

E(x, y; w(θ))

(1)

The MRF model is deﬁned by the energy function

E(x, y; w(θ)), where x denotes the current state of the MRF,

the observations are denoted y, and parameters θ. The func-

tion w(·) serves as a function that transforms θ before it is

actually used in the energy function. It could be as simple as

w(θ) = θ or could aid in enforcing certain conditions. For

instance, it is common to use an exponential function to en-

sure positive weighting coefﬁcients, such as w(θ) = e

for

a scalar θ, as employed in work like [12]. Our primary mo-

tivation for introducing w(·) here is to aid in the derivation

below.

Note that if E(x, y; w(θ)) is non-convex, we can only ex-

pect x* to be a local minimum. Practically, we have found

that our method is still able to perform well when learn-

ing parameters for non-convex energy functions, as we will

show in Section 4.

2.1. Advantages of Discriminative Learning

Typically, MRF models have been posed in a probabilis-

tic framework. Likewise, parameter learning has been sim-

ilarly posed probabilistically – often in a maximum likeli-

hood framework. In many models, computing the partition

function and performancing inference are intractable, so ap-

proximate methods, such as the tree-based bounds of Wain-

wright et al [19] or Hinton’s contrastive divergence method

[5], must be used.

A key advantage of the discriminative training criterion

is that it is not necessary to compute the partition func-

tion or expectations over various quantities. Instead, it is

only necessary to minimize the energy function. Informally,

this can be seen as reducing the most difﬁcult step in train-

ing from integrating over the space of all possible images,

which would be required to compute the partition function

or expectations, to only having to minimize an energy func-

tion.

A second advantage of this discriminative training crite-

rion is that it better matches the MAP inference procedure

typically used in large, non-convex MRF models. MAP es-

timates are popular when inference, exact or approximate, is

difﬁcult. Unfortunately, parameters that maximize the like-

lihood may not lead to the highest-quality MAP solution.

The Field of Experts model is a good example of this

issue. The parameters in this model were trained in a

maximum-likelihood fashion, using the Contrastive Diver-

gence method to compute approximate gradients. Using

this model, the best results, with the highest PSNR, are

found by terminating the minimization of the negative log-

likelihood of the model before reaching a local minimum.

As can be seen from Figure 1, if the FoE model trained us-

ing contrastive divergence is allowed to reach a local min-

imum then the performance decreases signiﬁcantly. This

happens because maximum-likelihood criterion that was

used to train the system is related, but not directly tied to

the PSNR of the MAP estimate.

Our argument is that if the intent is to use MAP esti-

mates and evaluate the estimates using some criterion, like

PSNR, a better strategy is to ﬁnd the parameters such that

the quality of MAP estimates are directly optimized. Re-

turning to the example in the previous paragraph, if the goal

is to maximize PSNR, maximizing a likelihood and then

hoping that it leads to MAP estimates with good PSNR val-

ues is not the best strategy. We argue that one should instead

choose the parameters such that the MAP estimates have the

highest PSNR possible

. As Figure 1 shows, when the FoE

model is trained using our proposed approach, the PSNR

achieved when the minimization terminates is very close to

the maximum PSNR achieved over the course of the mini-

mization. This is important because the quality of the model

is not tied to a particular minimization scheme. Using the

We wish to reiterate that our method is not tied to the PSNR image

quality metric – any differentiable image loss function can be used.

22 24 26 28 30 32 34 36

Contrastive Divergence

Our method

Final PSNR vs Maximum PSNR

PSNR at a local minimum

Maximum PSNR achieved

Figure 1. This ﬁgure shows the difference between training the

FoE model using contrastive divergence and our proposed method.

At termination, that is at a local minima, our training method pro-

duces results that are very close to the maximum PSNR achieved

over the course of the minimization on images with additive Gaus-

sian noise, σ = 15.

model trained with our method, different types of optimiza-

tion method could be used, while maintaining high-quality

results.

2.2. Related Work

Our approach is most closely related to the Variational

Mode Learning approach proposed in [15]. This method

works by treating a series of variational optimization steps

as a continuous function and differentiating the result with

respect to the model parameters. The key advantage of our

approach over Variational Mode Learning, is that the result

of the variational optimization must be recomputed every

time the gradient of the loss function is recomputed. In [15],

the variational optimization often required 20-30 steps. This

translates into 20 to 30 calls to the matrix solver each time

a gradient must be recomputed. On the other hand, our

method only requires one call to the matrix solver to com-

pute a gradient.

The approach proposed by Roth and Black [12] to learn

the parameters of the Field of Experts model uses a sam-

pling strategy and the idea of contrastive divergence [5] to

estimate the expectation over the model distribution. Us-

ing this estimate they perform gradient ascent on the log-

likelihood to update the parameters. This method, however,

only computes approximate gradients and there is no guar-

antee that the contrastive divergence method converges.

In [13], Scharstein and Pal propose another method

which uses MAP estimates to compute approximate expec-

tation gradients to learn parameters of a CRF model. How-

ever, this approach produces stability issues with models

with a large number of parameters. Our method is able to

effectively train a model with a large number of parameters.

This method is similar to that used by Kumar et al. in [7].

Another recently proposed approach to learn the parame-

ters of a continuous-state MRF model is using simultaneous

perturbation stochastic approximation (SPSA) to minimize

the training loss across a set of ground truth images [10].

When using SPSA it is difﬁcult to determine exact conver-

gence and multiple runs are usually required to reduce the

variance of the learnt parameters. Also, SPSA requires var-

ious coefﬁcients which have to be ’tweaked’ to get optimal

performance. While there are guidelines on how those co-

efﬁcients can be chosen [14], it is still a matter of trial and

error in determining the right values to achieve the best per-

formance.

It should be noted that in [3], Do et al. were able to learn

hyper-parameters, rather than parameters, of CRF model by

using a chain-structured model. Using a chain-structured

model makes it possible to compute the Hessian of the

CRF’s density function, thus enabling the method to learn

hyper-parameters. Here, we consider problems where the

density makes it impossible to compute this Hessian, as

is the case in non-Gaussian models with loops. Because

we cannot compute the Hessian, we cannot learn hyper-

parameters.

2.3. Using Gradient Minimization

Taskar et al. have shown that for certain types of energy

and loss functions, the learning task in Equation 1 can be ac-

complished with convex optimization algorithms [17, 18].

However, the similarity of Equation 1 to solutions for learn-

ing hyper-parameters, such as [3, 1, 6, 2], suggests that it

may be possible to optimize the MRF parameters θ using

simpler gradient-based algorithms. In the hyper-parameter

learning work, the authors were able to use implicit differ-

entiation to compute the gradient of a loss function with

respect to hyper-parameters. In this section, we show how

the same implicit differentiation technique can be applied to

calculate the gradient of L(x*(θ), t) with respect to the pa-

rameters θ. This will enable us to use basic steepest-descent

techniques to learn the parameters θ.

In the following section, we will begin with a general

formulation, then, in later sections, show the derivations for

a speciﬁc, ﬁlter-based MRF image prior.

2.4. Calculating the Gradient with Implicit Differ-

entiation

In this section, we will show how the gradient vector of

the loss can be computed with respect to some parameter

using the implicit differentation method used in hyper-

parameter learning. We will begin by calculating the deriva-

tive vector of x*(θ) with respect to θ

. Once we can differ-

entiate x*(θ) with respect to θ, a basic application of the

chain rule will enable us to compute

∂L

∂θ

Because x*(θ) = arg min

E(x, y; w(θ)), we can express

the following condition on x*:

∂E(x; y, w(θ))

∂x

x*(θ)

= 0 (2)

This simply states that the gradient at a minimum must

be equal to the zero vector. For clarity in the remain-

ing discussion, will replace

∂E(x,y;w(θ))

∂x

,with the function

g(x, w(θ)), such that

g(x, w(θ)) ,

∂E(x, y; w(θ))

∂x

Notice that we have retained θ as a parameter of g(·),

though it ﬁrst passes through the function w(·). We retain

θ because it will be eventually be treated as a parameter

that can be varied, though in Equation 2 it is treated as a

constant.

Using this notation, we can restate Equation 2 as

g(x

∗

(θ), w(θ)) = 0 (3)

Note that g(·) is a vector function of vector inputs.

We can now differentiate both sides with respect to the

parameter θ

. After applying the chain rule, the derivative

of the left side of Equation 3 is

∂g

∂θ

∂g

∂x

∗

∂x

∗

∂θ

∂g

∂w

∂θ

(4)

Note that if x and x

∗

are N × 1 vectors, then

∂g

∂x

∗

is an

N × N matrix. Using Equation 3, we can now solve for

∂x

∗

∂θ

0 =

∂g

∂x

∗

∂x

∗

∂θ

∂g

∂w

∂θ

∂x

∗

∂θ

= −

∂g

∂x

∗

−1

∂g

∂w

∂θ

(5)

Note that because θ

is a scalar,

∂g

∂w

∂θ

is an N × 1 vector.

The matrix

∂g

∂x

∗

is easily computed by noticing that

g(x

∗

, w(θ)) is the gradient of E(·), with each term from

x replaced with a term from x

∗

. This makes the

∂g

∂x

∗

term in

the above equations just the Hessian matrix of E(x) evalu-

ated at x*.

Denoting the Hessian of E(x) evaluated at x* as H

(x*)

and applying the chain rule leads to the derivative of the

overall loss function with respect to θ

∂L(x*(θ), t)

∂θ

= −

∂L(x*(θ), t)

∂x

∗

(x*)

−1

∂g

∂w

∂θ

(6)

Previous authors have pointed out two important points

regarding Equation 6 [3, 16]. First, the Hessian does

not need to be inverted. Instead, only the value

∂L(x*(θ),t)

∂θ

(x*)

−1

needs to be computed. This can

be accomplished efﬁciently in a number of ways, in-

cluding iterative methods like conjugate-gradients. Sec-

ond, by computing

∂L(x*(θ),t)

∂θ

(x*)

−1

rather than

(x*)

−1

∂g

∂w

∂θ

, only one call to the solver is necessary

to compute gradient for all parameters in the vector θ.

2.5. Overall Steps for Computing Gradient

This formulation provides us with a convenient frame-

work to learn the parameter θ using basic optimization rou-

tines. The steps are:

1. Compute x

∗

(θ) for the current parameter vector θ. In

our work, this is accomplished using non-linear conju-

gate gradient optimization.

2. Compute the Hessian matrix at x

∗

(θ), H

(x*). Also

compute the training loss, L(x*(θ), t), and its gradient

with respect to x*.

3. Compute the gradient of the L(·) using Equation 6. As

described above, performing the computations in the

correct order can lead to signiﬁcant gains in computa-

tional efﬁciency.

3. Application in Denoising Images

Having outlined our general approach, we now apply this

approach to learning image priors. In this section, we will

describe how this approach can be used to train a model

similar to the Field of Experts model. As we will show in

Section 4, training using our method leads to a denoising

model that performs quite well.

3.1. Background: Field of Experts Model

Recent work has shown that image priors created from

the combination of image ﬁlters and robust penalty func-

tions are very effective for low-level vision tasks like de-

noising, in-painting, and de-blurring [9, 12, 15]. The

Field of Experts model is deﬁned by a set of linear ﬁlters,

, ...f

and their associated weights α

, ..., α

. The

Lorentzian penalty function, ρ(x) = log(1 + x

) is used

to deﬁne the clique potentials. This leads to a probability

density function over an image, x, to be deﬁned as:

p(x) =

exp





−

i=1

p=1

ρ(x

∗ f

)





(7)

where N

is the number of pixels, N

is the number of ﬁl-

ters and (x

∗ f

) denotes the result of convolving the patch

at pixel p in image x with the ﬁlter f

When applied to denoising images, the probability den-

sity in Eq. 7 is used as a prior and combined with a Gaussian

likelihood function to form a posterior probability distribu-

tion of an image given by:

p(x|y) =

exp





−

i=1

p=1

ρ(x

∗ f

) −

2σ

p=1

− y

)





(8)

where σ is the standard deviation of the Gaussian noise and

y is a noisy observation of x.

3.2. Energy and Loss Function Formulation

As stated in Section 2 we learn the MRF parameters

by deﬁning a loss function that measures the similarity be-

tween the ground truth t and the minimum energy conﬁgu-

ration of the MRF model. We use the negative log-posterior

given in Equation 8 to form our energy function as:

E(x, y; α, β) =

i=1

p=1

log(1+(F

)

p=1

−y

)

(9)

Here we have used multiplication with a doubly-Toeplitz

matrix F

to denote convolution with a ﬁlter f

. Each ﬁlter,

, is formed from a linear combination of a set of basis

ﬁlters B

, , .., B

. The parameters β determine the coefﬁ-

cients of the basis ﬁlters, that is, F

is deﬁned as

j=1

(10)

This formulation allows us to learn the ﬁlters, F

, ..F

, via

the parameters β as well as their respective weights via α.

In the following section we assume a loss function,

L(x*(θ), t) to be the pixelwise-squared error between the

x and t. That is,

L(x*(θ), t) =

p=1

− t

)

(11)

where we have grouped the parameters α and β into a sin-

gle vector θ. However, our proposed formulation is ﬂexi-

ble enough to allow the user to choose a loss function that

matches their notion of image quality.

3.3. Calculating Gradients in the FoE Model

For clarity, we will describe how to compute the required

derivatives in the denoising formulation by considering a

model with one ﬁlter. In this case the gradient of the energy

function E(x, y; α, β) can then be written as

∂E(x, y; α, β)

∂x

= F

′

(u) + 2(x − y) (12)

where ρ

′

(u) is a function that is applied elementwise to the

vector u = F x and deﬁned as

′

(z) = exp(α)

1 + z

(13)

Following the steps in Section 2.4 we can write the

derivative of x* w.r.t. the parameter β as the vector given

by:

∂x*

∂β

= −(F

W F + I)

−1

′

(u*) + F

′′

(u*)B

(14)

where W is a N

diagonal matrix and a [W ]

i,i

entry is

determined by applying the function

′′

(z) = exp(α)

2 − 2z

(1 + z

)

(15)

elementwise to the vector u* = F x*.

This leads to the overall expression for computing the

derivative of the loss function with respect to β to be deﬁned

L(x

∗

(θ), t)

∂β

= −(x* − t)

W F + I)

−1

(16)

where C

= (B

′

(u*) + F

′′

(u*)B

In a similar fashion, we get the expression for computing

the derivative of the loss function with respect to α to be

deﬁned as

L(x

∗

(θ), t)

∂α

= −(x* − t)

W F + I)

−1

(17)

where C

= F

′

(u*).

The above equations can be easily extended for multiple

ﬁlters f

, ...f

and their corresponding doubly-Toeplitz

matrices F

, ...F

Now that we have the necessary information to compute

the required gradients (Equations 16 and 17) we can follow

the steps outlined in Section 2.5 to learn the parameters of

the FoE model.

3.4. Computational Issues

An important practical difference between this approach

for learning MRF parameters and previous approaches for

learning hyper-parameters is that the matrices F

. . . F

are sparse. We found that we were able to train on relatively

large 51 × 51 image patches without having to use iterative

solvers.

Another feature we can exploit is the fact that Equations

17 and 16 are identical except for the last C

term. This

means that we only have to compute the Hessian once per

gradient calculation.

3.5. Potential Problems

Because the energy function has multiple local minima,

it is likely that re-optimizing E(·) from an arbitrary initial

condition will ﬁnd a different minimum than the values of

x* used during training. Despite this, the model trained with

this method still performs very well, as we will show in

the following section. Training on many images prevents

the system from trying to exploit speciﬁc minima that may

be unique to one image. Instead, the system is attempting

to ﬁnd parameters that leads to a good solution across the

training set. Utilizing a large enough training set will pre-

vent the system from overﬁtting to accommodate a certain

minimum energy conﬁguration for a particular image.

4. Experiments and Results

We conducted our experiments using the images from

the Berkeley segmentation database[11] identiﬁed by Roth

and Black in their experiments. We used the same 40 im-

ages for training and 68 images for testing. Since our ap-

proach allowed us to train on larger patches than those used

by Roth and Black, the results reported in this paper are all

from training using four randomly selected 51 x 51 patches

from each training image, giving us a total of 160 train-

ing patches. We learnt 24 ﬁlters with dimension 5 x 5 pix-

els using the same basis as Roth and Black. Training was

also done using 2000 randomly selected 15x15 patches with

slightly lower results.

In order to compare our results, we denoised the test im-

ages using the Field of Experts(FoE) implementation pro-

vided by Roth and Black. The parameters in this implemen-

tation were learnt using the contrastive divergence method

mentioned in Section 2.2 and uses gradient ascent to de-

noise images. For convenience, we will refer to this sys-

tem as FoE-GA (for gradient ascent). We ran those exper-

iments for 2500 iterations per image as suggested by Roth

and Black using a step size η = 0.6. We also implemented a

denoising system using the same parameters from the FoE-

GA system but using conjugate gradient descent instead of

gradient ascent. This system we will refer to as FoE-CG

(for conjugate gradient) and results are reported when the

system converges at a local minimum.

As shown in Figure 2, we achieve similar and at times

better performance at convergence when compared to FoE-

GA across the test images when our MAP inference is al-

lowed to terminate at a local minimum. We wish to reiterate

that FoE-GA terminates after a ﬁxed number of iterations

which is chosen to maximize performance and not when

the system reaches a local minimum. When compared to

FoE-CG our performance is noticeably much better. This

shows a signiﬁcant and important difference using our pro-

posed training method: if the minimization is allowed to

terminate at a local minimum then our performance is much

better. Convergence is also achieved much faster using our

training method as shown in Table 1. We used the same

stopping criteria for our testing results and FoE-CG.

We also computed the perceptually based SSIM

index[20] to measure the denoising results. Table 2 gives

the average PSNR and SSIM index computed from the de-

noised images. Figures 3 and 4 show examples in which our

results are better in terms of PSNR and SSIM. Visually, the

texture is preserved better in our denoised images. How-

Learning optimized MAP estimates in continuously-valued MRF models

Figures

Citations

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

FFDNet: Toward a Fast and Flexible Solution for CNN-Based Image Denoising

FFDNet: Toward a Fast and Flexible Solution for CNN based Image Denoising

Shrinkage Fields for Effective Image Restoration

Solving inverse problems using data-driven models

References

Image quality assessment: from error visibility to structural similarity

A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics

Training products of experts by minimizing contrastive divergence

Choosing Multiple Parameters for Support Vector Machines

Image and depth from a conventional camera with a coded aperture

Related Papers (5)

Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering

A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics

Adam: A Method for Stochastic Optimization

Nonlinear total variation based noise removal algorithms

Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries

Frequently Asked Questions (11)

Q1. What are the contributions mentioned in the paper "Learning optimized map estimates in continuously-valued mrf models" ?

Q2. What is the method to learn the parameters of a continuous-state MRF model?

Q3. What is the common method of ensuring positive weighting coefficients?

Q4. What is the key advantage of the Variational Mode Learning approach?

Q5. How can the authors compute the gradient of a vector?

Q6. How do you find the results with the highest PSNR?

Q7. How do the authors train the MRF model?

Q8. What is the argument for a probabilistic approach to learning the parameters of a continuousvalue?

Q9. What is the gradient of the energy function E(x, y, )?

Q10. What is the strategy for maximizing the PSNR?

Q11. How can the authors learn the parameters of a hyperparameter?