scispace - formally typeset
Open AccessProceedings ArticleDOI

Learning optimized MAP estimates in continuously-valued MRF models

Reads0
Chats0
TLDR
This work presents a new approach for the discriminative training of continuous-valued Markov Random Field model parameters by optimizing the parameters so that the minimum energy solution of the model is as similar as possible to the ground-truth.
Abstract
We present a new approach for the discriminative training of continuous-valued Markov Random Field (MRF) model parameters. In our approach we train the MRF model by optimizing the parameters so that the minimum energy solution of the model is as similar as possible to the ground-truth. This leads to parameters which are directly optimized to increase the quality of the MAP estimates during inference. Our proposed technique allows us to develop a framework that is flexible and intuitively easy to understand and implement, which makes it an attractive alternative to learn the parameters of a continuous-valued MRF model. We demonstrate the effectiveness of our technique by applying it to the problems of image denoising and in-painting using the Field of Experts model. In our experiments, the performance of our system compares favourably to the Field of Experts model trained using contrastive divergence when applied to the denoising and in-painting tasks.

read more

Content maybe subject to copyright    Report

Learning Optimized MAP Estimates in Continuously-Valued MRF Models
Kegan G. G. Samuel, Marshall F. Tappen
University of Central Florida
School of Electrical Engineering and Computer Science, Orlando, FL
{kegan, mtappen}@eecs.ucf.edu
Abstract
We present a new approach for the discriminative train-
ing of continuous-valued Markov Random Field (MRF)
model parameters. In our approach we train the MRF
model by optimizing the parameters so that the minimum
energy solution of the model is as similar as possible to the
ground-truth. This leads to parameters which are directly
optimized to increase the quality of the MAP estimates dur-
ing inference. Our proposed technique allows us to develop
a framework that is flexible and intuitively easy to under-
stand and implement, which makes it an attractive alter-
native to learn the parameters of a continuous-valued MRF
model. We demonstrate the effectiveness of our technique by
applying it to the problems of image denoising and inpaint-
ing using the Field of Experts model. In our experiments,
the performance of our system compares favourably to the
Field of Experts model trained using contrastive divergence
when applied to the denoising and inpainting tasks.
1. Introduction
Recent years have seen the introduction of several new
approaches for learning the parameters of continuous-
valued Markov Random Field models [4, 12, 16, 21, 10, 13].
Models based on these methods have proven to be particu-
larly useful for expresssing image priors in low-level vision
systems, such as denoising, and have led to state-of-the-art
results for MRF-based systems.
In this paper, we introduce a new approach for discrimi-
natively training MRF parameters. As will be explained in
Section 2, we train the MRF model by optimizing its pa-
rameters so that the minimum energy solution of the model
is as similar as possible to the ground-truth. While previ-
ous work has relied on time-consuming iterative approxi-
mations [15] or stochastic approximations [10], we show
how implicit differentiation, can be used to analytically dif-
ferentiate the overall training loss with respect to the MRF
parameters. This leads to an efficient, flexible learning al-
gorithm that can be applied to a number of different models.
Using Roth and Black’s Field of Experts model, Section 4
shows how this new learning algorithm leads to improved
results over FoE models trained using other methods.
Our approach also has unique benefits for systems based
on MAP-inference. While Section 2.1 explores this issue in
more detail, our high-level argument is that if one’s goal is
to use MAP estimates and evaluate the results using some
criterion then optimizing MRF parameters using a proba-
bilistic criterion, like maximum-likelihood parameter esti-
mation, will not necessarily lead to the optimal system. In-
stead, the system should be directly optimized so that the
MAP solution scores as well as possible.
2. The Basic Learning Model
Similar to the discriminative, max-margin framework
proposed by Taskar et al. [17] and the Energy-Based Mod-
els [8], we learn the MRF parameters by defining a loss
function that measures the similarity between the ground
truth t and the minimum energy configuration of the MRF
model. Throughout this paper, we will denote this minimum
energy configuration as x*.
This discriminative training criterion can be expressed
formally as the following minimization:
min
θ
L(x*(θ), t)
where x*(θ) = arg min
x
E(x, y; w(θ))
(1)
The MRF model is defined by the energy function
E(x, y; w(θ)), where x denotes the current state of the MRF,
the observations are denoted y, and parameters θ. The func-
tion w(·) serves as a function that transforms θ before it is
actually used in the energy function. It could be as simple as
w(θ) = θ or could aid in enforcing certain conditions. For
instance, it is common to use an exponential function to en-
sure positive weighting coefficients, such as w(θ) = e
θ
for
a scalar θ, as employed in work like [12]. Our primary mo-
tivation for introducing w(·) here is to aid in the derivation
below.
Note that if E(x, y; w(θ)) is non-convex, we can only ex-
pect x* to be a local minimum. Practically, we have found
that our method is still able to perform well when learn-
ing parameters for non-convex energy functions, as we will
show in Section 4.
1

2.1. Advantages of Discriminative Learning
Typically, MRF models have been posed in a probabilis-
tic framework. Likewise, parameter learning has been sim-
ilarly posed probabilistically often in a maximum likeli-
hood framework. In many models, computing the partition
function and performancing inference are intractable, so ap-
proximate methods, such as the tree-based bounds of Wain-
wright et al [19] or Hinton’s contrastive divergence method
[5], must be used.
A key advantage of the discriminative training criterion
is that it is not necessary to compute the partition func-
tion or expectations over various quantities. Instead, it is
only necessary to minimize the energy function. Informally,
this can be seen as reducing the most difficult step in train-
ing from integrating over the space of all possible images,
which would be required to compute the partition function
or expectations, to only having to minimize an energy func-
tion.
A second advantage of this discriminative training crite-
rion is that it better matches the MAP inference procedure
typically used in large, non-convex MRF models. MAP es-
timates are popular when inference, exact or approximate, is
difficult. Unfortunately, parameters that maximize the like-
lihood may not lead to the highest-quality MAP solution.
The Field of Experts model is a good example of this
issue. The parameters in this model were trained in a
maximum-likelihood fashion, using the Contrastive Diver-
gence method to compute approximate gradients. Using
this model, the best results, with the highest PSNR, are
found by terminating the minimization of the negative log-
likelihood of the model before reaching a local minimum.
As can be seen from Figure 1, if the FoE model trained us-
ing contrastive divergence is allowed to reach a local min-
imum then the performance decreases significantly. This
happens because maximum-likelihood criterion that was
used to train the system is related, but not directly tied to
the PSNR of the MAP estimate.
Our argument is that if the intent is to use MAP esti-
mates and evaluate the estimates using some criterion, like
PSNR, a better strategy is to find the parameters such that
the quality of MAP estimates are directly optimized. Re-
turning to the example in the previous paragraph, if the goal
is to maximize PSNR, maximizing a likelihood and then
hoping that it leads to MAP estimates with good PSNR val-
ues is not the best strategy. We argue that one should instead
choose the parameters such that the MAP estimates have the
highest PSNR possible
1
. As Figure 1 shows, when the FoE
model is trained using our proposed approach, the PSNR
achieved when the minimization terminates is very close to
the maximum PSNR achieved over the course of the mini-
mization. This is important because the quality of the model
is not tied to a particular minimization scheme. Using the
1
We wish to reiterate that our method is not tied to the PSNR image
quality metric any differentiable image loss function can be used.
22 24 26 28 30 32 34 36
22
24
26
28
30
32
34
36
Contrastive Divergence
Our method
Final PSNR vs Maximum PSNR
PSNR at a local minimum
Maximum PSNR achieved
Figure 1. This figure shows the difference between training the
FoE model using contrastive divergence and our proposed method.
At termination, that is at a local minima, our training method pro-
duces results that are very close to the maximum PSNR achieved
over the course of the minimization on images with additive Gaus-
sian noise, σ = 15.
model trained with our method, different types of optimiza-
tion method could be used, while maintaining high-quality
results.
2.2. Related Work
Our approach is most closely related to the Variational
Mode Learning approach proposed in [15]. This method
works by treating a series of variational optimization steps
as a continuous function and differentiating the result with
respect to the model parameters. The key advantage of our
approach over Variational Mode Learning, is that the result
of the variational optimization must be recomputed every
time the gradient of the loss function is recomputed. In [15],
the variational optimization often required 20-30 steps. This
translates into 20 to 30 calls to the matrix solver each time
a gradient must be recomputed. On the other hand, our
method only requires one call to the matrix solver to com-
pute a gradient.
The approach proposed by Roth and Black [12] to learn
the parameters of the Field of Experts model uses a sam-
pling strategy and the idea of contrastive divergence [5] to
estimate the expectation over the model distribution. Us-
ing this estimate they perform gradient ascent on the log-
likelihood to update the parameters. This method, however,
only computes approximate gradients and there is no guar-
antee that the contrastive divergence method converges.
In [13], Scharstein and Pal propose another method
which uses MAP estimates to compute approximate expec-
tation gradients to learn parameters of a CRF model. How-
ever, this approach produces stability issues with models
with a large number of parameters. Our method is able to

effectively train a model with a large number of parameters.
This method is similar to that used by Kumar et al. in [7].
Another recently proposed approach to learn the parame-
ters of a continuous-state MRF model is using simultaneous
perturbation stochastic approximation (SPSA) to minimize
the training loss across a set of ground truth images [10].
When using SPSA it is difficult to determine exact conver-
gence and multiple runs are usually required to reduce the
variance of the learnt parameters. Also, SPSA requires var-
ious coefficients which have to be ’tweaked’ to get optimal
performance. While there are guidelines on how those co-
efficients can be chosen [14], it is still a matter of trial and
error in determining the right values to achieve the best per-
formance.
It should be noted that in [3], Do et al. were able to learn
hyper-parameters, rather than parameters, of CRF model by
using a chain-structured model. Using a chain-structured
model makes it possible to compute the Hessian of the
CRF’s density function, thus enabling the method to learn
hyper-parameters. Here, we consider problems where the
density makes it impossible to compute this Hessian, as
is the case in non-Gaussian models with loops. Because
we cannot compute the Hessian, we cannot learn hyper-
parameters.
2.3. Using Gradient Minimization
Taskar et al. have shown that for certain types of energy
and loss functions, the learning task in Equation 1 can be ac-
complished with convex optimization algorithms [17, 18].
However, the similarity of Equation 1 to solutions for learn-
ing hyper-parameters, such as [3, 1, 6, 2], suggests that it
may be possible to optimize the MRF parameters θ using
simpler gradient-based algorithms. In the hyper-parameter
learning work, the authors were able to use implicit differ-
entiation to compute the gradient of a loss function with
respect to hyper-parameters. In this section, we show how
the same implicit differentiation technique can be applied to
calculate the gradient of L(x*(θ), t) with respect to the pa-
rameters θ. This will enable us to use basic steepest-descent
techniques to learn the parameters θ.
In the following section, we will begin with a general
formulation, then, in later sections, show the derivations for
a specific, filter-based MRF image prior.
2.4. Calculating the Gradient with Implicit Differ-
entiation
In this section, we will show how the gradient vector of
the loss can be computed with respect to some parameter
θ
i
using the implicit differentation method used in hyper-
parameter learning. We will begin by calculating the deriva-
tive vector of x*(θ) with respect to θ
i
. Once we can differ-
entiate x*(θ) with respect to θ, a basic application of the
chain rule will enable us to compute
L
θ
i
.
Because x*(θ) = arg min
x
E(x, y; w(θ)), we can express
the following condition on x*:
E(x; y, w(θ))
x
¯
¯
¯
¯
x*(θ)
= 0 (2)
This simply states that the gradient at a minimum must
be equal to the zero vector. For clarity in the remain-
ing discussion, will replace
E(x,y;w(θ))
x
,with the function
g(x, w(θ)), such that
g(x, w(θ)) ,
E(x, y; w(θ))
x
Notice that we have retained θ as a parameter of g(·),
though it first passes through the function w(·). We retain
θ because it will be eventually be treated as a parameter
that can be varied, though in Equation 2 it is treated as a
constant.
Using this notation, we can restate Equation 2 as
g(x
(θ), w(θ)) = 0 (3)
Note that g(·) is a vector function of vector inputs.
We can now differentiate both sides with respect to the
parameter θ
i
. After applying the chain rule, the derivative
of the left side of Equation 3 is
g
θ
i
=
g
x
x
θ
i
+
g
w
w
θ
i
(4)
Note that if x and x
are N × 1 vectors, then
g
x
is an
N × N matrix. Using Equation 3, we can now solve for
x
θ
i
:
0 =
g
x
x
θ
i
+
g
w
w
θ
i
x
θ
i
=
µ
g
x
1
g
w
w
θ
i
(5)
Note that because θ
i
is a scalar,
g
w
w
θ
i
is an N × 1 vector.
The matrix
g
x
is easily computed by noticing that
g(x
, w(θ)) is the gradient of E(·), with each term from
x replaced with a term from x
. This makes the
g
x
term in
the above equations just the Hessian matrix of E(x) evalu-
ated at x*.
Denoting the Hessian of E(x) evaluated at x* as H
E
(x*)
and applying the chain rule leads to the derivative of the
overall loss function with respect to θ
i
:
L(x*(θ), t)
θ
i
=
L(x*(θ), t)
x
T
H
E
(x*)
1
g
w
w
θ
i
(6)
Previous authors have pointed out two important points
regarding Equation 6 [3, 16]. First, the Hessian does
not need to be inverted. Instead, only the value
L(x*(θ),t)
θ
i
T
H
E
(x*)
1
needs to be computed. This can

be accomplished efficiently in a number of ways, in-
cluding iterative methods like conjugate-gradients. Sec-
ond, by computing
L(x*(θ),t)
θ
i
T
H
E
(x*)
1
rather than
H
E
(x*)
1
g
w
w
θ
i
, only one call to the solver is necessary
to compute gradient for all parameters in the vector θ.
2.5. Overall Steps for Computing Gradient
This formulation provides us with a convenient frame-
work to learn the parameter θ using basic optimization rou-
tines. The steps are:
1. Compute x
(θ) for the current parameter vector θ. In
our work, this is accomplished using non-linear conju-
gate gradient optimization.
2. Compute the Hessian matrix at x
(θ), H
E
(x*). Also
compute the training loss, L(x*(θ), t), and its gradient
with respect to x*.
3. Compute the gradient of the L(·) using Equation 6. As
described above, performing the computations in the
correct order can lead to significant gains in computa-
tional efficiency.
3. Application in Denoising Images
Having outlined our general approach, we now apply this
approach to learning image priors. In this section, we will
describe how this approach can be used to train a model
similar to the Field of Experts model. As we will show in
Section 4, training using our method leads to a denoising
model that performs quite well.
3.1. Background: Field of Experts Model
Recent work has shown that image priors created from
the combination of image filters and robust penalty func-
tions are very effective for low-level vision tasks like de-
noising, in-painting, and de-blurring [9, 12, 15]. The
Field of Experts model is defined by a set of linear filters,
f
1
, ...f
N
f
and their associated weights α
1
, ..., α
N
f
. The
Lorentzian penalty function, ρ(x) = log(1 + x
2
) is used
to define the clique potentials. This leads to a probability
density function over an image, x, to be defined as:
p(x) =
1
Z
exp
N
f
X
i=1
α
i
N
p
X
p=1
ρ(x
p
f
i
)
(7)
where N
p
is the number of pixels, N
f
is the number of fil-
ters and (x
p
f
i
) denotes the result of convolving the patch
at pixel p in image x with the filter f
i
.
When applied to denoising images, the probability den-
sity in Eq. 7 is used as a prior and combined with a Gaussian
likelihood function to form a posterior probability distribu-
tion of an image given by:
p(x|y) =
1
Z
exp
N
f
X
i=1
α
i
N
p
X
p=1
ρ(x
p
f
i
)
1
2σ
2
N
p
X
p=1
(x
p
y
p
)
2
(8)
where σ is the standard deviation of the Gaussian noise and
y is a noisy observation of x.
3.2. Energy and Loss Function Formulation
As stated in Section 2 we learn the MRF parameters
by defining a loss function that measures the similarity be-
tween the ground truth t and the minimum energy configu-
ration of the MRF model. We use the negative log-posterior
given in Equation 8 to form our energy function as:
E(x, y; α, β) =
N
f
X
i=1
e
α
i
N
p
X
p=1
log(1+(F
i
x
p
)
2
)+
N
p
X
p=1
(x
p
y
p
)
2
(9)
Here we have used multiplication with a doubly-Toeplitz
matrix F
i
to denote convolution with a filter f
i
. Each filter,
F
i
, is formed from a linear combination of a set of basis
filters B
1
, , .., B
N
B
. The parameters β determine the coeffi-
cients of the basis filters, that is, F
i
is defined as
F
i
=
N
B
X
j=1
β
ij
B
j
(10)
This formulation allows us to learn the filters, F
i
, ..F
N
f
, via
the parameters β as well as their respective weights via α.
In the following section we assume a loss function,
L(x*(θ), t) to be the pixelwise-squared error between the
x and t. That is,
L(x*(θ), t) =
N
p
X
p=1
(x
p
t
p
)
2
(11)
where we have grouped the parameters α and β into a sin-
gle vector θ. However, our proposed formulation is flexi-
ble enough to allow the user to choose a loss function that
matches their notion of image quality.
3.3. Calculating Gradients in the FoE Model
For clarity, we will describe how to compute the required
derivatives in the denoising formulation by considering a
model with one filter. In this case the gradient of the energy
function E(x, y; α, β) can then be written as
E(x, y; α, β)
x
= F
T
ρ
(u) + 2(x y) (12)
where ρ
(u) is a function that is applied elementwise to the
vector u = F x and defined as
ρ
(z) = exp(α)
2z
1 + z
2
(13)

Following the steps in Section 2.4 we can write the
derivative of x* w.r.t. the parameter β as the vector given
by:
x*
β
j
= (F
T
W F + I)
1
(B
T
j
ρ
(u*) + F
T
ρ
′′
(u*)B
j
x)
(14)
where W is a N
p
xN
p
diagonal matrix and a [W ]
i,i
entry is
determined by applying the function
ρ
′′
(z) = exp(α)
2 2z
2
(1 + z
2
)
2
(15)
elementwise to the vector u* = F x*.
This leads to the overall expression for computing the
derivative of the loss function with respect to β to be defined
as
L(x
(θ), t)
β
= (x* t)
T
(F
T
W F + I)
1
C
β
(16)
where C
β
= (B
T
j
ρ
(u*) + F
T
ρ
′′
(u*)B
j
x)
In a similar fashion, we get the expression for computing
the derivative of the loss function with respect to α to be
defined as
L(x
(θ), t)
α
= (x* t)
T
(F
T
W F + I)
1
C
α
(17)
where C
α
= F
T
ρ
(u*).
The above equations can be easily extended for multiple
filters f
1
, ...f
N
f
and their corresponding doubly-Toeplitz
matrices F
1
, ...F
N
f
.
Now that we have the necessary information to compute
the required gradients (Equations 16 and 17) we can follow
the steps outlined in Section 2.5 to learn the parameters of
the FoE model.
3.4. Computational Issues
An important practical difference between this approach
for learning MRF parameters and previous approaches for
learning hyper-parameters is that the matrices F
1
. . . F
N
f
are sparse. We found that we were able to train on relatively
large 51 × 51 image patches without having to use iterative
solvers.
Another feature we can exploit is the fact that Equations
17 and 16 are identical except for the last C
θ
term. This
means that we only have to compute the Hessian once per
gradient calculation.
3.5. Potential Problems
Because the energy function has multiple local minima,
it is likely that re-optimizing E(·) from an arbitrary initial
condition will find a different minimum than the values of
x* used during training. Despite this, the model trained with
this method still performs very well, as we will show in
the following section. Training on many images prevents
the system from trying to exploit specific minima that may
be unique to one image. Instead, the system is attempting
to find parameters that leads to a good solution across the
training set. Utilizing a large enough training set will pre-
vent the system from overfitting to accommodate a certain
minimum energy configuration for a particular image.
4. Experiments and Results
We conducted our experiments using the images from
the Berkeley segmentation database[11] identified by Roth
and Black in their experiments. We used the same 40 im-
ages for training and 68 images for testing. Since our ap-
proach allowed us to train on larger patches than those used
by Roth and Black, the results reported in this paper are all
from training using four randomly selected 51 x 51 patches
from each training image, giving us a total of 160 train-
ing patches. We learnt 24 filters with dimension 5 x 5 pix-
els using the same basis as Roth and Black. Training was
also done using 2000 randomly selected 15x15 patches with
slightly lower results.
In order to compare our results, we denoised the test im-
ages using the Field of Experts(FoE) implementation pro-
vided by Roth and Black. The parameters in this implemen-
tation were learnt using the contrastive divergence method
mentioned in Section 2.2 and uses gradient ascent to de-
noise images. For convenience, we will refer to this sys-
tem as FoE-GA (for gradient ascent). We ran those exper-
iments for 2500 iterations per image as suggested by Roth
and Black using a step size η = 0.6. We also implemented a
denoising system using the same parameters from the FoE-
GA system but using conjugate gradient descent instead of
gradient ascent. This system we will refer to as FoE-CG
(for conjugate gradient) and results are reported when the
system converges at a local minimum.
As shown in Figure 2, we achieve similar and at times
better performance at convergence when compared to FoE-
GA across the test images when our MAP inference is al-
lowed to terminate at a local minimum. We wish to reiterate
that FoE-GA terminates after a fixed number of iterations
which is chosen to maximize performance and not when
the system reaches a local minimum. When compared to
FoE-CG our performance is noticeably much better. This
shows a significant and important difference using our pro-
posed training method: if the minimization is allowed to
terminate at a local minimum then our performance is much
better. Convergence is also achieved much faster using our
training method as shown in Table 1. We used the same
stopping criteria for our testing results and FoE-CG.
We also computed the perceptually based SSIM
index[20] to measure the denoising results. Table 2 gives
the average PSNR and SSIM index computed from the de-
noised images. Figures 3 and 4 show examples in which our
results are better in terms of PSNR and SSIM. Visually, the
texture is preserved better in our denoised images. How-

Citations
More filters
Journal ArticleDOI

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

TL;DR: Zhang et al. as mentioned in this paper proposed a feed-forward denoising convolutional neural networks (DnCNNs) to handle Gaussian denobling with unknown noise level.
Journal ArticleDOI

FFDNet: Toward a Fast and Flexible Solution for CNN-Based Image Denoising

TL;DR: FFDNet as discussed by the authors proposes a fast and flexible denoising convolutional neural network with a tunable noise level map as the input, which can handle a wide range of noise levels effectively with a single network.
Journal ArticleDOI

FFDNet: Toward a Fast and Flexible Solution for CNN based Image Denoising

TL;DR: FFDNet as mentioned in this paper proposes a fast and flexible denoising convolutional neural network with a tunable noise level map as the input, which can handle a wide range of noise levels effectively with a single network.
Proceedings ArticleDOI

Shrinkage Fields for Effective Image Restoration

Uwe Schmidt, +1 more
TL;DR: This work proposes shrinkage fields, a random field-based architecture that combines the image model and the optimization algorithm in a single unit, and demonstrates state-of-the-art restoration results with high levels of computational efficiency, and significant speedup potential through inherent parallelism.
Journal ArticleDOI

Solving inverse problems using data-driven models

TL;DR: This survey paper aims to give an account of some of the main contributions in data-driven inverse problems.
References
More filters
Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Proceedings ArticleDOI

A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics

TL;DR: In this paper, the authors present a database containing ground truth segmentations produced by humans for images of a wide variety of natural scenes, and define an error measure which quantifies the consistency between segmentations of differing granularities.
Journal ArticleDOI

Training products of experts by minimizing contrastive divergence

TL;DR: A product of experts (PoE) is an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary because it is hard even to approximate the derivatives of the renormalization term in the combination rule.
Journal ArticleDOI

Choosing Multiple Parameters for Support Vector Machines

TL;DR: The problem of automatically tuning multiple parameters for pattern recognition Support Vector Machines (SVMs) is considered by minimizing some estimates of the generalization error of SVMs using a gradient descent algorithm over the set of parameters.
Proceedings ArticleDOI

Image and depth from a conventional camera with a coded aperture

TL;DR: A simple modification to a conventional camera is proposed to insert a patterned occluder within the aperture of the camera lens, creating a coded aperture, and introduces a criterion for depth discriminability which is used to design the preferred aperture pattern.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What are the contributions mentioned in the paper "Learning optimized map estimates in continuously-valued mrf models" ?

The authors present a new approach for the discriminative training of continuous-valued Markov Random Field ( MRF ) model parameters. In their approach the authors train the MRF model by optimizing the parameters so that the minimum energy solution of the model is as similar as possible to the ground-truth. The authors demonstrate the effectiveness of their technique by applying it to the problems of image denoising and inpainting using the Field of Experts model. 

Using a chain-structured model makes it possible to compute the Hessian of the CRF’s density function, thus enabling the method to learn hyper-parameters. 

For instance, it is common to use an exponential function to ensure positive weighting coefficients, such as w(θ) = eθ for a scalar θ, as employed in work like [12]. 

The key advantage of their approach over Variational Mode Learning, is that the result of the variational optimization must be recomputed every time the gradient of the loss function is recomputed. 

by computing ∂L(x*(θ),t) ∂θi T HE(x*)−1 rather than HE(x*)−1 ∂g ∂w ∂w ∂θi, only one call to the solver is necessary to compute gradient for all parameters in the vector θ. 

Using this model, the best results, with the highest PSNR, are found by terminating the minimization of the negative loglikelihood of the model before reaching a local minimum. 

As will be explained in Section 2, the authors train the MRF model by optimizing its parameters so that the minimum energy solution of the model is as similar as possible to the ground-truth. 

While Section 2.1 explores this issue in more detail, their high-level argument is that if one’s goal is to use MAP estimates and evaluate the results using some criterion then optimizing MRF parameters using a probabilistic criterion, like maximum-likelihood parameter estimation, will not necessarily lead to the optimal system. 

In this case the gradient of the energy function E(x, y;α, β) can then be written as∂E(x, y;α, β) ∂x = FT ρ′(u) + 2(x − y) (12)where ρ′(u) is a function that is applied elementwise to the vector u = 

Returning to the example in the previous paragraph, if the goal is to maximize PSNR, maximizing a likelihood and then hoping that it leads to MAP estimates with good PSNR values is not the best strategy. 

the similarity of Equation 1 to solutions for learning hyper-parameters, such as [3, 1, 6, 2], suggests that it may be possible to optimize the MRF parameters θ using simpler gradient-based algorithms.