What are the contributions mentioned in the paper "Modeling natural images using gated mrfs" ?

This paper describes a Markov Random Field for real-valued image modeling that has two sets of latent variables. Furthermore, the latent variables of the model can be inferred efficiently and can be used as very effective descriptors in recognition tasks.

What are the future works in "Modeling natural images using gated mrfs" ?

Finally, a very promising research avenue is to extend the model to video sequences in which the temporal regularities created by smoothly changing viewing transformations should make it far easier to learn to model depth, three-dimensional transformations and occlusion [ 71 ]. Feedforward inference in their hierarchical generative model can be viewed as a type of variational approximation that is only exactly correct for the top layer, but the inference for the lower layers is a very good approximation because of the way they are learned [ 20 ].

What is the common task to quantitatively validate a generative model of natural images?

The most commonly used task to quantitatively validate a generative model of natural images is image denoising, assuming homogeneous additive Gaussian noise of known variance [10], [11], [37], [58], [12].

How many weight updates are made at the topmost layer?

All layers are trained by using FPCD but, as training proceeds, the number of Markov chain steps between weight updates is increased from 1 to 100 at the topmost layer in order to obtain a better approximation to the maximum likelihood gradient.

What is the correct sampling procedure for the deep model?

The correct sampling procedure [20] consists of generating a sample from the topmost RBM, followed by back-projection to image space through the chain of conditional distributions for each layer given the layer above.

What was the method used to train a linear multi-class logistic regression classifier?

The discriminative training consisted of training a linear multi-class logistic regression classifier on the top level representation without using back-propagation to jointly optimize the parameters across all layers.

How many times overcomplete will the representation be?

If k different local filters are replicated over all possible integer positionsin the image, the representation will be about k times overcomplete8.

Why did the authors train a deep model without weight-sharing?

Since the input images have fairly low resolution and the statistics across the images are strongly non-stationary (because the faces have been aligned), the authors trained a deep model without weight-sharing.

What is the probability of the system returning to the initial state?

If the sum of the kinetic and potential energy rises by ∆ due to inaccurate simulation of the dynamics, the system is returned to the initial state with probability 1 − exp(−∆).

How do the authors draw an unbiased sample from the deep model?

In order to draw an unbiased sample from the deep model, the authors then map the second layer sample produced in this way through the conditional distributions p(hm|h2) and p(hp|h2) to sample the mean and precision latent variables.

What is the update rule for gradient ascent in the likelihood?

The update rule for gradient ascent in the likelihood is: θ ← θ + η ( model − data ) (13)where <> denotes expectation over samples from the model or the training data.

What is the method for filling in missing pixels?

The latent representation in the higher layers is able to capture longer range structure and it does a better job at filling-in the missing pixels15.

What is the generative model used to fill in the missing pixels?

In order to fill-in the authors initialize the missing pixels at zero and propagate the occluded image through the four layers using the sequence of posterior expectations.

What is the variational bound on the likelihood that is improved as each layer is added?

however, the variational bound on the likelihood that is improved as each layer is added assumes this form of incorrect inference sothe learning ensures that it works well.

What is the difference between the number of latent variables and the number of parameters subject to learning?

Since the number of latent variables scales as the number of input variables, the number of parameters subject to learning scales quadratically with the size of the input making learning infeasibly slow.

(Open Access) Modeling Natural Images Using Gated MRFs (2013) | Marc'Aurelio Ranzato

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 1

Modeling Natural Images Using Gated MRFs

Marc’Aurelio Ranzato, Volodymyr Mnih, Joshua M. Susskind, Geoffrey E. Hinton

Abstract—This paper describes a Markov Random Field for real-valued image modeling that has two sets of latent variables.

One set is used to gate the interactions between all pairs of pixels while the second set determines the mean intensities of each

pixel. This is a powerful model with a conditional distribution over the input that is Gaussian with both mean and covariance

determined by the conﬁguration of latent variables, which is unlike previous models that were restricted to use Gaussians with

either a ﬁxed mean or a diagonal covariance matrix. Thanks to the increased ﬂexibility, this gated MRF can generate more realistic

samples after training on an unconstrained distribution of high-resolution natural images. Furthermore, the latent variables of

the model can be inferred efﬁciently and can be used as very effective descriptors in recognition tasks. Both generation and

discrimination drastically improve as layers of binary latent variables are added to the model, yielding a hierarchical model called

a Deep Belief Network.

Index Terms—gated MRF, natural images, deep learning, unsupervised learning, density estimation, energy-based model,

Boltzmann machine, factored 3-way model, generative model, object recognition, denoising, facial expression recognition

1 INTRODUCTION

HE study of the statistical properties of natural images

has a long history and has inﬂuenced many ﬁelds, from

image processing to computational neuroscience [1]. In

computer vision, for instance, ideas and principles derived

from image statistics and from studying the processing

stages of the human visual system have had a signiﬁcant

impact on the design of descriptors that are useful for

discrimination. A common paradigm has emerged over the

past few years in object and scene recognition systems.

Most methods [2] start by applying some well-engineered

features, like SIFT [3], HoG [4], SURF [5], or PHoG [6], to

describe image patches, and then aggregating these features

at different spatial resolutions and on different parts of the

image to produce a feature vector which is subsequently fed

into a general purpose classiﬁer, such as a Support Vector

Machine (SVM). Although very successful, these methods

rely heavily on human design of good patch descriptors

and ways to aggregate them. Given the large and growing

amount of easily available image data and continued ad-

vances in machine learning, it should be possible to exploit

the statistical properties of natural images more efﬁciently

by learning better patch descriptors and better ways of

aggregating them. This will be particularly signiﬁcant for

data where human expertise is limited such as microscopic,

radiographic or hyper-spectral imagery.

In this paper, we focus on probabilistic models of natural

images which are useful not only for extracting represen-

tations that can subsequently be used for discriminative

tasks [7], [8], [9], but also for providing adaptive priors

• M. Ranzato, V. Mnih and G.E. Hinton are with the Department of

Computer Science, University of Toronto, Toronto, ON, M5S 3G4,

CANADA.

E-mail: see http://www.cs.toronto.edu/˜ranzato

• J.M. Susskind is with Machine Perception Laboratory, University of

California San Diego, La Jolla, 92093, U.S.A.

that can be used for image restoration tasks [10], [11],

[12]. Thanks to their generative ability, probabilistic models

can cope more naturally with ambiguities in the sensory

inputs and have the potential to produce more robust

features. Devising good models of natural images, however,

is a challenging task [1], [12], [13], because images are

continuous, high-dimensional and very highly structured.

Recent studies have tried to capture high-order dependen-

cies by using hierarchical models that extract highly non-

linear representations of the input [14], [15]. In particular,

deep learning methods construct hierarchies composed of

multiple layers by greedily training each layer separately

using unsupervised algorithms [8], [16], [17], [18]. These

methods are appealing because 1) they adapt to the input

data; 2) they recursively build hierarchies using unsu-

pervised algorithms, breaking up the difﬁcult problem of

learning hierarchical non-linear systems into a sequence

of simpler learning tasks that use only unlabeled data; 3)

they have demonstrated good performance on a variety

of domains, from generic object recognition to action

recognition in video sequences [17], [18], [19].

In this paper we propose a probabilistic generative

model of images that can be used as the front-end of a

standard deep architecture, called a Deep Belief Network

(DBN) [20]. We test both the generative ability of this

model and the usefulness of the representations that it learns

for applications such as object recognition, facial expression

recognition and image denoising, and we demonstrate state-

of-the-art performance for several different tasks involving

several different types of image.

Our probabilistic model is called a gated Markov Ran-

dom Field (MRF) because it uses one of its two sets of

latent variables to create an image-speciﬁc energy function

that models the covariance structure of the pixels by switch-

ing in sets of pairwise interactions. It uses its other set of

latent variables to model the intensities of the pixels [13].

The DBN then uses several further layers of Bernoulli

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 2

latent variables to model the statistical structure in the

hidden activities of the two sets of latent variables of the

gated MRF. By replicating features in the lower layers it

is possible to learn a very good generative model of high-

resolution images and to use this as a principled framework

for learning adaptive descriptors that turn out to be very

useful for discriminative tasks.

In the reminder of this paper, we ﬁrst discuss our new

contributions with respect to our previous published work

and then describe the model in detail. In sec. 2 we review

other popular generative models of images and motivate

the need for the model we propose, the gated MRF. In

sec. 3, we describe the learning algorithm as well as the

inference procedure for the gated MRF. In order to capture

the dependencies between the latent variables of the gated

MRF, several other layers of latent variables can be added,

yielding a DBN with many layers, as described in sec. 4.

Such models cannot be scaled in a simple way to deal with

high-resolution images because the number of parameters

scales quadratically with the dimensionality of the input at

each layer. Therefore, in sec. 5 an efﬁcient and effective

weight-sharing scheme is introduced. The key idea is to

replicate parameters across local neighborhoods that do

not overlap in order to accomplish a twofold goal: exploit

stationarity of images while limiting the redundancy of

latent variables encoding features at nearby image locations.

Finally, we present a thorough validation of the model in

sec. 6 with comparisons to other models on a variety of

image types and tasks.

1.1 Contributions

This paper is a coherent synthesis of previously unpublished

results with the authors’ previous work on gated MRFs [21],

[9], [13], [22] that has appeared in several recent conference

papers and is intended to serve as the main reference on the

topic, describing in a more organized and consistent way

the major ideas behind this probabilistic model, clarifying

the relationship between the mPoT and mcRBM models

described below, and providing more details (including

pseudo-code) about the learning algorithms and the exper-

imental evaluations. We have included a subsection on the

relation to other classical probabilistic models that should

help the reader better understand the advantages of the

gated MRF and the similarities to other well-known models.

The paper includes empirical evaluations of the model on

an unusually large variety of tasks, not only on an image

denoising and generation tasks that are standard ways to

evaluate probabilistic generative models of natural images,

but also on three very different recognition tasks (scenes,

generic object recognition, and facial expressions under

occlusion). The paper demonstrates that the gated MRF can

be used for a wide range of different vision tasks, and it

should suggest many other tasks that can beneﬁt from the

generative power of the model.

2 THE GATED MRF

In this section, we ﬁrst review some of the most popu-

lar probabilistic models of images and discuss how their

PCA

PPCA

PoT

mPoT

Fig. 1. Toy illustration to compare different models. x-axis is the

ﬁrst pixel, y-axis is the second pixel of two-pixel images. Blue dots

are a dataset of two-pixel images. The red dot is the data point we

want to represent. The green dot is its (mean) reconstruction. The

models are: Principal Component Analysis, Probabilistic PCA, Factor

Analysis, Sparse Coding, Product of Student’s t and mean PoT.

underlying assumptions limit their modeling abilities. This

motivates the introduction of the model we propose. After

describing our basic model and its learning and inference

procedures, we show how we can make it hierarchical and

how we can scale it up using parameter-sharing to deal with

high-resolution images.

2.1 Relation to Other Probabilistic Models

Natural images live in a very high dimensional space that

has as many dimensions as number of pixels, easily in the

order of millions and more. Yet it is believed that they

occupy a tiny fraction of that space, due to the structure

of the world, encompassing a much lower dimensional

yet highly non-linear manifold [23]. The ultimate goal of

unsupervised learning is to discover representations that pa-

rameterize such a manifold, and hence, capture the intrinsic

structure of the input data. This structure is represented

through features, also called latent variables in probabilistic

models.

One simple way to check whether a model extracts

features that retain information about the input, is by recon-

structing the input itself from the features. If reconstruction

errors of inputs similar to training samples is lower than

reconstruction errors of other input data points, then the

model must have learned interesting regularities [24]. In

PCA, for instance, the mapping into feature space is a linear

projection into the leading principal components and the

reconstruction is performed by another linear projection.

The reconstruction is perfect only for those data points that

lie in the linear subspace spanned by the leading principal

components. The principal components are the structure

captured by this model.

Also in a probabilistic framework we have a mapping

into feature, or latent variable, space and back to image

space. The former is obtained by using the posterior

distribution over the latent variables, p(h|x) where x is

the input and h the latent variables, the latter through the

conditional distribution over the input, p(x|h).

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 3

As in PCA one would reconstruct the input from the

features in order to assess the quality of the encoding,

while in a probabilistic setting we can analyze and compare

different models in terms of their conditional p(x|h). We

can sample the latent variables,

h ∼ p(h|

x) given an input

image

x, and then look at how well the image

x can be

reconstructed using p(x|

h). Reconstructions produced in

this way are typically much more like real data than true

samples from the underlying generative model because the

latent variables are sampled from their posterior distribu-

tion, p(h|

x), rather than from their prior, p(h), but the

reconstructions do provide insight into how much of the

information in the image is preserved in the sampled values

of the latent variables.

As shown in ﬁg. 1, most models such as Probabilistic

Principal Component Analysis (PPCA) [25], Factor Analy-

sis (FA) [26], Independent Component Analysis (ICA) [27],

Sparse Coding (SC) [28], and Gaussian Restricted Boltz-

mann Machines (GRBM) [29], assume that the conditional

distribution of the pixels p(x|h) is Gaussian with a mean

determined by the latent variables and a ﬁxed, image-

independent covariance matrix. In PPCA the mean of

the distribution lies along the directions of the leading

eigenvectors while in SC it is along a linear combination

of a very small number of basis vectors (represented by

black arrows in the ﬁgure). From a generative point of view,

these are rather poor assumptions for modeling natural

images because much of the interesting structure of natural

images lies in the fact that the covariance structure of the

pixels varies considerably from image to image. A vertical

occluding edge, for example, eliminates the typical strong

correlation between pixels on opposite sides of the edge.

This limitation is addressed by models like Product of

Student’s t (PoT) [30], covariance Restricted Boltzmann

Machine (cRBM) [21] and the model proposed by Karklin

and Lewicki [14] each of which instead assume a Gaussian

conditional distribution with a ﬁxed mean but with a full

covariance determined by the states of the latent vari-

ables. Latent variables explicitly account for the correlation

patterns of the input pixels, avoiding interpolation across

edges while smoothing within uniform regions. The mean,

however, is ﬁxed to the average of the input data vectors

across the whole dataset. As shown in the next section,

this can yield very poor conditional models of the input

distribution.

In this work, we extend these two classes of models with

a new model whose conditional distribution over the input

has both a mean and a covariance matrix determined by

latent variables. We will introduce two such models, namely

the mean PoT (mPoT) [13] and the mean-covariance RBM

(mcRBM) [9], which differ only in the choice of their

distribution over latent variables. We refer to these models

as gated MRF’s because they are pair-wise Markov Random

Fields (MRFs) with latent variables gating the couplings

between input variables. Their marginal distribution can

be interpreted as a mixture of Gaussians with an inﬁnite

(mPoT) or exponential (mcRBM) number of components,

each with non-zero mean and full covariance matrix and

Fig. 2. In the ﬁrst column, each image is zero mean. In the

second column, the whole data set is centered but each image

can have non-zero mean. First row: 8x8 natural image patches and

contours of the empirical distribution of (tiny) two-pixel images (the

x-axis being the ﬁrst pixel and the y-axis the second pixel). Second

row: images generated by a model that does not account for mean

intensity with plots of how such model could ﬁt the distribution of

two-pixel images using mixture of Gaussians with components that

can choose between two covariances. Third row: images generated

by a model that has both “mean” and “covariance” hidden units and

toy-illustration of how such model can ﬁt the distribution of two-pixel

images discovering the manifold of structured images (along the anti-

diagonal) using a mixture of Gaussians with arbitrary mean and only

two covariances.

tied parameters.

2.2 Motivation

A Product of Student’s t (PoT) model [31] can be viewed as

modelling image-speciﬁc, pair-wise relationships between

pixel values by using the states of its latent variables. It

is very good at representing the fact that two pixels have

very similar intensities and no good at all at modelling what

these intensities are. Failure to model the mean also leads

to impoverished modelling of the covariances when the

input images have non-zero mean intensity. The covariance

RBM (cRBM) [21] is another model that shares the same

limitation since it only differs from PoT in the distribution

of its latent variables: The posterior over the latent variables

p(h|x) is a product of Bernoulli distributions instead of

Gamma distributions as in PoT.

We explain the fundamental limitation of these models

by using a simple toy example: Modelling two-pixel images

using a cRBM with only one binary latent variable (see

ﬁg. 2). This cRBM assumes that the conditional distribution

over the input p(x|h) is a zero-mean Gaussian with a

covariance that is determined by the state of the latent

variable. Since the latent variable is binary, the cRBM can

be viewed as a mixture of two zero-mean full covariance

Gaussians. The latent variable uses the pairwise relationship

between pixels to decide which of the two covariance

matrices should be used to model each image. When the

input data is pre-processed by making each image have zero

mean intensity (the plot of the empirical histogram is shown

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 4

in the ﬁrst row and ﬁrst column), most images lie near the

origin because most of the times nearby pixels are strongly

correlated. Less frequently we encounter edge images that

exhibit strong anti-correlation between the pixels, as shown

by the long tails along the anti-diagonal line. A cRBM

could model this data by using two Gaussians (second row

and ﬁrst column): one that is spherical and tight at the origin

for smooth images and another one that has a covariance

elongated along the anti-diagonal for structured images.

If, however, the whole set of images is normalized

by subtracting from every pixel the mean value of all

pixels over all images (ﬁrst row and second column), the

cRBM fails at modelling structured images (second row

and second column). It can ﬁt a Gaussian to the smooth

images by discovering the direction of strong correlation

along the main diagonal, but it is very likely to fail to

discover the direction of anti-correlation, which is crucial

to represent discontinuities, because structured images with

different mean intensity appear to be evenly spread over the

whole input space.

If the model has another set of latent variables that

can change the means of the Gaussian distributions in the

mixture (as explained more formally below and yielding

the mPoT and mcRBM models), then the model can rep-

resent both changes of mean intensity and the correlational

structure of pixels (see last row). The mean latent variables

effectively subtract off the relevant mean from each data-

point, letting the covariance latent variable capture the

covariance structure of the data. As before, the covariance

latent variable needs only to select between two covariance

matrices.

In fact, experiments on real 8x8 image patches conﬁrm

these conjectures. Fig. 2 shows samples drawn from PoT

and mPoT. The mPoT model (and similarly mcRBM [9])

is better at modelling zero mean images and much better

at modelling images that have non-zero mean intensity.

This will be particularly relevant when we introduce a

convolutional extension of the model to represent spatially

stationary high-resolution images (as opposed to small im-

age patches), since it will not be possible to independently

normalize overlapping image patches.

As we shall see in sec. 6.1, models that do not account

for mean intensity cannot generate realistic samples of

natural images since samples drawn from the conditional

distribution over the input have expected intensity that is

constant everywhere regardless of the value of the latent

variables. In the model we propose instead there is a set

of latent variables whose role is to bias the average mean

intensity differently in different regions of the input image.

Combined with the correlational structure provided by the

covariance latent variables, this produces smooth images

that have sharp boundaries between regions of different

mean intensity.

2.3 Energy Functions

We start the discussion assuming the input is a small

vectorized image patch, denoted by x ∈ R

, and the

latent variables are denoted by the vector h

∈ {0, 1}

First, we consider a pair-wise MRF deﬁned in terms of

an energy function E. The probability density function is

related to E by: p(x, h

) = exp(−E(x, h

))/Z, where Z

is an (intractable) normalization constant which is called

the partition function. The energy is:

E(x, h

) =

i,j,k

ijk

(1)

The states of the latent variables, called precision hidden

units, modulate the pair-wise interactions t

ijk

between all

pairs of input variables x

and x

, with i, j = 1..D. Sim-

ilarly to Sejnowski [32], the energy function is deﬁned in

terms of 3-way multiplicative interactions. Unlike previous

work by Memisevic and Hinton [33] on modeling image

transformations, here we use this energy function to model

the joint distribution of the variables within the vector x.

This way of allowing hidden units to modulate inter-

actions between input units has far too many parameters.

For real images we expect the required lateral interactions

to have a lot of regular structure. A hidden unit that

represents a vertical occluding edge, for example, needs

to modulate the lateral interactions so as to eliminate

horizontal interpolation of intensities in the region of the

edge. This regular structure can be approximated by writing

the 3-dimensional tensor of parameters t as a sum of outer

products: t

ijk

(1)

(2)

, where f is an index

over F deterministic factors, C

(1)

and C

(2)

∈ R

D×F

, and

P ∈ R

F ×N

. Since the factors are connected twice to the

same image through matrices C

(1)

and C

(2)

, it is natural to

tie their weights further reducing the number of parameters,

yielding the ﬁnal parameterization t

ijk

Thus, taking into account also the hidden biases, eq. 1

becomes:

E(x, h

) =

f=1

(

k=1

)(

i=1

)

−

k=1

(2)

which can be written more compactly in matrix form as:

E(x, h

) =

Cdiag(P h

x − b

(3)

where diag(v) is a diagonal matrix with diagonal entries

given by the elements of vector v . This model can be

interpreted as an instance of an RBM modeling pair-

wise interactions between the input pixels

and we dub it

covariance RBM (cRBM) [21], [9]

since it models the

covariance structure of the input through the “precision”

latent variables h

The hidden units remain conditionally independent given

the states of the input units and their binary states can be

sampled using:

p(h

= 1|x) = σ



−

f=1

(

i=1

)

+ b



(4)

1. More precisely, this is an instance of a semi-restricted Boltzmann

machine [34], [35], since only hidden units are “restricted”, i.e. lack lateral

interactions.

2. This model should not be confused with the conditional RBM [36].

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 5

where σ is the logistic function σ(v) = 1/



1 + exp(−v)



Given the states of the hidden units, the input units form

an MRF in which the effective pairwise interaction weight

between x

and x

. Therefore,

the conditional distribution over the input is:

p(x|h

) = N(0, Σ), with Σ

−1

= Cdiag(P h

(5)

Notice that the covariance matrix is not ﬁxed, but is a

function of the states of the precision latent variables h

In order to guarantee positive deﬁniteness of the covariance

matrix we need to constrain P to be non-negative and add a

small quadratic regularization term to the energy function

here ignored for clarity of presentation.

As described in sec. 2.2, we want the conditional dis-

tribution over the pixels to be a Gaussian with not only

its covariance but also its mean depending on the states of

the latent variables. Since the product of a full covariance

Gaussian (like the one in eq. 5) with a spherical non-

zero mean Gaussian is a non-zero mean full covariance

Gaussian, we simply add the energy function of cRBM in

eq. 3 to the energy function of a GRBM [29], yielding:

E(x, h

, h

) =

Cdiag(P h

x − b

x − h

x − b

− b

x (6)

where h

∈ {0, 1}

are called “mean” latent variables be-

cause they contribute to control the mean of the conditional

distribution over the input:

p(x|h

, h

) = N



Σ(W h

+ b

), Σ



, (7)

with Σ

−1

= Cdiag(P h

+ I

where I is the identity matrix, W ∈ R

D×M

is a matrix of

trainable parameters and b

∈ R

is a vector of trainable

biases for the input variables. The posterior distribution

over the mean latent variables is

p(h

= 1|x) = σ(

i=1

+ b

) (8)

The overall model, whose joint probability density function

is proportional to exp(−E



x, h

, h

)



, is called a mean

covariance RBM (mcRBM) [9] and is represented in ﬁg. 3.

The demonstration in ﬁg. 4 is designed to illustrate

how the mean and precision latent variables cooperate to

represent the input. Through the precision latent variables

the model knows about pair-wise correlations in the image.

3. In practice, this term is not needed when the dimensionality of h

is larger than x.

4. Notice how the mean latent variables compute a non-linear projection

of a linear ﬁlter bank, akin to the most simpliﬁed “simple-cell” model

of area V1 of the visual cortex, while the precision units perform an

operation similar to the “complex-cell” model because rectiﬁed (squared)

ﬁlter outputs are non-linearly pooled to produce their response. In this

model, simple and complex cells perform their operations in parallel (not

sequentially). If, however, we equate the factors used by the precision

units to simple cells, we recover the standard model in which simple cells

send their squared outputs to complex cells.

N M

precision

units

factors mean units

pixels

Fig. 3. Graphical model representation (with only three input

variables): There are two sets of latent variables (the mean and the

precision units) that are conditionally independent given the input

pixels and a set of deterministic factor nodes that connect triplets of

variables (pairs of input variables and one precision unit).

Fig. 4. A) Input image patch. B) Reconstruction performed using

only mean hiddens (i.e. W h

+ b

) (top) and both mean and preci-

sion hiddens (bottom) (that is multiplying the patch on the top by the

image-speciﬁc covariance Σ =



Cdiag(P h

+ I



−1

, see mean

of Gaussian in eq. 7). C) Reconstructions produced by combining the

correct image-speciﬁc covariance as above with the incorrect, hand-

speciﬁed pixel intensities shown in the top row. Knowledge about

pair-wise dependencies allows a blob of high or low intensity to be

spread out over the appropriate region. D) Reconstructions produced

like in C) showing that precision hiddens do not account for polarity

(nor for the exact intensity values of regions) but only for correlations.

For instance, it knows that the pixels in the lower part of

the image in ﬁg. 4-A are strongly correlated; these pixels

are likely to take the same value, but the precision latent

variables do not carry any information about which value

this is. Then, very noisy information about the values of the

individual pixels in the lower part of the image, as those

provided by the mean latent variables, would be sufﬁcient

to reconstruct the whole region quite well, since the model

knows which values can be smoothed. Mathematically, the

interaction between mean and precision latent variables

is expressed by the product between Σ (which depends

only on h

) and W h

+ b

in the mean of the Gaussian

distribution of eq. 7. We can repeat the same argument

for the pixels in the top right corner and for those in the

middle part of the image as well. Fig. 4-C illustrates this

concept, while ﬁg. 4-D shows that ﬂipping the sign of

the reconstruction of the mean latent variables ﬂips the

sign of the overall reconstruction as expected. Information

about intensity is propagated over each region thanks to the

pair-wise dependencies captured by the precision hidden

units. Fig. 4-B shows the same using the actual mean

intensity produced by the mean hidden units (top). The

reconstruction produced by the model using the whole set

of hidden units is very close to the input image, as can be

seen in the bottom part of ﬁg. 4-B.

In a mcRBM the posterior distribution over the latent

variables p(h|x) is a product of Bernoullis as shown in

eq. 8 and 4. This distribution is particularly easy to use

in a standard DBN [20] where each layer is trained using

a binary-binary RBM. Binary latent variables, however,

are not very good at representing different real-values of

Modeling Natural Images Using Gated MRFs

Figures

Citations

Deep learning

Deep Learning

Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting

Deep generative image models using a Laplacian pyramid of adversarial networks

References

ImageNet: A large-scale hierarchical image database

Distinctive Image Features from Scale-Invariant Keypoints

Gradient-based learning applied to document recognition

Histograms of oriented gradients for human detection

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

Related Papers (5)

Reducing the Dimensionality of Data with Neural Networks

A fast learning algorithm for deep belief nets

Generative Adversarial Nets

Gradient-based learning applied to document recognition

ImageNet Classification with Deep Convolutional Neural Networks

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "Modeling natural images using gated mrfs" ?

Q2. What are the future works in "Modeling natural images using gated mrfs" ?

Q3. What is the common task to quantitatively validate a generative model of natural images?

Q4. How many weight updates are made at the topmost layer?

Q5. What is the correct sampling procedure for the deep model?

Q6. What was the method used to train a linear multi-class logistic regression classifier?

Q7. How many times overcomplete will the representation be?

Q8. Why did the authors train a deep model without weight-sharing?

Q9. What is the probability of the system returning to the initial state?

Q10. How do the authors draw an unbiased sample from the deep model?

Q11. What is the update rule for gradient ascent in the likelihood?

Q12. What is the method for filling in missing pixels?

Q13. What is the generative model used to fill in the missing pixels?

Q14. How can a cRBM model a smooth image?

Q15. What is the variational bound on the likelihood that is improved as each layer is added?

Q16. What is the way to model overlapping images?

Q17. What is the difference between the number of latent variables and the number of parameters subject to learning?