scispace - formally typeset
Open AccessJournal ArticleDOI

Modeling Natural Images Using Gated MRFs

TLDR
A Markov Random Field for real-valued image modeling that has two sets of latent variables that gate the interactions between all pairs of pixels, while the second set determines the mean intensities of each pixel is described.
Abstract
This paper describes a Markov Random Field for real-valued image modeling that has two sets of latent variables. One set is used to gate the interactions between all pairs of pixels, while the second set determines the mean intensities of each pixel. This is a powerful model with a conditional distribution over the input that is Gaussian, with both mean and covariance determined by the configuration of latent variables, which is unlike previous models that were restricted to using Gaussians with either a fixed mean or a diagonal covariance matrix. Thanks to the increased flexibility, this gated MRF can generate more realistic samples after training on an unconstrained distribution of high-resolution natural images. Furthermore, the latent variables of the model can be inferred efficiently and can be used as very effective descriptors in recognition tasks. Both generation and discrimination drastically improve as layers of binary latent variables are added to the model, yielding a hierarchical model called a Deep Belief Network.

read more

Content maybe subject to copyright    Report

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 1
Modeling Natural Images Using Gated MRFs
Marc’Aurelio Ranzato, Volodymyr Mnih, Joshua M. Susskind, Geoffrey E. Hinton
Abstract—This paper describes a Markov Random Field for real-valued image modeling that has two sets of latent variables.
One set is used to gate the interactions between all pairs of pixels while the second set determines the mean intensities of each
pixel. This is a powerful model with a conditional distribution over the input that is Gaussian with both mean and covariance
determined by the configuration of latent variables, which is unlike previous models that were restricted to use Gaussians with
either a fixed mean or a diagonal covariance matrix. Thanks to the increased flexibility, this gated MRF can generate more realistic
samples after training on an unconstrained distribution of high-resolution natural images. Furthermore, the latent variables of
the model can be inferred efficiently and can be used as very effective descriptors in recognition tasks. Both generation and
discrimination drastically improve as layers of binary latent variables are added to the model, yielding a hierarchical model called
a Deep Belief Network.
Index Terms—gated MRF, natural images, deep learning, unsupervised learning, density estimation, energy-based model,
Boltzmann machine, factored 3-way model, generative model, object recognition, denoising, facial expression recognition
F
1 INTRODUCTION
T
HE study of the statistical properties of natural images
has a long history and has influenced many fields, from
image processing to computational neuroscience [1]. In
computer vision, for instance, ideas and principles derived
from image statistics and from studying the processing
stages of the human visual system have had a significant
impact on the design of descriptors that are useful for
discrimination. A common paradigm has emerged over the
past few years in object and scene recognition systems.
Most methods [2] start by applying some well-engineered
features, like SIFT [3], HoG [4], SURF [5], or PHoG [6], to
describe image patches, and then aggregating these features
at different spatial resolutions and on different parts of the
image to produce a feature vector which is subsequently fed
into a general purpose classifier, such as a Support Vector
Machine (SVM). Although very successful, these methods
rely heavily on human design of good patch descriptors
and ways to aggregate them. Given the large and growing
amount of easily available image data and continued ad-
vances in machine learning, it should be possible to exploit
the statistical properties of natural images more efficiently
by learning better patch descriptors and better ways of
aggregating them. This will be particularly significant for
data where human expertise is limited such as microscopic,
radiographic or hyper-spectral imagery.
In this paper, we focus on probabilistic models of natural
images which are useful not only for extracting represen-
tations that can subsequently be used for discriminative
tasks [7], [8], [9], but also for providing adaptive priors
M. Ranzato, V. Mnih and G.E. Hinton are with the Department of
Computer Science, University of Toronto, Toronto, ON, M5S 3G4,
CANADA.
E-mail: see http://www.cs.toronto.edu/˜ranzato
J.M. Susskind is with Machine Perception Laboratory, University of
California San Diego, La Jolla, 92093, U.S.A.
that can be used for image restoration tasks [10], [11],
[12]. Thanks to their generative ability, probabilistic models
can cope more naturally with ambiguities in the sensory
inputs and have the potential to produce more robust
features. Devising good models of natural images, however,
is a challenging task [1], [12], [13], because images are
continuous, high-dimensional and very highly structured.
Recent studies have tried to capture high-order dependen-
cies by using hierarchical models that extract highly non-
linear representations of the input [14], [15]. In particular,
deep learning methods construct hierarchies composed of
multiple layers by greedily training each layer separately
using unsupervised algorithms [8], [16], [17], [18]. These
methods are appealing because 1) they adapt to the input
data; 2) they recursively build hierarchies using unsu-
pervised algorithms, breaking up the difficult problem of
learning hierarchical non-linear systems into a sequence
of simpler learning tasks that use only unlabeled data; 3)
they have demonstrated good performance on a variety
of domains, from generic object recognition to action
recognition in video sequences [17], [18], [19].
In this paper we propose a probabilistic generative
model of images that can be used as the front-end of a
standard deep architecture, called a Deep Belief Network
(DBN) [20]. We test both the generative ability of this
model and the usefulness of the representations that it learns
for applications such as object recognition, facial expression
recognition and image denoising, and we demonstrate state-
of-the-art performance for several different tasks involving
several different types of image.
Our probabilistic model is called a gated Markov Ran-
dom Field (MRF) because it uses one of its two sets of
latent variables to create an image-specific energy function
that models the covariance structure of the pixels by switch-
ing in sets of pairwise interactions. It uses its other set of
latent variables to model the intensities of the pixels [13].
The DBN then uses several further layers of Bernoulli

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 2
latent variables to model the statistical structure in the
hidden activities of the two sets of latent variables of the
gated MRF. By replicating features in the lower layers it
is possible to learn a very good generative model of high-
resolution images and to use this as a principled framework
for learning adaptive descriptors that turn out to be very
useful for discriminative tasks.
In the reminder of this paper, we first discuss our new
contributions with respect to our previous published work
and then describe the model in detail. In sec. 2 we review
other popular generative models of images and motivate
the need for the model we propose, the gated MRF. In
sec. 3, we describe the learning algorithm as well as the
inference procedure for the gated MRF. In order to capture
the dependencies between the latent variables of the gated
MRF, several other layers of latent variables can be added,
yielding a DBN with many layers, as described in sec. 4.
Such models cannot be scaled in a simple way to deal with
high-resolution images because the number of parameters
scales quadratically with the dimensionality of the input at
each layer. Therefore, in sec. 5 an efficient and effective
weight-sharing scheme is introduced. The key idea is to
replicate parameters across local neighborhoods that do
not overlap in order to accomplish a twofold goal: exploit
stationarity of images while limiting the redundancy of
latent variables encoding features at nearby image locations.
Finally, we present a thorough validation of the model in
sec. 6 with comparisons to other models on a variety of
image types and tasks.
1.1 Contributions
This paper is a coherent synthesis of previously unpublished
results with the authors’ previous work on gated MRFs [21],
[9], [13], [22] that has appeared in several recent conference
papers and is intended to serve as the main reference on the
topic, describing in a more organized and consistent way
the major ideas behind this probabilistic model, clarifying
the relationship between the mPoT and mcRBM models
described below, and providing more details (including
pseudo-code) about the learning algorithms and the exper-
imental evaluations. We have included a subsection on the
relation to other classical probabilistic models that should
help the reader better understand the advantages of the
gated MRF and the similarities to other well-known models.
The paper includes empirical evaluations of the model on
an unusually large variety of tasks, not only on an image
denoising and generation tasks that are standard ways to
evaluate probabilistic generative models of natural images,
but also on three very different recognition tasks (scenes,
generic object recognition, and facial expressions under
occlusion). The paper demonstrates that the gated MRF can
be used for a wide range of different vision tasks, and it
should suggest many other tasks that can benefit from the
generative power of the model.
2 THE GATED MRF
In this section, we first review some of the most popu-
lar probabilistic models of images and discuss how their
x
1
x
2
PCA
x
1
x
2
PPCA
x
1
x
2
FA
x
1
x
2
SC
x
1
x
2
PoT
x
1
x
2
mPoT
Fig. 1. Toy illustration to compare different models. x-axis is the
first pixel, y-axis is the second pixel of two-pixel images. Blue dots
are a dataset of two-pixel images. The red dot is the data point we
want to represent. The green dot is its (mean) reconstruction. The
models are: Principal Component Analysis, Probabilistic PCA, Factor
Analysis, Sparse Coding, Product of Student’s t and mean PoT.
underlying assumptions limit their modeling abilities. This
motivates the introduction of the model we propose. After
describing our basic model and its learning and inference
procedures, we show how we can make it hierarchical and
how we can scale it up using parameter-sharing to deal with
high-resolution images.
2.1 Relation to Other Probabilistic Models
Natural images live in a very high dimensional space that
has as many dimensions as number of pixels, easily in the
order of millions and more. Yet it is believed that they
occupy a tiny fraction of that space, due to the structure
of the world, encompassing a much lower dimensional
yet highly non-linear manifold [23]. The ultimate goal of
unsupervised learning is to discover representations that pa-
rameterize such a manifold, and hence, capture the intrinsic
structure of the input data. This structure is represented
through features, also called latent variables in probabilistic
models.
One simple way to check whether a model extracts
features that retain information about the input, is by recon-
structing the input itself from the features. If reconstruction
errors of inputs similar to training samples is lower than
reconstruction errors of other input data points, then the
model must have learned interesting regularities [24]. In
PCA, for instance, the mapping into feature space is a linear
projection into the leading principal components and the
reconstruction is performed by another linear projection.
The reconstruction is perfect only for those data points that
lie in the linear subspace spanned by the leading principal
components. The principal components are the structure
captured by this model.
Also in a probabilistic framework we have a mapping
into feature, or latent variable, space and back to image
space. The former is obtained by using the posterior
distribution over the latent variables, p(h|x) where x is
the input and h the latent variables, the latter through the
conditional distribution over the input, p(x|h).

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 3
As in PCA one would reconstruct the input from the
features in order to assess the quality of the encoding,
while in a probabilistic setting we can analyze and compare
different models in terms of their conditional p(x|h). We
can sample the latent variables,
¯
h p(h|
¯
x) given an input
image
¯
x, and then look at how well the image
¯
x can be
reconstructed using p(x|
¯
h). Reconstructions produced in
this way are typically much more like real data than true
samples from the underlying generative model because the
latent variables are sampled from their posterior distribu-
tion, p(h|
¯
x), rather than from their prior, p(h), but the
reconstructions do provide insight into how much of the
information in the image is preserved in the sampled values
of the latent variables.
As shown in fig. 1, most models such as Probabilistic
Principal Component Analysis (PPCA) [25], Factor Analy-
sis (FA) [26], Independent Component Analysis (ICA) [27],
Sparse Coding (SC) [28], and Gaussian Restricted Boltz-
mann Machines (GRBM) [29], assume that the conditional
distribution of the pixels p(x|h) is Gaussian with a mean
determined by the latent variables and a fixed, image-
independent covariance matrix. In PPCA the mean of
the distribution lies along the directions of the leading
eigenvectors while in SC it is along a linear combination
of a very small number of basis vectors (represented by
black arrows in the figure). From a generative point of view,
these are rather poor assumptions for modeling natural
images because much of the interesting structure of natural
images lies in the fact that the covariance structure of the
pixels varies considerably from image to image. A vertical
occluding edge, for example, eliminates the typical strong
correlation between pixels on opposite sides of the edge.
This limitation is addressed by models like Product of
Student’s t (PoT) [30], covariance Restricted Boltzmann
Machine (cRBM) [21] and the model proposed by Karklin
and Lewicki [14] each of which instead assume a Gaussian
conditional distribution with a fixed mean but with a full
covariance determined by the states of the latent vari-
ables. Latent variables explicitly account for the correlation
patterns of the input pixels, avoiding interpolation across
edges while smoothing within uniform regions. The mean,
however, is fixed to the average of the input data vectors
across the whole dataset. As shown in the next section,
this can yield very poor conditional models of the input
distribution.
In this work, we extend these two classes of models with
a new model whose conditional distribution over the input
has both a mean and a covariance matrix determined by
latent variables. We will introduce two such models, namely
the mean PoT (mPoT) [13] and the mean-covariance RBM
(mcRBM) [9], which differ only in the choice of their
distribution over latent variables. We refer to these models
as gated MRFs because they are pair-wise Markov Random
Fields (MRFs) with latent variables gating the couplings
between input variables. Their marginal distribution can
be interpreted as a mixture of Gaussians with an infinite
(mPoT) or exponential (mcRBM) number of components,
each with non-zero mean and full covariance matrix and
Fig. 2. In the first column, each image is zero mean. In the
second column, the whole data set is centered but each image
can have non-zero mean. First row: 8x8 natural image patches and
contours of the empirical distribution of (tiny) two-pixel images (the
x-axis being the first pixel and the y-axis the second pixel). Second
row: images generated by a model that does not account for mean
intensity with plots of how such model could fit the distribution of
two-pixel images using mixture of Gaussians with components that
can choose between two covariances. Third row: images generated
by a model that has both “mean” and “covariance” hidden units and
toy-illustration of how such model can fit the distribution of two-pixel
images discovering the manifold of structured images (along the anti-
diagonal) using a mixture of Gaussians with arbitrary mean and only
two covariances.
tied parameters.
2.2 Motivation
A Product of Student’s t (PoT) model [31] can be viewed as
modelling image-specific, pair-wise relationships between
pixel values by using the states of its latent variables. It
is very good at representing the fact that two pixels have
very similar intensities and no good at all at modelling what
these intensities are. Failure to model the mean also leads
to impoverished modelling of the covariances when the
input images have non-zero mean intensity. The covariance
RBM (cRBM) [21] is another model that shares the same
limitation since it only differs from PoT in the distribution
of its latent variables: The posterior over the latent variables
p(h|x) is a product of Bernoulli distributions instead of
Gamma distributions as in PoT.
We explain the fundamental limitation of these models
by using a simple toy example: Modelling two-pixel images
using a cRBM with only one binary latent variable (see
fig. 2). This cRBM assumes that the conditional distribution
over the input p(x|h) is a zero-mean Gaussian with a
covariance that is determined by the state of the latent
variable. Since the latent variable is binary, the cRBM can
be viewed as a mixture of two zero-mean full covariance
Gaussians. The latent variable uses the pairwise relationship
between pixels to decide which of the two covariance
matrices should be used to model each image. When the
input data is pre-processed by making each image have zero
mean intensity (the plot of the empirical histogram is shown

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 4
in the first row and first column), most images lie near the
origin because most of the times nearby pixels are strongly
correlated. Less frequently we encounter edge images that
exhibit strong anti-correlation between the pixels, as shown
by the long tails along the anti-diagonal line. A cRBM
could model this data by using two Gaussians (second row
and first column): one that is spherical and tight at the origin
for smooth images and another one that has a covariance
elongated along the anti-diagonal for structured images.
If, however, the whole set of images is normalized
by subtracting from every pixel the mean value of all
pixels over all images (first row and second column), the
cRBM fails at modelling structured images (second row
and second column). It can fit a Gaussian to the smooth
images by discovering the direction of strong correlation
along the main diagonal, but it is very likely to fail to
discover the direction of anti-correlation, which is crucial
to represent discontinuities, because structured images with
different mean intensity appear to be evenly spread over the
whole input space.
If the model has another set of latent variables that
can change the means of the Gaussian distributions in the
mixture (as explained more formally below and yielding
the mPoT and mcRBM models), then the model can rep-
resent both changes of mean intensity and the correlational
structure of pixels (see last row). The mean latent variables
effectively subtract off the relevant mean from each data-
point, letting the covariance latent variable capture the
covariance structure of the data. As before, the covariance
latent variable needs only to select between two covariance
matrices.
In fact, experiments on real 8x8 image patches confirm
these conjectures. Fig. 2 shows samples drawn from PoT
and mPoT. The mPoT model (and similarly mcRBM [9])
is better at modelling zero mean images and much better
at modelling images that have non-zero mean intensity.
This will be particularly relevant when we introduce a
convolutional extension of the model to represent spatially
stationary high-resolution images (as opposed to small im-
age patches), since it will not be possible to independently
normalize overlapping image patches.
As we shall see in sec. 6.1, models that do not account
for mean intensity cannot generate realistic samples of
natural images since samples drawn from the conditional
distribution over the input have expected intensity that is
constant everywhere regardless of the value of the latent
variables. In the model we propose instead there is a set
of latent variables whose role is to bias the average mean
intensity differently in different regions of the input image.
Combined with the correlational structure provided by the
covariance latent variables, this produces smooth images
that have sharp boundaries between regions of different
mean intensity.
2.3 Energy Functions
We start the discussion assuming the input is a small
vectorized image patch, denoted by x R
D
, and the
latent variables are denoted by the vector h
p
{0, 1}
N
.
First, we consider a pair-wise MRF defined in terms of
an energy function E. The probability density function is
related to E by: p(x, h
p
) = exp(E(x, h
p
))/Z, where Z
is an (intractable) normalization constant which is called
the partition function. The energy is:
E(x, h
p
) =
1
2
X
i,j,k
t
ijk
x
i
x
j
h
p
k
(1)
The states of the latent variables, called precision hidden
units, modulate the pair-wise interactions t
ijk
between all
pairs of input variables x
i
and x
j
, with i, j = 1..D. Sim-
ilarly to Sejnowski [32], the energy function is defined in
terms of 3-way multiplicative interactions. Unlike previous
work by Memisevic and Hinton [33] on modeling image
transformations, here we use this energy function to model
the joint distribution of the variables within the vector x.
This way of allowing hidden units to modulate inter-
actions between input units has far too many parameters.
For real images we expect the required lateral interactions
to have a lot of regular structure. A hidden unit that
represents a vertical occluding edge, for example, needs
to modulate the lateral interactions so as to eliminate
horizontal interpolation of intensities in the region of the
edge. This regular structure can be approximated by writing
the 3-dimensional tensor of parameters t as a sum of outer
products: t
ijk
=
P
f
C
(1)
if
C
(2)
jf
P
fk
, where f is an index
over F deterministic factors, C
(1)
and C
(2)
R
D×F
, and
P R
F ×N
. Since the factors are connected twice to the
same image through matrices C
(1)
and C
(2)
, it is natural to
tie their weights further reducing the number of parameters,
yielding the final parameterization t
ijk
=
P
f
C
if
C
jf
P
fk
.
Thus, taking into account also the hidden biases, eq. 1
becomes:
E(x, h
p
) =
1
2
F
X
f=1
(
N
X
k=1
P
fk
h
p
k
)(
D
X
i=1
C
if
x
i
)
2
N
X
k=1
b
p
k
h
p
k
(2)
which can be written more compactly in matrix form as:
E(x, h
p
) =
1
2
x
T
Cdiag(P h
p
)C
T
x b
p
T
h
p
(3)
where diag(v) is a diagonal matrix with diagonal entries
given by the elements of vector v . This model can be
interpreted as an instance of an RBM modeling pair-
wise interactions between the input pixels
1
and we dub it
covariance RBM (cRBM) [21], [9]
2
since it models the
covariance structure of the input through the “precision”
latent variables h
p
.
The hidden units remain conditionally independent given
the states of the input units and their binary states can be
sampled using:
p(h
p
k
= 1|x) = σ
1
2
F
X
f=1
P
fk
(
D
X
i=1
C
if
x
i
)
2
+ b
p
k
(4)
1. More precisely, this is an instance of a semi-restricted Boltzmann
machine [34], [35], since only hidden units are “restricted”, i.e. lack lateral
interactions.
2. This model should not be confused with the conditional RBM [36].

JOURNAL OF PAMI, VOL. ?, NO. ?, JANUARY 20?? 5
where σ is the logistic function σ(v) = 1/
1 + exp(v)
.
Given the states of the hidden units, the input units form
an MRF in which the effective pairwise interaction weight
between x
i
and x
j
is
1
2
P
f
P
k
P
fk
h
p
k
C
if
C
jf
. Therefore,
the conditional distribution over the input is:
p(x|h
p
) = N(0, Σ), with Σ
1
= Cdiag(P h
p
)C
T
(5)
Notice that the covariance matrix is not fixed, but is a
function of the states of the precision latent variables h
p
.
In order to guarantee positive definiteness of the covariance
matrix we need to constrain P to be non-negative and add a
small quadratic regularization term to the energy function
3
,
here ignored for clarity of presentation.
As described in sec. 2.2, we want the conditional dis-
tribution over the pixels to be a Gaussian with not only
its covariance but also its mean depending on the states of
the latent variables. Since the product of a full covariance
Gaussian (like the one in eq. 5) with a spherical non-
zero mean Gaussian is a non-zero mean full covariance
Gaussian, we simply add the energy function of cRBM in
eq. 3 to the energy function of a GRBM [29], yielding:
E(x, h
m
, h
p
) =
1
2
x
T
Cdiag(P h
p
)C
T
x b
p
T
h
p
+
1
2
x
T
x h
m
W
T
x b
m
T
h
m
b
x
T
x (6)
where h
m
{0, 1}
M
are called “mean” latent variables be-
cause they contribute to control the mean of the conditional
distribution over the input:
p(x|h
m
, h
p
) = N
Σ(W h
m
+ b
x
), Σ
, (7)
with Σ
1
= Cdiag(P h
p
)C
T
+ I
where I is the identity matrix, W R
D×M
is a matrix of
trainable parameters and b
x
R
D
is a vector of trainable
biases for the input variables. The posterior distribution
over the mean latent variables is
4
:
p(h
m
k
= 1|x) = σ(
D
X
i=1
W
ik
x
i
+ b
m
k
) (8)
The overall model, whose joint probability density function
is proportional to exp(E
x, h
m
, h
p
)
, is called a mean
covariance RBM (mcRBM) [9] and is represented in fig. 3.
The demonstration in fig. 4 is designed to illustrate
how the mean and precision latent variables cooperate to
represent the input. Through the precision latent variables
the model knows about pair-wise correlations in the image.
3. In practice, this term is not needed when the dimensionality of h
p
is larger than x.
4. Notice how the mean latent variables compute a non-linear projection
of a linear filter bank, akin to the most simplified “simple-cell” model
of area V1 of the visual cortex, while the precision units perform an
operation similar to the “complex-cell” model because rectified (squared)
filter outputs are non-linearly pooled to produce their response. In this
model, simple and complex cells perform their operations in parallel (not
sequentially). If, however, we equate the factors used by the precision
units to simple cells, we recover the standard model in which simple cells
send their squared outputs to complex cells.
F
N M
precision
units
factors mean units
pixels
Fig. 3. Graphical model representation (with only three input
variables): There are two sets of latent variables (the mean and the
precision units) that are conditionally independent given the input
pixels and a set of deterministic factor nodes that connect triplets of
variables (pairs of input variables and one precision unit).
Fig. 4. A) Input image patch. B) Reconstruction performed using
only mean hiddens (i.e. W h
m
+ b
x
) (top) and both mean and preci-
sion hiddens (bottom) (that is multiplying the patch on the top by the
image-specific covariance Σ =
Cdiag(P h
p
)C
T
+ I
1
, see mean
of Gaussian in eq. 7). C) Reconstructions produced by combining the
correct image-specific covariance as above with the incorrect, hand-
specified pixel intensities shown in the top row. Knowledge about
pair-wise dependencies allows a blob of high or low intensity to be
spread out over the appropriate region. D) Reconstructions produced
like in C) showing that precision hiddens do not account for polarity
(nor for the exact intensity values of regions) but only for correlations.
For instance, it knows that the pixels in the lower part of
the image in fig. 4-A are strongly correlated; these pixels
are likely to take the same value, but the precision latent
variables do not carry any information about which value
this is. Then, very noisy information about the values of the
individual pixels in the lower part of the image, as those
provided by the mean latent variables, would be sufficient
to reconstruct the whole region quite well, since the model
knows which values can be smoothed. Mathematically, the
interaction between mean and precision latent variables
is expressed by the product between Σ (which depends
only on h
p
) and W h
m
+ b
x
in the mean of the Gaussian
distribution of eq. 7. We can repeat the same argument
for the pixels in the top right corner and for those in the
middle part of the image as well. Fig. 4-C illustrates this
concept, while fig. 4-D shows that flipping the sign of
the reconstruction of the mean latent variables flips the
sign of the overall reconstruction as expected. Information
about intensity is propagated over each region thanks to the
pair-wise dependencies captured by the precision hidden
units. Fig. 4-B shows the same using the actual mean
intensity produced by the mean hidden units (top). The
reconstruction produced by the model using the whole set
of hidden units is very close to the input image, as can be
seen in the bottom part of fig. 4-B.
In a mcRBM the posterior distribution over the latent
variables p(h|x) is a product of Bernoullis as shown in
eq. 8 and 4. This distribution is particularly easy to use
in a standard DBN [20] where each layer is trained using
a binary-binary RBM. Binary latent variables, however,
are not very good at representing different real-values of

Figures
Citations
More filters
Journal ArticleDOI

Deep learning

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Book

Deep Learning

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Proceedings ArticleDOI

Context Encoders: Feature Learning by Inpainting

TL;DR: It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Posted Content

Context Encoders: Feature Learning by Inpainting

TL;DR: Context Encoders as mentioned in this paper is a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings, which can be used for semantic inpainting tasks, either stand-alone or as initialization for nonparametric methods.
Proceedings Article

Deep generative image models using a Laplacian pyramid of adversarial networks

TL;DR: A generative parametric model capable of producing high quality samples of natural images using a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion.
References
More filters
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

TL;DR: The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What are the contributions mentioned in the paper "Modeling natural images using gated mrfs" ?

This paper describes a Markov Random Field for real-valued image modeling that has two sets of latent variables. Furthermore, the latent variables of the model can be inferred efficiently and can be used as very effective descriptors in recognition tasks. 

Finally, a very promising research avenue is to extend the model to video sequences in which the temporal regularities created by smoothly changing viewing transformations should make it far easier to learn to model depth, three-dimensional transformations and occlusion [ 71 ]. Feedforward inference in their hierarchical generative model can be viewed as a type of variational approximation that is only exactly correct for the top layer, but the inference for the lower layers is a very good approximation because of the way they are learned [ 20 ]. 

The most commonly used task to quantitatively validate a generative model of natural images is image denoising, assuming homogeneous additive Gaussian noise of known variance [10], [11], [37], [58], [12]. 

All layers are trained by using FPCD but, as training proceeds, the number of Markov chain steps between weight updates is increased from 1 to 100 at the topmost layer in order to obtain a better approximation to the maximum likelihood gradient. 

The correct sampling procedure [20] consists of generating a sample from the topmost RBM, followed by back-projection to image space through the chain of conditional distributions for each layer given the layer above. 

The discriminative training consisted of training a linear multi-class logistic regression classifier on the top level representation without using back-propagation to jointly optimize the parameters across all layers. 

If k different local filters are replicated over all possible integer positionsin the image, the representation will be about k times overcomplete8. 

Since the input images have fairly low resolution and the statistics across the images are strongly non-stationary (because the faces have been aligned), the authors trained a deep model without weight-sharing. 

If the sum of the kinetic and potential energy rises by ∆ due to inaccurate simulation of the dynamics, the system is returned to the initial state with probability 1 − exp(−∆). 

In order to draw an unbiased sample from the deep model, the authors then map the second layer sample produced in this way through the conditional distributions p(hm|h2) and p(hp|h2) to sample the mean and precision latent variables. 

The update rule for gradient ascent in the likelihood is: θ ← θ + η ( < ∂F ∂θ >model − < ∂F ∂θ >data ) (13)where <> denotes expectation over samples from the model or the training data. 

The latent representation in the higher layers is able to capture longer range structure and it does a better job at filling-in the missing pixels15. 

In order to fill-in the authors initialize the missing pixels at zero and propagate the occluded image through the four layers using the sequence of posterior expectations. 

A cRBM could model this data by using two Gaussians (second row and first column): one that is spherical and tight at the origin for smooth images and another one that has a covariance elongated along the anti-diagonal for structured images. 

however, the variational bound on the likelihood that is improved as each layer is added assumes this form of incorrect inference sothe learning ensures that it works well. 

This will be particularly relevant when the authors introduce a convolutional extension of the model to represent spatially stationary high-resolution images (as opposed to small image patches), since it will not be possible to independently normalize overlapping image patches. 

Since the number of latent variables scales as the number of input variables, the number of parameters subject to learning scales quadratically with the size of the input making learning infeasibly slow.