scispace - formally typeset
Open AccessProceedings ArticleDOI

Extracting and composing robust features with denoising autoencoders

TLDR
This work introduces and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern.
Abstract
Previous work has shown that the difficulties in learning deep generative or discriminative models can be overcome by an initial unsupervised learning step that maps inputs to useful intermediate representations. We introduce and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to initialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising advantage of corrupting the input of autoencoders on a pattern classification benchmark suite.

read more

Content maybe subject to copyright    Report

Extracting and Composing Robust Features with
Denoising Autoencoders
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol
Dept. IRO, Universit´e de Montr´eal
C.P. 6128, Montreal, Qc, H3C 3J7, Canada
http://www.iro.umontreal.ca/lisa
Technical Report 1316, February 2008
Abstract
Previous work has shown that the difficulties in learning de ep genera-
tive or discriminative models can be overcome by an initial unsupervised
learning step that maps inputs to useful intermediate representations. We
introduce and motivate a new training principle for unsupervised learning
of a representation based on the idea of making the learned representa-
tions robust to partial corruption of the input pattern. This approach can
be used to train autoencoders, and these denoising autoencoders can be
stacked to initialize deep architectures. The algorithm can be motivated
from a manifold learning and information theoretic perspective or from a
generative model perspective. Comparative experiments clearly show the
surprising advantage of corrupting the input of autoencoders on a pattern
classification benchmark suite.
1 Introduction
Recent theoretical studies indicate that deep architectures (Bengio & Le Cun,
2007; Bengio, 2007) may be needed to efficiently model complex distributions
and achieve better generalization performance on challenging recognition tasks.
The belief that additional levels of functional composition will yield increased
representational and mo deling power is not new (McClelland et al., 1986; Hin-
ton, 1989; Utgoff & Stracuzzi, 2002). However, in practice, learning in deep
architectures has proven to be difficult. One needs only to ponder the diffi-
cult problem of inference in deep directed graphical models, due to “explaining
away”. Also looking back at the history of multi-layer neural networks, their
difficult optimization (Bengio et al., 2007; Bengio, 2007) has long prevented
reaping the e xpected benefits of going beyond one or two hidden layers. How-
ever this situation has recently changed with the successful approach of (Hinton
et al., 2006; Hinton & Salakhutdinov, 2006; Bengio et al., 2007; Ranzato et al.,
2007; Lee et al., 2008) for training Deep Belief Networks and stacked autoen-
coders.
1

One key ingredient to this success appears to be the use of an unsupervised
training criterion to perform a layer-by-layer initialization: each layer is at first
trained to produce a higher level (hidden) representation of the observed pat-
terns, based on the representation it receives as input from the layer below, by
optimizing a local unsupervised criterion. Each level produces a representation
of the input pattern that is more abstract than the previous level’s, because it
is obtained by composing more operations. This initialization yields a starting
point, from which a global fine-tuning of the model’s parameters is then per-
formed using another training criterion appropriate for the task at hand. This
technique has been shown empirically to avoid ge tting stuck in the kind of poor
solutions one typically reaches with random initializations. While unsupervised
learning of a mapping that produces “good” intermediate representations of
the input pattern seems to be key, little is understood regarding what consti-
tutes “good” representations for initializing deep architectures, or what explicit
criteria may guide learning such representations. We know of only a few algo-
rithms that seem to work well for this purpose: Restricted Boltzmann Machines
(RBMs) trained with contrastive divergence on one hand, and various types of
autoencoders on the other.
The present research begins with the question of what explicit criteria a good
intermediate representation should satisfy. Obviously, it should at a minimum
retain a certain amount of “information” about its input, while at the same time
being constrained to a given form (e.g. a real-valued vector of a given size in the
case of an autoencoder). A supplemental criterion that has been proposed for
such models is s parsity of the representation (Ranzato et al., 2008; Lee et al.,
2008). Here we hypothesize and investigate an additional specific criterion:
robustness to partial destruction of the input, i.e., partially destructed
inputs should yield almost the same representation. It is motivated by the
following informal reasoning: a good representation is expected to c apture stable
structures in the form of depe ndencies and regularities characteristic of the
(unknown) distribution of its observed input. For high dimensional redundant
input (such as images) at least, such structures are likely to depend on evidence
gathered from a combination of many input dimensions. They should thus be
recoverable from partial observation only. A hallmark of this is our human
ability to recognize partially occluded or corrupted images. Further evidence is
our ability to form a high level concept associated to multiple modalities (such
as image and sound) and recall it even when some of the modalities are missing.
To validate our hypothesis and asses s its usefulness as one of the guiding
principles in learning deep architectures, we propose a modification to the au-
toencoder framework to explicitly integrate robustness to partially destroyed
inputs. Section 2 describes the algorithm in details. Section 3 discusses links
with other approaches in the literature. Section 4 is devoted to a closer inspec -
tion of the m odel from different theoretical s tandpoints. In section 5 we verify
empirically if the algorithm leads to a difference in performance. Section 6
concludes the study.
2

2 Description o f the Algorithm
2.1 Notation and Setup
Let X and Y be two random variables with joint probability density p(X, Y ),
with marginal distributions p(X) and p(Y ). Throughout the text, we will
use the following notation: Exp e ctation: EE
p(X)
[f(X)] =
R
p(x)f(x)dx. En-
tropy: IH(X) = IH(p) = EE
p(X)
[ log p(X)]. Conditional entropy: IH(X|Y ) =
EE
p(X,Y )
[ log p(X|Y )]. Kullback-Leibler divergence: ID
KL
(pkq) = EE
p(X)
[log
p(X)
q( X)
].
Cross-entropy: IH(pkq) = EE
p(X)
[ log q(X)] = IH(p) + ID
KL
(pkq). Mutual infor-
mation: I(X; Y ) = IH(X) IH(X|Y ). Sigmoid: s(x) =
1
1+e
x
and s(x) =
(s(x
1
), . . . , s(x
d
))
T
. Bernoulli distribution with mean µ: B
µ
(x). and by exten-
sion B
µ
(x) = (B
µ
1
(x
1
), . . . , B
µ
d
(x
d
)).
The setup we consider is the typical supervised learning setup with a training
set of n (input, target) pairs D
n
= {(x
(1)
, t
(1)
) . . . , (x
(n)
, t
(n)
)}, that we suppose
to b e an i.i.d. sample from an unknown distribution q(X, T ) with corresponding
marginals q(X) and q(T ).
2.2 The Basic Autoenco der
We begin by recalling the traditional autoencoder model such as the one used
in (Bengio et al., 2007) to build deep networks. An autoencoder takes an input
vector x [0, 1]
d
, and first maps it to a hidden representation y [0, 1]
d
0
through a deterministic mapping y = f
θ
(x) = s(Wx + b), parameterized by
θ = {W, b}. W is a d
0
× d weight matrix and b is a bias vector. The resulting
latent representation y is then mapped back to a “reconstructed” vector z
[0, 1]
d
in input space z = g
θ
0
(y) = s(W
0
y + b
0
) with θ
0
= {W
0
, b
0
}. The weight
matrix W
0
of the reverse mapping may optionally be constrained by W
0
= W
T
,
in which case the autoencoder is said to have tied weights. Each training x
(i)
is
thus mapped to a corres ponding y
(i)
and a reconstruction z
(i)
. The parameters
of this model are optimized to minimize the average reconstruction error:
θ
?
, θ
0?
= arg min
θ,θ
0
1
n
n
X
i=1
L
x
(i)
, z
(i)
= arg min
θ,θ
0
1
n
n
X
i=1
L
x
(i)
, g
θ
0
(f
θ
(x
(i)
))
(1)
where L is a loss function such as the traditional squared error L(x, z) = kxzk
2
.
An alternative loss, suggested by the interpretation of x and z as either bit
vectors or vectors of bit probabilities (Bernoullis) is the reconstruction cross-
entropy:
L
IH
(x, z)= IH(B
x
kB
z
)
=
d
X
k=1
[x
k
log z
k
+(1 x
k
) log(1 z
k
)] (2)
3

Figure 1: An example x is corrupted to
˜
x. The autoencoder then maps it to y
and attempts to reconstruct x.
q
D
f
θ
˜
x x
y
g
θ
0
Note that if x is a binary vector, L
IH
(x, z) is a negative log-likelihood for the
example x, given the Bernoulli parameters z. Equation 1 with L = L
IH
can be
written
θ
?
, θ
0?
= arg min
θ,θ
0
EE
q
0
(X)
[L
IH
(X, g
θ
0
(f
θ
(X)))] (3)
where q
0
(X) denotes the empirical distribution associated to our n training
inputs. This optimization will typically be carried out by stochastic gradient
descent.
2.3 The Denoising Autoencoder
To test our hypothesis and enforce robustness to partially destroyed inputs we
modify the basic autoencoder we just described. We w ill now train it to recon-
struct a clean “repaired” input from a corrupted, partially des troyed one. This
is done by first corrupting the initial input x to get a partially destroyed ver-
sion
˜
x by means of a stochastic mapping
˜
x q
D
(
˜
x|x). In our experiments, we
considered the following corrupting process, parameterized by the desired pro-
portion ν of “destruction”: for each input x, a fixed number νd of components
are chosen at random, and their value is forced to 0, while the others are left
untouched. The procedure can be viewed as replacing a component considered
missing by a default value, which is a common technique. A motivation for
zeroing the des troyed components is that it simulates the removal of these com-
ponents from the input. For images on a white (0) background, this corresponds
to “salt noise”. Note that alternative corrupting noise s could b e considered
1
.
The corrupted input
˜
x is then mapped, as with the basic autoencoder, to a
hidden representation y = f
θ
(
˜
x) = s(W
˜
x + b) from which we reconstruct a
z = g
θ
0
(y) = s(W
0
y + b
0
) (see figure 1 for a schematic representation of the
process). As before the parameters are trained to minimize the average recon-
struction error L
IH
(x, z) = IH(B
x
kB
z
) over a training set, i.e. to have z as close
as possible to the uncorrupted input x. But the key difference is that z is now
a deterministic function of
˜
x rather than x and thus the result of a stochastic
mapping of x.
Let us define the joint distribution
q
0
(X,
e
X, Y ) = q
0
(X)q
D
(
e
X|X)δ
f
θ
(
e
X)
(Y ) (4)
1
the approach we describe and our analysis is not specific to a particular kind of corrupting
noise.
4

where δ
u
(v) puts mass 0 when u 6= v. Thus Y is a deterministic function of
e
X. q
0
(X,
e
X, Y ) is parameterized by θ. The objective function minimized by
stochastic gradient descent becomes:
arg min
θ,θ
0
EE
q
0
(X,
e
X)
h
L
IH
X, g
θ
0
(f
θ
(
e
X))
i
. (5)
So from the point of view of the stochastic gradient descent algorithm, in ad-
dition to picking an input sample from the training set, we will also produce a
random corrupted version of it, and take a gradient step towards reconstructing
the uncorrupted version from the corrupted version. Note that in this way, the
autoencoder cannot learn the identity, unlike the basic autoencoder, thus re-
moving the constraint that d
0
< d or the need to regularize specifically to avoid
such a trivial solution.
2.4 Layer-wise Initialization and Fine Tuning
The basic autoencoder has been used as a building block to train deep net-
works (Bengio et al., 2007), with the representation of the k-th layer used as
input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been
trained. After a few layers have been trained, the parameters are used as initial-
ization for a network optimized with respect to a supervised training criterion.
This greedy layer-wise procedure has been shown to yield significantly better
local minima than random initialization of deep networks (Bengio et al., 2007),
achieving better generalization on a number of tasks (Larochelle et al., 2007).
The procedure to train a deep network using the denoising autoencoder is
similar. The only difference is how each layer is trained, i.e., to minimize the
criterion in eq. 5 instead of eq. 3. Note that the corruption process q
D
is only
used during training, but not for propagating representations from the raw input
to higher-level representations. Note also that when layer k is trained, it receives
as input the uncorrupted output of the previous layers.
3 Relationship t o Other Approache s
Our training procedure for the denoising autoencoder involves learning to re-
cover a clean input from a corrupted version, a task known as denoising. The
problem of image denoising, in particular, has been extensively studied in the
image processing community and many recent developments rely on machine
learning approaches (see e.g. Roth and Black (2005); Elad and Aharon (2006);
Hammond and Simoncelli (2007)). A particular form of gated autoencoders has
also been used for denoising in Memisevic (2007). Denoising using autoencoders
was ac tually introduced much earlier (LeCun, 1987; Gallinari e t al., 1987), as
an alternative to Hopfield models (Hopfield, 1982). Our objective however is
fundamentally different from that of developing a competitive image denoising
algorithm. We investigate explicit robustness to corrupting noise only as a cri-
terion to guide the learning of suitable intermediate representations, with the
5

Figures
Citations
More filters
Journal ArticleDOI

Deep learning

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Book

Deep Learning

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
References
More filters
Journal ArticleDOI

Learning representations by back-propagating errors

TL;DR: Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.
Journal ArticleDOI

Reducing the Dimensionality of Data with Neural Networks

TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Journal ArticleDOI

Neural networks and physical systems with emergent collective computational abilities

TL;DR: A model of a system having a large number of simple equivalent components, based on aspects of neurobiology but readily adapted to integrated circuits, produces a content-addressable memory which correctly yields an entire memory from any subpart of sufficient size.
Journal ArticleDOI

A fast learning algorithm for deep belief nets

TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What have the authors contributed in "Extracting and composing robust features with denoising autoencoders" ?

The authors introduce and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern. 

Future work inspired by this observation should investigate other types of corruption process, not only of the input but of the representation itself as well. 

Their training procedure for the denoising autoencoder involves learning to recover a clean input from a corrupted version, a task known as denoising. 

One key ingredient to this success appears to be the use of an unsupervised training criterion to perform a layer-by-layer initialization: each layer is at first trained to produce a higher level (hidden) representation of the observed patterns, based on the representation it receives as input from the layer below, by optimizing a local unsupervised criterion. 

Recent theoretical studies indicate that deep architectures (Bengio & Le Cun, 2007; Bengio, 2007) may be needed to efficiently model complex distributions and achieve better generalization performance on challenging recognition tasks. 

Let us augment the set of modeled random variables to include the corrupted example X̃ in addition to the corresponding uncorrupted example X, and let us perform maximum likelihood training on a model of their joint. 

An autoencoder takes an input vector x ∈ [0, 1]d, and first maps it to a hidden representation y ∈ [0, 1]d′ through a deterministic mapping y = fθ(x) = s(Wx + b), parameterized by θ = {W,b}. 

While unsupervised learning of a mapping that produces “good” intermediate representations of the input pattern seems to be key, little is understood regarding what constitutes “good” representations for initializing deep architectures, or what explicit criteria may guide learning such representations. 

As can be seen in the table, the corruption+denoising training works remarkably well as an initialization step, and in most cases yields significantly better classification performance than basic autoencoder stacking with no noise. 

An alternative loss, suggested by the interpretation of x and z as either bit vectors or vectors of bit probabilities (Bernoullis) is the reconstruction crossentropy:LIH(x, z)= IH(Bx‖Bz)= − d∑k=1[xk log zk+(1− xk) log(1− zk)] 

Optimizing for the lower bound leads to:max θ,θ′ E q(X,Y )[logBgθ′ (Y )(X)]=max θ,θ′ E q(X, eX)[logBgθ′ (fθ( eX))(X)] =minθ,θ′ E q(X, eX)[LIH(X, gθ′(fθ(X̃)))]where in the second line the authors use the fact that Y = fθ(X̃) deterministically.