What are the future works mentioned in the paper "Extracting and composing robust features with denoising autoencoders" ?

Future work inspired by this observation should investigate other types of corruption process, not only of the input but of the representation itself as well.

What is the key ingredient to this success?

One key ingredient to this success appears to be the use of an unsupervised training criterion to perform a layer-by-layer initialization: each layer is at first trained to produce a higher level (hidden) representation of the observed patterns, based on the representation it receives as input from the layer below, by optimizing a local unsupervised criterion.

What is the main reason why deep architectures are needed?

Recent theoretical studies indicate that deep architectures (Bengio & Le Cun, 2007; Bengio, 2007) may be needed to efficiently model complex distributions and achieve better generalization performance on challenging recognition tasks.

What is the way to train a model of a joint?

Let us augment the set of modeled random variables to include the corrupted example X̃ in addition to the corresponding uncorrupted example X, and let us perform maximum likelihood training on a model of their joint.

What is the deterministic mapping of the input vector x?

An autoencoder takes an input vector x ∈ [0, 1]d, and first maps it to a hidden representation y ∈ [0, 1]d′ through a deterministic mapping y = fθ(x) = s(Wx + b), parameterized by θ = {W,b}.

What is the classification algorithm for a svm?

As can be seen in the table, the corruption+denoising training works remarkably well as an initialization step, and in most cases yields significantly better classification performance than basic autoencoder stacking with no noise.

What is the way to optimize for the lower bound?

Optimizing for the lower bound leads to:max θ,θ′ E q(X,Y )[logBgθ′ (Y )(X)]=max θ,θ′ E q(X, eX)[logBgθ′ (fθ( eX))(X)] =minθ,θ′ E q(X, eX)[LIH(X, gθ′(fθ(X̃)))]where in the second line the authors use the fact that Y = fθ(X̃) deterministically.

(Open Access) Extracting and composing robust features with denoising autoencoders (2008) | Pascal Vincent

Q: What have the authors contributed in "Extracting and composing robust features with denoising autoencoders" ?

The authors introduce and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern.

Q: What is the training procedure for denoising?

Their training procedure for the denoising autoencoder involves learning to recover a clean input from a corrupted version, a task known as denoising.

Extracting and Composing Robust Features with

Denoising Autoencoders

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol

Dept. IRO, Universit´e de Montr´eal

C.P. 6128, Montreal, Qc, H3C 3J7, Canada

http://www.iro.umontreal.ca/∼lisa

Technical Report 1316, February 2008

Abstract

Previous work has shown that the diﬃculties in learning de ep genera-

tive or discriminative models can be overcome by an initial unsupervised

learning step that maps inputs to useful intermediate representations. We

introduce and motivate a new training principle for unsupervised learning

of a representation based on the idea of making the learned representa-

tions robust to partial corruption of the input pattern. This approach can

be used to train autoencoders, and these denoising autoencoders can be

stacked to initialize deep architectures. The algorithm can be motivated

from a manifold learning and information theoretic perspective or from a

generative model perspective. Comparative experiments clearly show the

surprising advantage of corrupting the input of autoencoders on a pattern

classiﬁcation benchmark suite.

1 Introduction

Recent theoretical studies indicate that deep architectures (Bengio & Le Cun,

2007; Bengio, 2007) may be needed to eﬃciently model complex distributions

and achieve better generalization performance on challenging recognition tasks.

The belief that additional levels of functional composition will yield increased

representational and mo deling power is not new (McClelland et al., 1986; Hin-

ton, 1989; Utgoﬀ & Stracuzzi, 2002). However, in practice, learning in deep

architectures has proven to be diﬃcult. One needs only to ponder the diﬃ-

cult problem of inference in deep directed graphical models, due to “explaining

away”. Also looking back at the history of multi-layer neural networks, their

diﬃcult optimization (Bengio et al., 2007; Bengio, 2007) has long prevented

reaping the e xpected beneﬁts of going beyond one or two hidden layers. How-

ever this situation has recently changed with the successful approach of (Hinton

et al., 2006; Hinton & Salakhutdinov, 2006; Bengio et al., 2007; Ranzato et al.,

2007; Lee et al., 2008) for training Deep Belief Networks and stacked autoen-

coders.

One key ingredient to this success appears to be the use of an unsupervised

training criterion to perform a layer-by-layer initialization: each layer is at ﬁrst

trained to produce a higher level (hidden) representation of the observed pat-

terns, based on the representation it receives as input from the layer below, by

optimizing a local unsupervised criterion. Each level produces a representation

of the input pattern that is more abstract than the previous level’s, because it

is obtained by composing more operations. This initialization yields a starting

point, from which a global ﬁne-tuning of the model’s parameters is then per-

formed using another training criterion appropriate for the task at hand. This

technique has been shown empirically to avoid ge tting stuck in the kind of poor

solutions one typically reaches with random initializations. While unsupervised

learning of a mapping that produces “good” intermediate representations of

the input pattern seems to be key, little is understood regarding what consti-

tutes “good” representations for initializing deep architectures, or what explicit

criteria may guide learning such representations. We know of only a few algo-

rithms that seem to work well for this purpose: Restricted Boltzmann Machines

(RBMs) trained with contrastive divergence on one hand, and various types of

autoencoders on the other.

The present research begins with the question of what explicit criteria a good

intermediate representation should satisfy. Obviously, it should at a minimum

retain a certain amount of “information” about its input, while at the same time

being constrained to a given form (e.g. a real-valued vector of a given size in the

case of an autoencoder). A supplemental criterion that has been proposed for

such models is s parsity of the representation (Ranzato et al., 2008; Lee et al.,

2008). Here we hypothesize and investigate an additional speciﬁc criterion:

robustness to partial destruction of the input, i.e., partially destructed

inputs should yield almost the same representation. It is motivated by the

following informal reasoning: a good representation is expected to c apture stable

structures in the form of depe ndencies and regularities characteristic of the

(unknown) distribution of its observed input. For high dimensional redundant

input (such as images) at least, such structures are likely to depend on evidence

gathered from a combination of many input dimensions. They should thus be

recoverable from partial observation only. A hallmark of this is our human

ability to recognize partially occluded or corrupted images. Further evidence is

our ability to form a high level concept associated to multiple modalities (such

as image and sound) and recall it even when some of the modalities are missing.

To validate our hypothesis and asses s its usefulness as one of the guiding

principles in learning deep architectures, we propose a modiﬁcation to the au-

toencoder framework to explicitly integrate robustness to partially destroyed

inputs. Section 2 describes the algorithm in details. Section 3 discusses links

with other approaches in the literature. Section 4 is devoted to a closer inspec -

tion of the m odel from diﬀerent theoretical s tandpoints. In section 5 we verify

empirically if the algorithm leads to a diﬀerence in performance. Section 6

concludes the study.

2 Description o f the Algorithm

2.1 Notation and Setup

Let X and Y be two random variables with joint probability density p(X, Y ),

with marginal distributions p(X) and p(Y ). Throughout the text, we will

use the following notation: Exp e ctation: EE

p(X)

[f(X)] =

p(x)f(x)dx. En-

tropy: IH(X) = IH(p) = EE

p(X)

[− log p(X)]. Conditional entropy: IH(X|Y ) =

p(X,Y )

[− log p(X|Y )]. Kullback-Leibler divergence: ID

(pkq) = EE

p(X)

[log

p(X)

q( X)

Cross-entropy: IH(pkq) = EE

p(X)

[− log q(X)] = IH(p) + ID

(pkq). Mutual infor-

mation: I(X; Y ) = IH(X) − IH(X|Y ). Sigmoid: s(x) =

1+e

−x

and s(x) =

(s(x

), . . . , s(x

))

. Bernoulli distribution with mean µ: B

(x). and by exten-

sion B

(x) = (B

), . . . , B

)).

The setup we consider is the typical supervised learning setup with a training

set of n (input, target) pairs D

= {(x

(1)

, t

(1)

) . . . , (x

(n)

, t

(n)

)}, that we suppose

to b e an i.i.d. sample from an unknown distribution q(X, T ) with corresponding

marginals q(X) and q(T ).

2.2 The Basic Autoenco der

We begin by recalling the traditional autoencoder model such as the one used

in (Bengio et al., 2007) to build deep networks. An autoencoder takes an input

vector x ∈ [0, 1]

, and ﬁrst maps it to a hidden representation y ∈ [0, 1]

through a deterministic mapping y = f

(x) = s(Wx + b), parameterized by

θ = {W, b}. W is a d

× d weight matrix and b is a bias vector. The resulting

latent representation y is then mapped back to a “reconstructed” vector z ∈

[0, 1]

in input space z = g

(y) = s(W

y + b

) with θ

= {W

, b

}. The weight

matrix W

of the reverse mapping may optionally be constrained by W

= W

in which case the autoencoder is said to have tied weights. Each training x

(i)

thus mapped to a corres ponding y

(i)

and a reconstruction z

(i)

. The parameters

of this model are optimized to minimize the average reconstruction error:

, θ

= arg min

θ,θ

i=1



(i)

, z

(i)



= arg min

θ,θ

i=1



(i)

, g

(i)

))



(1)

where L is a loss function such as the traditional squared error L(x, z) = kx−zk

An alternative loss, suggested by the interpretation of x and z as either bit

vectors or vectors of bit probabilities (Bernoullis) is the reconstruction cross-

entropy:

(x, z)= IH(B

)

= −

k=1

log z

+(1 − x

) log(1 − z

)] (2)

Figure 1: An example x is corrupted to

x. The autoencoder then maps it to y

and attempts to reconstruct x.

x x

Note that if x is a binary vector, L

(x, z) is a negative log-likelihood for the

example x, given the Bernoulli parameters z. Equation 1 with L = L

can be

written

, θ

= arg min

θ,θ

(X)

(X, g

(X)))] (3)

where q

(X) denotes the empirical distribution associated to our n training

inputs. This optimization will typically be carried out by stochastic gradient

descent.

2.3 The Denoising Autoencoder

To test our hypothesis and enforce robustness to partially destroyed inputs we

modify the basic autoencoder we just described. We w ill now train it to recon-

struct a clean “repaired” input from a corrupted, partially des troyed one. This

is done by ﬁrst corrupting the initial input x to get a partially destroyed ver-

sion

x by means of a stochastic mapping

x ∼ q

(

x|x). In our experiments, we

considered the following corrupting process, parameterized by the desired pro-

portion ν of “destruction”: for each input x, a ﬁxed number νd of components

are chosen at random, and their value is forced to 0, while the others are left

untouched. The procedure can be viewed as replacing a component considered

missing by a default value, which is a common technique. A motivation for

zeroing the des troyed components is that it simulates the removal of these com-

ponents from the input. For images on a white (0) background, this corresponds

to “salt noise”. Note that alternative corrupting noise s could b e considered

The corrupted input

x is then mapped, as with the basic autoencoder, to a

hidden representation y = f

(

x) = s(W

x + b) from which we reconstruct a

z = g

(y) = s(W

y + b

) (see ﬁgure 1 for a schematic representation of the

process). As before the parameters are trained to minimize the average recon-

struction error L

(x, z) = IH(B

) over a training set, i.e. to have z as close

as possible to the uncorrupted input x. But the key diﬀerence is that z is now

a deterministic function of

x rather than x and thus the result of a stochastic

mapping of x.

Let us deﬁne the joint distribution

(X,

X, Y ) = q

(X)q

(

X|X)δ

(

(Y ) (4)

the approach we describe and our analysis is not speciﬁc to a particular kind of corrupting

noise.

where δ

(v) puts mass 0 when u 6= v. Thus Y is a deterministic function of

X. q

(X,

X, Y ) is parameterized by θ. The objective function minimized by

stochastic gradient descent becomes:

arg min

θ,θ

(X,



X, g

(

X))

i

. (5)

So from the point of view of the stochastic gradient descent algorithm, in ad-

dition to picking an input sample from the training set, we will also produce a

random corrupted version of it, and take a gradient step towards reconstructing

the uncorrupted version from the corrupted version. Note that in this way, the

autoencoder cannot learn the identity, unlike the basic autoencoder, thus re-

moving the constraint that d

< d or the need to regularize speciﬁcally to avoid

such a trivial solution.

2.4 Layer-wise Initialization and Fine Tuning

The basic autoencoder has been used as a building block to train deep net-

works (Bengio et al., 2007), with the representation of the k-th layer used as

input for the (k + 1)-th, and the (k + 1)-th layer trained after the k-th has been

trained. After a few layers have been trained, the parameters are used as initial-

ization for a network optimized with respect to a supervised training criterion.

This greedy layer-wise procedure has been shown to yield signiﬁcantly better

local minima than random initialization of deep networks (Bengio et al., 2007),

achieving better generalization on a number of tasks (Larochelle et al., 2007).

The procedure to train a deep network using the denoising autoencoder is

similar. The only diﬀerence is how each layer is trained, i.e., to minimize the

criterion in eq. 5 instead of eq. 3. Note that the corruption process q

is only

used during training, but not for propagating representations from the raw input

to higher-level representations. Note also that when layer k is trained, it receives

as input the uncorrupted output of the previous layers.

3 Relationship t o Other Approache s

Our training procedure for the denoising autoencoder involves learning to re-

cover a clean input from a corrupted version, a task known as denoising. The

problem of image denoising, in particular, has been extensively studied in the

image processing community and many recent developments rely on machine

learning approaches (see e.g. Roth and Black (2005); Elad and Aharon (2006);

Hammond and Simoncelli (2007)). A particular form of gated autoencoders has

also been used for denoising in Memisevic (2007). Denoising using autoencoders

was ac tually introduced much earlier (LeCun, 1987; Gallinari e t al., 1987), as

an alternative to Hopﬁeld models (Hopﬁeld, 1982). Our objective however is

fundamentally diﬀerent from that of developing a competitive image denoising

algorithm. We investigate explicit robustness to corrupting noise only as a cri-

terion to guide the learning of suitable intermediate representations, with the

Extracting and composing robust features with denoising autoencoders

Figures

Citations

Deep learning

Generative Adversarial Nets

Deep Learning

Dropout: a simple way to prevent neural networks from overfitting

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

References

Learning representations by back-propagating errors

Reducing the Dimensionality of Data with Neural Networks

Neural networks and physical systems with emergent collective computational abilities

Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations

A fast learning algorithm for deep belief nets

Related Papers (5)

ImageNet Classification with Deep Convolutional Neural Networks

Gradient-based learning applied to document recognition

Deep Residual Learning for Image Recognition

Generative Adversarial Nets

Adam: A Method for Stochastic Optimization

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Extracting and composing robust features with denoising autoencoders" ?

Q2. What are the future works mentioned in the paper "Extracting and composing robust features with denoising autoencoders" ?

Q3. What is the training procedure for denoising?

Q4. What is the key ingredient to this success?

Q5. What is the main reason why deep architectures are needed?

Q6. What is the way to train a model of a joint?

Q7. What is the deterministic mapping of the input vector x?

Q8. What is the key to learning in deep architectures?

Q9. What is the classification algorithm for a svm?

Q10. What is the alternative loss suggested by the interpretation of x and z as bit vectors?

Q11. What is the way to optimize for the lower bound?