What does the transforming auto-encoder learn without using?

They learn without using knowledge of transformations, but they only learn instantiation parameters that are linear functions of the image.

What is the common way to apply a Kalman filter to data in which the dynamics?

The usual way to apply Kalman filters to data in which the dynamics is a non-linear function of the observations is to use an “extended” Kalman filter that linearizes the dynamics about the current operating point.

What is the output of a convolutional pool?

In a convolutional pool, the combined output after subsampling is typically the scalar activity of the most active unit in the pool [11].

What was the problem with the transforming autoencoder?

Since the data consisted of stereo pairs of images, each capsule had to look at an 11x11 patch in both members of the stereo pair, as well as reconstructing a 22x22 patch in both members.

What are the local invariant probabilities that capsules compute?

The locally invariant probabilities that capsules compute resemble the outputs of their complex cells and the equivariant instantiation parameters resemble the outputs of their simple cells.

What is the advantage of using a matrix to model the effects of viewpoint?

It can be ameliorated by making each of the lowest-level capsules operate over a very limited region of the pose space and only allowing larger regions for more complex visual entities that are much less densely distributed.

Why is the location of the unit used in the reconstruction of the image not used by higher levels?

Even if the location of this unit is used when creating the reconstruction required for unsupervised learning, it is not used by higher levels [5] because the aim of a convolutional net is to make the activities translation invariant.

What is the cost of using multiple real-values to represent the effects of viewpoint?

Using multiple real-values is the natural way to represent pose information and it is much more efficient than using coarse coding[3], but it comes at a price:

What is the way to extract the pose of a visual entity?

Replicated copies of exactly the same weight kernel are far from optimal for extracting the pose of a visual entity over a limited domain, especially if the replication must cover scale and orientation as well as position.

(Open Access) Transforming auto-encoders (2011) | Geoffrey E. Hinton

Q: What contributions have the authors mentioned in the paper "Transforming auto-encoders" ?

The authors show how neural networks can be used to learn features that output a whole vector of instantiation parameters and they argue that this is a much more promising way of dealing with variations in position, orientation, scale and lighting than the methods currently employed in the neural networks community. It is also more promising than the handengineered features currently used in computer vision because it provides an efficient way of adapting the features to the domain.

Q: How do the authors get the pose of a face off the ground?

In order to get such a part-whole hierarchy off the ground, the “capsules” that implement the lowest-level parts in the hierarchy need to extract explicit pose parameters from pixel intensities.

Transforming Auto-encoders

G. E. Hinton, A. Krizhevsky & S. D. Wang

Department of Computer Science, University of Toronto

{geoﬀrey.hinton, akrizhevsky, sidawang88}@gmail.com

Abstract. The artiﬁcial neural networks that are used to recognize

shap es typically use one or more layers of learned feature detectors that

pro d uce scalar outputs. By contrast, the computer vision community

uses complicated, hand-engineered features, like SIFT [6], that produce

a whole vector of outputs including an explicit representation of the pose

of the feature. We show how neural networks can be used to learn features

that output a whole vector of instantiation parameters and we argue that

this is a much more promising way of dealing with variations in position,

orientation, scale and lighting than the methods currently employed in

the neural networks community. It is also more promising than the hand-

engineered features currently used in computer vision because it provides

an eﬃcient way of adapting the features to the domain.

Keywords: Invariance, auto-encoder, shape representation

1 Introduction

Current methods for recognizing objects in images perform poorly and use meth-

ods that are intellectually unsatisfying. Some of the best computer vision systems

use histograms of oriented gradients as “visual words” and model the spatial

distribution of these elements using a crude spatial pyramid. Such methods can

recognize objects correctly without knowing exactly where they are – an ability

that is used to diagnose brain damage in humans. The best artiﬁcal neural net-

works [4, 5, 10] use hand-coded weight-sharing schemes to reduce the number of

free parameters and they achieve local translational invariance by subsampling

the activities of local pools of translated replicas of the same kernel. This method

of dealing with the changes in images caused by changes in viewpoint is much

better than no method at all, but it is clearly incapable of dealing with recog-

nition tasks, such as facial identity recognition, that require knowledge of the

precise spatial relationships between high-level parts like a nose and a mouth.

After several stages of subsampling in a convolutional net, high-level features

have a lot of uncertainty in their poses. This is generally regarded as a desire-

able property because it amounts to invariance to pose over some limited range,

but it makes it impossible to compute precise spatial relationships.

This paper argues that convolutional neural networks are misguided in what

they are trying to achieve. Instead of aiming for viewpoint invariance in the

activities of “neurons” that use a single scalar output to summarize the activities

2 G. E. Hinton, A. Krizhevsky & S. D. Wang

of a local pool of replicated feature detectors, artiﬁcal neural networks should

use local “capsules” that perform some quite complicated internal computations

on their inputs and then encapsulate the results of these computations into a

small vector of highly informative outputs. Each capsule learns to recognize

an implicitly deﬁned visual entity over a limited domain of viewing conditions

and deformations and it outputs both the probability that the entity is present

within its limited domain and a set of “instantiation parameters” that may

include the precise pose, lighting and deformation of the visual entity relative

to an implicitly deﬁned canonical version of that entity. When the capsule is

working properly, the probability of the visual entity being present is locally

invariant – it do es not change as the entity moves over the manifold of possible

appearances within the limited domain covered by the capsule. The instantiation

parameters, however, are “equivariant” – as the viewing conditions change and

the entity moves over the appearance manifold, the instantiation parameters

change by a corresponding amount because they are representing the intrinsic

coordinates of the entity on the appearance manifold.

One of the major advantages of capsules that output explicit instantiation

parameters is that they pr ovide a simple way to recognize wholes by recognizing

their parts. If a capsule can learn to output the pose of its visual entity in a vector

that is linearly related to the “natural” representations of pose used in computer

graphics, there is a simple and highly selective test for whether the visual entities

represented by two active capsules, A and B, have the right spatial relationship

to activate a higher-level capsule, C. Suppose that the pose outputs of capsule A

are represented by a matrix, T

, that speciﬁes the coordinate transform between

the canonical visual entity of A and the actual instantiation of that entity found

by capsule A. If we multiply T

by the part-whole coordinate transform T

that relates the canonical visual entity of A to the canonical visual entity of C,

we get a prediction for T

. Similarly, we can use T

and T

to get another

prediction. If these predictions are a good match, the instantiations found by

capsules A and B are in the right spatial relationship to activate capsule C and

the average of the predictions tells us how the larger visual entity represented by

C is transformed r elative to the canonical visual entity of C. If, for example, A

represents a mouth and B represents a nose, they can each make a prediction for

the pose of the face. If these predictions agree, the mouth and nose must be in the

right s patial relationship to form a face. An interesting property of this way of

performing shape recognition is that the knowledge of part-whole relationships is

viewpoint-invariant and is represented by weight matrices whereas the knowledge

of the instantiation parameters of currently observed objects and their parts is

viewpoint-equivariant and is represented by neural activities [12].

In order to get such a part-whole hierarchy oﬀ the ground, the “capsules” that

implement the lowest-level parts in the hierarchy need to extract explicit pose

parameters from pixel intensities. This paper shows that these capsules are quite

easy to learn from pairs of transformed images if the neural net has direct, non-

visual access to the transformations. In humans, for example, a saccade causes

Transforming Auto-encoders 3

x y

+Dx

+Dy

x y

+Dx

+Dy

x y

+Dx

+Dy

input

image

target

output

gate

actual

output

Fig. 1. Three caps ules of a transforming auto-enco der that models translations. Each

capsule in the ﬁgure has 3 recognition units and 4 generation units. The weights on the

connections are learned by backpropagating the discrepancy between th e actual and

target outputs.

a pure translation of the retinal image and the cortex has non-visual access to

information about eye-movements.

2 Learning the First Level of Capsules

Once pixel intensities have been converted into the outputs of a set of active,

ﬁrst-level capsules each of which produces an explicit representation of the pose

of its visual entity, it is relatively easy to see how larger and more complex visual

entities can be recognized by using agreements of the poses predicted by active,

lower-level capsules. But where do the ﬁrst-level capsules come from? How can

an artiﬁcial neural network learn to convert the language of pixel intensities

to the language of p ose parameters? That is the question addressed by this

paper and it turns out that there is a surprisingly simple answer which we call a

“transforming auto-encoder”. We explain the idea using simple 2-D images and

capsules whose only pose outputs are an x and a y position. We generalize to

more complicated poses later.

Consider the feedforward neural network shown in ﬁgure 1. The network is

deterministic and, once it has been learned, it takes as inputs an image and

desired shifts, ∆x and ∆y, and it outputs the shifted image. The network is

composed of a number of separate capsules that only interact at the ﬁnal layer

when they cooperate to produce the desired shifted image. Each capsule has its

own logistic “recognition units” that act as a hidden layer for computing three

numbers, x, y, and p, that are the outputs that the capsule will send to higher

levels of the vision system. p is the probability that the capsule’s visual entity is

4 G. E. Hinton, A. Krizhevsky & S. D. Wang

−6 −4 −2 0 2 4 6

−4

−3

−2

−1

x outputs of module 30 before and after a +3 or −3 pixel shift

Fig. 2. Left: A scatterplot in which the vertical axis represents the x output of one

of the capsules for each digit image and the horizontal axis represents the x output

from the same capsule if that image is shifted by +3 or −3 pixels in the x direction.

If the original image is already near the limit of the x positions that the capsu le can

represent, shifting further in that direction causes the capsule to produce the wrong

answer, but this does not matter if the capsule sets its probability to 0 for data ou tside

its domain of competence. Right: The outgoing weights of 10 of the 20 generative

units for 9 of the capsules.

present in the input image. The capsule also has its own “generation units” that

are used for computing the capsule’s contribution to the transformed image. The

inputs to the generation units are x + ∆x and y + ∆y, and the contributions

that the capsule’s generation units make to the output image are multiplied by

p, so inactive capsules have no eﬀect.

For the transforming auto-encoder to produce the correct output image, it

is essential that the x and y values computed by each active capsule correspond

to the actual x and y position of its visual entity and we do not need to know

this visual entity or the origin of its coordinate frame in advance.

As a simple demonstration of the eﬃcacy of the transforming auto-encoder,

we trained a network with 30 capsules each of which had 10 recognition units

and 20 generation units. Each capsule sees the whole of an MNIST digit image.

Both the input and the output images are shifted randomly by -2, -1, 0, +1, or

+2 pixels in the x and y directions and the transforming auto-encoder is given

the resulting ∆x and ∆y as an additional input. Figure 2 shows that the capsules

do indeed output x and y values that shift in just the right way when the input

image is shifted. Figure 2 shows that the capsules learn generative units with

projective ﬁelds that are highly localized. The receptive ﬁelds of the recognition

units are noisier and somewhat less localized.

Transforming Auto-encoders 5

Fig. 3. Top: Full aﬃne transformations using a tr ansfor ming auto-encoder with 25

capsules each of which has 40 recognition units and 40 generation units. The top r ow

shows input images; the middle row shows output images and the bottom row shows

the correctly transformed output images. Bottom: The output weights of the ﬁrst 20

generation units of the ﬁrst 7 capsules for this transforming auto-encoder.

2.1 More Complex 2-D Transformations

If each capsule is given 9 real-valued outputs that are treated as a 3 × 3 ma-

trix A, a transforming auto-encoder can be trained to predict a full 2-D aﬃne

transformation (translation, rotation, scaling and shearing). A known transfor-

mation matrix T is applied to the output of the capsule A to get the matrix T A.

The elements of T A are then used as the inputs to the generation units when

predicting the target output image.

2.2 Modeling Changes in 3-D Viewpoint

A major potential advantage of using matrix multiplies to model the eﬀects of

viewpoint is that it should make it far easier to cope with 3-D. Our preliminary

experiments (see ﬁgure 4) used computer graphics to generate stereo images of

various types of car from many diﬀerent viewp oints. The transforming auto-

encoder consisted of 900 capsules, each with two layers (32 then 64) of rectiﬁed

linear recognition units [8]. The capsules had 11x11 pixel receptive ﬁelds which

were arranged on a 30x30 grid over the 96x96 image, with a stride of 3 pixels

between neighbouring capsules. There was no weight-sharing. Each capsule pro-

duced from its layer of 64 recognition units a 3x3 matrix representation of the

3-D orientation of the feature that it was tuned to detect, as well as a probability

that its implicitly deﬁned feature was present. This 3x3 matrix was then multi-

plied by the real transformation matrix between the source and target images,

Transforming auto-encoders

Figures

Citations

Representation Learning: A Review and New Perspectives

Spatial transformer networks

beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

Dynamic Routing Between Capsules

Object Detection With Deep Learning: A Review

References

Gradient-based learning applied to document recognition

Object recognition from local scale-invariant features

Rectified Linear Units Improve Restricted Boltzmann Machines

Understanding the difficulty of training deep feedforward neural networks

Hierarchical models of object recognition in cortex

Related Papers (5)

Deep Residual Learning for Image Recognition

Gradient-based learning applied to document recognition

Adam: A Method for Stochastic Optimization

Generative Adversarial Nets

ImageNet Classification with Deep Convolutional Neural Networks

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Transforming auto-encoders" ?

Q2. What does the transforming auto-encoder learn without using?

Q3. What is the effect of inactive capsules?

Q4. What is the common way to apply a Kalman filter to data in which the dynamics?

Q5. What is the output of a convolutional pool?

Q6. What was the problem with the transforming autoencoder?

Q7. What are the local invariant probabilities that capsules compute?

Q8. What is the way to model the spatial distribution of objects in images?

Q9. What is the way to learn to recognize a face?

Q10. How do the authors get the pose of a face off the ground?

Q11. What is the advantage of using a matrix to model the effects of viewpoint?

Q12. Why is the location of the unit used in the reconstruction of the image not used by higher levels?

Q13. What is the cost of using multiple real-values to represent the effects of viewpoint?

Q14. What is the main advantage of using matrix multiplies to model the effects of viewpoint?

Q15. What is the way to extract the pose of a visual entity?