scispace - formally typeset
Open AccessBook ChapterDOI

Transforming auto-encoders

TLDR
It is argued that neural networks can be used to learn features that output a whole vector of instantiation parameters and this is a much more promising way of dealing with variations in position, orientation, scale and lighting than the methods currently employed in the neural networks community.
Abstract
The artificial neural networks that are used to recognize shapes typically use one or more layers of learned feature detectors that produce scalar outputs. By contrast, the computer vision community uses complicated, hand-engineered features, like SIFT [6], that produce a whole vector of outputs including an explicit representation of the pose of the feature. We show how neural networks can be used to learn features that output a whole vector of instantiation parameters and we argue that this is a much more promising way of dealing with variations in position, orientation, scale and lighting than the methods currently employed in the neural networks community. It is also more promising than the hand-engineered features currently used in computer vision because it provides an efficient way of adapting the features to the domain.

read more

Content maybe subject to copyright    Report

Transforming Auto-encoders
G. E. Hinton, A. Krizhevsky & S. D. Wang
Department of Computer Science, University of Toronto
{geoffrey.hinton, akrizhevsky, sidawang88}@gmail.com
Abstract. The artificial neural networks that are used to recognize
shap es typically use one or more layers of learned feature detectors that
pro d uce scalar outputs. By contrast, the computer vision community
uses complicated, hand-engineered features, like SIFT [6], that produce
a whole vector of outputs including an explicit representation of the pose
of the feature. We show how neural networks can be used to learn features
that output a whole vector of instantiation parameters and we argue that
this is a much more promising way of dealing with variations in position,
orientation, scale and lighting than the methods currently employed in
the neural networks community. It is also more promising than the hand-
engineered features currently used in computer vision because it provides
an efficient way of adapting the features to the domain.
Keywords: Invariance, auto-encoder, shape representation
1 Introduction
Current methods for recognizing objects in images perform poorly and use meth-
ods that are intellectually unsatisfying. Some of the best computer vision systems
use histograms of oriented gradients as “visual words” and model the spatial
distribution of these elements using a crude spatial pyramid. Such methods can
recognize objects correctly without knowing exactly where they are an ability
that is used to diagnose brain damage in humans. The best artifical neural net-
works [4, 5, 10] use hand-coded weight-sharing schemes to reduce the number of
free parameters and they achieve local translational invariance by subsampling
the activities of local pools of translated replicas of the same kernel. This method
of dealing with the changes in images caused by changes in viewpoint is much
better than no method at all, but it is clearly incapable of dealing with recog-
nition tasks, such as facial identity recognition, that require knowledge of the
precise spatial relationships between high-level parts like a nose and a mouth.
After several stages of subsampling in a convolutional net, high-level features
have a lot of uncertainty in their poses. This is generally regarded as a desire-
able property because it amounts to invariance to pose over some limited range,
but it makes it impossible to compute precise spatial relationships.
This paper argues that convolutional neural networks are misguided in what
they are trying to achieve. Instead of aiming for viewpoint invariance in the
activities of “neurons” that use a single scalar output to summarize the activities

2 G. E. Hinton, A. Krizhevsky & S. D. Wang
of a local pool of replicated feature detectors, artifical neural networks should
use local “capsules” that perform some quite complicated internal computations
on their inputs and then encapsulate the results of these computations into a
small vector of highly informative outputs. Each capsule learns to recognize
an implicitly defined visual entity over a limited domain of viewing conditions
and deformations and it outputs both the probability that the entity is present
within its limited domain and a set of “instantiation parameters” that may
include the precise pose, lighting and deformation of the visual entity relative
to an implicitly defined canonical version of that entity. When the capsule is
working properly, the probability of the visual entity being present is locally
invariant it do es not change as the entity moves over the manifold of possible
appearances within the limited domain covered by the capsule. The instantiation
parameters, however, are “equivariant” as the viewing conditions change and
the entity moves over the appearance manifold, the instantiation parameters
change by a corresponding amount because they are representing the intrinsic
coordinates of the entity on the appearance manifold.
One of the major advantages of capsules that output explicit instantiation
parameters is that they pr ovide a simple way to recognize wholes by recognizing
their parts. If a capsule can learn to output the pose of its visual entity in a vector
that is linearly related to the “natural” representations of pose used in computer
graphics, there is a simple and highly selective test for whether the visual entities
represented by two active capsules, A and B, have the right spatial relationship
to activate a higher-level capsule, C. Suppose that the pose outputs of capsule A
are represented by a matrix, T
A
, that specifies the coordinate transform between
the canonical visual entity of A and the actual instantiation of that entity found
by capsule A. If we multiply T
A
by the part-whole coordinate transform T
AC
that relates the canonical visual entity of A to the canonical visual entity of C,
we get a prediction for T
C
. Similarly, we can use T
B
and T
BC
to get another
prediction. If these predictions are a good match, the instantiations found by
capsules A and B are in the right spatial relationship to activate capsule C and
the average of the predictions tells us how the larger visual entity represented by
C is transformed r elative to the canonical visual entity of C. If, for example, A
represents a mouth and B represents a nose, they can each make a prediction for
the pose of the face. If these predictions agree, the mouth and nose must be in the
right s patial relationship to form a face. An interesting property of this way of
performing shape recognition is that the knowledge of part-whole relationships is
viewpoint-invariant and is represented by weight matrices whereas the knowledge
of the instantiation parameters of currently observed objects and their parts is
viewpoint-equivariant and is represented by neural activities [12].
In order to get such a part-whole hierarchy off the ground, the “capsules” that
implement the lowest-level parts in the hierarchy need to extract explicit pose
parameters from pixel intensities. This paper shows that these capsules are quite
easy to learn from pairs of transformed images if the neural net has direct, non-
visual access to the transformations. In humans, for example, a saccade causes

Transforming Auto-encoders 3
p
x y
+Dx
+Dy
p
x y
+Dx
+Dy
p
x y
+Dx
+Dy
input
image
target
output
gate
actual
output
Fig. 1. Three caps ules of a transforming auto-enco der that models translations. Each
capsule in the figure has 3 recognition units and 4 generation units. The weights on the
connections are learned by backpropagating the discrepancy between th e actual and
target outputs.
a pure translation of the retinal image and the cortex has non-visual access to
information about eye-movements.
2 Learning the First Level of Capsules
Once pixel intensities have been converted into the outputs of a set of active,
first-level capsules each of which produces an explicit representation of the pose
of its visual entity, it is relatively easy to see how larger and more complex visual
entities can be recognized by using agreements of the poses predicted by active,
lower-level capsules. But where do the first-level capsules come from? How can
an artificial neural network learn to convert the language of pixel intensities
to the language of p ose parameters? That is the question addressed by this
paper and it turns out that there is a surprisingly simple answer which we call a
“transforming auto-encoder”. We explain the idea using simple 2-D images and
capsules whose only pose outputs are an x and a y position. We generalize to
more complicated poses later.
Consider the feedforward neural network shown in figure 1. The network is
deterministic and, once it has been learned, it takes as inputs an image and
desired shifts, ∆x and ∆y, and it outputs the shifted image. The network is
composed of a number of separate capsules that only interact at the final layer
when they cooperate to produce the desired shifted image. Each capsule has its
own logistic “recognition units” that act as a hidden layer for computing three
numbers, x, y, and p, that are the outputs that the capsule will send to higher
levels of the vision system. p is the probability that the capsule’s visual entity is

4 G. E. Hinton, A. Krizhevsky & S. D. Wang
−6 −4 −2 0 2 4 6
−4
−3
−2
−1
0
1
2
3
4
5
6
x outputs of module 30 before and after a +3 or −3 pixel shift
Fig. 2. Left: A scatterplot in which the vertical axis represents the x output of one
of the capsules for each digit image and the horizontal axis represents the x output
from the same capsule if that image is shifted by +3 or 3 pixels in the x direction.
If the original image is already near the limit of the x positions that the capsu le can
represent, shifting further in that direction causes the capsule to produce the wrong
answer, but this does not matter if the capsule sets its probability to 0 for data ou tside
its domain of competence. Right: The outgoing weights of 10 of the 20 generative
units for 9 of the capsules.
present in the input image. The capsule also has its own “generation units” that
are used for computing the capsule’s contribution to the transformed image. The
inputs to the generation units are x + ∆x and y + ∆y, and the contributions
that the capsule’s generation units make to the output image are multiplied by
p, so inactive capsules have no effect.
For the transforming auto-encoder to produce the correct output image, it
is essential that the x and y values computed by each active capsule correspond
to the actual x and y position of its visual entity and we do not need to know
this visual entity or the origin of its coordinate frame in advance.
As a simple demonstration of the efficacy of the transforming auto-encoder,
we trained a network with 30 capsules each of which had 10 recognition units
and 20 generation units. Each capsule sees the whole of an MNIST digit image.
Both the input and the output images are shifted randomly by -2, -1, 0, +1, or
+2 pixels in the x and y directions and the transforming auto-encoder is given
the resulting ∆x and ∆y as an additional input. Figure 2 shows that the capsules
do indeed output x and y values that shift in just the right way when the input
image is shifted. Figure 2 shows that the capsules learn generative units with
projective fields that are highly localized. The receptive fields of the recognition
units are noisier and somewhat less localized.

Transforming Auto-encoders 5
Fig. 3. Top: Full affine transformations using a tr ansfor ming auto-encoder with 25
capsules each of which has 40 recognition units and 40 generation units. The top r ow
shows input images; the middle row shows output images and the bottom row shows
the correctly transformed output images. Bottom: The output weights of the first 20
generation units of the first 7 capsules for this transforming auto-encoder.
2.1 More Complex 2-D Transformations
If each capsule is given 9 real-valued outputs that are treated as a 3 × 3 ma-
trix A, a transforming auto-encoder can be trained to predict a full 2-D affine
transformation (translation, rotation, scaling and shearing). A known transfor-
mation matrix T is applied to the output of the capsule A to get the matrix T A.
The elements of T A are then used as the inputs to the generation units when
predicting the target output image.
2.2 Modeling Changes in 3-D Viewpoint
A major potential advantage of using matrix multiplies to model the effects of
viewpoint is that it should make it far easier to cope with 3-D. Our preliminary
experiments (see figure 4) used computer graphics to generate stereo images of
various types of car from many different viewp oints. The transforming auto-
encoder consisted of 900 capsules, each with two layers (32 then 64) of rectified
linear recognition units [8]. The capsules had 11x11 pixel receptive fields which
were arranged on a 30x30 grid over the 96x96 image, with a stride of 3 pixels
between neighbouring capsules. There was no weight-sharing. Each capsule pro-
duced from its layer of 64 recognition units a 3x3 matrix representation of the
3-D orientation of the feature that it was tuned to detect, as well as a probability
that its implicitly defined feature was present. This 3x3 matrix was then multi-
plied by the real transformation matrix between the source and target images,

Citations
More filters
Journal ArticleDOI

Representation Learning: A Review and New Perspectives

TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.
Proceedings Article

Spatial transformer networks

TL;DR: This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps.
Proceedings Article

beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

TL;DR: In this article, a modification of the variational autoencoder (VAE) framework is proposed to learn interpretable factorised latent representations from raw image data in a completely unsupervised manner.
Proceedings Article

Dynamic Routing Between Capsules

TL;DR: It is shown that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits.
Journal ArticleDOI

Object Detection With Deep Learning: A Review

TL;DR: In this article, a review of deep learning-based object detection frameworks is provided, focusing on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further.
References
More filters
Journal ArticleDOI

Gradient-based learning applied to document recognition

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Proceedings ArticleDOI

Object recognition from local scale-invariant features

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Proceedings Article

Rectified Linear Units Improve Restricted Boltzmann Machines

TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Proceedings Article

Understanding the difficulty of training deep feedforward neural networks

TL;DR: The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Journal ArticleDOI

Hierarchical models of object recognition in cortex

TL;DR: A new hierarchical model consistent with physiological data from inferotemporal cortex that accounts for this complex visual task and makes testable predictions is described.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Transforming auto-encoders" ?

The authors show how neural networks can be used to learn features that output a whole vector of instantiation parameters and they argue that this is a much more promising way of dealing with variations in position, orientation, scale and lighting than the methods currently employed in the neural networks community. It is also more promising than the handengineered features currently used in computer vision because it provides an efficient way of adapting the features to the domain. 

They learn without using knowledge of transformations, but they only learn instantiation parameters that are linear functions of the image. 

The inputs to the generation units are x + ∆x and y + ∆y, and the contributions that the capsule’s generation units make to the output image are multiplied by p, so inactive capsules have no effect. 

The usual way to apply Kalman filters to data in which the dynamics is a non-linear function of the observations is to use an “extended” Kalman filter that linearizes the dynamics about the current operating point. 

In a convolutional pool, the combined output after subsampling is typically the scalar activity of the most active unit in the pool [11]. 

Since the data consisted of stereo pairs of images, each capsule had to look at an 11x11 patch in both members of the stereo pair, as well as reconstructing a 22x22 patch in both members. 

The locally invariant probabilities that capsules compute resemble the outputs of their complex cells and the equivariant instantiation parameters resemble the outputs of their simple cells. 

Some of the best computer vision systems use histograms of oriented gradients as “visual words” and model the spatial distribution of these elements using a crude spatial pyramid. 

Once pixel intensities have been converted into the outputs of a set of active, first-level capsules each of which produces an explicit representation of the pose of its visual entity, it is relatively easy to see how larger and more complex visual entities can be recognized by using agreements of the poses predicted by active, lower-level capsules. 

In order to get such a part-whole hierarchy off the ground, the “capsules” that implement the lowest-level parts in the hierarchy need to extract explicit pose parameters from pixel intensities. 

It can be ameliorated by making each of the lowest-level capsules operate over a very limited region of the pose space and only allowing larger regions for more complex visual entities that are much less densely distributed. 

Even if the location of this unit is used when creating the reconstruction required for unsupervised learning, it is not used by higher levels [5] because the aim of a convolutional net is to make the activities translation invariant. 

Using multiple real-values is the natural way to represent pose information and it is much more efficient than using coarse coding[3], but it comes at a price: 

A major potential advantage of using matrix multiplies to model the effects of viewpoint is that it should make it far easier to cope with 3-D. 

Replicated copies of exactly the same weight kernel are far from optimal for extracting the pose of a visual entity over a limited domain, especially if the replication must cover scale and orientation as well as position.