Transforming auto-encoders
read more
Citations
Representation Learning: A Review and New Perspectives
Spatial transformer networks
beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
Dynamic Routing Between Capsules
Object Detection With Deep Learning: A Review
References
Gradient-based learning applied to document recognition
Object recognition from local scale-invariant features
Rectified Linear Units Improve Restricted Boltzmann Machines
Understanding the difficulty of training deep feedforward neural networks
Hierarchical models of object recognition in cortex
Related Papers (5)
Frequently Asked Questions (15)
Q2. What does the transforming auto-encoder learn without using?
They learn without using knowledge of transformations, but they only learn instantiation parameters that are linear functions of the image.
Q3. What is the effect of inactive capsules?
The inputs to the generation units are x + ∆x and y + ∆y, and the contributions that the capsule’s generation units make to the output image are multiplied by p, so inactive capsules have no effect.
Q4. What is the common way to apply a Kalman filter to data in which the dynamics?
The usual way to apply Kalman filters to data in which the dynamics is a non-linear function of the observations is to use an “extended” Kalman filter that linearizes the dynamics about the current operating point.
Q5. What is the output of a convolutional pool?
In a convolutional pool, the combined output after subsampling is typically the scalar activity of the most active unit in the pool [11].
Q6. What was the problem with the transforming autoencoder?
Since the data consisted of stereo pairs of images, each capsule had to look at an 11x11 patch in both members of the stereo pair, as well as reconstructing a 22x22 patch in both members.
Q7. What are the local invariant probabilities that capsules compute?
The locally invariant probabilities that capsules compute resemble the outputs of their complex cells and the equivariant instantiation parameters resemble the outputs of their simple cells.
Q8. What is the way to model the spatial distribution of objects in images?
Some of the best computer vision systems use histograms of oriented gradients as “visual words” and model the spatial distribution of these elements using a crude spatial pyramid.
Q9. What is the way to learn to recognize a face?
Once pixel intensities have been converted into the outputs of a set of active, first-level capsules each of which produces an explicit representation of the pose of its visual entity, it is relatively easy to see how larger and more complex visual entities can be recognized by using agreements of the poses predicted by active, lower-level capsules.
Q10. How do the authors get the pose of a face off the ground?
In order to get such a part-whole hierarchy off the ground, the “capsules” that implement the lowest-level parts in the hierarchy need to extract explicit pose parameters from pixel intensities.
Q11. What is the advantage of using a matrix to model the effects of viewpoint?
It can be ameliorated by making each of the lowest-level capsules operate over a very limited region of the pose space and only allowing larger regions for more complex visual entities that are much less densely distributed.
Q12. Why is the location of the unit used in the reconstruction of the image not used by higher levels?
Even if the location of this unit is used when creating the reconstruction required for unsupervised learning, it is not used by higher levels [5] because the aim of a convolutional net is to make the activities translation invariant.
Q13. What is the cost of using multiple real-values to represent the effects of viewpoint?
Using multiple real-values is the natural way to represent pose information and it is much more efficient than using coarse coding[3], but it comes at a price:
Q14. What is the main advantage of using matrix multiplies to model the effects of viewpoint?
A major potential advantage of using matrix multiplies to model the effects of viewpoint is that it should make it far easier to cope with 3-D.
Q15. What is the way to extract the pose of a visual entity?
Replicated copies of exactly the same weight kernel are far from optimal for extracting the pose of a visual entity over a limited domain, especially if the replication must cover scale and orientation as well as position.