scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Harmonic Networks: Deep Translation and Rotation Equivariance

TL;DR: H-Nets are presented, a CNN exhibiting equivariance to patch-wise translation and 360-rotation, and it is demonstrated that their layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization.
Abstract: Translating or rotating an input image should not affect the results of many computer vision tasks. Convolutional neural networks (CNNs) are already translation equivariant: input image translations produce proportionate feature map translations. This is not the case for rotations. Global rotation equivariance is typically sought through data augmentation, but patch-wise equivariance is more difficult. We present Harmonic Networks or H-Nets, a CNN exhibiting equivariance to patch-wise translation and 360-rotation. We achieve this by replacing regular CNN filters with circular harmonics, returning a maximal response and orientation for every receptive field patch. H-Nets use a rich, parameter-efficient and fixed computational complexity representation, and we show that deep feature maps within the network encode complicated rotational invariants. We demonstrate that our layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization. We also achieve state-of-the-art classification on rotated-MNIST, and competitive results on other benchmark challenges.

Content maybe subject to copyright    Report

Harmonic Networks: Deep Translation and Rotation Equivariance
Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov and Gabriel J. Brostow
{d.worrall, s.garbin, d.turmukhambetov, g.brostow}@cs.ucl.ac.uk
University College London
Abstract
Translating or rotating an input image should not affect the
results of many computer vision tasks. Convolutional neural net-
works (CNNs) are already translation equivariant: input image
translations produce proportionate feature map translations.
This is not the case for rotations. Global rotation equivariance
is typically sought through data augmentation, but patch-wise
equivariance is more difficult. We present Harmonic Networks
or H-Nets, a CNN exhibiting equivariance to patch-wise trans-
lation and 360-rotation. We achieve this by replacing regular
CNN filters with circular harmonics, returning a maximal
response and orientation for every receptive field patch.
H-Nets use a rich, parameter-efficient and fixed computa-
tional complexity representation, and we show that deep feature
maps within the network encode complicated rotational invari-
ants. We demonstrate that our layers are general enough to be
used in conjunction with the latest architectures and techniques,
such as deep supervision and batch normalization. We also
achieve state-of-the-art classification on rotated-MNIST, and
competitive results on other benchmark challenges.
1. Introduction
We tackle the challenge of representing
360
-rotations
in convolutional neural networks (CNNs) [
19
]. Currently,
convolutional layers are constrained by design to map an image
to a feature vector, and translated versions of the image map
to proportionally-translated versions of the same feature vector
[
21
] (ignoring edge effects)—see Figure 1. However, until now,
if one rotates the CNN input, then the feature vectors do not
necessarily rotate in a meaningful or easy to predict manner.
The sought-after property, directly relating input transformations
to feature vector transformations, is called equivariance.
A special case of equivariance is invariance, where feature
vectors remain constant under all transformations of the input.
This can be a desirable property globally for a model, such as a
classifier, but we should be careful not to restrict all intermediate
levels of processing to be transformation invariant. For example,
http://visual.cs.ucl.ac.uk/pubs/harmonicNets/
Figure 1. Patch-wise translation equivariance in CNNs arises from
translational weight tying, so that a translation
π
of the input image
I
,
leads to a corresponding translation
ψ
of the feature maps
f(I)
, where
π 6=ψ
in general, due to pooling effects. However, for rotations, CNNs
do not yet have a feature space transformation
ψ
‘hard-baked’ into
their structure, and it is complicated to discover what
ψ
may be, if it
exists at all. Harmonic Networks have a hard-baked representation,
which allows for easier interpretation of feature maps—see Figure 3.
consider detecting a deformable object, such as a butterfly. The
pose of the wings is limited in range, and so there are only certain
poses our detector should normally see. A transformation invari-
ant detector, good at detecting wings, would detect them whether
they were bigger, further apart, rotated, etc., and it would encode
all these cases with the same representation. It would fail to
notice nonsense situations, however, such as a butterfly with
wings rotated past the usual range, because it has thrown that
extra pose information away. An equivariant detector, on the
other hand, does not dispose of local pose information, and so it
hands on a richer and more useful representation to downstream
processes. Equivariance conveys more information about an
input to downstream processes, it also constrains the space of
possible learned models to those that are valid under the rules of
natural image formation [
30
]. This makes learning more reliable
and helps with generalization. For instance, consider CNNs.
The key insight is that the statistics of natural images, embodied
in the correlations between pixels, are a) invariant to translation,
and b) highly localized. Thus features at every layer in a CNN
are computed on local receptive fields, where weights are shared
1
5028

across translated receptive fields. This weight-tying serves both
as a constraint on the translational structure of image statistics,
and as an effective technique to reduce the number of learnable
parameters—see Figure 1. In essence, translational equivariance
has been ‘baked’ into the architecture of existing CNN models.
We do the same for rotation and refer to it as hard-baking.
The current widely accepted practice to cope with rotation is
to train with aggressive data augmentation [
16
]. This certainly
improves generalization, but is not exact, fails to capture
local equivariances, and does not ensure equivariance at every
layer within a network. How to maintain the richness of
local rotation information, is what we present in this paper.
Another disadvantage of data augmentation is that it leads to the
so-called black-box problem, where there is a lack of feature
map interpretability. Indeed, close inspection of first-layer
weights in a CNN reveals that many of them are rotated,
scaled, and translated copies of one another [
34
]. Why waste
computation learning all of these redundant weights?
In this paper, we present Harmonic Networks, or H-Nets.
They design patch-wise
360
-rotational equivariance into deep
image representations, by constraining the filters to the family of
circular harmonics. The circular harmonics are steerable filters
[
7
], which means that we can represent all rotated versions of
a filter, using just a finite, linear combination of steering bases.
This overcomes the issue of learning multiple filter copies in
CNNs, guarantees rotational equivariance, and produces feature
maps that transform predictably under input rotation.
2. Related Work
Multiple existing approaches seek to encode rotational
equivariance into CNNs. Many of these follow a broad
approach of introducing filter or feature map copies at different
rotations. None has dominated as standard practice.
Steerable filters
At the root of H-Nets lies the property
of filter steerability [
7
]. Filters exhibiting steerability can be
constructed at any rotation as a finite, linear combination of base
filters. This removes the need to learn multiple filters at different
rotations, and has the bonus of constant memory requirements.
As such, H-Nets could be thought of as using an infinite bank
of rotated filter copies. A work, which combines steerable
filters with learning is [
23
]. They build shallow features from
steerable filters, which are fed into a kernel SVM for object
detection and rigid pose regression. H-Nets use the same filters
with an added rotation offset term, so that filters in different
layers can have orientation-selectivity relative to one another.
Hard-baked transformations in CNNs
While H-Nets
hard-bake patch-wise
360
-rotation into the feature represen-
tation, numerous related works have encoded equivariance to
discrete rotations. The following works can be grouped into
those, which encode global equivariance versus patch-wise
equivariance, and those which rotate filters versus feature maps.
[
3
] introduce equivariance to
90
-rotations and dihedral
flips in CNNs by copying the transformed filters at different
rotation–flip combinations. More recently they generalized this
theory to all group-structured transformations in [
4
], but they
only demonstrated applications on finite groups—an extension
to continuous transformations would require a treatment on
anti-aliasing and bandlimiting. [
24
] use a larger number of
rotations for texture classification and [
26
] also use many
rotated handcrafted filter copies, opting not to learn the filters.
To achieve equivariance to a greater number of rotations, these
methods would need an infinite amount of computation. H-Nets
achieve equivariance to all rotations, but with finite computation.
[
6
] feed in multiple rotated copies of the CNN input and
fuse the output predictions. [
17
] do the same for a broader
class of global image transformations, and propose a novel
per-pixel pooling technique for output fusion. As discussed,
these techniques lead to global equivariances only and do not
produce interpretable feature maps. [
5
] go one step further and
copy each feature map at four
90
-rotations. They propose 4
different equivariance preserving feature map transformations.
Their CNN is similar to [
3
] in terms of what is being computed,
but rotating feature maps instead of filters. A downside of this
is that all inputs and feature maps have to be square; whereas,
we can use any sized input.
Learning generalized transformations
Others have tried
to learn the transformations directly from the data. While this is
an appealing idea, as we have said, for certain transformations it
makes more sense to hard-bake these in for interpretability and
reliability. [
25
] construct a higher-order Boltzmann machine,
which learns tuples of transformed linear filters in input–output
pairs. Although powerful, they have only shown this to work on
shallow architectures. [
9
] introduced capsules, units of neurons
designed to mimic the action of cortical columns. Capsules are
designed to be invariant to complicated transformations of the
input. Their outputs are merged at the deepest layer, and so are
only invariant to global transformation. [
22
] present a method to
regress equivariant feature detectors using an objective, which
penalizes representations, which lie far from the equivariant
manifold. Again, this only encourages global equivariance;
although, this work could be adapted to encourage equivariance
at every layer of a deep pipeline.
3. Problem analysis
Many computer vision systems strive to be view indepen-
dent, such as object recognition, which is invariant to affine
transformations, or boundary detection, which is equivariant
to non-rigid deformations. H-Nets hard-bake
360
-rotation
equivariance into their feature representation, by constraining
the convolutional filters of a CNN to be from the family of
circular harmonics. Below, we outline the formal definition of
equivariance (Section
3.1), how the circular harmonics exhibit
rotational equivariance (Section 3.2) and some properties of
the circular harmonics, which we must heed for successful
integration into the CNN framework (Section 3.2).
Continuous domain feature maps
In deep learning we use
5029

Figure 2. Real and imaginary parts of the complex Gaussian filter
W
m
(r
;e
r
2
,0)=e
r
2
e
imφ
, for some rotation orders. As a simple
example, we have set
R(r)=e
r
2
and
β =0
, but in general we learn
these quantities. Cross-correlation, of a feature map of rotation order
n
with one of these filters of rotation order
m
, results in a feature map
of rotation order
m+n
. Note the negative rotation order filters have
flipped imaginary parts compared to the positive orders.
feature maps, which live in a discrete domain. We shall instead
use continuous spaces, because the analysis is easier. Later on
in Section 4.2 we shall demonstrate how to convert back to the
discrete domain for practical implementation, but for now we
work entirely in continuous Euclidean space.
3.1. Equivariance
Equivariance is a useful property to have because transforma-
tions
π
of the input produce predictable transformations
ψ
of the
features, which are interpretable and can make learning easier.
Formally, we say that feature mapping
f :X Y
is equivariant
to a group of transformations if we can associate every
transformation
πΠ
of the input
xX
with a transformation
ψΨ of the features; that is,
ψ[f(x)]=f(π[x]). (1)
This means that the order, in which we apply the feature
mapping and the transformation is unimportant—they commute.
An example is depicted in Figure 1, which shows that in CNNs
the order of application of integer pixel-translations and the
feature map are interchangeable. An important point of note
is that
π 6=ψ
in general, so if we seek for
Π
to be rotations in
the image domain, we do not require to find the set of
f
, such
that
Ψ
“looks like a rotation in feature space, rather we are
searching for the set of
f
, such that there exists an equivalent
class of transformations
Ψ
in feature space. A special case of
equivariance is invariance, when Ψ={I}, the identity.
3.2. The Complex Circular Harmonics
With data augmentation CNNs may learn some rotation
equivariance, but this is difficult to quantify [
21
]. H-Nets take
the simpler approach of hard-baking this structure in. If
f
is
the feature mapping of a standard convolutional layer, then
360
-rotational equivariance can be hard-baked in by restricting
the filters to be of the from the circular harmonic family (proof
in Supplementary Material)
W
m
(r;R,β)=R(r)e
i(+β)
. (2)
K
K
Figure 3. DOWN: Cross-correlation of the input patch with
W
m
yields
a scalar complex-valued response. ACROSS-THEN-DOWN: Cross-
correlation with the
θ
-rotated image yields another complex-valued
response. BOTTOM: We transform from the unrotated response to the
rotated response, through multiplication by e
imθ
.
Here
r
are the spatial coordinates of image/feature maps, ex-
pressed in polar form,
m Z
is known as the rotation order,
R:R
+
R
is a function, called the radial profile, which con-
trols the overall shape of the filter, and
β [0,2π)
is a phase
offset term, which gives the filter orientation-selectivity. During
training, we learn the radial profile and phase offset terms. Ex-
amples of the real component of
W
m
for a ‘Gaussian envelope’
and different rotation orders are shown in Figure 2. Since we
are dealing with complex-valued filters, all filter responses are
complex-valued, and we assume from now on that the reader un-
derstands that all feature maps are complex-valued, unless other-
wise specified. Note that there are other works (e.g., [
32
]), which
use complex filters, but our treatment differs in that the complex
phase of the response is explicitly tied to rotation angle.
Rotational Equivariance of the Circular Harmonics
Some deep learning libraries implement cross-correlation
rather than convolution
, and since the understanding is
slightly easier to follow, we consider correlation. Strictly,
cross-correlation with complex functions requires that one
of the arguments is conjugated, but we do not do this in our
model/implementation, so
[WF](p
,q
)=
Z
W(pp
,q q
)F(p,q)dpdq (3)
[WF](p
,q
)=
Z
W(p
p,q
q)F(p,q)dpdq. (4)
Consider correlating a circular harmonic of order
m
with a
rotated image patch. We assume that the image patch is only
able to rotate locally about the origin of the filter. This means
that the cross-correlation response is a scalar function of input
image patch rotation
θ
. Using the notation from Equation 1,
and recalling that we are working in polar coordinates
(r,φ)
,
5030

counter-clockwise rotation of an image
F(r)
about the origin
by an angle
θ
is
F(r
θ
[φ]) = F(r,φθ)
. As a shorthand we
denote
F
θ
:=F(r
θ
[φ])
. It is a well-known result [
23
,
7
] (proof
in Supplementary Material) that
[W
m
F
θ
]=e
imθ
[W
m
F
0
], (5)
where we have written
W
m
in place of
W
m
(r, φ; R, β)
for
brevity. We see that the response to a
θ
-rotated image
F
θ
with a circular harmonic of order
m
is equivalent to the
cross-correlation of the unrotated image
F
0
with the harmonic,
followed by multiplication by
e
imθ
. While the rotation is done in
input space, multiplication by
e
imθ
is performed in feature space,
and so, using the notation from Equation 1,
ψ
θ
m
[] = e
imθ
·
.
This process is shown in Figure 3. Note that we have included a
subscript
m
on the feature space transformation. This is impor-
tant, because the kind of feature space transformation we apply
is dependent on the rotation order of the harmonic. Because
the phase of the response rotates with the input at frequency
m
, we say that the response is an
m
-equivariant feature map.
By thinking of an input image as a complex-valued feature map
with zero imaginary part, we could think of it as 0-equivariant.
The rotation order of a filter defines its response properties to
input rotation. In particular, rotation order
m=0
defines invari-
ance and
m=1
defines linear equivariance. For
m=0
this is be-
cause, denoting
f
m
:=[W
m
F
0
]
, then
ψ
θ
0
[f
m
]=e
i·0θ
·f
m
=f
m
,
which is independent of
θ
. For
m = 1
,
ψ
θ
1
[f
m
] = e
i·1θ
f
m
—as
the input rotates,
e
f
m
is a complex-valued number of constant
magnitude
f
m
, spinning round with a phase equal to
θ
. Natu-
rally, we are not constrained to using rotation orders 0 or 1 only,
and we make use of higher and negative orders in our work.
Arithmetic and the Equivariance Condition
Further
important properties of the circular harmonics, which are
proven in the Supplementary Material, are: 1) Chained cross-
correlation of rotation orders
m
1
and
m
2
lead to a new response
with rotation order
m
1
+ m
2
. 2) Point-wise nonlinearities
h:C C
, acting solely on the magnitudes maintain rotational
equivariance, so we can interleave cross-correlations with
typical CNN nonlinearities adapted to the complex domain. 3)
The summation of two responses of the same order m remains
of order
m
. Thus to construct a CNN where the output is
M
-equivariant to the input rotation, we require that the sum
of rotation orders along any path equals M, so
N
X
i=1
m
i
=M. (6)
This is the fundamental condition underpinning the equivariance
properties of H-Net, so we call it the equivariance condition.
We note here that for our purposes, our filter
W
m
=W
m
(the complex conjugate), which saves on parameters, but this
does not necessarily imply conjugacy of the responses unless
F is real, which is only true at the input.
Figure 4. An example of a 2 hidden layer H-Net with
m = 0
output,
input–output left-to-right. Each horizontal stream represents a series of
feature maps (circles) of constant rotation order. The edges represent
cross-correlations and are numbered with the rotation order of the
corresponding filter. The sum of rotation orders along any path of
consecutive edges through the network must equal
M =0
, to maintain
disentanglement of rotation orders.
4. Method
We have considered the
360
-rotational equivariance of
feature maps arising from cross-correlation with the circular
harmonics, and we determined that the rotation orders of
chained cross-correlations sum. Next, we use these results
to construct a deep architecture, which can leverage the
equivariance properties of circular harmonics.
4.1. Harmonic Networks
The rotation order of feature maps and filters sum upon cross-
correlation, so to achieve a given output rotation order, we must
obey the equivariance condition. In fact, at every feature map,
the equivariance condition must be met, otherwise, it should be
possible to arrive at the same feature map along two different
paths, with different summed rotation orders. The problem is
that combining complex features, with phases, which rotate at
different frequencies, leads to entanglement of the responses.
The resultant feature map is no longer equivariant to a single
rotation order, making it difficult to work with. We resolve this
by enforcing the equivariance condition at every feature map.
Our solution is to create separate streams of constant rota-
tion order responses running through the network—see Figure 4.
These streams contain multiple layers of feature maps, separated
by rotation order zero cross-correlations and nonlinearities. Mov-
ing between streams, we use cross-correlations of rotation order
equal to the difference between those two streams. It is very easy
to check that the equivariance condition holds in these networks.
When multiple responses converge at a feature map, we
have multiple choices of how to combine them. We could stack
them, we could pool across them, or we could sum them [
5
].
To save on memory, we chose to sum responses of the same
rotation order
Y
p
=
X
m,n:m+n=p
W
m
F
n
. (7)
5031

Figure 5. H-Nets operate in a continuous spatial domain, but we can
implement them on pixel-domain data because sampling and cross-
correlation commute. The schematic shows an example of a layer of an
H-Net (magnitudes only). The solid arrows follow the path of the im-
plementation, while the dashed arrows follow the possible alternative,
which is easier to analyze, but computationally infeasible. The intro-
duction of the sampling defines centers of equivariance at pixel centers
(yellow dots), about which a feature map is rotationally equivariant.
Y
p
is then fed into the next layer. Usually in our experiments,
we use streams of orders 0 and 1, which we found to work well
and is justified by the fact that CNN filters tend to contain very
little high frequency information [12].
Above, we see that the structure of the Harmonic Network
is very simple. We replaced regular CNN filters with radially
reweighted and phase shifted circular harmonics. This causes
each filter response to be equivariant to input rotations with
order
m
. To prevent responses of different rotation order from
entangling upon summation, we separated filter responses into
streams of equal rotation order.
Complex nonlinearities
Between cross-correlations, we
use complex nonlinearities, which act on the magnitudes of the
complex feature maps only, to preserve rotational equivariance.
An example is a complex version of the ReLU
C-ReLU
b
(Xe
)=ReLU(X+b)e
. (8)
We can provide similar analogues for other nonlinearities and
for Batch Normalization [
11
], which we use in our experiments.
We have thus far presented the Harmonic Network. Each
layer is a collection of feature maps of different rotation orders,
which transform predictably under rotation of the input to the net-
work and the
360
-rotation equivariance is achieved with finite
computation. Next we show how to implement this in practice.
4.2. Implementation: Discrete cross-correlations
Until now, we have operated on a domain with continuous
spatial dimensions
= R ×R×{1,k
}
. However, the H-Net
needs to operate on real-world images, which are sampled on a
2D-grid, thus we need to anti-alias the input to each discretized
layer. We do this with a simple Gaussian blur. We can then
use a regular CNN architecture without any problems. This
works on the fact that the order of bandlimited sampling and
Pixel filter
Polar filter
Bandlimit and resample signal
Figure 6. Images are sampled on a rectangular grid but our filters are
defined in the polar domain, so we bandlimit and resample the data
before cross-correlation via Gaussian resampling.
cross-correlation is interchangeable [
7
]; so either we correlate
in continuous space, then downsample, or downsample then
correlate in the discrete space. Since point-wise nonlinearities
and sampling also commute, the entire H-Net, seen as a deep
feature-mapping, commutes with sampling. This could allow
us to implement the H-Net on non-regular grids; although, we
did not explore this.
Viewing cross-correlation on discrete domains sheds some
insight into how the equivariance properties behave. In Figure
5, we see that the sampling strategy introduces multiple
origins, one for each feature map patch. We call these, centers
of equivariance, because a feature map will exhibit local
rotation equivariance about each of these points. If we move
to using more exotic sampling strategies, such as strided cross-
correlation or average pooling, then the centers of equivariance
are ablated or shifted. If we were to use max-pooling, then
the center of equivariance would be a complicated nonlinear
function of the input image and harmonic weights. For this
reason we have not used max-pooling in our experiments.
Complex cross-correlations
On a practical note, it is worth
mentioning, that complex cross-correlation can be implemented
efficiently using 4 real cross-correlations
W
Re
m
F
Re
W
Im
m
F
Im
|
{z }
real response
+iW
Re
m
F
Im
+W
Im
m
F
Re
)
|
{z }
imaginary response
. (9)
So circular harmonics can be implemented in current deep
learning frameworks, with minor engineering. We implement a
grid-resampled version of the filters
W(x
i
)=
P
j
g
i
(r
j
)W(r
j
)
,
with
g
i
(x
j
) e
−kr
i
x
j
k
2
2
/(2σ
2
)
(see Figure 6). The polar
representation
(r
j
,φ
j
)
can be mapped from the components
r
j
by
r
j
= [r
j
cosφ
j
,r
j
sinφ
j
]
. If we stack all the polar filter
samples into a matrix we can write each point as the outer
product of a radial tensor
R
j
and trigonometric angular tensor
[cosmΦ
r
j
,isinmΦ
r
j
]
. The phase offset
β
can be separated
out by noting that
W
m
(r
j
)=
I
X
i=1
R(r
j
)
Icosβ Isinβ
Isinβ Icosβ

cosmΦ
r
j
isinmΦ
r
j
(10)
where the complex exponential and trigonometric terms are
element-wise, and
I
is the identity matrix. This is just a reweight-
ing of the ring elements. In full generality, we could also use
a per-radius phase
β
r
i
, which would allow for spiral-like left-
and right-handed features, but we did not investigate this.
5032

Citations
More filters
Proceedings ArticleDOI
17 Mar 2017
TL;DR: Deformable convolutional networks as discussed by the authors augment the spatial sampling locations in the modules with additional offsets and learn the offsets from the target tasks, without additional supervision, which can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard backpropagation.
Abstract: Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in their building modules. In this work, we introduce two new modules to enhance the transformation modeling capability of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the performance of our approach. For the first time, we show that learning dense spatial transformation in deep CNNs is effective for sophisticated vision tasks such as object detection and semantic segmentation. The code is released at https://github.com/msracver/Deformable-ConvNets.

3,318 citations

Journal ArticleDOI
TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.
Abstract: Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research.

1,897 citations


Cites background from "Harmonic Networks: Deep Translation..."

  • ...Therefore, many approaches have been presented to enhance robustness, aiming at learning invariant CNN representations with respect to different types of transformations such as scale [131, 21], rotation [21, 42, 284, 323], or both [126]....

    [...]

Proceedings Article
15 Feb 2018
TL;DR: Capsule Networks as mentioned in this paper use a logistic unit to represent the presence of an entity and a 4x4 matrix to learn the relationship between that entity and the viewer (the pose).
Abstract: A capsule is a group of neurons whose outputs represent different properties of the same entity. Each layer in a capsule network contains many capsules [a group of capsules forms a capsule layer and can be used in place of a traditional layer in a neural net]. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 matrix which could learn to represent the relationship between that entity and the viewer (the pose). A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by trainable viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated for each image using the Expectation-Maximization algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The transformation matrices are trained discriminatively by backpropagating through the unrolled iterations of EM between each pair of adjacent capsule layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45\% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attack than our baseline convolutional neural network.

891 citations

Posted Content
TL;DR: It is observed that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner, and small architectural changes are derived that guarantee that unwanted information cannot leak into the hierarchical synthesis process.
Abstract: We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, e.g., detail appearing to be glued to image coordinates instead of the surfaces of depicted objects. We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process. The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.

621 citations


Additional excerpts

  • ..., rotation [16, 62, 60, 59] and scale [61]....

    [...]

Posted Content
TL;DR: WILDS is presented, a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, and is hoped to encourage the development of general-purpose methods that are anchored to real-world distribution shifts and that work well across different applications and problem settings.
Abstract: Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity, these real-world distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts which naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training results in substantially lower out-of-distribution than in-distribution performance, and that this gap remains even with models trained by existing methods for handling distribution shifts. This underscores the need for new training methods that produce models which are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at this https URL.

579 citations

References
More filters
Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations


"Harmonic Networks: Deep Translation..." refers methods in this paper

  • ...(7) We can provide similar analogues for other nonlinearities and for Batch Normalization [7], which we use in our experiments....

    [...]

  • ...We try to mimic their network architecture for H-Nets as best as we can, using 2 rotation order streams with m∈ {0,1} through to the deepest layer, and complex-valued versions of ReLU nonlinearities and Batch Normalization (see Method)....

    [...]

  • ...We can provide similar analogues for other nonlinearities and for Batch Normalization [11], which we use in our experiments....

    [...]

Book ChapterDOI
06 Sep 2014
TL;DR: A novel visualization technique is introduced that gives insight into the function of intermediate feature layers and the operation of the classifier in large Convolutional Network models, used in a diagnostic role to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark.
Abstract: Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

12,783 citations


"Harmonic Networks: Deep Translation..." refers background in this paper

  • ...Indeed, close inspection of first-layer weights in a CNN reveals that many of them are rotated, scaled, and translated copies of one another [34]....

    [...]