Harmonic Networks: Deep Translation and Rotation Equivariance

doi:10.1109/CVPR.2017.758

Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov and Gabriel J. Brostow

{d.worrall, s.garbin, d.turmukhambetov, g.brostow}@cs.ucl.ac.uk

University College London

∗

Abstract

Translating or rotating an input image should not affect the

results of many computer vision tasks. Convolutional neural net-

works (CNNs) are already translation equivariant: input image

translations produce proportionate feature map translations.

This is not the case for rotations. Global rotation equivariance

is typically sought through data augmentation, but patch-wise

equivariance is more difficult. We present Harmonic Networks

or H-Nets, a CNN exhibiting equivariance to patch-wise trans-

lation and 360-rotation. We achieve this by replacing regular

CNN filters with circular harmonics, returning a maximal

response and orientation for every receptive field patch.

H-Nets use a rich, parameter-efficient and fixed computa-

tional complexity representation, and we show that deep feature

maps within the network encode complicated rotational invari-

ants. We demonstrate that our layers are general enough to be

used in conjunction with the latest architectures and techniques,

such as deep supervision and batch normalization. We also

achieve state-of-the-art classification on rotated-MNIST, and

competitive results on other benchmark challenges.

1. Introduction

We tackle the challenge of representing

360

◦

-rotations

in convolutional neural networks (CNNs) [

19

]. Currently,

convolutional layers are constrained by design to map an image

to a feature vector, and translated versions of the image map

to proportionally-translated versions of the same feature vector

[

21

] (ignoring edge effects)—see Figure 1. However, until now,

if one rotates the CNN input, then the feature vectors do not

necessarily rotate in a meaningful or easy to predict manner.

The sought-after property, directly relating input transformations

to feature vector transformations, is called equivariance.

A special case of equivariance is invariance, where feature

vectors remain constant under all transformations of the input.

This can be a desirable property globally for a model, such as a

classifier, but we should be careful not to restrict all intermediate

levels of processing to be transformation invariant. For example,

∗

http://visual.cs.ucl.ac.uk/pubs/harmonicNets/

Figure 1. Patch-wise translation equivariance in CNNs arises from

translational weight tying, so that a translation

π

of the input image

I

,

leads to a corresponding translation

ψ

of the feature maps

f(I)

, where

π 6=ψ

in general, due to pooling effects. However, for rotations, CNNs

do not yet have a feature space transformation

ψ

‘hard-baked’ into

their structure, and it is complicated to discover what

ψ

may be, if it

exists at all. Harmonic Networks have a hard-baked representation,

which allows for easier interpretation of feature maps—see Figure 3.

consider detecting a deformable object, such as a butterfly. The

pose of the wings is limited in range, and so there are only certain

poses our detector should normally see. A transformation invari-

ant detector, good at detecting wings, would detect them whether

they were bigger, further apart, rotated, etc., and it would encode

all these cases with the same representation. It would fail to

notice nonsense situations, however, such as a butterfly with

wings rotated past the usual range, because it has thrown that

extra pose information away. An equivariant detector, on the

other hand, does not dispose of local pose information, and so it

hands on a richer and more useful representation to downstream

processes. Equivariance conveys more information about an

input to downstream processes, it also constrains the space of

possible learned models to those that are valid under the rules of

natural image formation [

30

]. This makes learning more reliable

and helps with generalization. For instance, consider CNNs.

The key insight is that the statistics of natural images, embodied

in the correlations between pixels, are a) invariant to translation,

and b) highly localized. Thus features at every layer in a CNN

are computed on local receptive fields, where weights are shared

1

5028

across translated receptive fields. This weight-tying serves both

as a constraint on the translational structure of image statistics,

and as an effective technique to reduce the number of learnable

parameters—see Figure 1. In essence, translational equivariance

has been ‘baked’ into the architecture of existing CNN models.

We do the same for rotation and refer to it as hard-baking.

The current widely accepted practice to cope with rotation is

to train with aggressive data augmentation [

16

]. This certainly

improves generalization, but is not exact, fails to capture

local equivariances, and does not ensure equivariance at every

layer within a network. How to maintain the richness of

local rotation information, is what we present in this paper.

Another disadvantage of data augmentation is that it leads to the

so-called black-box problem, where there is a lack of feature

map interpretability. Indeed, close inspection of first-layer

weights in a CNN reveals that many of them are rotated,

scaled, and translated copies of one another [

34

]. Why waste

computation learning all of these redundant weights?

In this paper, we present Harmonic Networks, or H-Nets.

They design patch-wise

360

◦

-rotational equivariance into deep

image representations, by constraining the filters to the family of

circular harmonics. The circular harmonics are steerable filters

[

7

], which means that we can represent all rotated versions of

a filter, using just a finite, linear combination of steering bases.

This overcomes the issue of learning multiple filter copies in

CNNs, guarantees rotational equivariance, and produces feature

maps that transform predictably under input rotation.

2. Related Work

Multiple existing approaches seek to encode rotational

equivariance into CNNs. Many of these follow a broad

approach of introducing filter or feature map copies at different

rotations. None has dominated as standard practice.

Steerable filters

At the root of H-Nets lies the property

of filter steerability [

7

]. Filters exhibiting steerability can be

constructed at any rotation as a finite, linear combination of base

filters. This removes the need to learn multiple filters at different

rotations, and has the bonus of constant memory requirements.

As such, H-Nets could be thought of as using an infinite bank

of rotated filter copies. A work, which combines steerable

filters with learning is [

23

]. They build shallow features from

steerable filters, which are fed into a kernel SVM for object

detection and rigid pose regression. H-Nets use the same filters

with an added rotation offset term, so that filters in different

layers can have orientation-selectivity relative to one another.

Hard-baked transformations in CNNs

While H-Nets

hard-bake patch-wise

360

◦

-rotation into the feature represen-

tation, numerous related works have encoded equivariance to

discrete rotations. The following works can be grouped into

those, which encode global equivariance versus patch-wise

equivariance, and those which rotate filters versus feature maps.

[

3

] introduce equivariance to

90

◦

-rotations and dihedral

flips in CNNs by copying the transformed filters at different

rotation–flip combinations. More recently they generalized this

theory to all group-structured transformations in [

4

], but they

only demonstrated applications on finite groups—an extension

to continuous transformations would require a treatment on

anti-aliasing and bandlimiting. [

24

] use a larger number of

rotations for texture classification and [

26

] also use many

rotated handcrafted filter copies, opting not to learn the filters.

To achieve equivariance to a greater number of rotations, these

methods would need an infinite amount of computation. H-Nets

achieve equivariance to all rotations, but with finite computation.

[

6

] feed in multiple rotated copies of the CNN input and

fuse the output predictions. [

17

] do the same for a broader

class of global image transformations, and propose a novel

per-pixel pooling technique for output fusion. As discussed,

these techniques lead to global equivariances only and do not

produce interpretable feature maps. [

5

] go one step further and

copy each feature map at four

90

◦

-rotations. They propose 4

different equivariance preserving feature map transformations.

Their CNN is similar to [

3

] in terms of what is being computed,

but rotating feature maps instead of filters. A downside of this

is that all inputs and feature maps have to be square; whereas,

we can use any sized input.

Learning generalized transformations

Others have tried

to learn the transformations directly from the data. While this is

an appealing idea, as we have said, for certain transformations it

makes more sense to hard-bake these in for interpretability and

reliability. [

25

] construct a higher-order Boltzmann machine,

which learns tuples of transformed linear filters in input–output

pairs. Although powerful, they have only shown this to work on

shallow architectures. [

9

] introduced capsules, units of neurons

designed to mimic the action of cortical columns. Capsules are

designed to be invariant to complicated transformations of the

input. Their outputs are merged at the deepest layer, and so are

only invariant to global transformation. [

22

] present a method to

regress equivariant feature detectors using an objective, which

penalizes representations, which lie far from the equivariant

manifold. Again, this only encourages global equivariance;

although, this work could be adapted to encourage equivariance

at every layer of a deep pipeline.

3. Problem analysis

Many computer vision systems strive to be view indepen-

dent, such as object recognition, which is invariant to affine

transformations, or boundary detection, which is equivariant

to non-rigid deformations. H-Nets hard-bake

360

◦

-rotation

equivariance into their feature representation, by constraining

the convolutional filters of a CNN to be from the family of

circular harmonics. Below, we outline the formal definition of

equivariance (Section

3.1), how the circular harmonics exhibit

rotational equivariance (Section 3.2) and some properties of

the circular harmonics, which we must heed for successful

integration into the CNN framework (Section 3.2).

Continuous domain feature maps

In deep learning we use

5029

Figure 2. Real and imaginary parts of the complex Gaussian filter

W

m

(r,φ

′

;e

−r

2

,0)=e

−r

2

e

imφ

, for some rotation orders. As a simple

example, we have set

R(r)=e

−r

2

and

β =0

, but in general we learn

these quantities. Cross-correlation, of a feature map of rotation order

n

with one of these filters of rotation order

m

, results in a feature map

of rotation order

m+n

. Note the negative rotation order filters have

flipped imaginary parts compared to the positive orders.

feature maps, which live in a discrete domain. We shall instead

use continuous spaces, because the analysis is easier. Later on

in Section 4.2 we shall demonstrate how to convert back to the

discrete domain for practical implementation, but for now we

work entirely in continuous Euclidean space.

3.1. Equivariance

Equivariance is a useful property to have because transforma-

tions

π

of the input produce predictable transformations

ψ

of the

features, which are interpretable and can make learning easier.

Formally, we say that feature mapping

f :X →Y

is equivariant

to a group of transformations if we can associate every

transformation

π∈Π

of the input

x∈X

with a transformation

ψ∈Ψ of the features; that is,

ψ[f(x)]=f(π[x]). (1)

This means that the order, in which we apply the feature

mapping and the transformation is unimportant—they commute.

An example is depicted in Figure 1, which shows that in CNNs

the order of application of integer pixel-translations and the

feature map are interchangeable. An important point of note

is that

π 6=ψ

in general, so if we seek for

Π

to be rotations in

the image domain, we do not require to find the set of

f

, such

that

Ψ

“looks like” a rotation in feature space, rather we are

searching for the set of

f

, such that there exists an equivalent

class of transformations

Ψ

in feature space. A special case of

equivariance is invariance, when Ψ={I}, the identity.

3.2. The Complex Circular Harmonics

With data augmentation CNNs may learn some rotation

equivariance, but this is difficult to quantify [

21

]. H-Nets take

the simpler approach of hard-baking this structure in. If

f

is

the feature mapping of a standard convolutional layer, then

360

◦

-rotational equivariance can be hard-baked in by restricting

the filters to be of the from the circular harmonic family (proof

in Supplementary Material)

W

m

(r,φ;R,β)=R(r)e

i(mφ+β)

. (2)

K

Figure 3. DOWN: Cross-correlation of the input patch with

W

m

yields

a scalar complex-valued response. ACROSS-THEN-DOWN: Cross-

correlation with the

θ

-rotated image yields another complex-valued

response. BOTTOM: We transform from the unrotated response to the

rotated response, through multiplication by e

imθ

.

Here

r,φ

are the spatial coordinates of image/feature maps, ex-

pressed in polar form,

m ∈ Z

is known as the rotation order,

R:R

+

→R

is a function, called the radial profile, which con-

trols the overall shape of the filter, and

β ∈ [0,2π)

is a phase

offset term, which gives the filter orientation-selectivity. During

training, we learn the radial profile and phase offset terms. Ex-

amples of the real component of

W

m

for a ‘Gaussian envelope’

and different rotation orders are shown in Figure 2. Since we

are dealing with complex-valued filters, all filter responses are

complex-valued, and we assume from now on that the reader un-

derstands that all feature maps are complex-valued, unless other-

wise specified. Note that there are other works (e.g., [

32

]), which

use complex filters, but our treatment differs in that the complex

phase of the response is explicitly tied to rotation angle.

Rotational Equivariance of the Circular Harmonics

Some deep learning libraries implement cross-correlation

⋆

rather than convolution

∗

, and since the understanding is

slightly easier to follow, we consider correlation. Strictly,

cross-correlation with complex functions requires that one

of the arguments is conjugated, but we do not do this in our

model/implementation, so

[W⋆F](p

′

,q

′

)=

Z

W(p−p

′

,q− q

′

)F(p,q)dpdq (3)

[W∗F](p

′

,q

′

)=

Z

W(p

′

−p,q

′

−q)F(p,q)dpdq. (4)

Consider correlating a circular harmonic of order

m

with a

rotated image patch. We assume that the image patch is only

able to rotate locally about the origin of the filter. This means

that the cross-correlation response is a scalar function of input

image patch rotation

θ

. Using the notation from Equation 1,

and recalling that we are working in polar coordinates

(r,φ)

,

5030

counter-clockwise rotation of an image

F(r,φ)

about the origin

by an angle

θ

is

F(r,π

θ

[φ]) = F(r,φ−θ)

. As a shorthand we

denote

F

θ

:=F(r,π

θ

[φ])

. It is a well-known result [

23

,

7

] (proof

in Supplementary Material) that

[W

m

⋆F

θ

]=e

imθ

[W

m

⋆F

0

], (5)

where we have written

W

m

in place of

W

m

(r, φ; R, β)

for

brevity. We see that the response to a

θ

-rotated image

F

θ

with a circular harmonic of order

m

is equivalent to the

cross-correlation of the unrotated image

F

0

with the harmonic,

followed by multiplication by

e

imθ

. While the rotation is done in

input space, multiplication by

e

imθ

is performed in feature space,

and so, using the notation from Equation 1,

ψ

θ

m

[•] = e

imθ

·•

.

This process is shown in Figure 3. Note that we have included a

subscript

m

on the feature space transformation. This is impor-

tant, because the kind of feature space transformation we apply

is dependent on the rotation order of the harmonic. Because

the phase of the response rotates with the input at frequency

m

, we say that the response is an

m

-equivariant feature map.

By thinking of an input image as a complex-valued feature map

with zero imaginary part, we could think of it as 0-equivariant.

The rotation order of a filter defines its response properties to

input rotation. In particular, rotation order

m=0

defines invari-

ance and

m=1

defines linear equivariance. For

m=0

this is be-

cause, denoting

f

m

:=[W

m

⋆F

0

]

, then

ψ

θ

0

[f

m

]=e

i·0θ

·f

m

=f

m

,

which is independent of

θ

. For

m = 1

,

ψ

θ

1

[f

m

] = e

i·1θ

f

m

—as

the input rotates,

e

iθ

f

m

is a complex-valued number of constant

magnitude

f

m

, spinning round with a phase equal to

θ

. Natu-

rally, we are not constrained to using rotation orders 0 or 1 only,

and we make use of higher and negative orders in our work.

Arithmetic and the Equivariance Condition

Further

important properties of the circular harmonics, which are

proven in the Supplementary Material, are: 1) Chained cross-

correlation of rotation orders

m

1

and

m

2

lead to a new response

with rotation order

m

1

+ m

2

. 2) Point-wise nonlinearities

h:C →C

, acting solely on the magnitudes maintain rotational

equivariance, so we can interleave cross-correlations with

typical CNN nonlinearities adapted to the complex domain. 3)

The summation of two responses of the same order m remains

of order

m

. Thus to construct a CNN where the output is

M

-equivariant to the input rotation, we require that the sum

of rotation orders along any path equals M, so

N

X

i=1

m

i

=M. (6)

This is the fundamental condition underpinning the equivariance

properties of H-Net, so we call it the equivariance condition.

We note here that for our purposes, our filter

W

−m

=W

m

(the complex conjugate), which saves on parameters, but this

does not necessarily imply conjugacy of the responses unless

F is real, which is only true at the input.

Figure 4. An example of a 2 hidden layer H-Net with

m = 0

output,

input–output left-to-right. Each horizontal stream represents a series of

feature maps (circles) of constant rotation order. The edges represent

cross-correlations and are numbered with the rotation order of the

corresponding filter. The sum of rotation orders along any path of

consecutive edges through the network must equal

M =0

, to maintain

disentanglement of rotation orders.

4. Method

We have considered the

360

◦

-rotational equivariance of

feature maps arising from cross-correlation with the circular

harmonics, and we determined that the rotation orders of

chained cross-correlations sum. Next, we use these results

to construct a deep architecture, which can leverage the

equivariance properties of circular harmonics.

4.1. Harmonic Networks

The rotation order of feature maps and filters sum upon cross-

correlation, so to achieve a given output rotation order, we must

obey the equivariance condition. In fact, at every feature map,

the equivariance condition must be met, otherwise, it should be

possible to arrive at the same feature map along two different

paths, with different summed rotation orders. The problem is

that combining complex features, with phases, which rotate at

different frequencies, leads to entanglement of the responses.

The resultant feature map is no longer equivariant to a single

rotation order, making it difficult to work with. We resolve this

by enforcing the equivariance condition at every feature map.

Our solution is to create separate streams of constant rota-

tion order responses running through the network—see Figure 4.

These streams contain multiple layers of feature maps, separated

by rotation order zero cross-correlations and nonlinearities. Mov-

ing between streams, we use cross-correlations of rotation order

equal to the difference between those two streams. It is very easy

to check that the equivariance condition holds in these networks.

When multiple responses converge at a feature map, we

have multiple choices of how to combine them. We could stack

them, we could pool across them, or we could sum them [

5

].

To save on memory, we chose to sum responses of the same

rotation order

Y

p

=

X

m,n:m+n=p

W

m

⋆F

n

. (7)

5031

Figure 5. H-Nets operate in a continuous spatial domain, but we can

implement them on pixel-domain data because sampling and cross-

correlation commute. The schematic shows an example of a layer of an

H-Net (magnitudes only). The solid arrows follow the path of the im-

plementation, while the dashed arrows follow the possible alternative,

which is easier to analyze, but computationally infeasible. The intro-

duction of the sampling defines centers of equivariance at pixel centers

(yellow dots), about which a feature map is rotationally equivariant.

Y

p

is then fed into the next layer. Usually in our experiments,

we use streams of orders 0 and 1, which we found to work well

and is justified by the fact that CNN filters tend to contain very

little high frequency information [12].

Above, we see that the structure of the Harmonic Network

is very simple. We replaced regular CNN filters with radially

reweighted and phase shifted circular harmonics. This causes

each filter response to be equivariant to input rotations with

order

m

. To prevent responses of different rotation order from

entangling upon summation, we separated filter responses into

streams of equal rotation order.

Complex nonlinearities

Between cross-correlations, we

use complex nonlinearities, which act on the magnitudes of the

complex feature maps only, to preserve rotational equivariance.

An example is a complex version of the ReLU

C-ReLU

b

(Xe

iφ

)=ReLU(X+b)e

iφ

. (8)

We can provide similar analogues for other nonlinearities and

for Batch Normalization [

11

], which we use in our experiments.

We have thus far presented the Harmonic Network. Each

layer is a collection of feature maps of different rotation orders,

which transform predictably under rotation of the input to the net-

work and the

360

◦

-rotation equivariance is achieved with finite

computation. Next we show how to implement this in practice.

4.2. Implementation: Discrete cross-correlations

Until now, we have operated on a domain with continuous

spatial dimensions

Ω = R ×R×{1,k

ℓ

}

. However, the H-Net

needs to operate on real-world images, which are sampled on a

2D-grid, thus we need to anti-alias the input to each discretized

layer. We do this with a simple Gaussian blur. We can then

use a regular CNN architecture without any problems. This

works on the fact that the order of bandlimited sampling and

Pixel filter

Polar filter

Bandlimit and resample signal

Figure 6. Images are sampled on a rectangular grid but our filters are

defined in the polar domain, so we bandlimit and resample the data

before cross-correlation via Gaussian resampling.

cross-correlation is interchangeable [

7

]; so either we correlate

in continuous space, then downsample, or downsample then

correlate in the discrete space. Since point-wise nonlinearities

and sampling also commute, the entire H-Net, seen as a deep

feature-mapping, commutes with sampling. This could allow

us to implement the H-Net on non-regular grids; although, we

did not explore this.

Viewing cross-correlation on discrete domains sheds some

insight into how the equivariance properties behave. In Figure

5, we see that the sampling strategy introduces multiple

origins, one for each feature map patch. We call these, centers

of equivariance, because a feature map will exhibit local

rotation equivariance about each of these points. If we move

to using more exotic sampling strategies, such as strided cross-

correlation or average pooling, then the centers of equivariance

are ablated or shifted. If we were to use max-pooling, then

the center of equivariance would be a complicated nonlinear

function of the input image and harmonic weights. For this

reason we have not used max-pooling in our experiments.

Complex cross-correlations

On a practical note, it is worth

mentioning, that complex cross-correlation can be implemented

efficiently using 4 real cross-correlations

W

Re

m

⋆F

Re

−W

Im

m

⋆F

Im

|

{z }

real response

+iW

Re

m

⋆F

Im

+W

Im

m

⋆F

Re

)

|

{z }

imaginary response

. (9)

So circular harmonics can be implemented in current deep

learning frameworks, with minor engineering. We implement a

grid-resampled version of the filters

W(x

i

)=

P

j

g

i

(r

j

)W(r

j

)

,

with

g

i

(x

j

) ∝ e

−kr

i

−x

j

k

2

/(2σ

2

)

(see Figure 6). The polar

representation

(r

j

,φ

j

)

can be mapped from the components

r

j

by

r

j

= [r

j

cosφ

j

,r

j

sinφ

j

]

⊤

. If we stack all the polar filter

samples into a matrix we can write each point as the outer

product of a radial tensor

R

j

and trigonometric angular tensor

[cosmΦ

r

j

,isinmΦ

r

j

]

⊤

. The phase offset

β

can be separated

out by noting that

W

m

(r

j

)=

I

X

i=1

R(r

j

)



Icosβ −Isinβ

Isinβ Icosβ



cosmΦ

r

j

isinmΦ

r

j



(10)

where the complex exponential and trigonometric terms are

element-wise, and

I

is the identity matrix. This is just a reweight-

ing of the ring elements. In full generality, we could also use

a per-radius phase

β

r

i

, which would allow for spiral-like left-

and right-handed features, but we did not investigate this.

5032

Harmonic Networks: Deep Translation and Rotation Equivariance

Citations

Cites background from "Harmonic Networks: Deep Translation..."

Additional excerpts

References

"Harmonic Networks: Deep Translation..." refers methods in this paper

"Harmonic Networks: Deep Translation..." refers background in this paper

Related Papers (5)