Deep LAC: Deep Localization, Alignment and Classification
for Fine-grained Recognition
Di Lin
†
Xiaoyong Shen
†
Cewu Lu
‡
Jiaya Jia
†
†
The Chinese University of Hong Kong
‡
Hong Kong University of Science and Technology
Abstract
We propose a fine-grained recognition system that incor-
porates part localization, alignment, and classification in
one deep neural network. This is a nontrivial process, as the
input to the classification module should be functions that
enable back-propagation in constructing the solver. Our
major contribution is to propose a valve linkage function
(VLF) for back-propagation chaining and form our deep lo-
calization, alignment and classification (LAC) system. The
VLF can adaptively compromise the errors of c lassification
and alignment when training the LAC model. It in turn he lp-
s update localization. The pe rfo rmance on fine- grained ob-
ject data bears out the effectiveness of our LAC system.
1. Introduction
Fine-grain ed object re cognition aims to identify sub-
category object classes, which includes finding subtle dif-
ference among species of animals, product brands, and even
architectura l styles. Thanks to recent succe ss of convolu-
tional neural networks (CNN) [13], good performance was
achieved on fine-grained tasks [4, 27].
The large flexibility of CNN structures makes fine-
grained recognition still have much r oom to improve. One
challenge is th at discriminative patterns (e.g., bird head in
bird species recognition) appear possibly in different loca-
tions, and with rotation and scaling in the collected images.
Although research of [17, 10] showed that CNN features are
reasonably robust to scale and rotation variation, it is nec-
essary to directly captur e these types of change to increase
the recognition accuracy [27, 4].
Existing solution s perform localization, alignment, and
classification independently and consecutively. This proce-
dure is illu strated in Figure 1 using solid-line arrows where
parts are localized, aligned according to templates, and then
fed into the classification neural network. Obviously, any
error arising during localization could influence alignment
and classification.
In this paper, we propose a feedback-c ontrol fr amework
Forward-
propagation
Backward-
propagation
Localized
Part
Template
Alignment
Pose-aligned
Part
Iteration
Alignment
Error
Classification
Error
Figure 1: The one-way procedure from localization to tem-
plate alignment ma kes each module rely on results from the
previous one. Contrarily, back-propagatio n highlighted by
dashed arrow makes it possible to refine localization accord-
ing to the classification and alignment results. It forms a
bi-directio nal refinement process.
to back-propag a te alignment and classification errors to lo-
calization, in order to optimally update all states in itera-
tions. This process is highlighte d by dashe d arrows in Fig-
ure 1, which, in our experim e nts, benefits final classifica-
tion. This framework is constructed as one deep neural net-
work in c luding all localization, alignment and c la ssification
tasks.
The difficulty of forming a neural network f or all mod-
ules stems from the special requirement of cla ssification
sub-network input. As shown in Figure 1, the input to clas-
sification is a n image after a lignment. It cannot achieve the
back-p ropagation cha in during the whole network solving
due to the fact that the der ivation of a constant, which is the
aligned region , is zero.
The main focus of this paper is thus to propose a valve
linkage function (VLF) in alignment sub-network to opti-
1
Alignment
Sub-network
Valve Linkage
Function
Input
Image
American
Goldfinch
Pose-aligned
Part
Localized
Part
Category
Label
Localization
Sub-network
Classification
Sub-network
Alignment
Error
Classification
Error
Convolutional
Layer
Forward-
propagation
Backward-
propagation
Fully Connected
Layer
Templates
Figure 2: Deep LAC. It consists of localization, alignment and classification sub-networks. With the help of VLF, alignment
sub-network outputs pose-aligned part images fo r classification in the FP stage, while classification and alignment errors can
be propagated back to localization in the BP stage.
mally connect the localization and classification modu le s
in our deep LAC fram ework. The architecture is shown in
Figure 2. Because we involve these tasks in one network,
forward-propagation (FP) and backward-propagatio n (BP)
solving procedures become available.
In FP, VLF outputs a pose-aligned part image to classi-
fication. In BP, it should be a functio n containing neces-
sary parameters for updating the localization sub-network.
Our VLF not o nly connects all sub-networks, but also func-
tions as information valve to compromise classification and
alignment errors. If alignmen t is good enough in the F-
P stage, VLF gu arantees corresponding accurate classifica-
tion. Otherwise, errors propag ated from classification fine-
ly tune the previous mo dules. These effects make the whole
network reach a stable state. Note this scheme is genera l,
as similar VLF can be proposed for oth er networks that in-
volve several modules and various parameters.
Other contribution includes the new localization and
alignment sub -networks. As shown in Figure 2, localiza-
tion [20] is with regre ssion of part location. It differs from
general object detection by making u se of relatively stable
relationship between the fine-grained object (e.g., bird) and
part region (e.g., bird head), which contrarily cannot be p-
reserved for general objects. On the alignment side, we in-
troduce multi-template selection to effectively handle pose
variance of parts.
With our system joint modeling localization, alignment,
and classification, decent performance is accomplished in
compariso n to the solutions where these modules are c on-
sidered independently. We apply our method to data for au-
tomatic classification without part annotation in the testing
phase.
2. Related Work
Pioneer work in this area conc e ntrated on constructing
discriminative whole-image representation [22, 15, 23]. It
was later found suffering from the proble m of losing sub-
tle difference between subordinate categories. Loca liza tion
and alignment can ameliorate this problem by extracting
parts fr om visually similar regions and re ducing th eir vari-
ance. Recent work exploits these oper ations.
Farrell et al. [7] and Yao et al. [26] used templates to get
the location of parts. Yang et al. [25] learned templates to
localize important parts o f fine-grained objects in an unsu-
pervised manner. G avves et al. [8] aligne d images in order
to accommodate the possibly large variation of poses. In
[5, 1, 24], segmentation and part localization were unified
in one framework to alleviate the distracting effect of back-
ground. Berg et al. [3] put forward part-based one-vs-one
feature (POOF). Zhang et al. [28] enforced DPM to extract
part regions and features as image representatio n.
Recently, fine-grained recognition was achieved by com-
bining localization, alignment, and deep neural networks.
Zhang et al. [28] applied pre-trained convolutiona l neur a l
network [13] to extract feature on the localized part. In
their later work [ 27], selective search [21] was used for part
proposal. Branson et al. [4] studied highe r-ord e r geomet-
ric warping to align parts. In [27, 4], the fine-tune d CNN
model on dataset [22] was used to extract repre sentation on
parts. This method accomplished state-of-the-art results on
bird identification [22].
These metho ds did not perform joint refinement of local-
ization, alignment, and classification in one network. Em-
ployment of these modules together in our system is found
profitable fo r fine-grained recog nition. We give more d e -
tails below.
3. Our Approach
To recognize fine-grained classes, we learn deep LAC
models for distinct and meaningful parts. Features extract-
ed on parts are used in classical classifiers, e.g. SVM. The
main network consists of three sub-ones for the aforemen-
tioned three tasks – localization sub-network provides p a rt
position; alignm ent sub-network performs template align-
ment to offset translation, scaling and rotation of localized
parts; pose-aligned parts are fe d to the classification sub-
network.
As aforementioned, the way to connect those three mod-
ules within a unified deep neur a l network is worth studying.
In what follows, we first describe localization and c la ssifi-
cation sub-networks, which are implemented according to
the CNN mod el [13]. Then we detail our alignment sub-
network where forward-propagation (FP) and backward-
propagation (BP) stages are implemented.
3.1. Localization Sub-network
Our localization sub-network outputs the common-
ly used coordinates for the top-left and bottom-right
bounding-box corners denoted as (x
1
, y
1
) and (x
2
, y
2
), giv-
en an input natu ral image for fine-grained recognition. In
the training phase, we regress bounding b oxes of part re-
gions. Ground truth bound ing boxes are genera te d with part
annotation. We unify input image reso lution and construct
a localization sub-network [1 3], which consists of 5 convo-
lutional layers and 3 fully connected ones. Our last fully
connected layer is a 4-way output for regressing bounding-
box corners (x
1
, y
1
) and (x
2
, y
2
).
With output L = (x
1
, y
1
, x
2
, y
2
), our localization sub -
network is expressed a s
L = f
l
(W
l
; I), (1)
where W
l
is the weig ht parameter set and I is the input
image. During training, ground truth locations of parts L
gt
are used. The location objective function is given by
E
l
(W
l
; I, L
gt
) =
1
2
||f
l
(W
l
; I) − L
gt
||
2
. (2)
We min imize it over W
l
. This fram ework works we ll on
part location regression because the appearance of objects
and part regions are generally stable in fine-grained tasks.
The location of parts thus can be reasonably pre dicted. Fig-
ure 5 shows the examples of localize d bird heads and bod-
ies.
3.2. Classific ation Sub-network
The classification sub-network is the last module shown
in Figure 2. Our classification takes the pose-a ligned part
image as input, denoted as φ
∗
, and genera te s the categor y
label. This classification CNN [13] is expressed as
y = f
c
(W
c
; φ
∗
), (3)
where W
c
is the weight parame te r set in this sub-network.
The output is the category label y.
During training, the grou nd truth label y
gt
is provided.
The predicted category label y in Eq. (3) should be consis-
tent with y
gt
. We enfor ce a penalty on y, which is denoted
as E
c
(W
c
; φ
∗
, y
gt
). In classification, we follow the method
of [13] to u se softmax regression loss in order to penalize
the classification error.
Our major contribution in this system is the construction
of the alignment sub-network, which is deta iled below to-
gether with the formulation of φ
∗
in Eq. (3).
3.3. Alignment Sub-network
Alignment sub-network receives part location L (i.e.,
bounding box) from the localization module, performs tem-
plate alignment [18] and feeds a pose-aligned part image
to classification, as shown in Figure 2. Our alignment sub-
network offsets translation, scaling, and rotation for pose-
aligned part region genera tion, which is important for ac-
curate classification. Apart from pose aligning, this sub-
network plays a crucial role on bridg ing the backward-
propagation (BP) stage of the whole LAC model, which
helps utilize the classification and alignment results to re-
fine localization.
We prop ose a new valve linkage function (VLF) as the
output of alignmen t sub-ne twork to accomplish the above
goals. In what follows, we present our alignment part and
then detail our VLF in line with the FP and BP stages of the
LAC m odel.
3.3.1 Template Alignment
We rectify loc alized part regions, making their poses c lose
to the templates. To evaluate the similarity of poses, w e
(a)
(b)
Figure 3: Examples o f alignment templates of bird (a) head
and (b) body.
define the function be twe en part regions R
i
and R
j
as
S[R
i
, R
j
] =
255
X
m=0
255
X
n=0
p
ij
(m, n)log(
p
ij
(m, n)
p
i
(m)p
j
(n)
), (4)
where p
i
, p
j
∈ R
c
form distributions of gray-scale val-
ues of uniform -size images R
i
and R
j
respectively. p
ij
∈
R
256×256
is for the joint distribution. T his pose sim ilarity
function is based on mutua l information [18]. A large value
means similar poses between R
i
and R
j
.
To resist large pose variation, we generate a template set
for alignment. For each pair in N tra ining part images, we
calculate the similar ity using Eq . (4) and finally for m a sim-
ilarity matrix S
t
∈ R
N×N
. S
t
is then processed with spec-
tral clu stering [16] to split the N part images into K cluster-
s. From each cluster, we select the part region closest to the
cluster cente r a s template to represent this set. To include
mirrored poses, we also flip each template. Eventually, we
obtain a template set T.
Figure 4 shows the pipeline of alignment. Given an
input image I, the regressed part bounding b ox L gener-
ated by localization sub-network and the center c
r
(L) =
(
x
1
+x
2
2
,
y
1
+y
2
2
) of the bounding box, we assume the pose-
aligned part region is with center location c, rotated with θ
degree and is scaled with factor α. To co mpare it with a
template t, we extract the region, denoted as φ(c, θ, α; I).
Using the above similarity fun ction, alignment is done by
finding c, θ, α and t that maximize
E
a
(c, θ , α, t; I, L) =
S[φ(c, θ, α; I), t] + λ exp(−
1
2
kc − c
r
(L)k
2
),
c ∈ [x
1
, x
2
] × [y
1
, y
2
], θ ∈ Θ, α ∈ A, t ∈ T, (5)
where λ is a constant. Using the second term of Eq. (5), we
adjust the aligning center c according to the regressed cen-
ter c
r
(L) of parts. This helps resist imperfectly regressed
Figure 4: Alignment sub-ne twork selects optim ally pose-
aligned parts for classification.
part centers and locate the aligning centers within part re-
gions, making alignment more reliable. Θ, A, and T define
the ranges of parameters. A large value from Eq. (5) indi-
cates reliable alignment. Maximizing Eq. (5) is a chieved
by searching the quantized parametric space.
3.3.2 Valve Linkage Function (VLF)
Our VLF de fines the output of the alignmen t sub-network,
which is important to link the sub-networks and make them
work as a whole in training and testing. It is expressed as
P (L; I, L
f
) =
E
a
(c
∗
, θ
∗
, α
∗
, t
∗
; I, L)
E
a
(c
∗
, θ
∗
, α
∗
, t
∗
; I, L
f
)
φ(c
∗
, θ
∗
, α
∗
; I),
(6)
where
{c
∗
, θ
∗
, α
∗
, t
∗
} = a rg max
c,θ,α,t
E
a
(c, θ , α, t; I, L
f
),
s.t. c ∈ [x
1
, x
2
] × [y
1
, y
2
], θ ∈ Θ, α ∈ A, t ∈ T. (7)
Here φ(c
∗
, θ
∗
, α
∗
; I) is the pose-a ligned part and L
f
is the
output of localization sub-network in the cur rent forward-
propagation (FP) stage. The role of valve function in FP
and BP is discussed below.
FP stage In the FP stage o f the neural network, align-
ment su b-network receives par t location L
f
and aligns it
as P (L
f
; I, L
f
) for further classification. The output is ex-
pressed as
P (L
f
; I, L
f
) = φ(c
∗
, θ
∗
, α
∗
; I), (8)
which is exactly the pose-aligned p a rt.
BP stage In the BP stage , the output of alignment sub-
network P (L; I, L
f
) becomes a function of L. Therefore,
the objective fun ction of LAC model is formulated as
J(W
c
,W
l
; I, L
gt
, y
gt
) =
E
c
(W
c
; P (L; I, L
f
), y
gt
) + E
l
(W
l
; I, L
gt
), (9)
where W
c
and W
l
are the par a meters to be determined. E
c
and E
l
are defined in two other sub-networks. We minimize
this objective function to update loca liza tion and classifica-
tion sub-networks during training.
To update the classification sub-network, we compute
the gradients of objective function J with respe ct to W
c
.
It is the same as those presented in [13].
To update th e localization sub-network, gra dients w ith
respect to W
l
are computed, written as
∇
W
l
J=
∂E
l
∂W
l
+
∂E
c
∂W
l
=
∂E
l
∂W
l
+
∂E
c
∂P (L; I, L
f
)
∂P (L; I, L
f
)
∂L
∂L
∂W
l
,(10 )
where the forme r term
∂E
l
∂W
l
represents the BP stage within
localization.
Analysis In the second term of Eq. (10),
∂E
c
∂P (L;I,L
f
)
and
∂L
∂W
l
pass useful info rmation in the BP stages within classi-
fication and localization sub-networks respectively. With-
out the valve linkage fu nction part
∂P (L;I,L
f
)
∂L
, inf orma-
tion pro pagation from classification to localization would
be blocked.
We further show that VLF provides information control
from classification to other sub-networks. In the BP stage,
P (L; I, L
f
) can be rewritten as
P (L; I, L
f
) =
1
e
E
a
(c
∗
, θ
∗
, α
∗
, t
∗
; I, L)φ(c
∗
, θ
∗
, α
∗
; I),
(11)
where e = E
a
(c
∗
, θ
∗
, α
∗
, t
∗
; I, L
f
) is the alignment ene r-
gy gener a te d in FP stage. With it be coming a constant in
backward pr opagation,
∂P (L;I,L
f
)
∂L
can be expressed as
∂P (L; I, L
f
)
∂L
=
1
e
φ(c
∗
, θ
∗
, α
∗
; I)
∂E
a
∂L
. (12)
And the term
∂E
a
∂L
is extended to
∂E
a
∂L
= −
λ
2
exp(−
1
2
kc − c
r
(L)k)
∂kc − c
r
(L)k
2
∂L
,
(13)
where c = (c
x
, c
y
) and
∂kc − c
r
(L)k
2
∂L
= (
x
1
+ x
2
2
− c
x
,
y
1
+ y
2
2
− c
y
,
x
1
+ x
2
2
− c
x
,
y
1
+ y
2
2
− c
y
). (14)
Here, factor
1
e
can be d eemed as a valve controlling influ-
ence fr om classification. As described in Section 3 .3.1, a
larger alignment score e corresponds to better alignment in
the FP stage. In BP stage,
1
e
is used to re-weight the BP
error
∂E
c
∂P (L;I,L
f
)
from classification. It functions as a com-
promise between classification and alignment errors.
In this c ase, a large e means good alignme nt in the BP
stage, for which information from the classification sub-
network is a utomatically reduced given a small
1
e
. In con-
trast, if e is small, current alignment becomes less reliable.
Thus more classification infor mation is automatically intro-
duced by the large
1
e
to guide W
l
update. Simply put, one
can understand
1
e
as a dynamic learning rate in the BP stage.
It is adaptive to matching performa nce.
With this kind of auto-ad justment mechanism in our
VLF connecting classification and alignment, localization
can be refined in the BP stage. We verify the powerfulness
of this design in experiments.
4. Experiments
We evaluate our method on two widely employed
datasets: 1) the Caltech-UCSD Bird-200-2011 [22] and 2)
Caltech-UCSD Bird-200-2010 [23].
In implementation, we mod ify the Caffe platform [11]
for CNN construction. Bird heads and bodies are consid-
ered as semantic parts. We train two deep LACs for them
respectively. All CNN models a re fine-tuned using the pre-
trained ImageNet model. The 6
th
layer of the CNN clas-
sification models (i.e., two part models + one whole ima ge
model) is extracted to form a 4096 × 3D fea ture. Then we
follow the popular CNN-SVM scheme [19] to train a SVM
classifier on o ur CNN feature.
The major para metric setting for each part model is as
follows. 1) In the localization sub-network, all input images
are resized to 227 ×227. We replace the original 1,000-way
fully connected laye r with a 4-way layer for regressing p art
bounding box. The pre-trained Im ageNet model is used to
initialize our localization sub-network. 2) For alignment,
in template selection, all 5,994 part annotations for head or
body in the training set of Caltech- UCSD Bird-200-2011
[22] are used. Th e 5,994 parts are cropped and resized to
227 × 227. Using sp e ctral clustering, we obtain the 5,994-
part split into 30 clu ster s. From each cluster, we select the
part region closest to the cluster center and its mirrored ver-
sion as two templates. This process forms 60-template T
eventually.
During template alignment, the r otation degree θ is an
integer and its range is Θ = [−60, 60]. M eanwhile, we
search the scale α within A = {2.5, 3, 3.5, 4, 4.5}. An-
other controllable parameter in alignment is λ in Eq. (5).
Empirically, we set it to 0.001. Finally the classification
sub-network takes input images each with size 227 × 227.