scispace - formally typeset
Open AccessProceedings ArticleDOI

Deep LAC: Deep localization, alignment and classification for fine-grained recognition

TLDR
A valve linkage function (VLF) for back-propagation chaining is proposed to form the deep localization, alignment and classification (LAC) system and can adaptively compromise the errors of classification and alignment when training the LAC model.
Abstract
We propose a fine-grained recognition system that incorporates part localization, alignment, and classification in one deep neural network. This is a nontrivial process, as the input to the classification module should be functions that enable back-propagation in constructing the solver. Our major contribution is to propose a valve linkage function (VLF) for back-propagation chaining and form our deep localization, alignment and classification (LAC) system. The VLF can adaptively compromise the errors of classification and alignment when training the LAC model. It in turn helps update localization. The performance on fine-grained object data bears out the effectiveness of our LAC system.

read more

Content maybe subject to copyright    Report

Deep LAC: Deep Localization, Alignment and Classification
for Fine-grained Recognition
Di Lin
Xiaoyong Shen
Cewu Lu
Jiaya Jia
The Chinese University of Hong Kong
Hong Kong University of Science and Technology
Abstract
We propose a fine-grained recognition system that incor-
porates part localization, alignment, and classification in
one deep neural network. This is a nontrivial process, as the
input to the classification module should be functions that
enable back-propagation in constructing the solver. Our
major contribution is to propose a valve linkage function
(VLF) for back-propagation chaining and form our deep lo-
calization, alignment and classification (LAC) system. The
VLF can adaptively compromise the errors of c lassification
and alignment when training the LAC model. It in turn he lp-
s update localization. The pe rfo rmance on fine- grained ob-
ject data bears out the effectiveness of our LAC system.
1. Introduction
Fine-grain ed object re cognition aims to identify sub-
category object classes, which includes finding subtle dif-
ference among species of animals, product brands, and even
architectura l styles. Thanks to recent succe ss of convolu-
tional neural networks (CNN) [13], good performance was
achieved on fine-grained tasks [4, 27].
The large flexibility of CNN structures makes fine-
grained recognition still have much r oom to improve. One
challenge is th at discriminative patterns (e.g., bird head in
bird species recognition) appear possibly in different loca-
tions, and with rotation and scaling in the collected images.
Although research of [17, 10] showed that CNN features are
reasonably robust to scale and rotation variation, it is nec-
essary to directly captur e these types of change to increase
the recognition accuracy [27, 4].
Existing solution s perform localization, alignment, and
classification independently and consecutively. This proce-
dure is illu strated in Figure 1 using solid-line arrows where
parts are localized, aligned according to templates, and then
fed into the classification neural network. Obviously, any
error arising during localization could influence alignment
and classification.
In this paper, we propose a feedback-c ontrol fr amework
Forward-
propagation
Backward-
propagation
Localized
Part
Template
Alignment
Pose-aligned
Part
Iteration
Alignment
Error
Classication
Error
Figure 1: The one-way procedure from localization to tem-
plate alignment ma kes each module rely on results from the
previous one. Contrarily, back-propagatio n highlighted by
dashed arrow makes it possible to refine localization accord-
ing to the classification and alignment results. It forms a
bi-directio nal refinement process.
to back-propag a te alignment and classification errors to lo-
calization, in order to optimally update all states in itera-
tions. This process is highlighte d by dashe d arrows in Fig-
ure 1, which, in our experim e nts, benefits final classifica-
tion. This framework is constructed as one deep neural net-
work in c luding all localization, alignment and c la ssification
tasks.
The difficulty of forming a neural network f or all mod-
ules stems from the special requirement of cla ssification
sub-network input. As shown in Figure 1, the input to clas-
sification is a n image after a lignment. It cannot achieve the
back-p ropagation cha in during the whole network solving
due to the fact that the der ivation of a constant, which is the
aligned region , is zero.
The main focus of this paper is thus to propose a valve
linkage function (VLF) in alignment sub-network to opti-
1

Alignment
Sub-network
Valve Linkage
Function
Input
Image
American
Goldnch
Pose-aligned
Part
Localized
Part
Category
Label
Localization
Sub-network
Classication
Sub-network
Alignment
Error
Classication
Error
Convolutional
Layer
Forward-
propagation
Backward-
propagation
Fully Connected
Layer
Templates
Figure 2: Deep LAC. It consists of localization, alignment and classication sub-networks. With the help of VLF, alignment
sub-network outputs pose-aligned part images fo r classication in the FP stage, while classication and alignment errors can
be propagated back to localization in the BP stage.
mally connect the localization and classication modu le s
in our deep LAC fram ework. The architecture is shown in
Figure 2. Because we involve these tasks in one network,
forward-propagation (FP) and backward-propagatio n (BP)
solving procedures become available.
In FP, VLF outputs a pose-aligned part image to classi-
cation. In BP, it should be a functio n containing neces-
sary parameters for updating the localization sub-network.
Our VLF not o nly connects all sub-networks, but also func-
tions as information valve to compromise classication and
alignment errors. If alignmen t is good enough in the F-
P stage, VLF gu arantees corresponding accurate classica-
tion. Otherwise, errors propag ated from classication ne-
ly tune the previous mo dules. These effects make the whole
network reach a stable state. Note this scheme is genera l,
as similar VLF can be proposed for oth er networks that in-
volve several modules and various parameters.
Other contribution includes the new localization and
alignment sub -networks. As shown in Figure 2, localiza-
tion [20] is with regre ssion of part location. It differs from
general object detection by making u se of relatively stable
relationship between the ne-grained object (e.g., bird) and
part region (e.g., bird head), which contrarily cannot be p-
reserved for general objects. On the alignment side, we in-
troduce multi-template selection to effectively handle pose
variance of parts.
With our system joint modeling localization, alignment,
and classication, decent performance is accomplished in
compariso n to the solutions where these modules are c on-
sidered independently. We apply our method to data for au-
tomatic classication without part annotation in the testing
phase.
2. Related Work
Pioneer work in this area conc e ntrated on constructing
discriminative whole-image representation [22, 15, 23]. It
was later found suffering from the proble m of losing sub-
tle difference between subordinate categories. Loca liza tion
and alignment can ameliorate this problem by extracting
parts fr om visually similar regions and re ducing th eir vari-
ance. Recent work exploits these oper ations.
Farrell et al. [7] and Yao et al. [26] used templates to get
the location of parts. Yang et al. [25] learned templates to
localize important parts o f ne-grained objects in an unsu-
pervised manner. G avves et al. [8] aligne d images in order
to accommodate the possibly large variation of poses. In
[5, 1, 24], segmentation and part localization were unied
in one framework to alleviate the distracting effect of back-
ground. Berg et al. [3] put forward part-based one-vs-one
feature (POOF). Zhang et al. [28] enforced DPM to extract

part regions and features as image representatio n.
Recently, fine-grained recognition was achieved by com-
bining localization, alignment, and deep neural networks.
Zhang et al. [28] applied pre-trained convolutiona l neur a l
network [13] to extract feature on the localized part. In
their later work [ 27], selective search [21] was used for part
proposal. Branson et al. [4] studied highe r-ord e r geomet-
ric warping to align parts. In [27, 4], the ne-tune d CNN
model on dataset [22] was used to extract repre sentation on
parts. This method accomplished state-of-the-art results on
bird identication [22].
These metho ds did not perform joint renement of local-
ization, alignment, and classication in one network. Em-
ployment of these modules together in our system is found
protable fo r ne-grained recog nition. We give more d e -
tails below.
3. Our Approach
To recognize ne-grained classes, we learn deep LAC
models for distinct and meaningful parts. Features extract-
ed on parts are used in classical classiers, e.g. SVM. The
main network consists of three sub-ones for the aforemen-
tioned three tasks localization sub-network provides p a rt
position; alignm ent sub-network performs template align-
ment to offset translation, scaling and rotation of localized
parts; pose-aligned parts are fe d to the classication sub-
network.
As aforementioned, the way to connect those three mod-
ules within a unied deep neur a l network is worth studying.
In what follows, we rst describe localization and c la ssi-
cation sub-networks, which are implemented according to
the CNN mod el [13]. Then we detail our alignment sub-
network where forward-propagation (FP) and backward-
propagation (BP) stages are implemented.
3.1. Localization Sub-network
Our localization sub-network outputs the common-
ly used coordinates for the top-left and bottom-right
bounding-box corners denoted as (x
1
, y
1
) and (x
2
, y
2
), giv-
en an input natu ral image for ne-grained recognition. In
the training phase, we regress bounding b oxes of part re-
gions. Ground truth bound ing boxes are genera te d with part
annotation. We unify input image reso lution and construct
a localization sub-network [1 3], which consists of 5 convo-
lutional layers and 3 fully connected ones. Our last fully
connected layer is a 4-way output for regressing bounding-
box corners (x
1
, y
1
) and (x
2
, y
2
).
With output L = (x
1
, y
1
, x
2
, y
2
), our localization sub -
network is expressed a s
L = f
l
(W
l
; I), (1)
where W
l
is the weig ht parameter set and I is the input
image. During training, ground truth locations of parts L
gt
are used. The location objective function is given by
E
l
(W
l
; I, L
gt
) =
1
2
||f
l
(W
l
; I) L
gt
||
2
. (2)
We min imize it over W
l
. This fram ework works we ll on
part location regression because the appearance of objects
and part regions are generally stable in ne-grained tasks.
The location of parts thus can be reasonably pre dicted. Fig-
ure 5 shows the examples of localize d bird heads and bod-
ies.
3.2. Classic ation Sub-network
The classication sub-network is the last module shown
in Figure 2. Our classication takes the pose-a ligned part
image as input, denoted as φ
, and genera te s the categor y
label. This classication CNN [13] is expressed as
y = f
c
(W
c
; φ
), (3)
where W
c
is the weight parame te r set in this sub-network.
The output is the category label y.
During training, the grou nd truth label y
gt
is provided.
The predicted category label y in Eq. (3) should be consis-
tent with y
gt
. We enfor ce a penalty on y, which is denoted
as E
c
(W
c
; φ
, y
gt
). In classication, we follow the method
of [13] to u se softmax regression loss in order to penalize
the classication error.
Our major contribution in this system is the construction
of the alignment sub-network, which is deta iled below to-
gether with the formulation of φ
in Eq. (3).
3.3. Alignment Sub-network
Alignment sub-network receives part location L (i.e.,
bounding box) from the localization module, performs tem-
plate alignment [18] and feeds a pose-aligned part image
to classication, as shown in Figure 2. Our alignment sub-
network offsets translation, scaling, and rotation for pose-
aligned part region genera tion, which is important for ac-
curate classication. Apart from pose aligning, this sub-
network plays a crucial role on bridg ing the backward-
propagation (BP) stage of the whole LAC model, which
helps utilize the classication and alignment results to re-
ne localization.
We prop ose a new valve linkage function (VLF) as the
output of alignmen t sub-ne twork to accomplish the above
goals. In what follows, we present our alignment part and
then detail our VLF in line with the FP and BP stages of the
LAC m odel.
3.3.1 Template Alignment
We rectify loc alized part regions, making their poses c lose
to the templates. To evaluate the similarity of poses, w e

(a)
(b)
Figure 3: Examples o f alignment templates of bird (a) head
and (b) body.
dene the function be twe en part regions R
i
and R
j
as
S[R
i
, R
j
] =
255
X
m=0
255
X
n=0
p
ij
(m, n)log(
p
ij
(m, n)
p
i
(m)p
j
(n)
), (4)
where p
i
, p
j
R
c
form distributions of gray-scale val-
ues of uniform -size images R
i
and R
j
respectively. p
ij
R
256×256
is for the joint distribution. T his pose sim ilarity
function is based on mutua l information [18]. A large value
means similar poses between R
i
and R
j
.
To resist large pose variation, we generate a template set
for alignment. For each pair in N tra ining part images, we
calculate the similar ity using Eq . (4) and nally for m a sim-
ilarity matrix S
t
R
N×N
. S
t
is then processed with spec-
tral clu stering [16] to split the N part images into K cluster-
s. From each cluster, we select the part region closest to the
cluster cente r a s template to represent this set. To include
mirrored poses, we also ip each template. Eventually, we
obtain a template set T.
Figure 4 shows the pipeline of alignment. Given an
input image I, the regressed part bounding b ox L gener-
ated by localization sub-network and the center c
r
(L) =
(
x
1
+x
2
2
,
y
1
+y
2
2
) of the bounding box, we assume the pose-
aligned part region is with center location c, rotated with θ
degree and is scaled with factor α. To co mpare it with a
template t, we extract the region, denoted as φ(c, θ, α; I).
Using the above similarity fun ction, alignment is done by
nding c, θ, α and t that maximize
E
a
(c, θ , α, t; I, L) =
S[φ(c, θ, α; I), t] + λ exp(
1
2
kc c
r
(L)k
2
),
c [x
1
, x
2
] × [y
1
, y
2
], θ Θ, α A, t T, (5)
where λ is a constant. Using the second term of Eq. (5), we
adjust the aligning center c according to the regressed cen-
ter c
r
(L) of parts. This helps resist imperfectly regressed
Figure 4: Alignment sub-ne twork selects optim ally pose-
aligned parts for classication.
part centers and locate the aligning centers within part re-
gions, making alignment more reliable. Θ, A, and T dene
the ranges of parameters. A large value from Eq. (5) indi-
cates reliable alignment. Maximizing Eq. (5) is a chieved
by searching the quantized parametric space.
3.3.2 Valve Linkage Function (VLF)
Our VLF de nes the output of the alignmen t sub-network,
which is important to link the sub-networks and make them
work as a whole in training and testing. It is expressed as
P (L; I, L
f
) =
E
a
(c
, θ
, α
, t
; I, L)
E
a
(c
, θ
, α
, t
; I, L
f
)
φ(c
, θ
, α
; I),
(6)
where
{c
, θ
, α
, t
} = a rg max
c,θ,α,t
E
a
(c, θ , α, t; I, L
f
),
s.t. c [x
1
, x
2
] × [y
1
, y
2
], θ Θ, α A, t T. (7)
Here φ(c
, θ
, α
; I) is the pose-a ligned part and L
f
is the
output of localization sub-network in the cur rent forward-
propagation (FP) stage. The role of valve function in FP
and BP is discussed below.
FP stage In the FP stage o f the neural network, align-
ment su b-network receives par t location L
f
and aligns it
as P (L
f
; I, L
f
) for further classication. The output is ex-
pressed as
P (L
f
; I, L
f
) = φ(c
, θ
, α
; I), (8)
which is exactly the pose-aligned p a rt.

BP stage In the BP stage , the output of alignment sub-
network P (L; I, L
f
) becomes a function of L. Therefore,
the objective fun ction of LAC model is formulated as
J(W
c
,W
l
; I, L
gt
, y
gt
) =
E
c
(W
c
; P (L; I, L
f
), y
gt
) + E
l
(W
l
; I, L
gt
), (9)
where W
c
and W
l
are the par a meters to be determined. E
c
and E
l
are dened in two other sub-networks. We minimize
this objective function to update loca liza tion and classica-
tion sub-networks during training.
To update the classication sub-network, we compute
the gradients of objective function J with respe ct to W
c
.
It is the same as those presented in [13].
To update th e localization sub-network, gra dients w ith
respect to W
l
are computed, written as
W
l
J=
E
l
W
l
+
E
c
W
l
=
E
l
W
l
+
E
c
P (L; I, L
f
)
P (L; I, L
f
)
L
L
W
l
,(10 )
where the forme r term
E
l
W
l
represents the BP stage within
localization.
Analysis In the second term of Eq. (10),
E
c
P (L;I,L
f
)
and
L
W
l
pass useful info rmation in the BP stages within classi-
cation and localization sub-networks respectively. With-
out the valve linkage fu nction part
P (L;I,L
f
)
L
, inf orma-
tion pro pagation from classication to localization would
be blocked.
We further show that VLF provides information control
from classication to other sub-networks. In the BP stage,
P (L; I, L
f
) can be rewritten as
P (L; I, L
f
) =
1
e
E
a
(c
, θ
, α
, t
; I, L)φ(c
, θ
, α
; I),
(11)
where e = E
a
(c
, θ
, α
, t
; I, L
f
) is the alignment ene r-
gy gener a te d in FP stage. With it be coming a constant in
backward pr opagation,
P (L;I,L
f
)
L
can be expressed as
P (L; I, L
f
)
L
=
1
e
φ(c
, θ
, α
; I)
E
a
L
. (12)
And the term
E
a
L
is extended to
E
a
L
=
λ
2
exp(
1
2
kc c
r
(L)k)
kc c
r
(L)k
2
L
,
(13)
where c = (c
x
, c
y
) and
kc c
r
(L)k
2
L
= (
x
1
+ x
2
2
c
x
,
y
1
+ y
2
2
c
y
,
x
1
+ x
2
2
c
x
,
y
1
+ y
2
2
c
y
). (14)
Here, factor
1
e
can be d eemed as a valve controlling influ-
ence fr om classification. As described in Section 3 .3.1, a
larger alignment score e corresponds to better alignment in
the FP stage. In BP stage,
1
e
is used to re-weight the BP
error
E
c
P (L;I,L
f
)
from classication. It functions as a com-
promise between classication and alignment errors.
In this c ase, a large e means good alignme nt in the BP
stage, for which information from the classication sub-
network is a utomatically reduced given a small
1
e
. In con-
trast, if e is small, current alignment becomes less reliable.
Thus more classication infor mation is automatically intro-
duced by the large
1
e
to guide W
l
update. Simply put, one
can understand
1
e
as a dynamic learning rate in the BP stage.
It is adaptive to matching performa nce.
With this kind of auto-ad justment mechanism in our
VLF connecting classication and alignment, localization
can be rened in the BP stage. We verify the powerfulness
of this design in experiments.
4. Experiments
We evaluate our method on two widely employed
datasets: 1) the Caltech-UCSD Bird-200-2011 [22] and 2)
Caltech-UCSD Bird-200-2010 [23].
In implementation, we mod ify the Caffe platform [11]
for CNN construction. Bird heads and bodies are consid-
ered as semantic parts. We train two deep LACs for them
respectively. All CNN models a re ne-tuned using the pre-
trained ImageNet model. The 6
th
layer of the CNN clas-
sication models (i.e., two part models + one whole ima ge
model) is extracted to form a 4096 × 3D fea ture. Then we
follow the popular CNN-SVM scheme [19] to train a SVM
classier on o ur CNN feature.
The major para metric setting for each part model is as
follows. 1) In the localization sub-network, all input images
are resized to 227 ×227. We replace the original 1,000-way
fully connected laye r with a 4-way layer for regressing p art
bounding box. The pre-trained Im ageNet model is used to
initialize our localization sub-network. 2) For alignment,
in template selection, all 5,994 part annotations for head or
body in the training set of Caltech- UCSD Bird-200-2011
[22] are used. Th e 5,994 parts are cropped and resized to
227 × 227. Using sp e ctral clustering, we obtain the 5,994-
part split into 30 clu ster s. From each cluster, we select the
part region closest to the cluster center and its mirrored ver-
sion as two templates. This process forms 60-template T
eventually.
During template alignment, the r otation degree θ is an
integer and its range is Θ = [60, 60]. M eanwhile, we
search the scale α within A = {2.5, 3, 3.5, 4, 4.5}. An-
other controllable parameter in alignment is λ in Eq. (5).
Empirically, we set it to 0.001. Finally the classication
sub-network takes input images each with size 227 × 227.

Citations
More filters
Journal ArticleDOI

Recent advances in convolutional neural networks

TL;DR: A broad survey of the recent advances in convolutional neural networks can be found in this article, where the authors discuss the improvements of CNN on different aspects, namely, layer design, activation function, loss function, regularization, optimization and fast computation.
Posted Content

Recent Advances in Convolutional Neural Networks

TL;DR: This paper details the improvements of CNN on different aspects, including layer design, activation function, loss function, regularization, optimization and fast computation, and introduces various applications of convolutional neural networks in computer vision, speech and natural language processing.
Proceedings ArticleDOI

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition

TL;DR: Li et al. as discussed by the authors proposed a recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutual reinforced way.
Proceedings ArticleDOI

Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition

TL;DR: This paper proposes a novel part learning approach by a multi-attention convolutional neural network (MA-CNN), where part generation and feature learning can reinforce each other, and shows the best performances on three challenging published fine-grained datasets.
Posted Content

This Looks Like That: Deep Learning for Interpretable Image Recognition

TL;DR: A deep network architecture -- prototypical part network (ProtoPNet), that reasons in a similar way to the way ornithologists, physicians, and others would explain to people on how to solve challenging image classification tasks, that provides a level of interpretability that is absent in other interpretable deep models.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

On Spectral Clustering: Analysis and an algorithm

TL;DR: A simple spectral clustering algorithm that can be implemented using a few lines of Matlab is presented, and tools from matrix perturbation theory are used to analyze the algorithm, and give conditions under which it can be expected to do well.
Journal ArticleDOI

Selective Search for Object Recognition

TL;DR: This paper introduces selective search which combines the strength of both an exhaustive search and segmentation, and shows that its selective search enables the use of the powerful Bag-of-Words model for recognition.
Posted Content

CNN Features off-the-shelf: an Astounding Baseline for Recognition

TL;DR: A series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13 suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

The Caltech-UCSD Birds-200-2011 Dataset

TL;DR: CUB-200-2011 as mentioned in this paper is an extended version of CUB200, which roughly doubles the number of images per category and adds new part localization annotations, annotated with bounding boxes, part locations, and at-ribute labels.
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Deep lac: deep localization, alignment and classification for fine-grained recognition" ?

The authors propose a fine-grained recognition system that incorporates part localization, alignment, and classification in one deep neural network.