What contributions have the authors mentioned in the paper "Deep lac: deep localization, alignment and classification for fine-grained recognition" ?

Q: What contributions have the authors mentioned in the paper "Deep lac: deep localization, alignment and classification for fine-grained recognition" ?

The authors propose a fine-grained recognition system that incorporates part localization, alignment, and classification in one deep neural network.

(Open Access) Deep LAC: Deep localization, alignment and classification for fine-grained recognition (2015) | Di Lin

Deep LAC: Deep Localization, Alignment and Classiﬁcation

for Fine-grained Recognition

Di Lin

†

Xiaoyong Shen

†

Cewu Lu

‡

Jiaya Jia

†

The Chinese University of Hong Kong

‡

Hong Kong University of Science and Technology

Abstract

We propose a ﬁne-grained recognition system that incor-

porates part localization, alignment, and classiﬁcation in

one deep neural network. This is a nontrivial process, as the

input to the classiﬁcation module should be functions that

enable back-propagation in constructing the solver. Our

major contribution is to propose a valve linkage function

(VLF) for back-propagation chaining and form our deep lo-

calization, alignment and classiﬁcation (LAC) system. The

VLF can adaptively compromise the errors of c lassiﬁcation

and alignment when training the LAC model. It in turn he lp-

s update localization. The pe rfo rmance on ﬁne- grained ob-

ject data bears out the effectiveness of our LAC system.

1. Introduction

Fine-grain ed object re cognition aims to identify sub-

category object classes, which includes ﬁnding subtle dif-

ference among species of animals, product brands, and even

architectura l styles. Thanks to recent succe ss of convolu-

tional neural networks (CNN) [13], good performance was

achieved on ﬁne-grained tasks [4, 27].

The large ﬂexibility of CNN structures makes ﬁne-

grained recognition still have much r oom to improve. One

challenge is th at discriminative patterns (e.g., bird head in

bird species recognition) appear possibly in different loca-

tions, and with rotation and scaling in the collected images.

Although research of [17, 10] showed that CNN features are

reasonably robust to scale and rotation variation, it is nec-

essary to directly captur e these types of change to increase

the recognition accuracy [27, 4].

Existing solution s perform localization, alignment, and

classiﬁcation independently and consecutively. This proce-

dure is illu strated in Figure 1 using solid-line arrows where

parts are localized, aligned according to templates, and then

fed into the classiﬁcation neural network. Obviously, any

error arising during localization could inﬂuence alignment

and classiﬁcation.

In this paper, we propose a feedback-c ontrol fr amework

Forward-

propagation

Backward-

propagation

Localized

Part

Template

Alignment

Pose-aligned

Part

Iteration

Alignment

Error

Classiﬁcation

Error

Figure 1: The one-way procedure from localization to tem-

plate alignment ma kes each module rely on results from the

previous one. Contrarily, back-propagatio n highlighted by

dashed arrow makes it possible to reﬁne localization accord-

ing to the classiﬁcation and alignment results. It forms a

bi-directio nal reﬁnement process.

to back-propag a te alignment and classiﬁcation errors to lo-

calization, in order to optimally update all states in itera-

tions. This process is highlighte d by dashe d arrows in Fig-

ure 1, which, in our experim e nts, beneﬁts ﬁnal classiﬁca-

tion. This framework is constructed as one deep neural net-

work in c luding all localization, alignment and c la ssiﬁcation

tasks.

The difﬁculty of forming a neural network f or all mod-

ules stems from the special requirement of cla ssiﬁcation

sub-network input. As shown in Figure 1, the input to clas-

siﬁcation is a n image after a lignment. It cannot achieve the

back-p ropagation cha in during the whole network solving

due to the fact that the der ivation of a constant, which is the

aligned region , is zero.

The main focus of this paper is thus to propose a valve

linkage function (VLF) in alignment sub-network to opti-

Alignment

Sub-network

Valve Linkage

Function

Input

Image

American

Goldﬁnch

Pose-aligned

Part

Localized

Part

Category

Label

Localization

Sub-network

Classiﬁcation

Sub-network

Alignment

Error

Classiﬁcation

Error

Convolutional

Layer

Forward-

propagation

Backward-

propagation

Fully Connected

Layer

Templates

Figure 2: Deep LAC. It consists of localization, alignment and classiﬁcation sub-networks. With the help of VLF, alignment

sub-network outputs pose-aligned part images fo r classiﬁcation in the FP stage, while classiﬁcation and alignment errors can

be propagated back to localization in the BP stage.

mally connect the localization and classiﬁcation modu le s

in our deep LAC fram ework. The architecture is shown in

Figure 2. Because we involve these tasks in one network,

forward-propagation (FP) and backward-propagatio n (BP)

solving procedures become available.

In FP, VLF outputs a pose-aligned part image to classi-

ﬁcation. In BP, it should be a functio n containing neces-

sary parameters for updating the localization sub-network.

Our VLF not o nly connects all sub-networks, but also func-

tions as information valve to compromise classiﬁcation and

alignment errors. If alignmen t is good enough in the F-

P stage, VLF gu arantees corresponding accurate classiﬁca-

tion. Otherwise, errors propag ated from classiﬁcation ﬁne-

ly tune the previous mo dules. These effects make the whole

network reach a stable state. Note this scheme is genera l,

as similar VLF can be proposed for oth er networks that in-

volve several modules and various parameters.

Other contribution includes the new localization and

alignment sub -networks. As shown in Figure 2, localiza-

tion [20] is with regre ssion of part location. It differs from

general object detection by making u se of relatively stable

relationship between the ﬁne-grained object (e.g., bird) and

part region (e.g., bird head), which contrarily cannot be p-

reserved for general objects. On the alignment side, we in-

troduce multi-template selection to effectively handle pose

variance of parts.

With our system joint modeling localization, alignment,

and classiﬁcation, decent performance is accomplished in

compariso n to the solutions where these modules are c on-

sidered independently. We apply our method to data for au-

tomatic classiﬁcation without part annotation in the testing

phase.

2. Related Work

Pioneer work in this area conc e ntrated on constructing

discriminative whole-image representation [22, 15, 23]. It

was later found suffering from the proble m of losing sub-

tle difference between subordinate categories. Loca liza tion

and alignment can ameliorate this problem by extracting

parts fr om visually similar regions and re ducing th eir vari-

ance. Recent work exploits these oper ations.

Farrell et al. [7] and Yao et al. [26] used templates to get

the location of parts. Yang et al. [25] learned templates to

localize important parts o f ﬁne-grained objects in an unsu-

pervised manner. G avves et al. [8] aligne d images in order

to accommodate the possibly large variation of poses. In

[5, 1, 24], segmentation and part localization were uniﬁed

in one framework to alleviate the distracting effect of back-

ground. Berg et al. [3] put forward part-based one-vs-one

feature (POOF). Zhang et al. [28] enforced DPM to extract

part regions and features as image representatio n.

Recently, ﬁne-grained recognition was achieved by com-

bining localization, alignment, and deep neural networks.

Zhang et al. [28] applied pre-trained convolutiona l neur a l

network [13] to extract feature on the localized part. In

their later work [ 27], selective search [21] was used for part

proposal. Branson et al. [4] studied highe r-ord e r geomet-

ric warping to align parts. In [27, 4], the ﬁne-tune d CNN

model on dataset [22] was used to extract repre sentation on

parts. This method accomplished state-of-the-art results on

bird identiﬁcation [22].

These metho ds did not perform joint reﬁnement of local-

ization, alignment, and classiﬁcation in one network. Em-

ployment of these modules together in our system is found

proﬁtable fo r ﬁne-grained recog nition. We give more d e -

tails below.

3. Our Approach

To recognize ﬁne-grained classes, we learn deep LAC

models for distinct and meaningful parts. Features extract-

ed on parts are used in classical classiﬁers, e.g. SVM. The

main network consists of three sub-ones for the aforemen-

tioned three tasks – localization sub-network provides p a rt

position; alignm ent sub-network performs template align-

ment to offset translation, scaling and rotation of localized

parts; pose-aligned parts are fe d to the classiﬁcation sub-

network.

As aforementioned, the way to connect those three mod-

ules within a uniﬁed deep neur a l network is worth studying.

In what follows, we ﬁrst describe localization and c la ssiﬁ-

cation sub-networks, which are implemented according to

the CNN mod el [13]. Then we detail our alignment sub-

network where forward-propagation (FP) and backward-

propagation (BP) stages are implemented.

3.1. Localization Sub-network

Our localization sub-network outputs the common-

ly used coordinates for the top-left and bottom-right

bounding-box corners denoted as (x

, y

) and (x

, y

), giv-

en an input natu ral image for ﬁne-grained recognition. In

the training phase, we regress bounding b oxes of part re-

gions. Ground truth bound ing boxes are genera te d with part

annotation. We unify input image reso lution and construct

a localization sub-network [1 3], which consists of 5 convo-

lutional layers and 3 fully connected ones. Our last fully

connected layer is a 4-way output for regressing bounding-

box corners (x

, y

) and (x

, y

With output L = (x

, y

, x

, y

), our localization sub -

network is expressed a s

L = f

; I), (1)

where W

is the weig ht parameter set and I is the input

image. During training, ground truth locations of parts L

are used. The location objective function is given by

; I, L

) =

||f

; I) − L

. (2)

We min imize it over W

. This fram ework works we ll on

part location regression because the appearance of objects

and part regions are generally stable in ﬁne-grained tasks.

The location of parts thus can be reasonably pre dicted. Fig-

ure 5 shows the examples of localize d bird heads and bod-

ies.

3.2. Classiﬁc ation Sub-network

The classiﬁcation sub-network is the last module shown

in Figure 2. Our classiﬁcation takes the pose-a ligned part

image as input, denoted as φ

∗

, and genera te s the categor y

label. This classiﬁcation CNN [13] is expressed as

y = f

; φ

∗

), (3)

where W

is the weight parame te r set in this sub-network.

The output is the category label y.

During training, the grou nd truth label y

is provided.

The predicted category label y in Eq. (3) should be consis-

tent with y

. We enfor ce a penalty on y, which is denoted

as E

; φ

∗

, y

). In classiﬁcation, we follow the method

of [13] to u se softmax regression loss in order to penalize

the classiﬁcation error.

Our major contribution in this system is the construction

of the alignment sub-network, which is deta iled below to-

gether with the formulation of φ

∗

in Eq. (3).

3.3. Alignment Sub-network

Alignment sub-network receives part location L (i.e.,

bounding box) from the localization module, performs tem-

plate alignment [18] and feeds a pose-aligned part image

to classiﬁcation, as shown in Figure 2. Our alignment sub-

network offsets translation, scaling, and rotation for pose-

aligned part region genera tion, which is important for ac-

curate classiﬁcation. Apart from pose aligning, this sub-

network plays a crucial role on bridg ing the backward-

propagation (BP) stage of the whole LAC model, which

helps utilize the classiﬁcation and alignment results to re-

ﬁne localization.

We prop ose a new valve linkage function (VLF) as the

output of alignmen t sub-ne twork to accomplish the above

goals. In what follows, we present our alignment part and

then detail our VLF in line with the FP and BP stages of the

LAC m odel.

3.3.1 Template Alignment

We rectify loc alized part regions, making their poses c lose

to the templates. To evaluate the similarity of poses, w e

(a)

(b)

Figure 3: Examples o f alignment templates of bird (a) head

and (b) body.

deﬁne the function be twe en part regions R

and R

S[R

, R

] =

255

m=0

255

n=0

(m, n)log(

(m, n)

(m)p

(n)

), (4)

where p

, p

∈ R

form distributions of gray-scale val-

ues of uniform -size images R

and R

respectively. p

∈

256×256

is for the joint distribution. T his pose sim ilarity

function is based on mutua l information [18]. A large value

means similar poses between R

and R

To resist large pose variation, we generate a template set

for alignment. For each pair in N tra ining part images, we

calculate the similar ity using Eq . (4) and ﬁnally for m a sim-

ilarity matrix S

∈ R

N×N

. S

is then processed with spec-

tral clu stering [16] to split the N part images into K cluster-

s. From each cluster, we select the part region closest to the

cluster cente r a s template to represent this set. To include

mirrored poses, we also ﬂip each template. Eventually, we

obtain a template set T.

Figure 4 shows the pipeline of alignment. Given an

input image I, the regressed part bounding b ox L gener-

ated by localization sub-network and the center c

(L) =

(

) of the bounding box, we assume the pose-

aligned part region is with center location c, rotated with θ

degree and is scaled with factor α. To co mpare it with a

template t, we extract the region, denoted as φ(c, θ, α; I).

Using the above similarity fun ction, alignment is done by

ﬁnding c, θ, α and t that maximize

(c, θ , α, t; I, L) =

S[φ(c, θ, α; I), t] + λ exp(−

kc − c

(L)k

c ∈ [x

, x

] × [y

, y

], θ ∈ Θ, α ∈ A, t ∈ T, (5)

where λ is a constant. Using the second term of Eq. (5), we

adjust the aligning center c according to the regressed cen-

ter c

(L) of parts. This helps resist imperfectly regressed

Figure 4: Alignment sub-ne twork selects optim ally pose-

aligned parts for classiﬁcation.

part centers and locate the aligning centers within part re-

gions, making alignment more reliable. Θ, A, and T deﬁne

the ranges of parameters. A large value from Eq. (5) indi-

cates reliable alignment. Maximizing Eq. (5) is a chieved

by searching the quantized parametric space.

3.3.2 Valve Linkage Function (VLF)

Our VLF de ﬁnes the output of the alignmen t sub-network,

which is important to link the sub-networks and make them

work as a whole in training and testing. It is expressed as

P (L; I, L

) =

∗

, θ

∗

, α

∗

, t

∗

; I, L)

∗

, θ

∗

, α

∗

, t

∗

; I, L

)

φ(c

∗

, θ

∗

, α

∗

; I),

(6)

where

∗

, θ

∗

, α

∗

, t

∗

} = a rg max

c,θ,α,t

(c, θ , α, t; I, L

s.t. c ∈ [x

, x

] × [y

, y

], θ ∈ Θ, α ∈ A, t ∈ T. (7)

Here φ(c

∗

, θ

∗

, α

∗

; I) is the pose-a ligned part and L

is the

output of localization sub-network in the cur rent forward-

propagation (FP) stage. The role of valve function in FP

and BP is discussed below.

FP stage In the FP stage o f the neural network, align-

ment su b-network receives par t location L

and aligns it

as P (L

; I, L

) for further classiﬁcation. The output is ex-

pressed as

P (L

; I, L

) = φ(c

∗

, θ

∗

, α

∗

; I), (8)

which is exactly the pose-aligned p a rt.

BP stage In the BP stage , the output of alignment sub-

network P (L; I, L

) becomes a function of L. Therefore,

the objective fun ction of LAC model is formulated as

J(W

; I, L

, y

) =

; P (L; I, L

), y

) + E

; I, L

), (9)

where W

and W

are the par a meters to be determined. E

and E

are deﬁned in two other sub-networks. We minimize

this objective function to update loca liza tion and classiﬁca-

tion sub-networks during training.

To update the classiﬁcation sub-network, we compute

the gradients of objective function J with respe ct to W

It is the same as those presented in [13].

To update th e localization sub-network, gra dients w ith

respect to W

are computed, written as

∇

∂E

∂W

∂E

∂W

∂E

∂W

∂E

∂P (L; I, L

)

∂P (L; I, L

)

∂L

∂W

,(10 )

where the forme r term

∂E

∂W

represents the BP stage within

localization.

Analysis In the second term of Eq. (10),

∂E

∂P (L;I,L

)

and

∂L

∂W

pass useful info rmation in the BP stages within classi-

ﬁcation and localization sub-networks respectively. With-

out the valve linkage fu nction part

∂P (L;I,L

)

∂L

, inf orma-

tion pro pagation from classiﬁcation to localization would

be blocked.

We further show that VLF provides information control

from classiﬁcation to other sub-networks. In the BP stage,

P (L; I, L

) can be rewritten as

P (L; I, L

) =

∗

, θ

∗

, α

∗

, t

∗

; I, L)φ(c

∗

, θ

∗

, α

∗

; I),

(11)

where e = E

∗

, θ

∗

, α

∗

, t

∗

; I, L

) is the alignment ene r-

gy gener a te d in FP stage. With it be coming a constant in

backward pr opagation,

∂P (L;I,L

)

∂L

can be expressed as

∂P (L; I, L

)

∂L

φ(c

∗

, θ

∗

, α

∗

; I)

∂E

∂L

. (12)

And the term

∂E

∂L

is extended to

∂E

∂L

= −

exp(−

kc − c

(L)k)

∂kc − c

(L)k

∂L

(13)

where c = (c

, c

) and

∂kc − c

(L)k

∂L

= (

+ x

− c

+ y

− c

+ x

− c

+ y

− c

). (14)

Here, factor

can be d eemed as a valve controlling inﬂu-

ence fr om classiﬁcation. As described in Section 3 .3.1, a

larger alignment score e corresponds to better alignment in

the FP stage. In BP stage,

is used to re-weight the BP

error

∂E

∂P (L;I,L

)

from classiﬁcation. It functions as a com-

promise between classiﬁcation and alignment errors.

In this c ase, a large e means good alignme nt in the BP

stage, for which information from the classiﬁcation sub-

network is a utomatically reduced given a small

. In con-

trast, if e is small, current alignment becomes less reliable.

Thus more classiﬁcation infor mation is automatically intro-

duced by the large

to guide W

update. Simply put, one

can understand

as a dynamic learning rate in the BP stage.

It is adaptive to matching performa nce.

With this kind of auto-ad justment mechanism in our

VLF connecting classiﬁcation and alignment, localization

can be reﬁned in the BP stage. We verify the powerfulness

of this design in experiments.

4. Experiments

We evaluate our method on two widely employed

datasets: 1) the Caltech-UCSD Bird-200-2011 [22] and 2)

Caltech-UCSD Bird-200-2010 [23].

In implementation, we mod ify the Caffe platform [11]

for CNN construction. Bird heads and bodies are consid-

ered as semantic parts. We train two deep LACs for them

respectively. All CNN models a re ﬁne-tuned using the pre-

trained ImageNet model. The 6

layer of the CNN clas-

siﬁcation models (i.e., two part models + one whole ima ge

model) is extracted to form a 4096 × 3D fea ture. Then we

follow the popular CNN-SVM scheme [19] to train a SVM

classiﬁer on o ur CNN feature.

The major para metric setting for each part model is as

follows. 1) In the localization sub-network, all input images

are resized to 227 ×227. We replace the original 1,000-way

fully connected laye r with a 4-way layer for regressing p art

bounding box. The pre-trained Im ageNet model is used to

initialize our localization sub-network. 2) For alignment,

in template selection, all 5,994 part annotations for head or

body in the training set of Caltech- UCSD Bird-200-2011

[22] are used. Th e 5,994 parts are cropped and resized to

227 × 227. Using sp e ctral clustering, we obtain the 5,994-

part split into 30 clu ster s. From each cluster, we select the

part region closest to the cluster center and its mirrored ver-

sion as two templates. This process forms 60-template T

eventually.

During template alignment, the r otation degree θ is an

integer and its range is Θ = [−60, 60]. M eanwhile, we

search the scale α within A = {2.5, 3, 3.5, 4, 4.5}. An-

other controllable parameter in alignment is λ in Eq. (5).

Empirically, we set it to 0.001. Finally the classiﬁcation

sub-network takes input images each with size 227 × 227.

Deep LAC: Deep localization, alignment and classification for fine-grained recognition

Figures

Citations

Recent advances in convolutional neural networks

Recent Advances in Convolutional Neural Networks

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition

Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition

This Looks Like That: Deep Learning for Interpretable Image Recognition

References

ImageNet Classification with Deep Convolutional Neural Networks

On Spectral Clustering: Analysis and an algorithm

Selective Search for Object Recognition

CNN Features off-the-shelf: an Astounding Baseline for Recognition

The Caltech-UCSD Birds-200-2011 Dataset

Related Papers (5)

Bilinear CNN Models for Fine-Grained Visual Recognition

Deep Residual Learning for Image Recognition

The Caltech-UCSD Birds-200-2011 Dataset

3D Object Representations for Fine-Grained Categorization

ImageNet Classification with Deep Convolutional Neural Networks

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Deep lac: deep localization, alignment and classification for fine-grained recognition" ?