Proceedings Article•DOI•

Fine-Grained Categorization by Alignments

Efstratios Gavves¹, Basura Fernando², Cees G. M. Snoek¹, Arnold W. M. Smeulders¹, Tinne Tuytelaars² - Show less +1 more•Institutions (2)

University of Amsterdam¹, Katholieke Universiteit Leuven²

01 Dec 2013-pp 1713-1720

TL;DR: It is argued that in the distinction of fine grained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG.

read less

Abstract: The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The alignments are then used to transfer part annotations from training images to test images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We furthermore argue that in the distinction of fine grained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG. We evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art.

...read moreread less

Summary (3 min read)

Jump to: [1. Introduction] – [2. Related work] – [3. Alignments] – [3.1. Supervised alignments] – [3.2. Unsupervised alignments] – [4. Final Image Representation] – [5.1. Experimental setup] – [5.2. Matching vs Classification Descriptors] – [5.3. Supervised alignments] – [5.4. Unsupervised Alignments] – [5.5. State-of-the-art comparison] and [6. Conclusions]

1. Introduction

Fine-grained categorization relies on identifying the subtle differences in appearance of specific object parts.
Parts may be divided in intrinsic parts [3, 16] such as the head of a dog or the body of a bird, and distinctive parts [32, 31] specific to few sub-categories.
The large variability that naturally arises for large number of classes complicates their detection.
Furthermore, rough alignment is not sub-category specific, thus the object representation becomes independent of the number of classes or training images [33, 32].
In contrast to the raw SIFT or template features preferred in the fine-grained literature [16, 31, 32], such localized feature encodings are less sensitive to misalignments.

3. Alignments

In the following the authors will employ both shape masks and ellipses as local frames of reference.
Consistent means that corresponding parts are found in similar locations, when expressed relative to this frame of reference.
As is common in fine-grained categorization [33, 32, 31], the authors have available both at training and at test time the bounding box locations of the object of interest.
Ignoring the image content outside the bounding box is a reasonable thing to do, since context is unlikely to play any major role in recognition of sub-categories, e.g., all birds are usually either on trees or flying in the sky.
The rectangular bounding box around an object allows for extracting important information, such as the approxi- mate shape of the object.

3.1. Supervised alignments

In the supervised scenario the ground truth locations of basic object parts, such as the beak or the tail of the birds, are available in the training set.
This gives us a shape mask for the image, which the authors effectively summarize in the form of HOG features [7].
Therefore, the authors can expect that given an object, there are several others with similar shapes and, that due to the anatomical constraints of the super-category they belong to, are likely to be found in similar poses.
The authors are now in position to use the ground truth locations of the parts in the training images and predict the corresponding locations in the test image.
The authors experimentally witnessed that averaging yields accurate results, accurate enough to recover rough alignments.

3.2. Unsupervised alignments

In the unsupervised scenario no ground truth information of the training part locations is available.
Since no ground truth part locations are available, it does not make sense to align the test image to a small subset of training images.
More specifically, the authors fit an ellipse to the pixels X of the segmentation mask and compute the local 2-d geometry in the form of the two principal axes aj = x̄+.
Regarding the ancillary axis, the authors cannot easily define an origin in a consistent way.
This procedure fully defines the frame of reference, see Fig.

4. Final Image Representation

Thus, using features that are precise, but sensitive to common image transformations, is likely to be suboptimal.
Instead, the authors propose to use Fisher vectors [23] extracted in the predicted parts/regions.
The authors turn their focus into two approaches, one that is more relevant to part based models and another one that is more relevant to consistent regions.
Together with the object information this approach also captures some of the context that surrounds the object parts.
For the second approach the authors sample densely every d pixels only on the intersection area of the segmentation mask and the region.

5.1. Experimental setup

The authors first run their experiments on the CU-2011 Birds dataset [30], one of the most extensive datasets in the fine-grained literature.
The CU-2011 Birds dataset is composed of 200 sub-species of birds, several of whom bear tremendous similarities, especially under common image transformations, see Fig.
Following the standard evaluation protocol [33, 32, 31], the authors mirror the train images to double the size of the training set and use the bounding boxes to normalize the images.
The authors use the ground truth part annotations only during learning, unless stated otherwise.
For Fisher vectors the authors use a Gaussian mixture model with 256 components.

5.2. Matching vs Classification Descriptors

In this first experiment the authors evaluate what are good descriptors for describing parts in a fine-grained categorization setting.
In order to avoid a too strong correlation between the parts and also control the dimensionality of the final feature vector the authors use only the following 7 parts, which cover the bird silhouette: beak, belly, forehead, left wing, right wing, tail and throat.
Similarly, for the HOG object descriptors the authors also compute a HOG vector using the bounding box, rescaled to 100× 100 pixels.
For fine-grained classes the gradients are often quite similar, since they belong to the same superclass.
The authors plot in the left image of Fig. 5 the individual accuracies per class for Fisher vectors and for HOG, noticing that Fisher vectors outperform for 184 out of the 200 sub-categories.

5.3. Supervised alignments

In the second experiment the authors test whether supervised alignments actually benefit the recognition of fine-grained categories, as compared to a standard classification pipeline.
The authors use the same 7 parts as in the previous experiment plus a Fisher vector extracted from the whole bounding box.
Also, inspired by [16], the authors repeat the same experiment using only the predicted location of the beak, whose window captures most of the information around the head.
Furthermore, the authors note that extracting Fisher vectors on the supervised alignments is 47.1% accurate, which is rather close to the 52.5% obtained when extracting Fisher vectors on the parts provided by the ground truth.
This indicates that the authors capture the part locations well enough for an appearance descriptor like the Fisher vector.

5.4. Unsupervised Alignments

In this experiment the authors compare the unsupervised alignments with the supervised ones.
After extracting the principal axis the authors split the bird mask into four regions, starting from the highest point, considering only the pixels within the segmentation mask.
The authors furthermore compare their method against a horizontally split [4× 1] spatial pyramid.
The authors repeat the experiment considering different number of regions.
We, furthermore, plot the individual accuracy differences per class for supervised and unsupervised alignments in the right picture in Fig.

5.5. State-of-the-art comparison

In experiment 4, the authors compare their unsupervised alignments with state-of-the-art methods reported on CU-2011 Birds and Stanford Dogs.
The authors add color by sampling SIFT descriptors from the opponent color spaces [27].
And compared to learned features proposed in [12] unsupervised alignments perform 36.5% better.
The authors report also some numbers from prior works on CU-2010 Birds, which is the previous version of CU-2011 Birds.
In Fig. 7 the authors show images of the two categories most confused to each other: Loggerhead Shrike and Great Grey Shrike.

6. Conclusions

In this paper the authors aim for fine-grained categorization without human interaction.
Different from prior work, the authors show that localizing distinctive details by roughly aligning the objects allows for successful recognition of fine-grained subclasses.
The authors present two methods for extracting alignments, requiring different levels of supervision.
The authors evaluate on the CU2011 Birds and Stanford Dogs dataset, outperforming the state-of-the-art.
The authors conclude that rough alignments lead to accurate fine-grained categorization.

Did you find this useful? Give us your feedback

Figures (12)

Table 1. Comparison of Matching vs Classification Descriptors based on accuracy. Fisher vectors are better equipped in describing part appearance than HOG for fine-grained categorization.

Figure 4. In the left column we see random birds, for which we have already extracted a segmentation mask. After fitting an ellipse, we obtain the two axes in the middle column pictures, the principal green and the ancillary magenta ones. After the gravity vector assumption [22] we assume the origin of the principal axis to be the highest point in the direction of the green arrow. Based on this frame of reference, we split equally in the right column pictures the principal axis to obtain consistent regions.

Figure 1. The first image shows a Hooded Warbler, whereas the second image shows a Kentucky Warbler. Based on example images like these, fine-grained categorization tries to answer the question: what fine-grained bird category do we have in the third image? Rather than directly trying to localize parts (be it distinctive or intrinsic), we show in this paper that better results can be obtained if one first tries to align the birds based on their global shape, ignoring the actual bird categories.

Table 2. Supervised alignments are more accurate than a spatial pyramid kernel and an alignment based on the beak of a bird only, while being rather close to the theoretical accuracy of the oracle parts in Table 1.

Figure 5. A fine-grained category-by-category comparison. We report results on the 200 concepts in CU-2011 Birds, measured in terms of accuracy. Falling at the right side of the reference line x=0 means that for oracle parts the Fisher vector is better than HOG (left picture), Fisher vector on parts is more accurate than a 2×2 spatial pyramid kernel (middle picture), and Fisher vectors on unsupervised alignments are more accurate than Fisher vectors on parts derived from supervised alignments (right picture). The difference is not that big (2%), but note that for Fisher vector unsupervised alignments no ground truth part locations are required.

Table 4. State-of-the-art comparison in CU-2011 Birds [30]. Unsupervised alignments with Fisher vectors outperform the state-ofthe-art considerably.

Table 3. Unsupervised alignments are more accurate than supervised ones, while at the same time requiring no supervision at all.

Figure 6. Best recognized categories, that is Pied billed Grebe, Heermann Gull, Bobolink and European Goldfinch. We observe that birds in these sub-classes have consistent appearance.

Table 5. State-of-the-art comparison in Stanford Dogs [15]. Unsupervised alignments with Fisher vectors outperform the state-ofthe-art considerably.

Figure 2. The computation of the segmentation mask can be accurate as in the left, ok as in the middle or completely fail as in the right image. Most times segmentations are somewhere in between the left and middle example, thus allowing us to obtain a rather good impression of the object’s shape.

Figure 7. The two most confused categories, that is Loggerhead Shrikes in the left column and Great Grey Shrikes in the right column. These two classes have very similar appearance, thus often resulting in confusion also for alignments.

Figure 3. In the top left, we have a test image, for which we want to predict part locations. On the right, we have the nearest neighbor training images, their ground truth part locations and their HOG shape representations, based on which they were retrieved. Regressing the locations from the nearest neighbors to the test image we get the predicted parts, shown as the colorful symbols. The predicted part locations look quite consistent.

Content maybe subject to copyright Report

UvA-DARE is a service provided by the library of the University of Amsterdam (http

://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Fine-Grained Categorization by Alignments

Gavves, E.; Fernando, B.; Snoek, C.G.M.; Smeulders, A.W.M.; Tuytelaars, T.

DOI

10.1109/ICCV.2013.215

Publication date

2013

Document Version

Author accepted manuscript

Published in

2013 IEEE International Conference on Computer Vision

Link to publication

Citation for published version (APA):

Gavves, E., Fernando, B., Snoek, C. G. M., Smeulders, A. W. M., & Tuytelaars, T. (2013).

Fine-Grained Categorization by Alignments. In

2013 IEEE International Conference on

Computer Vision: ICCV 2013 : proceedings: 1-8 December 2013, Sydney, NSW, Australia

(pp. 1713-1720). IEEE Computer Society. https://doi.org/10.1109/ICCV.2013.215

General rights

It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)

and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open

content license (like Creative Commons).

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please

let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material

inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter

to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You

will be contacted as soon as possible.

Download date:10 Aug 2022

Fine-Grained Categorization by Alignments

E. Gavves

, B. Fernando

, C.G.M. Snoek

, A.W.M. Smeulders

1, 3

, and T. Tuytelaars

University of Amsterdam, ISIS

KU Leuven, ESAT-PSI, iMinds

CWI Amsterdam

Abstract

The aim of this paper is ﬁne-grained categorization with-

out human interaction. Different from prior work, which

relies on detectors for speciﬁc object parts, we propose to

localize distinctive details by roughly aligning the objects

using just the overall shape, since implicit to ﬁne-grained

categorization is the existence of a super-class shape shared

among all classes. The alignments are then used to trans-

fer part annotations from training images to test images

(supervised alignment), or to blindly yet consistently seg-

ment the object in a number of regions (unsupervised align-

ment). We furthermore argue that in the distinction of ﬁne-

grained sub-categories, classiﬁcation-oriented encodings

like Fisher vectors are better suited for describing local-

ized information t han popular matching oriented features

like HOG. We evaluate the method on the CU-2011 Birds

and Stanford Dogs ﬁne-grained datasets, outperforming the

state-of-the-art.

1. Introduction

Fine-grained categorization relies on identifying the sub-

tle differences in appearance of speciﬁc object parts. Re-

search in cognitive psychology has suggested [24] and re-

cent works in computer vision have conﬁrmed [10, 31, 34]

this mechanism. Humans learn to distinguish different types

of birds by addressing the differences in speciﬁc details.

The same holds for car types [8], sailing boat types, dog

breeds [15, 16], but also when learning to discriminate dif-

ferent types of pathologies. For this purpose, active learning

methods have been proposed to extract attributes [9], volu-

metric models [10] or part models [3]. They require expert-

level knowledge at run time, which is often unavailable. In

contrast, we aim f or ﬁne-grained categorization without hu-

man interaction.

Various methods have been proposed to learn in an un-

supervised manner, what details to focus on for identifying

Figure 1. The ﬁrst image shows a Hooded Warbler, whereas the

second image shows a Kentucky Warbler. Based on example

images like these, ﬁne-grained categorization tries to answer the

question: what ﬁne-grained bird category do we have in the third

image? Rather than directly trying to localize parts (be it distinc-

tive or intrinsic), we show in this paper that better results can be

obtained if one ﬁrst tries to align the birds based on their global

shape, ignoring the actual bird categories.

ﬁne-grained sub-categories, such as the recent works rely-

ing on templates [31, 32]. In [32] templates rely on high

dimensionalities to arrive at good results, while in [31] they

are designed to be precise, being effectively analogous to

“parts” [11]. Yet, it remains unclear what is the most critical

aspect of “parts” in a ﬁne-grained categorization context: is

it the ability to accurately localize corresponding locations

over object i nstances, or simply the ability to capture de-

tailed information? While often these go hand in hand, as

indeed is the case for templates, we defend the view that

actually it is the latter that matters. We argue that a very

precise “part” localization is not necessary and rough align-

ments sufﬁce, as long as one manages to capture the ﬁne-

grained details in the appearance.

Parts may be divided in intrinsic parts [3, 16] such as

the head of a dog or the body of a bird, and distinctive

parts [32, 31] speciﬁc to few sub-categories. Recovering in-

trinsic parts implies that such parts are seen throughout the

whole dataset. However, the large variability that naturally

arises for large number of classes complicates their detec-

tion. Distinctive parts, on the other hand, are destined to be

found on few sub-categories only. They are more consistent

in appearance, as the distinctive details are better tailored

to be detected on few sub-categories. On the downside,

however, the number of sub-category speciﬁc parts soon be-

comes huge for large number of classes, each trained on a

small number of examples. This limits their ability to ro-

bustly capture the viewpoints, pose and lighting condition

changes. Hence, detecting parts, be it intrinsic or distinc-

tive, seems to involve contradictory requirements.

Different from prior work, we propose not to learn detec-

tors for individual parts, but instead localize distinctive de-

tails by ﬁrst roughly aligning the objects. This alignment is

rough and insensitive to vast appearance variations for large

number of sub-categories. Furthermore, rough alignment

is not sub-category speciﬁc, thus the object representation

becomes independent of the number of classes or training

images [33, 32]. For alignments we only use the overall

shape.

A ﬁrst novelty of our work is based on the observation

that all sub-categories belonging to the same super-category

share similar global characteristics regarding their shape

and poses. Therefore, it is effective to align objects, as

we will pursue. In the supervised case, annotated details

are transferred from training images to test images. In the

unsupervised case, we use alignments to delineate corre-

sponding object regions that we will use in the differential

classiﬁcation.

Our second novelty is based on the observation that s tart-

ing from rough alignments instead of precise part loca-

tions, noticeable appearance perturbations will appear even

between very similar objects, due to common image de-

formations such as small translations, viewpoint variations

and partial occlusions. Using as ﬁne-grained representa-

tions [10, 32, 34, 1] raw descriptors such as [17, 2, 32],

that are precise, yet sensitive to common image transforma-

tions, is therefore likely to be a sub-optimal choice, espe-

cially when part detection becomes challenging. We pro-

pose to use state-of-the-art feature encodings, like Fisher

Vectors [23], typically used for image classiﬁcation, as

local descriptors. In contrast to the raw SIFT or template

features preferred in t he ﬁne-grained literature [16, 31, 32],

such localized feature encodings are l ess sensitive to mis-

alignments. Indeed, as our experiments indicate, they are

better suited than matching based features.

We present two methods for recovering alignments that

require varying levels of part supervision during training.

We evaluate our methods on the CU-2011 Birds and Stan-

ford Dogs dataset [30]. The results vouch for unsupervised

alignments, which outperform previous published results.

2. Related work

Fine-grained categorization has entered the stage in the

computer vision literature only recently. Prior works have

focused on various aspects of ﬁne-grained categorization,

such as the description of ﬁne-grained objects, the detection

of ﬁne-grained objects and the use of human interaction to

boost recognition.

Fine-grained description. For the description of ﬁne-

grained objects various proposals have been made in the lit-

erature. In [32] Yao et al. propose to use color and gradi-

ent pixel values, arriving at high-dimensional histograms.

Farell et al. [10] use color SIFT features, whereas Yang

et al. [31] propose to use shape, color and texture based

kernel descriptors [2]. Different from the above works,

we propose to use strong classiﬁcation- and not matching-

oriented, encodings to describe the alignment parts and re-

gions. Sanchez et al. in [13] and Chai et al. in [6] rely

on classiﬁcation-oriented encodings, Fisher vectors speciﬁ-

cally, to learn a global object level representations. Inspired

by their work we also adopt Fisher vectors. However, we

use Fisher vectors not only as global, object level represen-

tations, but also as localized appearance descriptors.

Fine-grained detection. The detection of objects in a

ﬁne-grained categorization setting ranges from the segmen-

tation of the object of interest [19, 5, 6] to ﬁtting ellip-

soids [10] and detecting individual parts and templates [33,

34, 32, 31, 16]. In their seminal work [19] Nilsback and Zis-

serman show the importance of segmenting out background

information for recognizing ﬂowers. Furthermore, in [5, 6]

Chai et al. demonstrate how co-segmentation may be em-

ployed to improve classiﬁcation. In the current work we

also use segmentation, but with the intention to acquire an

impression of the object’s shape and to recover interesting

object regions.

Targeting more towards parts instead of segmentations,

Yao et al. propose to either sample discriminative features

using randomized trees [33] or convolute images wit h hun-

dreds of thousands of randomly generated templates [32].

Since a huge feature space is generated, tree pruning is em-

ployed to discard the unnecessary dimensions and make the

problem tractable. In [10, 34] Farrell et al. capture the poses

of birds, whereas in [34] Zhang et al. furthermore propose

to normalize such poses and extract warped features, arriv-

ing at impressive results. In [21] Parkhi et al. propose to use

deformable part models to detect the head of cats and dogs

and in [1] Berg and Belhumeur learn discriminative parts

from pairwise comparisons between classes. Also, in [16]

Liu et al. propose to share parts between classes to arrive at

accurate part localization.

Different from the above works, we do not directly aim

at localizing individual parts, but r ather at aligning the ob-

ject as a whole. Based on this alignment, we then derive a

small number of predicted parts (supervised) or regions (un-

supervised). Such regions are highly repeatable, while few

in number, thus ensuring consistency across the dataset and

a smaller parameter space to learn our ﬁne-grained object

descriptions.

Human interaction. In [20] Parikh and Grauman itera-

tively generate discriminative attributes. They then evaluate

and retain the “nameable” ones, that is the ones that can be

interpreted by humans. In [4] Branson et al. try to determine

Figure 2. The computation of the segmentation mask can be accu-

rate as in the left, ok as in the middle or completely fail as in the

right image. Most times segmentations are somewhere in between

the left and middle example, thus allowing us to obtain a rather

good impres sion of the object’s s hape.

the object’s sub-category using visual properties that can be

easily answered by a user, such as whether the object “has

stripes”. In [29] Wah et al. propose an active learning ap-

proach that considers user clicks on object part locations, so

that the machine learns to select the most informative ques-

tion to pose to the user. In [ 9] Duan et al. propose to use a la-

tent conditional random ﬁeld to generate localized attributes

that are both machine and human friendly. A user then picks

those attributes that are sensible. And in [3] Branson et

al. show that part models designed for generic objects do

not always perform equally well for ﬁne-grained categories.

They therefore propose online supervision to learn better

part models. The above approaches require time-consuming

user input and often expert-knowledge. Hence, their appli-

cability is usually restricted to small datasets covering only

a limited number of ﬁne-grained categories [9]. In the cur-

rent work we propose a ﬁne-grained categorization method

that does not require any human interaction.

3. Alignments

A local frame of reference s erves to identify the spatial

properties of an object. In the following we will employ

both shape masks and ellipses as local frames of reference.

We say an image is aligned with other images if we have

identiﬁed a local frame of reference in the image that is

consistent with (a subset of) the frames of reference found

in other images. Consistent means that corresponding parts

are found in similar locations, when expressed relative to

this frame of reference.

As is common in ﬁne-grained categorization [33, 32, 31],

we have available both at tr aining and at test time the bound-

ing box locations of the object of interest. We focus exclu-

sively on the classiﬁcation problem, leaving the problem of

object detection for another occasion. Ignoring the image

content outside the bounding box is a reasonable thing to

do, since context is unlikely to play any major role in recog-

nition of sub-categories, e.g., all birds are usually either on

trees or ﬂying in the sky.

The rectangular bounding box around an object allows

for extracting important information, such as the approxi-

mate shape of the object. More speciﬁcally, we use Grab-

Cut [ 25] on the bounding box to compute an accurate ﬁgure-

ground segmentation. Although GrabCut is not always as

accurate and i n rare cases fails to recover even a basic con-

tour, in the vast majority of cases it is able to return a r ather

precise contour of the object, see Fig. 2.

3.1. Supervised alignments

In the supervised scenario the ground truth locations of

basic object parts, such as the beak or the tail of the birds,

are available in the training set. This is a typical scenario

when the number of images is limited, so that human ex-

perts can provide information at such a level of granularity.

In this setting, we aim at accurately aligning the test image

with a small number of training images. Then, we can use

the common frame of reference to predict the part locations

in the test image.

Our ﬁrst goal is to retrieve a small number of training

pictures that have a similar shape as the object in the test

image. Note that, at this stage, it does not matter whether

these are images that belong to the same sub-category or

not. To this end, we ﬁrst obtain the segmentation mask of

the object as described before. Since we are interested only

in the outer shape of the object, we suppress all the interior

shape information. This gives us a shape mask for the im-

age, which we effectively summarize in the form of HOG

features [7].

A HOG feature forms in theory a high-dimensional,

dense space. In practice, however, all the sub-categories be-

long to the same super-category, hence the generated poses

will mainly lie on a lower dimensional manifold. Therefore,

we can expect that given an object, there are several oth-

ers with similar shapes and, that due to the anatomical con-

straints of the super-category they belong to, are likely to be

found in similar poses. Given the ℓ

-normalized HOG fea-

ture of the image shape mask, we retrieve the nearest neigh-

bor images from the training set using a query-by-example

setting. As a result, we end up with a shortlist of other sim-

ilarly posed objects, see Fig. 3.

Having retrieved the t raining images with the most simi-

lar poses, the bounding boxes can be used as frames of ref-

erence. We are now in position to use the ground truth lo-

cations of the parts in the training images and predict the

corresponding locations in the test image. To calculate the

positions of the same parts on the test image, one may ap-

ply several methods of varying sophistication, ranging from

simple average pooling of part locations to local, indepen-

dent optimization of parts based on HOG convolutions. We

experimentally witnessed that averaging yields accurate re-

sults, accurate enough to recover rough alignments. To en-

sure maximum compatibility we repeat the above procedure

for all training and testing images in the dataset, thus pre-

dicting part locations for all the objects in the dataset.

Figure 3. In the top left, we have a test image, for which we want to predict part locations. On the right, we have the nearest neighbor

training images, their ground truth part locations and their HOG shape representations, based on which they were retrieved. Regressing

the locations from the nearest neighbors to the test image we get the predicted parts, shown as the colorful symbols. The predicted part

locations look quite consistent.

3.2. Unsupervised alignments

In the unsupervised scenario no ground truth information

of the training part locations is available. However, we still

have the bounding box that surrounds the object, based on

which we can derive a shape mask per object.

Since no ground truth part locations are available, it does

not make sense to align the test image to a small subset of

training images. Instead, we derive a f rame of reference

based on the global object shape, inspired by local afﬁne

frames used for afﬁne invariant keypoint description [18].

While not as accurate as the alignments in the previous sub-

section, this procedure allows us to obtain robust and con-

sistent alignments over the entire database.

More speciﬁcally, we ﬁt an ellipse to the pixels X of the

segmentation mask and compute the local 2-d geometry in

the form of the two principal axes

= ¯x + ~e

(1)

In eq. (1) λ

and ~e

stand for the j-th eigenvalue and eigen-

vector of the covariance matrix C = E[(X − ¯x)(X − ¯x)

]

and ¯x is the average location of the mask pixels, see Fig. 4.

GrabCut does not always return very accurate contours

around the objects. Still, the centre of mass of the object

is relatively stable to random ﬂuctuations of the object con-

tour. Thus, we let the ellipse axes meet each other at this

point. To this end we extract the principal axes using all the

foreground pixels of the shape mask.

For objects that have an elliptical shape the longer axis is

usually the principal axis. Additionally, we follow the grav-

ity vector assumption [22] and adopt the highest end point

of the principal axis as its origin. Regarding the ancillary

axis, we cannot easily deﬁne an origin in a consistent way.

We therefore decide not to use the ancillary axis in the gen-

eration of consistent regions. This procedure fully deﬁnes

the frame of reference, see Fig. 4.

Relative to this frame of reference, we can deﬁne dif-

ferent locations or regions at will. Here, we divide the

principal axis equally from the origin to the end in a ﬁxed

number of segments, and deﬁne r egions as the part of the

foreground mask that falls within one such segment. Given

accurate segmentation masks, t he corresponding locations

in different ﬁne-grained objects are visited in the same or-

der, thus resulting in pose-normalized representations, see

Fig. 4. Small errors in the segmentations, as in the last row

of picture of Fig. 4, have only a limited impact on the re-

gions we obtain.

4. Final Image Representation

Our alignments are designed to be rough. Thus, using

features that are precise, but sensitive to common image

transformations, is likely to be suboptimal. Instead, we pro-

pose to use Fisher vectors [23] extracted in the predicted

parts/regions. There are different ways one could sample

from the alignment region to generate a Fisher vector. We

turn our focus into two approaches, one that is more relevant

to part based models and another one that is more relevant

to consistent regions. For the ﬁrst approach we sample in a

T × T window around the center of the part, sampling de-

scriptors every d pixels. Together with the object informa-

tion this approach also captures some of the context that sur-

rounds the object parts. For the second approach we sample

densely every d pixels only on the intersection area of the

segmentation mask and the region. This approach includes

less context, as no descriptors centered to the background

are extracted. Note that although the second approach is

theoretically more accurate in capturing only the object ap-

pearance details, at the same time it might either include

background pixels or omit foreground pixels, since segmen-

tation masks are not perfect.

HTML Viewer

Frequently Asked Questions (9)

Q1. What contributions have the authors mentioned in the paper "Fine-grained categorization by alignments" ?

The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, the authors propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The authors evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art. The authors furthermore argue that in the distinction of finegrained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG.

Q2. What is the purpose of tree pruning?

Since a huge feature space is generated, tree pruning is employed to discard the unnecessary dimensions and make the problem tractable.

Q3. How do the authors retrieve the nearest neighbor images from the training set?

Given the ℓ2-normalized HOG feature of the image shape mask, the authors retrieve the nearest neighbor images from the training set using a query-by-example setting.

Q4. What is the way to calculate the positions of the parts on the test image?

To calculate the positions of the same parts on the test image, one may apply several methods of varying sophistication, ranging from simple average pooling of part locations to local, independent optimization of parts based on HOG convolutions.

Q5. How are Fisher vectors able to better describe the little nuances in the gradients?

Fisher vectors are able to better describe the little nuances in the gradients, since they are specifically designed to capture also first and second order statistics of the gradient information.

Q6. How accurate is the Fisher vector on the supervised alignments?

the authors note that extracting Fisher vectors on the supervised alignments is 47.1% accurate, which is rather close to the 52.5% obtained when extracting Fisher vectors on the parts provided by the ground truth.

Q7. How do the authors compute the final representation of the object?

The Fisher vectors from the 7 parts are concatenated with a Fisher vector from the whole bounding box to arrive at the final object representation.

Q8. How do they determine the sub-category of an object?

try to determinethe object’s sub-category using visual properties that can be easily answered by a user, such as whether the object “has stripes”.

Q9. What is the difference between the two novelty methods?

Their second novelty is based on the observation that starting from rough alignments instead of precise part locations, noticeable appearance perturbations will appear even between very similar objects, due to common image deformations such as small translations, viewpoint variations and partial occlusions.

Fine-Grained Categorization by Alignments

Summary (3 min read)

1. Introduction

3. Alignments

3.1. Supervised alignments

3.2. Unsupervised alignments

4. Final Image Representation

5.1. Experimental setup

5.2. Matching vs Classification Descriptors

5.3. Supervised alignments

5.4. Unsupervised Alignments

5.5. State-of-the-art comparison

6. Conclusions

Figures (12)

Citations

Cites methods from "Fine-Grained Categorization by Alig..."

Cites background or methods from "Fine-Grained Categorization by Alig..."

References

"Fine-Grained Categorization by Alig..." refers background in this paper

"Fine-Grained Categorization by Alig..." refers background in this paper

"Fine-Grained Categorization by Alig..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (9)

Q1. What contributions have the authors mentioned in the paper "Fine-grained categorization by alignments" ?

Q2. What is the purpose of tree pruning?

Q3. How do the authors retrieve the nearest neighbor images from the training set?

Q4. What is the way to calculate the positions of the parts on the test image?

Q5. How are Fisher vectors able to better describe the little nuances in the gradients?

Q6. How accurate is the Fisher vector on the supervised alignments?

Q7. How do the authors compute the final representation of the object?

Q8. How do they determine the sub-category of an object?

Q9. What is the difference between the two novelty methods?