scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Fine-Grained Categorization by Alignments

TL;DR: It is argued that in the distinction of fine grained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG.
Abstract: The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The alignments are then used to transfer part annotations from training images to test images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We furthermore argue that in the distinction of fine grained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG. We evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art.

Summary (3 min read)

1. Introduction

  • Fine-grained categorization relies on identifying the subtle differences in appearance of specific object parts.
  • Parts may be divided in intrinsic parts [3, 16] such as the head of a dog or the body of a bird, and distinctive parts [32, 31] specific to few sub-categories.
  • The large variability that naturally arises for large number of classes complicates their detection.
  • Furthermore, rough alignment is not sub-category specific, thus the object representation becomes independent of the number of classes or training images [33, 32].
  • In contrast to the raw SIFT or template features preferred in the fine-grained literature [16, 31, 32], such localized feature encodings are less sensitive to misalignments.

3. Alignments

  • In the following the authors will employ both shape masks and ellipses as local frames of reference.
  • Consistent means that corresponding parts are found in similar locations, when expressed relative to this frame of reference.
  • As is common in fine-grained categorization [33, 32, 31], the authors have available both at training and at test time the bounding box locations of the object of interest.
  • Ignoring the image content outside the bounding box is a reasonable thing to do, since context is unlikely to play any major role in recognition of sub-categories, e.g., all birds are usually either on trees or flying in the sky.
  • The rectangular bounding box around an object allows for extracting important information, such as the approxi- mate shape of the object.

3.1. Supervised alignments

  • In the supervised scenario the ground truth locations of basic object parts, such as the beak or the tail of the birds, are available in the training set.
  • This gives us a shape mask for the image, which the authors effectively summarize in the form of HOG features [7].
  • Therefore, the authors can expect that given an object, there are several others with similar shapes and, that due to the anatomical constraints of the super-category they belong to, are likely to be found in similar poses.
  • The authors are now in position to use the ground truth locations of the parts in the training images and predict the corresponding locations in the test image.
  • The authors experimentally witnessed that averaging yields accurate results, accurate enough to recover rough alignments.

3.2. Unsupervised alignments

  • In the unsupervised scenario no ground truth information of the training part locations is available.
  • Since no ground truth part locations are available, it does not make sense to align the test image to a small subset of training images.
  • More specifically, the authors fit an ellipse to the pixels X of the segmentation mask and compute the local 2-d geometry in the form of the two principal axes aj = x̄+.
  • Regarding the ancillary axis, the authors cannot easily define an origin in a consistent way.
  • This procedure fully defines the frame of reference, see Fig.

4. Final Image Representation

  • Thus, using features that are precise, but sensitive to common image transformations, is likely to be suboptimal.
  • Instead, the authors propose to use Fisher vectors [23] extracted in the predicted parts/regions.
  • The authors turn their focus into two approaches, one that is more relevant to part based models and another one that is more relevant to consistent regions.
  • Together with the object information this approach also captures some of the context that surrounds the object parts.
  • For the second approach the authors sample densely every d pixels only on the intersection area of the segmentation mask and the region.

5.1. Experimental setup

  • The authors first run their experiments on the CU-2011 Birds dataset [30], one of the most extensive datasets in the fine-grained literature.
  • The CU-2011 Birds dataset is composed of 200 sub-species of birds, several of whom bear tremendous similarities, especially under common image transformations, see Fig.
  • Following the standard evaluation protocol [33, 32, 31], the authors mirror the train images to double the size of the training set and use the bounding boxes to normalize the images.
  • The authors use the ground truth part annotations only during learning, unless stated otherwise.
  • For Fisher vectors the authors use a Gaussian mixture model with 256 components.

5.2. Matching vs Classification Descriptors

  • In this first experiment the authors evaluate what are good descriptors for describing parts in a fine-grained categorization setting.
  • In order to avoid a too strong correlation between the parts and also control the dimensionality of the final feature vector the authors use only the following 7 parts, which cover the bird silhouette: beak, belly, forehead, left wing, right wing, tail and throat.
  • Similarly, for the HOG object descriptors the authors also compute a HOG vector using the bounding box, rescaled to 100× 100 pixels.
  • For fine-grained classes the gradients are often quite similar, since they belong to the same superclass.
  • The authors plot in the left image of Fig. 5 the individual accuracies per class for Fisher vectors and for HOG, noticing that Fisher vectors outperform for 184 out of the 200 sub-categories.

5.3. Supervised alignments

  • In the second experiment the authors test whether supervised alignments actually benefit the recognition of fine-grained categories, as compared to a standard classification pipeline.
  • The authors use the same 7 parts as in the previous experiment plus a Fisher vector extracted from the whole bounding box.
  • Also, inspired by [16], the authors repeat the same experiment using only the predicted location of the beak, whose window captures most of the information around the head.
  • Furthermore, the authors note that extracting Fisher vectors on the supervised alignments is 47.1% accurate, which is rather close to the 52.5% obtained when extracting Fisher vectors on the parts provided by the ground truth.
  • This indicates that the authors capture the part locations well enough for an appearance descriptor like the Fisher vector.

5.4. Unsupervised Alignments

  • In this experiment the authors compare the unsupervised alignments with the supervised ones.
  • After extracting the principal axis the authors split the bird mask into four regions, starting from the highest point, considering only the pixels within the segmentation mask.
  • The authors furthermore compare their method against a horizontally split [4× 1] spatial pyramid.
  • The authors repeat the experiment considering different number of regions.
  • We, furthermore, plot the individual accuracy differences per class for supervised and unsupervised alignments in the right picture in Fig.

5.5. State-of-the-art comparison

  • In experiment 4, the authors compare their unsupervised alignments with state-of-the-art methods reported on CU-2011 Birds and Stanford Dogs.
  • The authors add color by sampling SIFT descriptors from the opponent color spaces [27].
  • And compared to learned features proposed in [12] unsupervised alignments perform 36.5% better.
  • The authors report also some numbers from prior works on CU-2010 Birds, which is the previous version of CU-2011 Birds.
  • In Fig. 7 the authors show images of the two categories most confused to each other: Loggerhead Shrike and Great Grey Shrike.

6. Conclusions

  • In this paper the authors aim for fine-grained categorization without human interaction.
  • Different from prior work, the authors show that localizing distinctive details by roughly aligning the objects allows for successful recognition of fine-grained subclasses.
  • The authors present two methods for extracting alignments, requiring different levels of supervision.
  • The authors evaluate on the CU2011 Birds and Stanford Dogs dataset, outperforming the state-of-the-art.
  • The authors conclude that rough alignments lead to accurate fine-grained categorization.

Did you find this useful? Give us your feedback

Figures (12)

Content maybe subject to copyright    Report

UvA-DARE is a service provided by the library of the University of Amsterdam (http
s
://dare.uva.nl)
UvA-DARE (Digital Academic Repository)
Fine-Grained Categorization by Alignments
Gavves, E.; Fernando, B.; Snoek, C.G.M.; Smeulders, A.W.M.; Tuytelaars, T.
DOI
10.1109/ICCV.2013.215
Publication date
2013
Document Version
Author accepted manuscript
Published in
2013 IEEE International Conference on Computer Vision
Link to publication
Citation for published version (APA):
Gavves, E., Fernando, B., Snoek, C. G. M., Smeulders, A. W. M., & Tuytelaars, T. (2013).
Fine-Grained Categorization by Alignments. In
2013 IEEE International Conference on
Computer Vision: ICCV 2013 : proceedings: 1-8 December 2013, Sydney, NSW, Australia
(pp. 1713-1720). IEEE Computer Society. https://doi.org/10.1109/ICCV.2013.215
General rights
It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)
and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open
content license (like Creative Commons).
Disclaimer/Complaints regulations
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please
let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material
inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter
to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You
will be contacted as soon as possible.
Download date:10 Aug 2022

Fine-Grained Categorization by Alignments
E. Gavves
1
, B. Fernando
2
, C.G.M. Snoek
1
, A.W.M. Smeulders
1, 3
, and T. Tuytelaars
2
1
University of Amsterdam, ISIS
2
KU Leuven, ESAT-PSI, iMinds
3
CWI Amsterdam
Abstract
The aim of this paper is fine-grained categorization with-
out human interaction. Different from prior work, which
relies on detectors for specific object parts, we propose to
localize distinctive details by roughly aligning the objects
using just the overall shape, since implicit to fine-grained
categorization is the existence of a super-class shape shared
among all classes. The alignments are then used to trans-
fer part annotations from training images to test images
(supervised alignment), or to blindly yet consistently seg-
ment the object in a number of regions (unsupervised align-
ment). We furthermore argue that in the distinction of fine-
grained sub-categories, classification-oriented encodings
like Fisher vectors are better suited for describing local-
ized information t han popular matching oriented features
like HOG. We evaluate the method on the CU-2011 Birds
and Stanford Dogs fine-grained datasets, outperforming the
state-of-the-art.
1. Introduction
Fine-grained categorization relies on identifying the sub-
tle differences in appearance of specific object parts. Re-
search in cognitive psychology has suggested [24] and re-
cent works in computer vision have confirmed [10, 31, 34]
this mechanism. Humans learn to distinguish different types
of birds by addressing the differences in specific details.
The same holds for car types [8], sailing boat types, dog
breeds [15, 16], but also when learning to discriminate dif-
ferent types of pathologies. For this purpose, active learning
methods have been proposed to extract attributes [9], volu-
metric models [10] or part models [3]. They require expert-
level knowledge at run time, which is often unavailable. In
contrast, we aim f or fine-grained categorization without hu-
man interaction.
Various methods have been proposed to learn in an un-
supervised manner, what details to focus on for identifying
Figure 1. The first image shows a Hooded Warbler, whereas the
second image shows a Kentucky Warbler. Based on example
images like these, fine-grained categorization tries to answer the
question: what fine-grained bird category do we have in the third
image? Rather than directly trying to localize parts (be it distinc-
tive or intrinsic), we show in this paper that better results can be
obtained if one first tries to align the birds based on their global
shape, ignoring the actual bird categories.
fine-grained sub-categories, such as the recent works rely-
ing on templates [31, 32]. In [32] templates rely on high
dimensionalities to arrive at good results, while in [31] they
are designed to be precise, being effectively analogous to
“parts” [11]. Yet, it remains unclear what is the most critical
aspect of “parts” in a fine-grained categorization context: is
it the ability to accurately localize corresponding locations
over object i nstances, or simply the ability to capture de-
tailed information? While often these go hand in hand, as
indeed is the case for templates, we defend the view that
actually it is the latter that matters. We argue that a very
precise “part” localization is not necessary and rough align-
ments suffice, as long as one manages to capture the fine-
grained details in the appearance.
Parts may be divided in intrinsic parts [3, 16] such as
the head of a dog or the body of a bird, and distinctive
parts [32, 31] specific to few sub-categories. Recovering in-
trinsic parts implies that such parts are seen throughout the
whole dataset. However, the large variability that naturally
arises for large number of classes complicates their detec-
tion. Distinctive parts, on the other hand, are destined to be
found on few sub-categories only. They are more consistent
in appearance, as the distinctive details are better tailored
to be detected on few sub-categories. On the downside,
however, the number of sub-category specific parts soon be-
comes huge for large number of classes, each trained on a

small number of examples. This limits their ability to ro-
bustly capture the viewpoints, pose and lighting condition
changes. Hence, detecting parts, be it intrinsic or distinc-
tive, seems to involve contradictory requirements.
Different from prior work, we propose not to learn detec-
tors for individual parts, but instead localize distinctive de-
tails by first roughly aligning the objects. This alignment is
rough and insensitive to vast appearance variations for large
number of sub-categories. Furthermore, rough alignment
is not sub-category specific, thus the object representation
becomes independent of the number of classes or training
images [33, 32]. For alignments we only use the overall
shape.
A first novelty of our work is based on the observation
that all sub-categories belonging to the same super-category
share similar global characteristics regarding their shape
and poses. Therefore, it is effective to align objects, as
we will pursue. In the supervised case, annotated details
are transferred from training images to test images. In the
unsupervised case, we use alignments to delineate corre-
sponding object regions that we will use in the differential
classification.
Our second novelty is based on the observation that s tart-
ing from rough alignments instead of precise part loca-
tions, noticeable appearance perturbations will appear even
between very similar objects, due to common image de-
formations such as small translations, viewpoint variations
and partial occlusions. Using as fine-grained representa-
tions [10, 32, 34, 1] raw descriptors such as [17, 2, 32],
that are precise, yet sensitive to common image transforma-
tions, is therefore likely to be a sub-optimal choice, espe-
cially when part detection becomes challenging. We pro-
pose to use state-of-the-art feature encodings, like Fisher
Vectors [23], typically used for image classification, as
local descriptors. In contrast to the raw SIFT or template
features preferred in t he fine-grained literature [16, 31, 32],
such localized feature encodings are l ess sensitive to mis-
alignments. Indeed, as our experiments indicate, they are
better suited than matching based features.
We present two methods for recovering alignments that
require varying levels of part supervision during training.
We evaluate our methods on the CU-2011 Birds and Stan-
ford Dogs dataset [30]. The results vouch for unsupervised
alignments, which outperform previous published results.
2. Related work
Fine-grained categorization has entered the stage in the
computer vision literature only recently. Prior works have
focused on various aspects of fine-grained categorization,
such as the description of fine-grained objects, the detection
of fine-grained objects and the use of human interaction to
boost recognition.
Fine-grained description. For the description of fine-
grained objects various proposals have been made in the lit-
erature. In [32] Yao et al. propose to use color and gradi-
ent pixel values, arriving at high-dimensional histograms.
Farell et al. [10] use color SIFT features, whereas Yang
et al. [31] propose to use shape, color and texture based
kernel descriptors [2]. Different from the above works,
we propose to use strong classification- and not matching-
oriented, encodings to describe the alignment parts and re-
gions. Sanchez et al. in [13] and Chai et al. in [6] rely
on classification-oriented encodings, Fisher vectors specifi-
cally, to learn a global object level representations. Inspired
by their work we also adopt Fisher vectors. However, we
use Fisher vectors not only as global, object level represen-
tations, but also as localized appearance descriptors.
Fine-grained detection. The detection of objects in a
fine-grained categorization setting ranges from the segmen-
tation of the object of interest [19, 5, 6] to fitting ellip-
soids [10] and detecting individual parts and templates [33,
34, 32, 31, 16]. In their seminal work [19] Nilsback and Zis-
serman show the importance of segmenting out background
information for recognizing flowers. Furthermore, in [5, 6]
Chai et al. demonstrate how co-segmentation may be em-
ployed to improve classification. In the current work we
also use segmentation, but with the intention to acquire an
impression of the object’s shape and to recover interesting
object regions.
Targeting more towards parts instead of segmentations,
Yao et al. propose to either sample discriminative features
using randomized trees [33] or convolute images wit h hun-
dreds of thousands of randomly generated templates [32].
Since a huge feature space is generated, tree pruning is em-
ployed to discard the unnecessary dimensions and make the
problem tractable. In [10, 34] Farrell et al. capture the poses
of birds, whereas in [34] Zhang et al. furthermore propose
to normalize such poses and extract warped features, arriv-
ing at impressive results. In [21] Parkhi et al. propose to use
deformable part models to detect the head of cats and dogs
and in [1] Berg and Belhumeur learn discriminative parts
from pairwise comparisons between classes. Also, in [16]
Liu et al. propose to share parts between classes to arrive at
accurate part localization.
Different from the above works, we do not directly aim
at localizing individual parts, but r ather at aligning the ob-
ject as a whole. Based on this alignment, we then derive a
small number of predicted parts (supervised) or regions (un-
supervised). Such regions are highly repeatable, while few
in number, thus ensuring consistency across the dataset and
a smaller parameter space to learn our fine-grained object
descriptions.
Human interaction. In [20] Parikh and Grauman itera-
tively generate discriminative attributes. They then evaluate
and retain the “nameable” ones, that is the ones that can be
interpreted by humans. In [4] Branson et al. try to determine

Figure 2. The computation of the segmentation mask can be accu-
rate as in the left, ok as in the middle or completely fail as in the
right image. Most times segmentations are somewhere in between
the left and middle example, thus allowing us to obtain a rather
good impres sion of the object’s s hape.
the object’s sub-category using visual properties that can be
easily answered by a user, such as whether the object “has
stripes”. In [29] Wah et al. propose an active learning ap-
proach that considers user clicks on object part locations, so
that the machine learns to select the most informative ques-
tion to pose to the user. In [ 9] Duan et al. propose to use a la-
tent conditional random field to generate localized attributes
that are both machine and human friendly. A user then picks
those attributes that are sensible. And in [3] Branson et
al. show that part models designed for generic objects do
not always perform equally well for fine-grained categories.
They therefore propose online supervision to learn better
part models. The above approaches require time-consuming
user input and often expert-knowledge. Hence, their appli-
cability is usually restricted to small datasets covering only
a limited number of fine-grained categories [9]. In the cur-
rent work we propose a fine-grained categorization method
that does not require any human interaction.
3. Alignments
A local frame of reference s erves to identify the spatial
properties of an object. In the following we will employ
both shape masks and ellipses as local frames of reference.
We say an image is aligned with other images if we have
identified a local frame of reference in the image that is
consistent with (a subset of) the frames of reference found
in other images. Consistent means that corresponding parts
are found in similar locations, when expressed relative to
this frame of reference.
As is common in fine-grained categorization [33, 32, 31],
we have available both at tr aining and at test time the bound-
ing box locations of the object of interest. We focus exclu-
sively on the classification problem, leaving the problem of
object detection for another occasion. Ignoring the image
content outside the bounding box is a reasonable thing to
do, since context is unlikely to play any major role in recog-
nition of sub-categories, e.g., all birds are usually either on
trees or flying in the sky.
The rectangular bounding box around an object allows
for extracting important information, such as the approxi-
mate shape of the object. More specifically, we use Grab-
Cut [ 25] on the bounding box to compute an accurate figure-
ground segmentation. Although GrabCut is not always as
accurate and i n rare cases fails to recover even a basic con-
tour, in the vast majority of cases it is able to return a r ather
precise contour of the object, see Fig. 2.
3.1. Supervised alignments
In the supervised scenario the ground truth locations of
basic object parts, such as the beak or the tail of the birds,
are available in the training set. This is a typical scenario
when the number of images is limited, so that human ex-
perts can provide information at such a level of granularity.
In this setting, we aim at accurately aligning the test image
with a small number of training images. Then, we can use
the common frame of reference to predict the part locations
in the test image.
Our first goal is to retrieve a small number of training
pictures that have a similar shape as the object in the test
image. Note that, at this stage, it does not matter whether
these are images that belong to the same sub-category or
not. To this end, we first obtain the segmentation mask of
the object as described before. Since we are interested only
in the outer shape of the object, we suppress all the interior
shape information. This gives us a shape mask for the im-
age, which we effectively summarize in the form of HOG
features [7].
A HOG feature forms in theory a high-dimensional,
dense space. In practice, however, all the sub-categories be-
long to the same super-category, hence the generated poses
will mainly lie on a lower dimensional manifold. Therefore,
we can expect that given an object, there are several oth-
ers with similar shapes and, that due to the anatomical con-
straints of the super-category they belong to, are likely to be
found in similar poses. Given the
2
-normalized HOG fea-
ture of the image shape mask, we retrieve the nearest neigh-
bor images from the training set using a query-by-example
setting. As a result, we end up with a shortlist of other sim-
ilarly posed objects, see Fig. 3.
Having retrieved the t raining images with the most simi-
lar poses, the bounding boxes can be used as frames of ref-
erence. We are now in position to use the ground truth lo-
cations of the parts in the training images and predict the
corresponding locations in the test image. To calculate the
positions of the same parts on the test image, one may ap-
ply several methods of varying sophistication, ranging from
simple average pooling of part locations to local, indepen-
dent optimization of parts based on HOG convolutions. We
experimentally witnessed that averaging yields accurate re-
sults, accurate enough to recover rough alignments. To en-
sure maximum compatibility we repeat the above procedure
for all training and testing images in the dataset, thus pre-
dicting part locations for all the objects in the dataset.

Figure 3. In the top left, we have a test image, for which we want to predict part locations. On the right, we have the nearest neighbor
training images, their ground truth part locations and their HOG shape representations, based on which they were retrieved. Regressing
the locations from the nearest neighbors to the test image we get the predicted parts, shown as the colorful symbols. The predicted part
locations look quite consistent.
3.2. Unsupervised alignments
In the unsupervised scenario no ground truth information
of the training part locations is available. However, we still
have the bounding box that surrounds the object, based on
which we can derive a shape mask per object.
Since no ground truth part locations are available, it does
not make sense to align the test image to a small subset of
training images. Instead, we derive a f rame of reference
based on the global object shape, inspired by local affine
frames used for affine invariant keypoint description [18].
While not as accurate as the alignments in the previous sub-
section, this procedure allows us to obtain robust and con-
sistent alignments over the entire database.
More specifically, we fit an ellipse to the pixels X of the
segmentation mask and compute the local 2-d geometry in
the form of the two principal axes
a
j
= ¯x + ~e
j
p
λ
j
(1)
In eq. (1) λ
j
and ~e
j
stand for the j-th eigenvalue and eigen-
vector of the covariance matrix C = E[(X ¯x)(X ¯x)
T
]
and ¯x is the average location of the mask pixels, see Fig. 4.
GrabCut does not always return very accurate contours
around the objects. Still, the centre of mass of the object
is relatively stable to random fluctuations of the object con-
tour. Thus, we let the ellipse axes meet each other at this
point. To this end we extract the principal axes using all the
foreground pixels of the shape mask.
For objects that have an elliptical shape the longer axis is
usually the principal axis. Additionally, we follow the grav-
ity vector assumption [22] and adopt the highest end point
of the principal axis as its origin. Regarding the ancillary
axis, we cannot easily define an origin in a consistent way.
We therefore decide not to use the ancillary axis in the gen-
eration of consistent regions. This procedure fully defines
the frame of reference, see Fig. 4.
Relative to this frame of reference, we can define dif-
ferent locations or regions at will. Here, we divide the
principal axis equally from the origin to the end in a fixed
number of segments, and define r egions as the part of the
foreground mask that falls within one such segment. Given
accurate segmentation masks, t he corresponding locations
in different fine-grained objects are visited in the same or-
der, thus resulting in pose-normalized representations, see
Fig. 4. Small errors in the segmentations, as in the last row
of picture of Fig. 4, have only a limited impact on the re-
gions we obtain.
4. Final Image Representation
Our alignments are designed to be rough. Thus, using
features that are precise, but sensitive to common image
transformations, is likely to be suboptimal. Instead, we pro-
pose to use Fisher vectors [23] extracted in the predicted
parts/regions. There are different ways one could sample
from the alignment region to generate a Fisher vector. We
turn our focus into two approaches, one that is more relevant
to part based models and another one that is more relevant
to consistent regions. For the first approach we sample in a
T × T window around the center of the part, sampling de-
scriptors every d pixels. Together with the object informa-
tion this approach also captures some of the context that sur-
rounds the object parts. For the second approach we sample
densely every d pixels only on the intersection area of the
segmentation mask and the region. This approach includes
less context, as no descriptors centered to the background
are extracted. Note that although the second approach is
theoretically more accurate in capturing only the object ap-
pearance details, at the same time it might either include
background pixels or omit foreground pixels, since segmen-
tation masks are not perfect.

Citations
More filters
Book ChapterDOI
06 Sep 2014
TL;DR: In this article, the authors propose a model for fine-grained categorization by leveraging deep convolutional features computed on bottom-up region proposals, which learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a finegrained category from a pose normalized representation.
Abstract: Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.

1,035 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: This paper proposes to apply visual attention to fine-grained classification task using deep neural network and achieves the best accuracy under the weakest supervision condition, and is competitive against other methods that rely on additional annotations.
Abstract: Fine-grained classification is challenging because categories can only be discriminated by subtle and local differences. Variances in the pose, scale or rotation usually make the problem more difficult. Most fine-grained classification systems follow the pipeline of finding foreground object or object parts (where) to extract discriminative features (what).

755 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: This work proposes a method for fine-grained recognition that uses no part annotations, based on generating parts using co-segmentation and alignment, which is combined in a discriminative mixture.
Abstract: Scaling up fine-grained recognition to all domains of fine-grained objects is a challenge the computer vision community will need to face in order to realize its goal of recognizing all object categories. Current state-of-the-art techniques rely heavily upon the use of keypoint or part annotations, but scaling up to hundreds or thousands of domains renders this annotation cost-prohibitive for all but the most important categories. In this work we propose a method for fine-grained recognition that uses no part annotations. Our method is based on generating parts using co-segmentation and alignment, which we combine in a discriminative mixture. Experimental results show its efficacy, demonstrating state-of-the-art results even when compared to methods that use part annotations during training.

507 citations


Cites methods from "Fine-Grained Categorization by Alig..."

  • ...Of the methods developed that do not use part annotations, there have been a few works philosophically similar to ours in the goal of finding localized parts or regions in an unsupervised fashion [15, 18, 10], with [18] and [10] more relevant....

    [...]

  • ...[18, 19] segment images via GrabCut [37], and then roughly align objects by parameterizing them as an ellipse....

    [...]

Posted Content
TL;DR: An architecture for fine-grained visual categorization that approaches expert human performance in the classification of bird species recognition is proposed, and a novel graph-based clustering algorithm for learning a compact pose normalization space is proposed.
Abstract: We propose an architecture for fine-grained visual categorization that approaches expert human performance in the classification of bird species. Our architecture first computes an estimate of the object's pose; this is used to compute local image features which are, in turn, used for classification. The features are computed by applying deep convolutional nets to image patches that are located and normalized by the pose. We perform an empirical study of a number of pose normalization schemes, including an investigation of higher order geometric warping functions. We propose a novel graph-based clustering algorithm for learning a compact pose normalization space. We perform a detailed investigation of state-of-the-art deep convolutional feature implementations and fine-tuning feature learning for fine-grained classification. We observe that a model that integrates lower-level feature layers with pose-normalized extraction routines and higher-level feature layers with unaligned image features works best. Our experiments advance state-of-the-art performance on bird species recognition, with a large improvement of correct classification rates over previous methods (75% vs. 55-65%).

473 citations


Cites background or methods from "Fine-Grained Categorization by Alig..."

  • ...ained categorization over the past 5 years has been extensive. Areas explored include feature representations that better preserve fine-grained information [35,46,47,48], segmentation-based approaches [1,13,14,15,21,37] that facilitate extraction of purer features, and part/pose normalized feature spaces [5,6,19,33, 38,39,43,50,51]. Among this large body of work, it is a goal of our paper to empirically investigate ...

    [...]

  • ...he base learning rate to 0:001. 5.1 Summary of Results and Comparison to Related Work Method Oracle Parts Oracle BBox Part Scheme Features Learning % Acc POOF [5] 3 Sim-2-131 POOF SVM 56.8 Alignments [21] 3 Trans-X-4 Fisher SVM 62.7 Symbiotic [15] 3 Trans-1-1 Fisher SVM 61.0 DPD [51] 3 Trans-1-8 KDES SVM 51.0 Decaf [17] 3 Trans-1-8 CNN Logistic Regr. 65.0 CUB [44] Trans-1-15 BoW SVM 10.3 Visipedia [12...

    [...]

  • ...ements on the CUB datasets over the last few years have been remarkable, with early methods achieving 10 20% 200-way classification accuracy [10,44,45,47], and recent methods achieving 55 65% accuracy [5,12,15,17,21,51]. Here we report further accuracy gains up to 75:7%. This paper makes 2 main contributions: 1.An empirical study of pose normalization schemes for fine-grained classification, including an investigation...

    [...]

  • ...ntly methods that employed more modern features like POOF [5], Fisher-encoded SIFT and color descriptors [40], and Kernel Descriptors (KDES) [7] significantly boosted performance into the 50 62% range [5,12,15,21,51]. CNN features [28] have helped yield a second major jump in performance to 65 76%. 2.Incorporating a stronger localization/alignment model is also important. Among alignment models, a similarity tran...

    [...]

  • ...ile controlling for other aspects of our algorithms. HOG is widely used as a good feature for localized models, whereas Fisher-encoded SIFT is widely used on CUB200-2011 with state-of-the-art results [12,15,21]. For HOG, we use the implementation/parameter settings of [20] and induce a 16 16 31 descriptor for each region type. For Fisher features, we use the implementation and parameter settings from [12]. ...

    [...]

Book ChapterDOI
Ze Yang1, Tiange Luo1, Dong Wang1, Zhiqiang Hu1, Jun Gao1, Liwei Wang1 
08 Sep 2018
TL;DR: In this paper, a self-supervision mechanism is proposed to locate informative regions without the need of bounding-box/part annotations, which consists of a navigator agent, a teacher agent and a scrutinizer agent.
Abstract: Fine-grained classification is challenging due to the difficulty of finding discriminative features. Finding those subtle traits that fully characterize the object is not straightforward. To handle this circumstance, we propose a novel self-supervision mechanism to effectively localize informative regions without the need of bounding-box/part annotations. Our model, termed NTS-Net for Navigator-Teacher-Scrutinizer Network, consists of a Navigator agent, a Teacher agent and a Scrutinizer agent. In consideration of intrinsic consistency between informativeness of the regions and their probability being ground-truth class, we design a novel training paradigm, which enables Navigator to detect most informative regions under the guidance from Teacher. After that, the Scrutinizer scrutinizes the proposed regions from Navigator and makes predictions. Our model can be viewed as a multi-agent cooperation, wherein agents benefit from each other, and make progress together. NTS-Net can be trained end-to-end, while provides accurate fine-grained classification predictions as well as highly informative regions during inference. We achieve state-of-the-art performance in extensive benchmark datasets.

433 citations

References
More filters
Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations


"Fine-Grained Categorization by Alig..." refers background in this paper

  • ...The same holds for car types [8], sailing boat types, dog breeds [15, 16], but also when learning to discriminate different types of pathologies....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Fine-Grained Categorization by Alig..." refers background in this paper

  • ...This gives us a shape mask for the image, which we effectively summarize in the form of HOG features [7]....

    [...]

Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations


"Fine-Grained Categorization by Alig..." refers background in this paper

  • ...In [32] templates rely on high dimensionalities to arrive at good results, while in [31] they are designed to be precise, being effectively analogous to “parts” [11]....

    [...]

Journal ArticleDOI
01 Aug 2004
TL;DR: A more powerful, iterative version of the optimisation of the graph-cut approach is developed and the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result.
Abstract: The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.

5,670 citations

Frequently Asked Questions (9)
Q1. What contributions have the authors mentioned in the paper "Fine-grained categorization by alignments" ?

The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, the authors propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The authors evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art. The authors furthermore argue that in the distinction of finegrained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG. 

Since a huge feature space is generated, tree pruning is employed to discard the unnecessary dimensions and make the problem tractable. 

Given the ℓ2-normalized HOG feature of the image shape mask, the authors retrieve the nearest neighbor images from the training set using a query-by-example setting. 

To calculate the positions of the same parts on the test image, one may apply several methods of varying sophistication, ranging from simple average pooling of part locations to local, independent optimization of parts based on HOG convolutions. 

Fisher vectors are able to better describe the little nuances in the gradients, since they are specifically designed to capture also first and second order statistics of the gradient information. 

the authors note that extracting Fisher vectors on the supervised alignments is 47.1% accurate, which is rather close to the 52.5% obtained when extracting Fisher vectors on the parts provided by the ground truth. 

The Fisher vectors from the 7 parts are concatenated with a Fisher vector from the whole bounding box to arrive at the final object representation. 

try to determinethe object’s sub-category using visual properties that can be easily answered by a user, such as whether the object “has stripes”. 

Their second novelty is based on the observation that starting from rough alignments instead of precise part locations, noticeable appearance perturbations will appear even between very similar objects, due to common image deformations such as small translations, viewpoint variations and partial occlusions.