Fine-Grained Categorization by Alignments
Summary (3 min read)
1. Introduction
- Fine-grained categorization relies on identifying the subtle differences in appearance of specific object parts.
- Parts may be divided in intrinsic parts [3, 16] such as the head of a dog or the body of a bird, and distinctive parts [32, 31] specific to few sub-categories.
- The large variability that naturally arises for large number of classes complicates their detection.
- Furthermore, rough alignment is not sub-category specific, thus the object representation becomes independent of the number of classes or training images [33, 32].
- In contrast to the raw SIFT or template features preferred in the fine-grained literature [16, 31, 32], such localized feature encodings are less sensitive to misalignments.
3. Alignments
- In the following the authors will employ both shape masks and ellipses as local frames of reference.
- Consistent means that corresponding parts are found in similar locations, when expressed relative to this frame of reference.
- As is common in fine-grained categorization [33, 32, 31], the authors have available both at training and at test time the bounding box locations of the object of interest.
- Ignoring the image content outside the bounding box is a reasonable thing to do, since context is unlikely to play any major role in recognition of sub-categories, e.g., all birds are usually either on trees or flying in the sky.
- The rectangular bounding box around an object allows for extracting important information, such as the approxi- mate shape of the object.
3.1. Supervised alignments
- In the supervised scenario the ground truth locations of basic object parts, such as the beak or the tail of the birds, are available in the training set.
- This gives us a shape mask for the image, which the authors effectively summarize in the form of HOG features [7].
- Therefore, the authors can expect that given an object, there are several others with similar shapes and, that due to the anatomical constraints of the super-category they belong to, are likely to be found in similar poses.
- The authors are now in position to use the ground truth locations of the parts in the training images and predict the corresponding locations in the test image.
- The authors experimentally witnessed that averaging yields accurate results, accurate enough to recover rough alignments.
3.2. Unsupervised alignments
- In the unsupervised scenario no ground truth information of the training part locations is available.
- Since no ground truth part locations are available, it does not make sense to align the test image to a small subset of training images.
- More specifically, the authors fit an ellipse to the pixels X of the segmentation mask and compute the local 2-d geometry in the form of the two principal axes aj = x̄+.
- Regarding the ancillary axis, the authors cannot easily define an origin in a consistent way.
- This procedure fully defines the frame of reference, see Fig.
4. Final Image Representation
- Thus, using features that are precise, but sensitive to common image transformations, is likely to be suboptimal.
- Instead, the authors propose to use Fisher vectors [23] extracted in the predicted parts/regions.
- The authors turn their focus into two approaches, one that is more relevant to part based models and another one that is more relevant to consistent regions.
- Together with the object information this approach also captures some of the context that surrounds the object parts.
- For the second approach the authors sample densely every d pixels only on the intersection area of the segmentation mask and the region.
5.1. Experimental setup
- The authors first run their experiments on the CU-2011 Birds dataset [30], one of the most extensive datasets in the fine-grained literature.
- The CU-2011 Birds dataset is composed of 200 sub-species of birds, several of whom bear tremendous similarities, especially under common image transformations, see Fig.
- Following the standard evaluation protocol [33, 32, 31], the authors mirror the train images to double the size of the training set and use the bounding boxes to normalize the images.
- The authors use the ground truth part annotations only during learning, unless stated otherwise.
- For Fisher vectors the authors use a Gaussian mixture model with 256 components.
5.2. Matching vs Classification Descriptors
- In this first experiment the authors evaluate what are good descriptors for describing parts in a fine-grained categorization setting.
- In order to avoid a too strong correlation between the parts and also control the dimensionality of the final feature vector the authors use only the following 7 parts, which cover the bird silhouette: beak, belly, forehead, left wing, right wing, tail and throat.
- Similarly, for the HOG object descriptors the authors also compute a HOG vector using the bounding box, rescaled to 100× 100 pixels.
- For fine-grained classes the gradients are often quite similar, since they belong to the same superclass.
- The authors plot in the left image of Fig. 5 the individual accuracies per class for Fisher vectors and for HOG, noticing that Fisher vectors outperform for 184 out of the 200 sub-categories.
5.3. Supervised alignments
- In the second experiment the authors test whether supervised alignments actually benefit the recognition of fine-grained categories, as compared to a standard classification pipeline.
- The authors use the same 7 parts as in the previous experiment plus a Fisher vector extracted from the whole bounding box.
- Also, inspired by [16], the authors repeat the same experiment using only the predicted location of the beak, whose window captures most of the information around the head.
- Furthermore, the authors note that extracting Fisher vectors on the supervised alignments is 47.1% accurate, which is rather close to the 52.5% obtained when extracting Fisher vectors on the parts provided by the ground truth.
- This indicates that the authors capture the part locations well enough for an appearance descriptor like the Fisher vector.
5.4. Unsupervised Alignments
- In this experiment the authors compare the unsupervised alignments with the supervised ones.
- After extracting the principal axis the authors split the bird mask into four regions, starting from the highest point, considering only the pixels within the segmentation mask.
- The authors furthermore compare their method against a horizontally split [4× 1] spatial pyramid.
- The authors repeat the experiment considering different number of regions.
- We, furthermore, plot the individual accuracy differences per class for supervised and unsupervised alignments in the right picture in Fig.
5.5. State-of-the-art comparison
- In experiment 4, the authors compare their unsupervised alignments with state-of-the-art methods reported on CU-2011 Birds and Stanford Dogs.
- The authors add color by sampling SIFT descriptors from the opponent color spaces [27].
- And compared to learned features proposed in [12] unsupervised alignments perform 36.5% better.
- The authors report also some numbers from prior works on CU-2010 Birds, which is the previous version of CU-2011 Birds.
- In Fig. 7 the authors show images of the two categories most confused to each other: Loggerhead Shrike and Great Grey Shrike.
6. Conclusions
- In this paper the authors aim for fine-grained categorization without human interaction.
- Different from prior work, the authors show that localizing distinctive details by roughly aligning the objects allows for successful recognition of fine-grained subclasses.
- The authors present two methods for extracting alignments, requiring different levels of supervision.
- The authors evaluate on the CU2011 Birds and Stanford Dogs dataset, outperforming the state-of-the-art.
- The authors conclude that rough alignments lead to accurate fine-grained categorization.
Did you find this useful? Give us your feedback
Citations
1,035 citations
755 citations
507 citations
Cites methods from "Fine-Grained Categorization by Alig..."
...Of the methods developed that do not use part annotations, there have been a few works philosophically similar to ours in the goal of finding localized parts or regions in an unsupervised fashion [15, 18, 10], with [18] and [10] more relevant....
[...]
...[18, 19] segment images via GrabCut [37], and then roughly align objects by parameterizing them as an ellipse....
[...]
473 citations
Cites background or methods from "Fine-Grained Categorization by Alig..."
...ained categorization over the past 5 years has been extensive. Areas explored include feature representations that better preserve fine-grained information [35,46,47,48], segmentation-based approaches [1,13,14,15,21,37] that facilitate extraction of purer features, and part/pose normalized feature spaces [5,6,19,33, 38,39,43,50,51]. Among this large body of work, it is a goal of our paper to empirically investigate ...
[...]
...he base learning rate to 0:001. 5.1 Summary of Results and Comparison to Related Work Method Oracle Parts Oracle BBox Part Scheme Features Learning % Acc POOF [5] 3 Sim-2-131 POOF SVM 56.8 Alignments [21] 3 Trans-X-4 Fisher SVM 62.7 Symbiotic [15] 3 Trans-1-1 Fisher SVM 61.0 DPD [51] 3 Trans-1-8 KDES SVM 51.0 Decaf [17] 3 Trans-1-8 CNN Logistic Regr. 65.0 CUB [44] Trans-1-15 BoW SVM 10.3 Visipedia [12...
[...]
...ements on the CUB datasets over the last few years have been remarkable, with early methods achieving 10 20% 200-way classification accuracy [10,44,45,47], and recent methods achieving 55 65% accuracy [5,12,15,17,21,51]. Here we report further accuracy gains up to 75:7%. This paper makes 2 main contributions: 1.An empirical study of pose normalization schemes for fine-grained classification, including an investigation...
[...]
...ntly methods that employed more modern features like POOF [5], Fisher-encoded SIFT and color descriptors [40], and Kernel Descriptors (KDES) [7] significantly boosted performance into the 50 62% range [5,12,15,21,51]. CNN features [28] have helped yield a second major jump in performance to 65 76%. 2.Incorporating a stronger localization/alignment model is also important. Among alignment models, a similarity tran...
[...]
...ile controlling for other aspects of our algorithms. HOG is widely used as a good feature for localized models, whereas Fisher-encoded SIFT is widely used on CUB200-2011 with state-of-the-art results [12,15,21]. For HOG, we use the implementation/parameter settings of [20] and induce a 16 16 31 descriptor for each region type. For Fisher features, we use the implementation and parameter settings from [12]. ...
[...]
433 citations
References
49,639 citations
"Fine-Grained Categorization by Alig..." refers background in this paper
...The same holds for car types [8], sailing boat types, dog breeds [15, 16], but also when learning to discriminate different types of pathologies....
[...]
46,906 citations
31,952 citations
"Fine-Grained Categorization by Alig..." refers background in this paper
...This gives us a shape mask for the image, which we effectively summarize in the form of HOG features [7]....
[...]
10,501 citations
"Fine-Grained Categorization by Alig..." refers background in this paper
...In [32] templates rely on high dimensionalities to arrive at good results, while in [31] they are designed to be precise, being effectively analogous to “parts” [11]....
[...]
5,670 citations
Related Papers (5)
Frequently Asked Questions (9)
Q2. What is the purpose of tree pruning?
Since a huge feature space is generated, tree pruning is employed to discard the unnecessary dimensions and make the problem tractable.
Q3. How do the authors retrieve the nearest neighbor images from the training set?
Given the ℓ2-normalized HOG feature of the image shape mask, the authors retrieve the nearest neighbor images from the training set using a query-by-example setting.
Q4. What is the way to calculate the positions of the parts on the test image?
To calculate the positions of the same parts on the test image, one may apply several methods of varying sophistication, ranging from simple average pooling of part locations to local, independent optimization of parts based on HOG convolutions.
Q5. How are Fisher vectors able to better describe the little nuances in the gradients?
Fisher vectors are able to better describe the little nuances in the gradients, since they are specifically designed to capture also first and second order statistics of the gradient information.
Q6. How accurate is the Fisher vector on the supervised alignments?
the authors note that extracting Fisher vectors on the supervised alignments is 47.1% accurate, which is rather close to the 52.5% obtained when extracting Fisher vectors on the parts provided by the ground truth.
Q7. How do the authors compute the final representation of the object?
The Fisher vectors from the 7 parts are concatenated with a Fisher vector from the whole bounding box to arrive at the final object representation.
Q8. How do they determine the sub-category of an object?
try to determinethe object’s sub-category using visual properties that can be easily answered by a user, such as whether the object “has stripes”.
Q9. What is the difference between the two novelty methods?
Their second novelty is based on the observation that starting from rough alignments instead of precise part locations, noticeable appearance perturbations will appear even between very similar objects, due to common image deformations such as small translations, viewpoint variations and partial occlusions.