Fine-grained recognition without part annotations

doi:10.1109/CVPR.2015.7299194

Fine-Grained Recognition without Part Annotations: Supplementary Material

Jonathan Krause

1

Hailin Jin

2

Jianchao Yang

2

Li Fei-Fei

1

Stanford University

2

Adobe Research

{jkrause,feifeili}@cs.stanford.edu {hljin,jiayang}@adobe.com

1. Network Architecture Comparison on cars-

196

In the main text we showed that large gains from using

a VGGNet [

5] architecture on the CUB-2011 [6] dataset.

We show a similar comparison on the cars-196 [

3] dataset

in Tab.

1. As before, using a VGGNet architecture leads to

large gains. Particularly striking is the gain from ﬁne-tuning

a VGGNet on cars-196 – a basic R-CNN goes from 57.4%

to 88.4% accuracy only by ﬁne-tuning, much larger than the

already sizeable gain from ﬁne-tuning a CaffeNet [

2].

2. Additional Visualizations

The visualizations in this section are expanded versions

of ﬁgures from the main text.

2.1. Pose Nearest Neighbors

In Fig.

1 we show more examples of nearest neighbors

using conv

4

features, which is our heuristic for measuring

the difference in pose between different images (cf. Fig. 4

of the main text). In most cases the nearest neighbors of

an image come from a variety of ﬁne-grained classes and

tend to have similar poses, justifying their use as a heuristic.

In cases where there are potentially many instances with

similar poses (e.g. ﬁrst row, third column, or ﬁfth row, ﬁrst

column), the nearest neighbors may share more than just

pose. This heuristic still works reasonably when the pose is

relatively unusual (third row, ﬁrst column, and fourth row,

third column), although occasionally small pose differences

persist (direction of the head in the third row, third column).

2.2. Foreground Reﬁnement

Additional examples of images where the foreground re-

ﬁnement (cf. Sec.

3.1 and Fig. 3 of the main text) changes

the segmentation are given in Fig.

2. Most errors in a

GrabCut[

4]+class model which can be corrected by a fore-

ground reﬁnement are undersegmentations. In the most

extreme case, these undersegmentations can actually be

empty, which the foreground reﬁnement ﬁxes. In all cases

the segmentation after reﬁnement is better than the segmen-

tation before reﬁnement, though the ﬁnal segmentation may

CNN Used

Method [2] [5]

R-CNN [1] 51.0 57.4

R-CNN+ft 73.5 88.4

CNN+GT BBox 53.9 59.9

CNN+GT BBox+ft 75.4 89.0

PD+DCoP+ﬂip 65.8 75.9

PD+DCoP+ﬂip+ft 81.3 92.6

PD+DCoP+ﬂip+GT BBox+ft 81.8 92.8

Table 1. Analysis of variations of our method on cars-196, com-

paring performance when using a CaffeNet [

2] versus a CNN with

a VGGNet architecture [5]. Performance is measured in 196-way

accuracy.

still have imperfections.

2.3. Co-segmentation

We show additional qualitative co-segmentation results

in Fig.

3 to supplement the results in Fig. 6 of the main

text. In general, co-segmentation works quite well, but in

cases where part of the background is sufﬁciently different

from the rest of the background the segmentation quality

can suffer. Segmentation is also difﬁcult at certain car parts,

e.g. the wheels, since they look very different from the rest

of the car. It is also difﬁcult to properly segment the bottom

of many cars, since the shadow of the car often looks similar

to the foreground.

References

[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-

ture hierarchies for accurate object detection and semantic

segmentation. In Computer Vision and Pattern Recognition,

2014.

1

[2] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-

shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-

tional architecture for fast feature embedding. arXiv preprint

arXiv:1408.5093, 2014.

1

[3] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object rep-

resentations for ﬁne-grained categorization. In International

Conference on Computer Vision Workshops (ICCVW), pages

554–561. IEEE, 2013. 1

1

conv

4

neighbors

Figure 1. Additional visualizations for nearest neighbors with conv

4

features, which tend to preserve pose.

[4] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interac-

tive foreground extraction using iterated graph cuts. In ACM

Transactions on Graphics (TOG), volume 23, pages 309–314.

ACM, 2004.

1

[5] K. Simonyan and A. Zisserman. Very deep convolutional

networks for large-scale image recognition. arXiv preprint

arXiv:1409.1556, 2014. 1

[6] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.

The caltech-ucsd birds-200-2011 dataset. 2011. 1

2

no reﬁnement with reﬁnement no reﬁnement with reﬁnement no reﬁnement with reﬁnement

Figure 2. Additional visualizations for the effect of foreground reﬁnement. Within each column of images, the ﬁrst image is the original

image, the second is the GrabCut+class model, and the third is GrabCut+class+reﬁne.

Figure 3. Additional visualizations of co-segmentation results. The last results in each row are failure cases.

3

Fine-grained recognition without part annotations

Citations

Recent advances in convolutional neural networks

Bilinear CNN Models for Fine-Grained Visual Recognition

Recent Advances in Convolutional Neural Networks

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition

Bilinear CNN Models for Fine-grained Visual Recognition

References

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet Large Scale Visual Recognition Challenge

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Related Papers (5)

Deep Residual Learning for Image Recognition

The Caltech-UCSD Birds-200-2011 Dataset

3D Object Representations for Fine-Grained Categorization

Very Deep Convolutional Networks for Large-Scale Image Recognition

Spatial transformer networks