scispace - formally typeset
Open AccessProceedings ArticleDOI

Fine-grained recognition without part annotations

Reads0
Chats0
TLDR
This work proposes a method for fine-grained recognition that uses no part annotations, based on generating parts using co-segmentation and alignment, which is combined in a discriminative mixture.
Abstract
Scaling up fine-grained recognition to all domains of fine-grained objects is a challenge the computer vision community will need to face in order to realize its goal of recognizing all object categories. Current state-of-the-art techniques rely heavily upon the use of keypoint or part annotations, but scaling up to hundreds or thousands of domains renders this annotation cost-prohibitive for all but the most important categories. In this work we propose a method for fine-grained recognition that uses no part annotations. Our method is based on generating parts using co-segmentation and alignment, which we combine in a discriminative mixture. Experimental results show its efficacy, demonstrating state-of-the-art results even when compared to methods that use part annotations during training.

read more

Content maybe subject to copyright    Report

Fine-Grained Recognition without Part Annotations: Supplementary Material
Jonathan Krause
1
Hailin Jin
2
Jianchao Yang
2
Li Fei-Fei
1
1
Stanford University
2
Adobe Research
{jkrause,feifeili}@cs.stanford.edu {hljin,jiayang}@adobe.com
1. Network Architecture Comparison on cars-
196
In the main text we showed that large gains from using
a VGGNet [
5] architecture on the CUB-2011 [6] dataset.
We show a similar comparison on the cars-196 [
3] dataset
in Tab.
1. As before, using a VGGNet architecture leads to
large gains. Particularly striking is the gain from fine-tuning
a VGGNet on cars-196 a basic R-CNN goes from 57.4%
to 88.4% accuracy only by fine-tuning, much larger than the
already sizeable gain from fine-tuning a CaffeNet [
2].
2. Additional Visualizations
The visualizations in this section are expanded versions
of figures from the main text.
2.1. Pose Nearest Neighbors
In Fig.
1 we show more examples of nearest neighbors
using conv
4
features, which is our heuristic for measuring
the difference in pose between different images (cf. Fig. 4
of the main text). In most cases the nearest neighbors of
an image come from a variety of fine-grained classes and
tend to have similar poses, justifying their use as a heuristic.
In cases where there are potentially many instances with
similar poses (e.g. first row, third column, or fifth row, first
column), the nearest neighbors may share more than just
pose. This heuristic still works reasonably when the pose is
relatively unusual (third row, first column, and fourth row,
third column), although occasionally small pose differences
persist (direction of the head in the third row, third column).
2.2. Foreground Refinement
Additional examples of images where the foreground re-
finement (cf. Sec.
3.1 and Fig. 3 of the main text) changes
the segmentation are given in Fig.
2. Most errors in a
GrabCut[
4]+class model which can be corrected by a fore-
ground refinement are undersegmentations. In the most
extreme case, these undersegmentations can actually be
empty, which the foreground refinement fixes. In all cases
the segmentation after refinement is better than the segmen-
tation before refinement, though the final segmentation may
CNN Used
Method [2] [5]
R-CNN [1] 51.0 57.4
R-CNN+ft 73.5 88.4
CNN+GT BBox 53.9 59.9
CNN+GT BBox+ft 75.4 89.0
PD+DCoP+flip 65.8 75.9
PD+DCoP+flip+ft 81.3 92.6
PD+DCoP+flip+GT BBox+ft 81.8 92.8
Table 1. Analysis of variations of our method on cars-196, com-
paring performance when using a CaffeNet [
2] versus a CNN with
a VGGNet architecture [5]. Performance is measured in 196-way
accuracy.
still have imperfections.
2.3. Co-segmentation
We show additional qualitative co-segmentation results
in Fig.
3 to supplement the results in Fig. 6 of the main
text. In general, co-segmentation works quite well, but in
cases where part of the background is sufficiently different
from the rest of the background the segmentation quality
can suffer. Segmentation is also difficult at certain car parts,
e.g. the wheels, since they look very different from the rest
of the car. It is also difficult to properly segment the bottom
of many cars, since the shadow of the car often looks similar
to the foreground.
References
[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ture hierarchies for accurate object detection and semantic
segmentation. In Computer Vision and Pattern Recognition,
2014.
1
[2] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-
tional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014.
1
[3] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object rep-
resentations for fine-grained categorization. In International
Conference on Computer Vision Workshops (ICCVW), pages
554–561. IEEE, 2013. 1
1

conv
4
neighbors
Figure 1. Additional visualizations for nearest neighbors with conv
4
features, which tend to preserve pose.
[4] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interac-
tive foreground extraction using iterated graph cuts. In ACM
Transactions on Graphics (TOG), volume 23, pages 309–314.
ACM, 2004.
1
[5] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014. 1
[6] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
The caltech-ucsd birds-200-2011 dataset. 2011. 1
2

no renement with renement no renement with renement no renement with renement
Figure 2. Additional visualizations for the effect of foreground refinement. Within each column of images, the first image is the original
image, the second is the GrabCut+class model, and the third is GrabCut+class+refine.
Figure 3. Additional visualizations of co-segmentation results. The last results in each row are failure cases.
3
Citations
More filters
Journal ArticleDOI

Recent advances in convolutional neural networks

TL;DR: A broad survey of the recent advances in convolutional neural networks can be found in this article, where the authors discuss the improvements of CNN on different aspects, namely, layer design, activation function, loss function, regularization, optimization and fast computation.
Proceedings ArticleDOI

Bilinear CNN Models for Fine-Grained Visual Recognition

TL;DR: Blinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image descriptor, are proposed.
Posted Content

Recent Advances in Convolutional Neural Networks

TL;DR: This paper details the improvements of CNN on different aspects, including layer design, activation function, loss function, regularization, optimization and fast computation, and introduces various applications of convolutional neural networks in computer vision, speech and natural language processing.
Proceedings ArticleDOI

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition

TL;DR: Li et al. as discussed by the authors proposed a recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutual reinforced way.
Posted Content

Bilinear CNN Models for Fine-grained Visual Recognition

TL;DR: This paper proposed bilinear models, which consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image descriptor, which can model local pairwise feature interactions in a translationally invariant manner.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings ArticleDOI

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.