Bilinear CNN Models for Fine-grained Visual Recognition

Open AccessPosted Content

Bilinear CNN Models for Fine-grained Visual Recognition

Tsung-Yu Lin, +2 more

- 29 Apr 2015 -

arXiv: Computer Vision and Pattern Recog...

Chats0

TLDR

This paper proposed bilinear models, which consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image descriptor, which can model local pairwise feature interactions in a translationally invariant manner.

Abstract:

We propose bilinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an image descriptor. This architecture can model local pairwise feature interactions in a translationally invariant manner which is particularly useful for fine-grained categorization. It also generalizes various orderless texture descriptors such as the Fisher vector, VLAD and O2P. We present experiments with bilinear models where the feature extractors are based on convolutional neural networks. The bilinear form simplifies gradient computation and allows end-to-end training of both networks using image labels only. Using networks initialized from the ImageNet dataset followed by domain specific fine-tuning we obtain 84.1% accuracy of the CUB-200-2011 dataset requiring only category labels at training time. We present experiments and visualizations that analyze the effects of fine-tuning and the choice two networks on the speed and accuracy of the models. Results show that the architecture compares favorably to the existing state of the art on a number of fine-grained datasets while being substantially simpler and easier to train. Moreover, our most accurate model is fairly efficient running at 8 frames/sec on a NVIDIA Tesla K40 GPU. The source code for the complete system will be made available at this http URL

Citations

PDF

Open Access

More filters

Proceedings Article

Spatial transformer networks

Max Jaderberg, +3 more

TL;DR: This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps.

...read moreread less

Journal ArticleDOI

Recent advances in convolutional neural networks

Jiuxiang Gu, +10 more

- 01 May 2018 -

Pattern Recognition

TL;DR: A broad survey of the recent advances in convolutional neural networks can be found in this article, where the authors discuss the improvements of CNN on different aspects, namely, layer design, activation function, loss function, regularization, optimization and fast computation.

...read moreread less

Posted Content

Convolutional Two-Stream Network Fusion for Video Action Recognition

Christoph Feichtenhofer, +2 more

- 22 Apr 2016 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this paper, a spatial and temporal network can be fused at the last convolution layer without loss of performance, but with a substantial saving in parameters, and furthermore, pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance.

...read moreread less

Posted Content

Recent Advances in Convolutional Neural Networks

Jiuxiang Gu, +11 more

- 22 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper details the improvements of CNN on different aspects, including layer design, activation function, loss function, regularization, optimization and fast computation, and introduces various applications of convolutional neural networks in computer vision, speech and natural language processing.

...read moreread less

Proceedings ArticleDOI

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Akira Fukui, +5 more

TL;DR: This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Journal ArticleDOI

Gradient-based learning applied to document recognition

Yann LeCun, +6 more

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.

...read moreread less

Collapse

Bilinear CNN Models for Fine-grained Visual Recognition

Citations

Spatial transformer networks

Recent advances in convolutional neural networks

Convolutional Two-Stream Network Fusion for Video Action Recognition

Recent Advances in Convolutional Neural Networks

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Gradient-based learning applied to document recognition

Related Papers (5)

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet: A large-scale hierarchical image database

Very Deep Convolutional Networks for Large-Scale Image Recognition

Going deeper with convolutions