Very Deep Convolutional Networks for Large-Scale Image Recognition

Home
/
Papers
/
Very Deep Convolutional Networks for Large-Scale Image Recognition

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

04 Sep 2014-

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

read less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

Fully Connected Deep Structured Networks

[...]

Alexander G. Schwing, Raquel Urtasun

09 Mar 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work unifies this two-stage process for semantic segmentation into a single joint training algorithm and demonstrates the method on the semantic image segmentation task and shows encouraging results on the challenging PASCAL VOC 2012 dataset.

...read moreread less

Abstract: Convolutional neural networks with many layers have recently been shown to achieve excellent results on many high-level tasks such as image classification, object detection and more recently also semantic segmentation. Particularly for semantic segmentation, a two-stage procedure is often employed. Hereby, convolutional networks are trained to provide good local pixel-wise features for the second step being traditionally a more global graphical model. In this work we unify this two-stage process into a single joint training algorithm. We demonstrate our method on the semantic image segmentation task and show encouraging results on the challenging PASCAL VOC 2012 dataset.

...read moreread less

324 citations

Cites background or methods from "Very Deep Convolutional Networks fo..."

..., we employ the 16 layer DeepNet model [31]....
[...]
...It was shown independently by many authors [31, 4], that successively increasing the number of parameters during training typically yields better performance due to better initialization of larger models....
[...]
...They have been shown to achieve state-of-the-art performance in a variety of vision problems, including image classification [19, 31], object detection [11], human pose estimation [32], stereo [36], and caption generation [15, 24, 35, 8, 14, 10]....
[...]
...Whereas the latter combines dense conditional random fields [17] with the fully convolutional networks presented by Long et al. [21], we employ and modify the 16 layer DeepNet architecture presented in work by Simonyan and Zisserman [31]....
[...]
...[21], we employ and modify the 16 layer DeepNet architecture presented in work by Simonyan and Zisserman [31]....
[...]

Proceedings Article•

Volumetric convnets with mixed residual connections for automated prostate segmentation from 3d MR images

[...]

Lequan Yu¹, Xin Yang¹, Hao Chen¹, Jing Qin², Pheng-Ann Heng¹ - Show less +1 more•Institutions (2)

The Chinese University of Hong Kong¹, Hong Kong Polytechnic University²

10 Feb 2017

TL;DR: The proposed volumetric convolutional neural network (ConvNet) with mixed residual connections is general enough and can be easily extended to other medical image analysis tasks, especially ones with limited training data.

...read moreread less

Abstract: Automated prostate segmentation from 3D MR images is very challenging due to large variations of prostate shape and indistinct prostate boundaries. We propose a novel volumetric convolutional neural network (ConvNet) with mixed residual connections to cope with this challenging problem. Compared with previous methods, our volumetric ConvNet has two compelling advantages. First, it is implemented in a 3D manner and can fully exploit the 3D spatial contextual information of input data to perform efficient, precise and volume-to-volume prediction. Second and more important, the novel combination of residual connections (i.e., long and short) can greatly improve the training efficiency and discriminative capability of our network by enhancing the information propagation within the ConvNet both locally and globally. While the forward propagation of location information can improve the segmentation accuracy, the smooth backward propagation of gradient flow can accelerate the convergence speed and enhance the discrimination capability. Extensive experiments on the open MICCAI PROMISE12 challenge dataset corroborated the effectiveness of the proposed volumetric ConvNet with mixed residual connections. Our method ranked the first in the challenge, outperforming other competitors by a large margin with respect to most of evaluation metrics. The proposed volumetric ConvNet is general enough and can be easily extended to other medical image analysis tasks, especially ones with limited training data.

...read moreread less

323 citations

Cites background from "Very Deep Convolutional Networks fo..."

...Previous studies (Simonyan and Zisserman 2014; Tran et al. 2015) have demonstrated that smaller convolutional kernels are more efficient in ConvNet design....
[...]

Proceedings Article•DOI•

Annotating Object Instances with a Polygon-RNN

[...]

Lluis Castrejon¹, Kaustav Kundu¹, Raquel Urtasun¹, Sanja Fidler¹•Institutions (1)

University of Toronto¹

01 Jul 2017

TL;DR: This work proposes an approach for semi-automatic annotation of object instances that takes as input an image crop and sequentially produces vertices of the polygon outlining the object, producing as accurate segmentation as desired by the annotator.

...read moreread less

Abstract: We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows a human annotator to interfere at any time and correct a vertex if needed, producing as accurate segmentation as desired by the annotator. We show that our approach speeds up the annotation process by a factor of 4.7 across all classes in Cityscapes, while achieving 78:4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators. For cars, our speed-up factor is 7.3 for an agreement of 82:2%. We further show generalization capabilities of our approach to unseen datasets.

...read moreread less

323 citations

Cites methods from "Very Deep Convolutional Networks fo..."

...We adopt the VGG-16 architecture [27] and modify it for the purpose of our task....
[...]

Posted Content•

StyleBank: An Explicit Representation for Neural Image Style Transfer

[...]

Dongdong Chen¹, Lu Yuan², Jing Liao², Nenghai Yu¹, Gang Hua² - Show less +1 more•Institutions (2)

University of Science and Technology of China¹, Microsoft²

27 Mar 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes StyleBank, which is composed of multiple convolution filter banks and each filter bank explicitly represents one style, for neural image style transfer, the first style transfer network that links back to traditional texton mapping methods, and hence provides new understanding on neural style transfer.

...read moreread less

Abstract: We propose StyleBank, which is composed of multiple convolution filter banks and each filter bank explicitly represents one style, for neural image style transfer. To transfer an image to a specific style, the corresponding filter bank is operated on top of the intermediate feature embedding produced by a single auto-encoder. The StyleBank and the auto-encoder are jointly learnt, where the learning is conducted in such a way that the auto-encoder does not encode any style information thanks to the flexibility introduced by the explicit filter bank representation. It also enables us to conduct incremental learning to add a new image style by learning a new filter bank while holding the auto-encoder fixed. The explicit style representation along with the flexible network design enables us to fuse styles at not only the image level, but also the region level. Our method is the first style transfer network that links back to traditional texton mapping methods, and hence provides new understanding on neural style transfer. Our method is easy to train, runs in real-time, and produces results that qualitatively better or at least comparable to existing methods.

...read moreread less

323 citations

Cites background from "Very Deep Convolutional Networks fo..."

..., pre-trained VGG-16 [36]) feature domain....
[...]
...In all of our experiments, we compute content loss at layer relu4 2 and style loss at layer relu1 2, relu2 2, relu3 2, and relu4 2 of the pre-trained VGG-16 network....
[...]
...DeepDream [1] may be the first attempt to generate artistic work using CNN. Inspired by this work, Gatys et al. [12] successfully applies CNN (pre-trained VGG-16 networks) to neural style transfer and produces more impressive stylization results compared to classic texture transfer methods....
[...]
...(4) where F l and Gi are respectively feature map and Gram matrix computed from layer l of VGG-16 network [36](pretrained on the ImageNet dataset [34])....
[...]
...These CNN algorithms either apply an iterative optimization mechanism [12], or directly learn a feed-forward generator network [19, 37] to seek an image close to both the content image and the style image – all measured in the CNN (i.e., pre-trained VGG-16 [36]) feature domain....
[...]

Proceedings Article•DOI•

The microsoft 2016 conversational speech recognition system

[...]

Wayne Xiong¹, Jasha Droppo¹, Xuedong Huang¹, Frank Seide¹, Michael L. Seltzer¹, Andreas Stolcke¹, Dong Yu¹, Geoffrey Zweig¹ - Show less +4 more•Institutions (1)

Microsoft¹

05 Mar 2017

TL;DR: Microsoft's conversational speech recognition system is described, in which recent developments in neural-network-based acoustic and language modeling are combined to advance the state of the art on the Switchboard recognition task.

...read moreread less

Abstract: We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.

...read moreread less

322 citations

Cites background from "Very Deep Convolutional Networks fo..."

...The first is the VGG architecture of [22]....
[...]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
…
152
153
154
155
156
157
158
…
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

ImageNet: A large-scale hierarchical image database

[...]

Jia Deng¹, Wei Dong¹, Richard Socher¹, Li-Jia Li¹, Kai Li¹, Li Fei-Fei¹ - Show less +2 more•Institutions (1)

Princeton University¹

20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

...read moreread less

49,639 citations

Proceedings Article•DOI•

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

[...]

Ross Girshick¹, Jeff Donahue¹, Trevor Darrell¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

23 Jun 2014

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

...read moreread less

21,729 citations

Posted Content•

Fully Convolutional Networks for Semantic Segmentation

[...]

Jonathan Long¹, Evan Shelhamer¹, Trevor Darrell¹•Institutions (1)

University of California, Berkeley¹

14 Nov 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation.

...read moreread less

Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.

...read moreread less

9,803 citations

Journal Article•DOI•

Backpropagation applied to handwritten zip code recognition

[...]

Yann LeCun¹, Bernhard E. Boser¹, John S. Denker¹, D. Henderson¹, Richard Howard¹, W. Hubbard¹, Lawrence D. Jackel¹ - Show less +3 more•Institutions (1)

Bell Labs¹

01 Dec 1989-Neural Computation

TL;DR: This paper demonstrates how constraints from the task domain can be integrated into a backpropagation network through the architecture of the network, successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service.

...read moreread less

Abstract: The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.

...read moreread less

9,775 citations

Journal Article•DOI•

The Pascal Visual Object Classes Challenge: A Retrospective

[...]

Mark Everingham¹, S. M. Eslami², Luc Van Gool³, Christopher Williams⁴, John Winn², Andrew Zisserman⁵ - Show less +2 more•Institutions (5)

University of Leeds¹, Microsoft², ETH Zurich³, University of Edinburgh⁴, University of Oxford⁵

01 Jan 2015-International Journal of Computer Vision

TL;DR: A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

...read moreread less

Abstract: The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

...read moreread less

6,061 citations