Very Deep Convolutional Networks for Large-Scale Image Recognition

Home
/
Papers
/
Very Deep Convolutional Networks for Large-Scale Image Recognition

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

01 Jan 2015-

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

read less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Tell Me Where to Look: Guided Attention Inference Network

[...]

Kunpeng Li¹, Ziyan Wu², Kuan-Chuan Peng², Jan Ernst², Yun Fu¹ - Show less +1 more•Institutions (2)

Northeastern University¹, Princeton University²

27 Feb 2018

TL;DR: This work makes attention maps an explicit and natural component of the end-to-end training for the first time and provides self-guidance directly on these maps by exploring supervision from the network itself to improve them, and seamlessly bridge the gap between using weak and extra supervision if available.

...read moreread less

Abstract: Weakly supervised learning with only coarse labels can obtain visual explanations of deep neural network such as attention maps by back-propagating gradients. These attention maps are then available as priors for tasks such as object localization and semantic segmentation. In one common framework we address three shortcomings of previous approaches in modeling such attention maps: We (1) make attention maps an explicit and natural component of the end-to-end training for the first time, (2) provide self-guidance directly on these maps by exploring supervision from the network itself to improve them, and (3) seamlessly bridge the gap between using weak and extra supervision if available. Despite its simplicity, experiments on the semantic segmentation task demonstrate the effectiveness of our methods. We clearly surpass the state-of-the-art on PASCAL VOC 2012 test and val. sets. Besides, the proposed framework provides a way not only explaining the focus of the learner but also feeding back with direct guidance towards specific tasks. Under mild assumptions our method can also be understood as a plug-in to existing weakly supervised learners to improve their generalization performance.

...read moreread less

451 citations

Proceedings Article•DOI•

Sliced Wasserstein Discrepancy for Unsupervised Domain Adaptation

[...]

Chen-Yu Lee¹, Tanmay Batra¹, Mohammad Haris Baig¹, Daniel Ulbricht¹•Institutions (1)

Apple Inc.¹

10 Mar 2019

TL;DR: The proposed sliced Wasserstein discrepancy (SWD) is designed to capture the natural notion of dissimilarity between the outputs of task-specific classifiers and enables efficient distribution alignment in an end-to-end trainable fashion.

...read moreread less

Abstract: In this work, we connect two distinct concepts for unsupervised domain adaptation: feature distribution alignment between domains by utilizing the task-specific decision boundary and the Wasserstein metric. Our proposed sliced Wasserstein discrepancy (SWD) is designed to capture the natural notion of dissimilarity between the outputs of task-specific classifiers. It provides a geometrically meaningful guidance to detect target samples that are far from the support of the source and enables efficient distribution alignment in an end-to-end trainable fashion. In the experiments, we validate the effectiveness and genericness of our method on digit and sign recognition, image classification, semantic segmentation, and object detection.

...read moreread less

451 citations

Proceedings Article•DOI•

A discriminative CNN video representation for event detection

[...]

Zhongwen Xu¹, Yi Yang¹, Alexander G. Hauptmann²•Institutions (2)

University of Technology, Sydney¹, Carnegie Mellon University²

07 Jun 2015

TL;DR: In this paper, a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available is proposed, which leverages deep convolutional neural networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkits.

...read moreread less

Abstract: In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkits. This paper makes two contributions to the inference of CNN video representation. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be significantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8% for the TRECVID MEDTest 14 dataset and from 34.0% to 44.6% for the TRECVID MEDTest 13 dataset.

...read moreread less

451 citations

Journal Article•DOI•

Deep learning based tissue analysis predicts outcome in colorectal cancer

[...]

Dmitrii Bychkov¹, Nina Linder¹, Nina Linder², Riku Turkki¹, Stig Nordling¹, Panu E. Kovanen¹, Clare Verrill³, Margarita Walliander¹, Mikael Lundin¹, Caj Haglund¹, Johan Lundin⁴, Johan Lundin¹ - Show less +8 more•Institutions (4)

University of Helsinki¹, Uppsala University², University of Oxford³, Karolinska Institutet⁴

21 Feb 2018-Scientific Reports

TL;DR: The results suggest that state-of-the-art deep learning techniques can extract more prognostic information from the tissue morphology of colorectal cancer than an experienced human observer.

...read moreread less

Abstract: Image-based machine learning and deep learning in particular has recently shown expert-level accuracy in medical image classification. In this study, we combine convolutional and recurrent architectures to train a deep network to predict colorectal cancer outcome based on images of tumour tissue samples. The novelty of our approach is that we directly predict patient outcome, without any intermediate tissue classification. We evaluate a set of digitized haematoxylin-eosin-stained tumour tissue microarray (TMA) samples from 420 colorectal cancer patients with clinicopathological and outcome data available. The results show that deep learning-based outcome prediction with only small tissue areas as input outperforms (hazard ratio 2.3; CI 95% 1.79-3.03; AUC 0.69) visual histological assessment performed by human experts on both TMA spot (HR 1.67; CI 95% 1.28-2.19; AUC 0.58) and whole-slide level (HR 1.65; CI 95% 1.30-2.15; AUC 0.57) in the stratification into low- and high-risk patients. Our results suggest that state-of-the-art deep learning techniques can extract more prognostic information from the tissue morphology of colorectal cancer than an experienced human observer.

...read moreread less

451 citations

Proceedings Article•DOI•

Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning

[...]

Tien-Ju Yang¹, Yu-Hsin Chen¹, Vivienne Sze¹•Institutions (1)

Massachusetts Institute of Technology¹

21 Jul 2017

TL;DR: Zhang et al. as discussed by the authors proposed an energy-aware pruning algorithm for CNNs, which directly uses the energy consumption of a CNN to guide the pruning process, and the energy estimation methodology uses parameters extrapolated from actual hardware measurements.

...read moreread less

Abstract: Deep convolutional neural networks (CNNs) are indispensable to state-of-the-art computer vision algorithms. However, they are still rarely deployed on battery-powered mobile devices, such as smartphones and wearable gadgets, where vision algorithms can enable many revolutionary real-world applications. The key limiting factor is the high energy consumption of CNN processing due to its high computational complexity. While there are many previous efforts that try to reduce the CNN model size or the amount of computation, we find that they do not necessarily result in lower energy consumption. Therefore, these targets do not serve as a good metric for energy cost estimation. To close the gap between CNN design and energy consumption optimization, we propose an energy-aware pruning algorithm for CNNs that directly uses the energy consumption of a CNN to guide the pruning process. The energy estimation methodology uses parameters extrapolated from actual hardware measurements. The proposed layer-by-layer pruning algorithm also prunes more aggressively than previously proposed pruning methods by minimizing the error in the output feature maps instead of the filter weights. For each layer, the weights are first pruned and then locally fine-tuned with a closed-form least-square solution to quickly restore the accuracy. After all layers are pruned, the entire network is globally fine-tuned using back-propagation. With the proposed pruning method, the energy consumption of AlexNet and GoogLeNet is reduced by 3.7x and 1.6x, respectively, with less than 1% top-5 accuracy loss. We also show that reducing the number of target classes in AlexNet greatly decreases the number of weights, but has a limited impact on energy consumption.

...read moreread less

451 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
…
174
175
176
177
178
179
180
…
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book Chapter•DOI•

I and J

[...]

William Marsden

01 Jan 2012

139,059 citations

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Proceedings Article•DOI•

ImageNet: A large-scale hierarchical image database

[...]

Jia Deng¹, Wei Dong¹, Richard Socher¹, Li-Jia Li¹, Kai Li¹, Li Fei-Fei¹ - Show less +2 more•Institutions (1)

Princeton University¹

20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

...read moreread less

49,639 citations

Journal Article•DOI•

A and V.

[...]

Robert W. Stephenson

01 Nov 1962-British Journal of Ophthalmology

40,330 citations

Proceedings Article•DOI•

Going deeper with convolutions

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations