Very Deep Convolutional Networks for Large-Scale Image Recognition

Home
/
Papers
/
Very Deep Convolutional Networks for Large-Scale Image Recognition

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

04 Sep 2014-

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

read less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks

[...]

Rasmus Rothe¹, Radu Timofte¹, Luc Van Gool²•Institutions (2)

ETH Zurich¹, Katholieke Universiteit Leuven²

01 Apr 2018-International Journal of Computer Vision

TL;DR: A deep learning solution to age estimation from a single face image without the use of facial landmarks is proposed and the IMDB-WIKI dataset is introduced, the largest public dataset of face images with age and gender labels.

...read moreread less

Abstract: In this paper we propose a deep learning solution to age estimation from a single face image without the use of facial landmarks and introduce the IMDB-WIKI dataset, the largest public dataset of face images with age and gender labels. If the real age estimation research spans over decades, the study of apparent age estimation or the age as perceived by other humans from a face image is a recent endeavor. We tackle both tasks with our convolutional neural networks (CNNs) of VGG-16 architecture which are pre-trained on ImageNet for image classification. We pose the age estimation problem as a deep classification problem followed by a softmax expected value refinement. The key factors of our solution are: deep learned models from large data, robust face alignment, and expected value formulation for age regression. We validate our methods on standard benchmarks and achieve state-of-the-art results for both real and apparent age estimation.

...read moreread less

755 citations

Cites methods from "Very Deep Convolutional Networks fo..."

...Yang et al. [57] (SEU-NJU team, 4th place in LAP challenge) use face and landmark detection for face alignment and the VGG-16 architecture [50] for modeling....
[...]
...Thereby each filter in VGG-16 captures simpler geometrical structures but in comparison allows more complex reasoning through its increased depth....
[...]
...For our convolutional neural networks (CNNs) we use the deep VGG-16 architecture [48]....
[...]
...Our method uses a CNN with the VGG-16 (Simonyan and Zisserman 2014) architecture [cf....
[...]
...3.4 Output layer and expected value The pre-trained CNN (with VGG-16 architecture) for the ImageNet classification task has an output layer of 1000 softmax-normalized neurons, one for each of the object classes....
[...]

Journal Article•DOI•

U2-Net: Going deeper with nested U-structure for salient object detection

[...]

Xuebin Qin¹, Zichen Vincent Zhang¹, Chenyang Huang¹, Masood Dehghan¹, Osmar R. Zaïane¹, Martin Jagersand¹ - Show less +2 more•Institutions (1)

University of Alberta¹

01 Oct 2020-Pattern Recognition

TL;DR: A simple yet powerful deep network architecture, U2-Net, for salient object detection (SOD), a two-level nested U-structure that enables us to train a deep network from scratch without using backbones from image classification tasks.

...read moreread less

753 citations

Cites background from "Very Deep Convolutional Networks fo..."

...There is a common pattern in the design of most SOD networks [18, 27, 41, 6], that is, they focus on making good use of deep features extracted by existing backbones, such as Alexnet [17], VGG [35], ResNet [12], ResNeXt [44], DenseNet [15], etc....
[...]
...In modern CNN designs, such as VGG, ResNet, DenseNet and so on, small convolutional filters with size of 1×1 or 3×3 are the most frequently used components for feature extraction....
[...]
...Practically, we adapt the backbones (VGG-16 and ResNet50) by adding an extra stage after their last convolutional stages to achieve the same receptive fields with our original U2-Net architecture design....
[...]
...To validate the backbone free design, we conduct ablation studies on replacing the encoder part of our full size U2-Net with different backbones: VGG16 and ResNet50....
[...]
...Different from the previous salient object detection models which use backbones (e.g. VGG, ResNet, etc.) as their encoders, our newly proposed U2-Net architecture is backbone free....
[...]

Proceedings Article•DOI•

SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints

[...]

Amir Sadeghian¹, Vineet Kosaraju¹, Ali Sadeghian², Noriaki Hirose¹, Hamid Rezatofighi¹, Silvio Savarese¹ - Show less +2 more•Institutions (2)

Stanford University¹, University of Florida²

01 Jun 2019

TL;DR: In this paper, an interpretable framework based on Generative Adversarial Network (GAN) is proposed for path prediction for multiple interacting agents in a scene, which leverages two sources of information, the path history of all the agents in the scene, and the scene context information, using images of the scene.

...read moreread less

Abstract: This paper addresses the problem of path prediction for multiple interacting agents in a scene, which is a crucial step for many autonomous platforms such as self-driving cars and social robots. We present SoPhie; an interpretable framework based on Generative Adversarial Network (GAN), which leverages two sources of information, the path history of all the agents in a scene, and the scene context information, using images of the scene. To predict a future path for an agent, both physical and social information must be leveraged. Previous work has not been successful to jointly model physical and social interactions. Our approach blends a social attention mechanism with physical attention that helps the model to learn where to look in a large scene and extract the most salient parts of the image relevant to the path. Whereas, the social attention component aggregates information across the different agent interactions and extracts the most important trajectory information from the surrounding neighbors. SoPhie also takes advantage of GAN to generates more realistic samples and to capture the uncertain nature of the future paths by modeling its distribution. All these mechanisms enable our approach to predict socially and physically plausible paths for the agents and to achieve state-of-the-art performance on several different trajectory forecasting benchmarks.

...read moreread less

752 citations

Journal Article•DOI•

A Survey of Deep Learning-Based Object Detection

[...]

Licheng Jiao¹, Fan Zhang¹, Fang Liu¹, Shuyuan Yang¹, Lingling Li¹, Zhixi Feng¹, Rong Qu² - Show less +3 more•Institutions (2)

Xidian University¹, University of Nottingham²

05 Sep 2019-IEEE Access

TL;DR: This survey provides a comprehensive overview of a variety of object detection methods in a systematic manner, covering the one-stage and two-stage detectors, and lists the traditional and new applications.

...read moreread less

Abstract: Object detection is one of the most important and challenging branches of computer vision, which has been widely applied in people's life, such as monitoring security, autonomous driving and so on, with the purpose of locating instances of semantic objects of a certain class. With the rapid development of deep learning algorithms for detection tasks, the performance of object detectors has been greatly improved. In order to understand the main development status of object detection pipeline thoroughly and deeply, in this survey, we analyze the methods of existing typical detection models and describe the benchmark datasets at first. Afterwards and primarily, we provide a comprehensive overview of a variety of object detection methods in a systematic manner, covering the one-stage and two-stage detectors. Moreover, we list the traditional and new applications. Some representative branches of object detection are analyzed as well. Finally, we discuss the architecture of exploiting these object detection methods to build an effective and efficient system and point out a set of development trends to better follow the state-of-the-art algorithms and further research.

...read moreread less

749 citations

Cites background or methods from "Very Deep Convolutional Networks fo..."

...than Fast R-CNN (1830ms) with the same VGG [26] backbone, and processing rate was 5fps vs....
[...]
...As well, total running time of Faster R-CNN (198ms) is nearly 10 times lower than Fast R-CNN (1830ms) with the same VGG [24] backbone, and processing rate is 5fps vs. 0.5fps. D. Mask R-CNN Mask R-CNN [9] is an extending work to Faster R-CNN mainly for instance segmentation task....
[...]
...Experiments showed that SSD512 had a competitive result both mAP and speed with VGG-16 [24] backbone....
[...]
...Experiments showed that SSD512 had a competitive result on both mAP and speed with VGG-16 [26] backbone....
[...]
...For M2Det is an one-stage detector, it achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy utilizing VGG-16 on COCO test-dev set....
[...]

Journal Article•DOI•

Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly

[...]

Yongqin Xian¹, Christoph H. Lampert², Bernt Schiele¹, Zeynep Akata³•Institutions (3)

Max Planck Society¹, Institute of Science and Technology Austria², University of Amsterdam³

01 Sep 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The Animals with Attributes 2 (AWA2) dataset as mentioned in this paper is a new dataset for zero-shot learning, which is publicly available both in terms of image features and the images themselves.

...read moreread less

Abstract: Due to the importance of zero-shot learning, i.e., classifying images where there is a lack of labeled training data, the number of proposed approaches has recently increased steadily. We argue that it is time to take a step back and to analyze the status quo of the area. The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets used for this task. This is an important contribution as published results are often not comparable and sometimes even flawed due to, e.g., pre-training on zero-shot test classes. Moreover, we propose a new zero-shot learning dataset, the Animals with Attributes 2 (AWA2) dataset which we make publicly available both in terms of image features and the images themselves. Second, we compare and analyze a significant number of the state-of-the-art methods in depth, both in the classic zero-shot setting but also in the more realistic generalized zero-shot setting. Finally, we discuss in detail the limitations of the current status of the area which can be taken as a basis for advancing it.

...read moreread less

747 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
…
49
50
51
52
53
54
55
…
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

ImageNet: A large-scale hierarchical image database

[...]

Jia Deng¹, Wei Dong¹, Richard Socher¹, Li-Jia Li¹, Kai Li¹, Li Fei-Fei¹ - Show less +2 more•Institutions (1)

Princeton University¹

20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

...read moreread less

49,639 citations

Proceedings Article•DOI•

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

[...]

Ross Girshick¹, Jeff Donahue¹, Trevor Darrell¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

23 Jun 2014

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

...read moreread less

21,729 citations

Posted Content•

Fully Convolutional Networks for Semantic Segmentation

[...]

Jonathan Long¹, Evan Shelhamer¹, Trevor Darrell¹•Institutions (1)

University of California, Berkeley¹

14 Nov 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation.

...read moreread less

Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.

...read moreread less

9,803 citations

Journal Article•DOI•

Backpropagation applied to handwritten zip code recognition

[...]

Yann LeCun¹, Bernhard E. Boser¹, John S. Denker¹, D. Henderson¹, Richard Howard¹, W. Hubbard¹, Lawrence D. Jackel¹ - Show less +3 more•Institutions (1)

Bell Labs¹

01 Dec 1989-Neural Computation

TL;DR: This paper demonstrates how constraints from the task domain can be integrated into a backpropagation network through the architecture of the network, successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service.

...read moreread less

Abstract: The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.

...read moreread less

9,775 citations

Journal Article•DOI•

The Pascal Visual Object Classes Challenge: A Retrospective

[...]

Mark Everingham¹, S. M. Eslami², Luc Van Gool³, Christopher Williams⁴, John Winn², Andrew Zisserman⁵ - Show less +2 more•Institutions (5)

University of Leeds¹, Microsoft², ETH Zurich³, University of Edinburgh⁴, University of Oxford⁵

01 Jan 2015-International Journal of Computer Vision

TL;DR: A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

...read moreread less

Abstract: The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

...read moreread less

6,061 citations