Deep Residual Learning for Image Recognition

Home
/
Papers
/
Deep Residual Learning for Image Recognition

Posted Content•

Deep Residual Learning for Image Recognition

Kaiming He¹, Xiangyu Zhang¹, Shaoqing Ren¹, Jian Sun¹•Institutions (1)

10 Dec 2015-arXiv: Computer Vision and Pattern Recognition-

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

read less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

YOLOv3: An Incremental Improvement.

[...]

Joseph Redmon, Ali Farhadi

08 Apr 2018-arXiv: Computer Vision and Pattern Recognition

TL;DR: The authors present some updates to YOLO!

...read moreread less

Abstract: We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at this https URL

...read moreread less

12,770 citations

Cites background or methods from "Deep Residual Learning for Image Re..."

...of COCOs weird average mean AP metric it is on par with the SSD variants but is 3 faster. It is still quite a bit behind other backbone AP AP 50 AP 75 AP S AP M AP L Two-stage methods Faster R-CNN+++ [5] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9 Faster R-CNN w FPN [8] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2 Faster R-CNN by G-RMI [6] Inception-ResNet-v2 [21] 34.7 55.5 36.7 13.5 38.1 52.0 Faster...
[...]
...owerful than Darknet19 but still more efﬁcient than ResNet-101 or ResNet-152. Here are some ImageNet results: Backbone Top-1 Top-5 Bn Ops BFLOP/s FPS Darknet-19 [15] 74.1 91.8 7.29 1246 171 ResNet-101[5] 77.1 93.7 19.7 1039 53 ResNet-152 [5] 77.6 93.8 29.4 1090 37 Darknet-53 77.2 93.8 18.7 1457 78 Table 2. Comparison of backbones. Accuracy, billions of operations, billion ﬂoating point operations per...
[...]

Proceedings Article•

Mask R-CNN

[...]

Kaiming He¹, Georgia Gkioxari², Piotr Dollár³, Ross Girshick³•Institutions (3)

Microsoft¹, University of California², École Centrale Paris³

20 Mar 2017

TL;DR: This work presents a conceptually simple, flexible, and general framework for object instance segmentation that outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners.

...read moreread less

Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: this https URL

...read moreread less

11,343 citations

Cites background or methods from "Deep Residual Learning for Image Re..."

...We train on 8 GPUs (so effective minibatch size is 16) for 160k iterations, with a learning rate of 0.02 which is decreased by 10 at the 120k iteration....
[...]
...We hope such rapid training will remove a major hurdle in this area and encourage more people to perform research on this challenging topic....
[...]
...We also report results on test-dev [28]....
[...]

Journal Article•DOI•

A survey on deep learning in medical image analysis

[...]

Geert Litjens¹, Thijs Kooi¹, Babak Ehteshami Bejnordi¹, Arnaud Arindra Adiyoso Setio¹, Francesco Ciompi¹, Mohsen Ghafoorian¹, Jeroen van der Laak¹, Bram van Ginneken¹, Clara I. Sánchez¹ - Show less +5 more•Institutions (1)

Radboud University Nijmegen¹

01 Dec 2017-Medical Image Analysis

TL;DR: This paper reviews the major deep learning concepts pertinent to medical image analysis and summarizes over 300 contributions to the field, most of which appeared in the last year, to survey the use of deep learning for image classification, object detection, segmentation, registration, and other tasks.

...read moreread less

8,730 citations

Cites methods from "Deep Residual Learning for Image Re..."

...Nie et al. (2016c) CNN Survival prediction; features from a Multi-modal 3D CNN is fused with hand-crafted features to train an SVM...
[...]
...Xie et al. (2015a) Nucleus detection Ki-67 CNN model that learns the voting offset vectors and voting...
[...]
...Xie et al. (2016b) Perimysium segmentation H&E 2D spatial clockwork RNN...
[...]
...Xie et al. (2015b) Nucleus detection H&E, Ki-67 CNN-based structured regression model for cell detection...
[...]
...Xie et al. (2016a) Nucleus detection and cell counting FL and H&E Microscopy cell counting with fully convolutional regression...
[...]

Posted Content•

Focal Loss for Dense Object Detection

[...]

Tsung-Yi Lin¹, Priya Goyal¹, Ross Girshick¹, Kaiming He¹, Piotr Dollár¹ - Show less +1 more•Institutions (1)

Facebook¹

07 Aug 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes to address the extreme foreground-background class imbalance encountered during training of dense detectors by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, and develops a novel Focal Loss, which focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.

...read moreread less

Abstract: The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: this https URL.

...read moreread less

7,715 citations

Cites methods from "Deep Residual Learning for Image Re..."

...Plugging in ResNeXt-32x8d-101-FPN [40] as the RetinaNet backbone TABLE 3 Object Detection Single-Model Results (Bounding Box AP), versus State-of-the-Art on COCO test-dev backbone AP AP50 AP75 APS APM APL Two-stage methods Faster R-CNN+++ [31] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9 Faster R-CNNw FPN [4] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2 Faster R-CNN by G-RMI [33] Inception-ResNet-v2 [38] 34.7 55.5 36.7 13.5 38.1 52.0 Faster R-CNNw TDM [30] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1 One-stage methods YOLOv2 [8] DarkNet-19 [8] 21.6 44.0 19.2 5.0 22.4 35.5 SSD513 [9], [10] ResNet-101-SSD 31.2 50.4 33.3 10.2 34.5 49.8 DSSD513 [10] ResNet-101-DSSD 33.2 53.3 35.2 13.0 35.4 51.1 RetinaNet (ours) ResNet-101-FPN 39.1 59.1 42.3 21.8 42.7 50.2 RetinaNet (ours) ResNeXt-101-FPN 40.8 61.1 44.1 24.1 44.2 51.2 We show results for our RetinaNet-101-800 model, trained with scale jitter and for 1.5 longer than the same model from Table 1e....
[...]
...Following [4], we build FPN on top of the ResNet architecture [31]....
[...]
...For all experiments we use depth 50 or 101 ResNets [31] with a Feature Pyramid Network [4] constructed on top....
[...]
...ResNet-101 models are pre-trained on ImageNet1k; we use the models released by [31]....
[...]
...The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) [4] backbone on top of a feedforward ResNet architecture [31] (a) to generate a rich, multi-scale convolutional feature pyramid (b)....
[...]

Posted Content•

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

[...]

Forrest Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer - Show less +2 more

24 Feb 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters and is able to compress to less than 0.5MB (510x smaller than AlexNet).

...read moreread less

Abstract: Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here: this https URL

...read moreread less

5,904 citations

Cites background or methods from "Deep Residual Learning for Image Re..."

...level, i.e. the contents of individual modules of the CNN. Now, we explore design decisions at the macroarchitecture level concerning the high-level connections among Fire modules. Inspired by ResNet [13], we explored three different architectures: Vanilla SqueezeNet (as per the prior sections). SqueezeNet with simple bypass connections between some Fire modules. SqueezeNet with complex bypass conn...
[...]
... layers that deliver even higher ImageNet accuracy [14]. The choice of connections across multiple layers or modules is an emerging area of CNN macroarchitectural research. Residual Networks (ResNet) [13] and Highway Networks [29] each propose the use of connections that skip over multiple layers, for example additively connecting the activations from layer 3 to the activations from layer 6. We refer ...
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Journal Article•DOI•

Long short-term memory

[...]

Sepp Hochreiter¹, Jürgen Schmidhuber²•Institutions (2)

Technische Universität München¹, Dalle Molle Institute for Artificial Intelligence Research²

01 Nov 1997-Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read moreread less

72,897 citations

"Deep Residual Learning for Image Re..." refers background in this paper

...Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15]....
[...]

Proceedings Article•

Very Deep Convolutional Networks for Large-Scale Image Recognition

[...]

Karen Simonyan¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read moreread less

49,914 citations

Proceedings Article•DOI•

Going deeper with convolutions

[...]

Christian Szegedy¹, Wei Liu², Yangqing Jia¹, Pierre Sermanet¹, Scott Reed³, Dragomir Anguelov¹, Dumitru Erhan¹, Vincent Vanhoucke¹, Andrew Rabinovich - Show less +5 more•Institutions (3)

Google¹, University of North Carolina at Chapel Hill², University of Michigan³

07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

...read moreread less

40,257 citations

Proceedings Article•

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

[...]

Sergey Ioffe¹, Christian Szegedy¹•Institutions (1)

Google¹

06 Jul 2015

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

...read moreread less

30,843 citations

"Deep Residual Learning for Image Re..." refers background or methods in this paper

...We do not use dropout [14], following the practice in [16]....
[...]
...Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]....
[...]
...These plain networks are trained with BN [16], which ensures forward propagated signals to have non-zero variances....
[...]
...We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]....
[...]
...9, and adopt the weight initialization in [13] and BN [16] but with no dropout....
[...]