scispace - formally typeset
Search or ask a question
Posted Content

Deep Residual Learning for Image Recognition

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Citations
More filters
Posted Content
TL;DR: The authors present some updates to YOLO!
Abstract: We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at this https URL

12,770 citations


Cites background or methods from "Deep Residual Learning for Image Re..."

  • ...of COCOs weird average mean AP metric it is on par with the SSD variants but is 3 faster. It is still quite a bit behind other backbone AP AP 50 AP 75 AP S AP M AP L Two-stage methods Faster R-CNN+++ [5] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9 Faster R-CNN w FPN [8] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2 Faster R-CNN by G-RMI [6] Inception-ResNet-v2 [21] 34.7 55.5 36.7 13.5 38.1 52.0 Faster...

    [...]

  • ...owerful than Darknet19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results: Backbone Top-1 Top-5 Bn Ops BFLOP/s FPS Darknet-19 [15] 74.1 91.8 7.29 1246 171 ResNet-101[5] 77.1 93.7 19.7 1039 53 ResNet-152 [5] 77.6 93.8 29.4 1090 37 Darknet-53 77.2 93.8 18.7 1457 78 Table 2. Comparison of backbones. Accuracy, billions of operations, billion floating point operations per...

    [...]

Proceedings Article
20 Mar 2017
TL;DR: This work presents a conceptually simple, flexible, and general framework for object instance segmentation that outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners.
Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: this https URL

11,343 citations


Cites background or methods from "Deep Residual Learning for Image Re..."

  • ...We train on 8 GPUs (so effective minibatch size is 16) for 160k iterations, with a learning rate of 0.02 which is decreased by 10 at the 120k iteration....

    [...]

  • ...We hope such rapid training will remove a major hurdle in this area and encourage more people to perform research on this challenging topic....

    [...]

  • ...We also report results on test-dev [28]....

    [...]

Journal ArticleDOI
TL;DR: This paper reviews the major deep learning concepts pertinent to medical image analysis and summarizes over 300 contributions to the field, most of which appeared in the last year, to survey the use of deep learning for image classification, object detection, segmentation, registration, and other tasks.

8,730 citations


Cites methods from "Deep Residual Learning for Image Re..."

  • ...Nie et al. (2016c) CNN Survival prediction; features from a Multi-modal 3D CNN is fused with hand-crafted features to train an SVM...

    [...]

  • ...Xie et al. (2015a) Nucleus detection Ki-67 CNN model that learns the voting offset vectors and voting...

    [...]

  • ...Xie et al. (2016b) Perimysium segmentation H&E 2D spatial clockwork RNN...

    [...]

  • ...Xie et al. (2015b) Nucleus detection H&E, Ki-67 CNN-based structured regression model for cell detection...

    [...]

  • ...Xie et al. (2016a) Nucleus detection and cell counting FL and H&E Microscopy cell counting with fully convolutional regression...

    [...]

Posted Content
Tsung-Yi Lin1, Priya Goyal1, Ross Girshick1, Kaiming He1, Piotr Dollár1 
TL;DR: This paper proposes to address the extreme foreground-background class imbalance encountered during training of dense detectors by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, and develops a novel Focal Loss, which focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
Abstract: The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: this https URL.

7,715 citations


Cites methods from "Deep Residual Learning for Image Re..."

  • ...Plugging in ResNeXt-32x8d-101-FPN [40] as the RetinaNet backbone TABLE 3 Object Detection Single-Model Results (Bounding Box AP), versus State-of-the-Art on COCO test-dev backbone AP AP50 AP75 APS APM APL Two-stage methods Faster R-CNN+++ [31] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9 Faster R-CNNw FPN [4] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2 Faster R-CNN by G-RMI [33] Inception-ResNet-v2 [38] 34.7 55.5 36.7 13.5 38.1 52.0 Faster R-CNNw TDM [30] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1 One-stage methods YOLOv2 [8] DarkNet-19 [8] 21.6 44.0 19.2 5.0 22.4 35.5 SSD513 [9], [10] ResNet-101-SSD 31.2 50.4 33.3 10.2 34.5 49.8 DSSD513 [10] ResNet-101-DSSD 33.2 53.3 35.2 13.0 35.4 51.1 RetinaNet (ours) ResNet-101-FPN 39.1 59.1 42.3 21.8 42.7 50.2 RetinaNet (ours) ResNeXt-101-FPN 40.8 61.1 44.1 24.1 44.2 51.2 We show results for our RetinaNet-101-800 model, trained with scale jitter and for 1.5 longer than the same model from Table 1e....

    [...]

  • ...Following [4], we build FPN on top of the ResNet architecture [31]....

    [...]

  • ...For all experiments we use depth 50 or 101 ResNets [31] with a Feature Pyramid Network [4] constructed on top....

    [...]

  • ...ResNet-101 models are pre-trained on ImageNet1k; we use the models released by [31]....

    [...]

  • ...The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) [4] backbone on top of a feedforward ResNet architecture [31] (a) to generate a rich, multi-scale convolutional feature pyramid (b)....

    [...]

Posted Content
TL;DR: This work proposes a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters and is able to compress to less than 0.5MB (510x smaller than AlexNet).
Abstract: Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here: this https URL

5,904 citations


Cites background or methods from "Deep Residual Learning for Image Re..."

  • ...level, i.e. the contents of individual modules of the CNN. Now, we explore design decisions at the macroarchitecture level concerning the high-level connections among Fire modules. Inspired by ResNet [13], we explored three different architectures: Vanilla SqueezeNet (as per the prior sections). SqueezeNet with simple bypass connections between some Fire modules. SqueezeNet with complex bypass conn...

    [...]

  • ... layers that deliver even higher ImageNet accuracy [14]. The choice of connections across multiple layers or modules is an emerging area of CNN macroarchitectural research. Residual Networks (ResNet) [13] and Highway Networks [29] each propose the use of connections that skip over multiple layers, for example additively connecting the activations from layer 3 to the activations from layer 6. We refer ...

    [...]

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


"Deep Residual Learning for Image Re..." refers background in this paper

  • ...Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15]....

    [...]

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations


"Deep Residual Learning for Image Re..." refers background or methods in this paper

  • ...We do not use dropout [14], following the practice in [16]....

    [...]

  • ...Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]....

    [...]

  • ...These plain networks are trained with BN [16], which ensures forward propagated signals to have non-zero variances....

    [...]

  • ...We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]....

    [...]

  • ...9, and adopt the weight initialization in [13] and BN [16] but with no dropout....

    [...]