scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

YOLO9000: Better, Faster, Stronger

21 Jul 2017-pp 6517-6525
TL;DR: YOLO9000 as discussed by the authors is a state-of-the-art real-time object detection system that can detect over 9000 object categories in real time using a novel multi-scale training method, offering an easy tradeoff between speed and accuracy.
Abstract: We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. Using a novel, multi-scale training method the same YOLOv2 model can run at varying sizes, offering an easy tradeoff between speed and accuracy. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that dont have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. YOLO9000 predicts detections for more than 9000 different object categories, all in real-time.

Content maybe subject to copyright    Report

YOLO9000:
Better, Faster, Stronger
Joseph Redmon
∗×
, Ali Farhadi
∗†×
University of Washington
, Allen Institute for AI
, XNOR.ai
×
http://pjreddie.com/yolo9000/
Abstract
We introduce YOLO9000, a state-of-the-art, real-time
object detection system that can detect over 9000 object
categories. First we propose various improvements to the
YOLO detection method, both novel and drawn from prior
work. The improved model, YOLOv2, is state-of-the-art on
standard detection tasks like PA SCA L VOC and COCO. Us-
ing a novel, multi-scale training method the same YOLOv2
model can run at varying sizes, offering an easy tradeoff
between speed and accuracy. At 67 FPS, YOLOv2 gets
76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6
mAP, outperforming state-of-the-art methods like Faster R-
CNN with ResNet and SSD while still running significantly
faster. Finally we propose a method to jointly train on ob-
ject detection and classification. Using this method we train
YOLO9000 simultaneously on the COCO detection dataset
and the ImageNet classification dataset. Our joint training
allows YOLO9000 to predict detections for object classes
that don’t have labelled detection data. We validate our
approach on the ImageNet detection task. YOLO9000 gets
19.7 mAP on the ImageNet detection validation set despite
only having detection data for 44 of the 200 classes. On
the 156 classes not in COCO, YOLO9000 gets 16.0 mAP.
YOLO9000 predicts detections for more than 9000 different
object categories, all in real-time.
1. Introduction
General purpose object detection should be fast, accu-
rate, and able to recognize a wide variety of objects. Since
the introduction of neural networks, detection frameworks
have become increasingly fast and accurate. However, most
detection methods are still constrained to a small set of ob-
jects.
Current object detection datasets are limited compared
to datasets for other tasks like classification and tagging.
The most common detection datasets contain thousands to
hundreds of thousands of images with dozens to hundreds
of tags [3] [10] [2]. Classification datasets have millions
of images with tens or hundreds of thousands of categories
[20] [2].
We would like detection to scale to level of object clas-
sification. However, labelling images for detection is far
more expensive than labelling for classification or tagging
(tags are often user-supplied for free). Thus we are unlikely
to see detection datasets on the same scale as classification
Figure 1: YOLO9000. YOLO9000 can detect a wide variety of
object classes in real-time.
7263

datasets in the near future.
We propose a new method to harness the large amount
of classification data we already have and use it to expand
the scope of current detection systems. Our method uses a
hierarchical view of object classification that allows us to
combine distinct datasets together.
We also propose a joint training algorithm that allows
us to train object detectors on both detection and classifica-
tion data. Our method leverages labeled detection images to
learn to precisely localize objects while it uses classification
images to increase its vocabulary and robustness.
Using this method we train YOLO9000, a real-time ob-
ject detector that can detect over 9000 different object cat-
egories. First we improve upon the base YOLO detection
system to produce YOLOv2, a state-of-the-art, real-time
detector. Then we use our dataset combination method
and joint training algorithm to train a model on more than
9000 classes from ImageNet as well as detection data from
COCO.
All of our code and pre-trained models are available on-
line at
http://pjreddie.com/yolo9000/.
2. Better
YOLO suffers from a variety of shortcomings relative to
state-of-the-art detection systems. Error analysis of YOLO
compared to Fast R-CNN shows that YOLO makes a sig-
nificant number of localization errors. Furthermore, YOLO
has relatively low recall compared to region proposal-based
methods. Thus we focus mainly on improving recall and
localization while maintaining classification accuracy.
Computer vision generally trends towards larger, deeper
networks [
6] [18] [17]. Better performance often hinges on
training larger networks or ensembling multiple models to-
gether. However, with YOLOv2 we want a more accurate
detector that is still fast. Instead of scaling up our network,
we simplify the network and then make the representation
easier to learn. We pool a variety of ideas from past work
with our own novel concepts to improve YOLO’s perfor-
mance. A summary of results can be found in Table
2.
Batch Normalization. Batch normalization leads to sig-
nificant improvements in convergence while eliminating the
need for other forms of regularization [
7]. By adding batch
normalization on all of the convolutional layers in YOLO
we get more than 2% improvement in mAP. Batch normal-
ization also helps regularize the model. With batch nor-
malization we can remove dropout from the model without
overfitting.
High Resolution Classifier. All state-of-the-art detec-
tion methods use classifier pre-trained on ImageNet [
16].
Starting with AlexNet most classifiers operate on input im-
ages smaller than 256 × 256 [
8]. The original YOLO trains
the classifier network at 224 × 224 and increases the reso-
lution to 448 for detection. This means the network has to
simultaneously switch to learning object detection and ad-
just to the new input resolution.
For YOLOv2 we first fine tune the classification network
at the full 448 × 448 resolution for 10 epochs on ImageNet.
This gives the network time to adjust its filters to work better
on higher resolution input. We then fine tune the resulting
network on detection. This high resolution classification
network gives us an increase of almost 4% mAP.
Convolutional With Anchor Boxes. YOLO predicts
the coordinates of bounding boxes directly using fully con-
nected layers on top of the convolutional feature extractor.
Instead of predicting coordinates directly Faster R-CNN
predicts bounding boxes using hand-picked priors [
15]. Us-
ing only convolutional layers the region proposal network
(RPN) in Faster R-CNN predicts offsets and confidences for
anchor boxes. Since the prediction layer is convolutional,
the RPN predicts these offsets at every location in a feature
map. Predicting offsets instead of coordinates simplifies the
problem and makes it easier for the network to learn.
We remove the fully connected layers from YOLO and
use anchor boxes to predict bounding boxes. First we
eliminate one pooling layer to make the output of the net-
work’s convolutional layers higher resolution. We also
shrink the network to operate on 416 input images instead
of 448×448. We do this because we want an odd number of
locations in our feature map so there is a single center cell.
Objects, especially large objects, tend to occupy the center
of the image so it’s good to have a single location right at
the center to predict these objects instead of four locations
that are all nearby. YOLO’s convolutional layers downsam-
ple the image by a factor of 32 so by using an input image
of 416 we get an output feature map of 13 × 13.
When we move to anchor boxes we also decouple the
class prediction mechanism from the spatial location and
instead predict class and objectness for every anchor box.
Following YOLO, the objectness prediction still predicts
the IOU of the ground truth and the proposed box and the
class predictions predict the conditional probability of that
class given that there is an object.
Using anchor boxes we get a small decrease in accuracy.
YOLO only predicts 98 boxes per image but with anchor
boxes our model predicts more than a thousand. Without
anchor boxes our intermediate model gets 69.5 mAP with a
recall of 81%. With anchor boxes our model gets 69.2 mAP
with a recall of 88%. Even though the mAP decreases, the
increase in recall means that our model has more room to
improve.
Dimension Clusters. We encounter two issues with an-
chor boxes when using them with YOLO. The first is that
the box dimensions are hand picked. The network can learn
to adjust the boxes appropriately but if we pick better priors
for the network to start with we can make it easier for the
network to learn to predict good detections.
Instead of choosing priors by hand, we run k-means
clustering on the training set bounding boxes to automat-
ically find good priors. If we use standard k-means with
7264

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
COCO
# Clusters
Avg IOU
0.75
VOC 2007
Figure 2: Clustering box dimensions on VOC and COCO. We
run k-means clustering on the dimensions of bounding boxes to get
good priors for our model. The left image shows the average IOU
we get with various choices for k. k = 5 gives a good tradeoff for
recall vs. complexity of the model. The right image shows the rel-
ative centroids for VOC and COCO. COCO has greater variation
in size than VOC.
Euclidean distance larger boxes generate more error than
smaller boxes. However, what we really want are priors
that lead to good IOU scores, which is independent of the
size of the box. Thus for our distance metric we use:
d(box, centroid) = 1 IOU(box, centroid)
We run k-means for various values of k and plot the av-
erage IOU with closest centroid, see Figure
2. We choose
k = 5 as a good tradeoff between model complexity and
high recall. The cluster centroids are significantly different
than hand-picked anchor boxes. There are fewer short, wide
boxes and more tall, thin boxes.
We compare the average IOU to closest prior of our clus-
tering strategy and the hand-picked anchor boxes in Table 1.
At only 5 priors the centroids perform similarly to 9 anchor
boxes with an average IOU of 61.0 compared to 60.9. If
we use 9 centroids we see a much higher average IOU. This
indicates that using k-means to generate our bounding box
starts the model off with a better representation and makes
the task easier to learn.
Box Generation # Avg IOU
Cluster SSE 5 58.7
Cluster IOU 5 61.0
Anchor Boxes [15] 9 60.9
Cluster IOU 9 67.2
Table 1: Average IOU of boxes to closest priors on VOC 2007.
The average IOU of objects on VOC 2007 to their closest, unmod-
ified prior using different generation methods. Clustering gives
much better results than using hand-picked priors.
Direct location prediction. When using anchor boxes
with YOLO we encounter a second issue: model instability,
especially during early iterations. Most of the instability
comes from predicting the (x, y) locations for the box. In
region proposal networks the network predicts values t
x
and
t
y
and the (x, y) center coordinates are calculated as:
x = (t
x
w
a
) x
a
y = (t
y
h
a
) y
a
For example, a prediction of t
x
= 1 would shift the box
to the right by the width of the anchor box, a prediction of
t
x
= 1 would shift it to the left by the same amount.
This formulation is unconstrained so any anchor box can
end up at any point in the image, regardless of what loca-
tion predicted the box. With random initialization the model
takes a long time to stabilize to predicting sensible offsets.
Instead of predicting offsets we follow the approach of
YOLO and predict location coordinates relative to the loca-
tion of the grid cell. This bounds the ground truth to fall
between 0 and 1. We use a logistic activation to constrain
the network’s predictions to fall in this range.
The network predicts 5 bounding boxes at each cell in
the output feature map. The network predicts 5 coordinates
for each bounding box, t
x
, t
y
, t
w
, t
h
, and t
o
. If the cell is
offset from the top left corner of the image by (c
x
, c
y
) and
the bounding box prior has width and height p
w
, p
h
, then
the predictions correspond to:
b
x
= σ(t
x
) + c
x
b
y
= σ(t
y
) + c
y
b
w
= p
w
e
t
w
b
h
= p
h
e
t
h
P r(object) IOU (b, object) = σ(t
o
)
Since we constrain the location prediction the
parametrization is easier to learn, making the network
more stable. Using dimension clusters along with directly
predicting the bounding box center location improves
YOLO by almost 5% over the version with anchor boxes.
Fine-Grained Features.This modified YOLO predicts
detections on a 13 × 13 feature map. While this is suffi-
cient for large objects, it may benefit from finer grained fea-
tures for localizing smaller objects. Faster R-CNN and SSD
both run their proposal networks at various feature maps in
the network to get a range of resolutions. We take a differ-
ent approach, simply adding a passthrough layer that brings
features from an earlier layer at 26 × 26 resolution.
The passthrough layer concatenates the higher resolution
features with the low resolution features by stacking adja-
cent features into different channels instead of spatial lo-
cations, similar to the identity mappings in ResNet. This
turns the 26 × 26 × 512 feature map into a 13 × 13 × 2048
7265

σ(t
x
)
σ(t
y
)
p
w
p
h
b
h
b
w
b
w
=p
w
e
b
h
=p
h
e
c
x
c
y
b
x
=σ(t
x
)+c
x
b
y
=σ(t
y
)+c
y
t
w
t
h
Figure 3: Bounding boxes with dimension priors and location
prediction. We predict the width and height of the box as offsets
from cluster centroids. We predict the center coordinates of the
box relative to the location of filter application using a sigmoid
function.
feature map, which can be concatenated with the original
features. Our detector runs on top of this expanded feature
map so that it has access to fine grained features. This gives
a modest 1% performance increase.
Multi-Scale Training. The original YOLO uses an input
resolution of 448 × 448. With the addition of anchor boxes
we changed the resolution to 416×416. However, since our
model only uses convolutional and pooling layers it can be
resized on the fly. We want YOLOv2 to be robust to running
on images of different sizes so we train this into the model.
Instead of fixing the input image size we change the net-
work every few iterations. Every 10 batches our network
randomly chooses new image dimensions. Since our model
downsamples by a factor of 32, we pull from the following
multiples of 32: {320, 352, ..., 608}. Thus the smallest op-
tion is 320 × 320 and the largest is 608 × 608. We resize the
network to that dimension and continue training.
This regime forces the network to learn to predict well
across a variety of input dimensions. This means the same
network can predict detections at different resolutions. The
network runs faster at smaller sizes so YOLOv2 offers an
easy tradeoff between speed and accuracy.
At low resolutions YOLOv2 operates as a cheap, fairly
accurate detector. At 288 × 288 it runs at more than 90 FPS
with mAP almost as good as Fast R-CNN. This makes it
ideal for smaller GPUs, high framerate video, or multiple
video streams.
At high resolution YOLOv2 is a state-of-the-art detector
with 78.6 mAP on VOC 2007 while still operating above
real-time speeds. See Table
3 for a comparison of YOLOv2
with other frameworks on VOC 2007. Figure
4
Further Experiments. We train YOLOv2 for detection
Mean Average Precision
Frames Per Second
R-CNN
YOLO
Fast R-CNN
Faster R-CNN
Faster R-CNN
Resnet
SSD512
SSD300
YOLOv2
80
70
60
0 50 100
30
Figure 4: Accuracy and speed on VOC 2007.
on VOC 2012. Table
4 shows the comparative performance
of YOLOv2 versus other state-of-the-art detection systems.
YOLOv2 achieves 73.4 mAP while running far faster than
other methods. We also train on COCO, see Table
5. On the
VOC metric (IOU = .5) YOLOv2 gets 44.0 mAP, compara-
ble to SSD and Faster R-CNN.
3. Faster
We want detection to be accurate but we also want it to be
fast. Most applications for detection, like robotics or self-
driving cars, rely on low latency predictions. In order to
maximize performance we design YOLOv2 to be fast from
the ground up.
Most detection frameworks rely on VGG-16 as the base
feature extractor [
17]. VGG-16 is a powerful, accurate clas-
sification network but it is needlessly complex. The con-
volutional layers of VGG-16 require 30.69 billion floating
point operations for a single pass over a single image at
224 × 224 resolution.
The YOLO framework uses a custom network based on
the Googlenet architecture [
19]. This network is faster than
VGG-16, only using 8.52 billion operations for a forward
pass. However, it’s accuracy is slightly worse than VGG-
16. For single-crop, top-5 accuracy at 224 × 224, YOLO’s
custom model gets 88.0% ImageNet compared to 90.0% for
VGG-16.
Darknet-19. We propose a new classification model to
be used as the base of YOLOv2. Our model builds off of
prior work on network design as well as common knowl-
edge in the field. Similar to the VGG models we use mostly
3 × 3 filters and double the number of channels after ev-
ery pooling step [
17]. Following the work on Network in
Network (NIN) we use global average pooling to make pre-
dictions as well as 1 × 1 filters to compress the feature rep-
resentation between 3 × 3 convolutions [9]. We use batch
normalization to stabilize training, speed up convergence,
7266

YOLO YOLOv2
batch norm? X X X X X X X X
hi-res classifier?
X X X X X X X
convolutional?
X X X X X X
anchor boxes?
X X
new network? X X X X X
dimension priors?
X X X X
location prediction?
X X X X
passthrough?
X X X
multi-scale?
X X
hi-res detector?
X
VOC2007 mAP 63.4 65.8 69.5 69.2 69.6 74.4 75.4 76.8 78.6
Table 2: The path from YOLO to YOLOv2. Most of the listed design decisions lead to significant increases in mAP. Two
exceptions are switching to a fully convolutional network with anchor boxes and using the new network. Switching to the
anchor box style approach increased recall without changing mAP while using the new network cut computation by 33%.
Detection Frameworks Train mAP FPS
Fast R-CNN [5] 2007+2012 70.0 0.5
Faster R-CNN VGG-16[
15] 2007+2012 73.2 7
Faster R-CNN ResNet[
6] 2007+2012 76.4 5
YOLO [
14] 2007+2012 63.4 45
SSD300 [
11] 2007+2012 74.3 46
SSD500 [
11] 2007+2012 76.8 19
YOLOv2 288 × 288 2007+2012 69.0 91
YOLOv2 352 × 352 2007+2012 73.7 81
YOLOv2 416 × 416 2007+2012 76.8 67
YOLOv2 480 × 480 2007+2012 77.8 59
YOLOv2 544 × 544 2007+2012 78.6 40
Table 3: Detection frameworks on PASCAL VOC 2007.
YOLOv2 is faster and more accurate than prior detection meth-
ods. It can also run at different resolutions for an easy tradeoff
between speed and accuracy. Each YOLOv2 entry is actually the
same trained model with the same weights, just evaluated at a dif-
ferent size. All timing information is on a Geforce GTX Titan X
(original, not Pascal model).
and regularize the model [
7].
Our final model, called Darknet-19, has 19 convolutional
layers and 5 maxpooling layers. For a full description see
Table
6. Darknet-19 only requires 5.58 billion operations
to process an image yet achieves 72. 9% top-1 accuracy and
91.2% top-5 accuracy on ImageNet.
Training for classification. We train the network on
the standard ImageNet 1000 class classification dataset for
160 epochs using stochastic gradient descent with a starting
learning rate of 0.1, polynomial rate decay with a power of
4, weight decay of 0.0005 and momentum of 0.9 using the
Darknet neural network framework [
13]. During training
we use standard data augmentation tricks including random
crops, rotations, and hue, saturation, and exposure shifts.
As discussed above, after our initial training on images
at 224 × 224 we fine tune our network at a larger size, 448.
For this fine tuning we train with the above parameters but
for only 10 epochs and starting at a learning rate of 10
3
. At
this higher resolution our network achieves a top-1 accuracy
of 76.5% and a top-5 accuracy of 93.3%.
Training for detection. We modify this network for de-
tection by removing the last convolutional layer and instead
adding on three 3 × 3 convolutional layers with 1024 fil-
ters each followed by a final 1 × 1 convolutional layer with
the number of outputs we need for detection. For VOC we
predict 5 boxes with 5 coordinates each and 20 classes per
box so 125 filters. We also add a passthrough layer from the
final 3 × 3 × 512 layer to the second to last convolutional
layer so that our model can use fine grain features.
We train the network for 160 epochs with a starting
learning rate of 10
3
, dividing it by 10 at 60 and 90 epochs.
We use a weight decay of 0.0005 and momentum of 0.9.
We use a similar data augmentation to YOLO and SSD with
random crops, color shifting, etc. We use the same training
strategy on COCO and VOC.
4. Stronger
We propose a mechanism for jointly training on classi-
fication and detection data. Our method uses images la-
belled for detection to learn detection-specific information
like bounding box coordinate prediction and objectness as
well as how to classify common objects. It uses images with
only class labels to expand the number of categories it can
detect.
During training we mix images from both detection and
classification datasets. When our network sees an image
labelled for detection we can backpropagate based on the
full YOLOv2 loss function. When it sees a classification
image we only backpropagate loss from the classification-
specific parts of the architecture.
This approach presents a few challenges. Detection
datasets have only common objects and general labels, like
7267

Citations
More filters
Proceedings ArticleDOI
Tsung-Yi Lin1, Priya Goyal2, Ross Girshick2, Kaiming He2, Piotr Dollár2 
07 Aug 2017
TL;DR: This paper proposes to address the extreme foreground-background class imbalance encountered during training of dense detectors by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, and develops a novel Focal Loss, which focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.
Abstract: The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.

12,161 citations

Proceedings ArticleDOI
Mark Sandler1, Andrew Howard1, Menglong Zhu1, Andrey Zhmoginov1, Liang-Chieh Chen1 
18 Jun 2018
TL;DR: MobileNetV2 as mentioned in this paper is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers and intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity.
Abstract: In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on ImageNet [1] classification, COCO object detection [2], VOC image segmentation [3]. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as actual latency, and the number of parameters.

9,381 citations

Posted Content
Mark Sandler1, Andrew Howard1, Menglong Zhu1, Andrey Zhmoginov1, Liang-Chieh Chen1 
TL;DR: A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.
Abstract: In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters

8,807 citations

Journal ArticleDOI
Tsung-Yi Lin1, Priya Goyal1, Ross Girshick1, Kaiming He1, Piotr Dollár1 
TL;DR: Focal loss as discussed by the authors focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training, which improves the accuracy of one-stage detectors.
Abstract: The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron .

5,734 citations

Posted Content
TL;DR: This work uses new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, C mBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100.
Abstract: There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at this https URL

5,709 citations

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations

Trending Questions (3)
What is yolov9?

The paper does not mention YOLOv9. The paper is about YOLO9000, a real-time object detection system that can detect over 9000 object categories.

Do categories need the same number of images for Yolo training?

The paper does not explicitly mention whether categories need the same number of images for YOLO training.