YOLO9000: Better, Faster, Stronger

doi:10.1109/CVPR.2017.690

YOLO9000:

Better, Faster, Stronger

Joseph Redmon

∗×

, Ali Farhadi

∗†×

University of Washington

∗

, Allen Institute for AI

†

, XNOR.ai

×

http://pjreddie.com/yolo9000/

Abstract

We introduce YOLO9000, a state-of-the-art, real-time

object detection system that can detect over 9000 object

categories. First we propose various improvements to the

YOLO detection method, both novel and drawn from prior

work. The improved model, YOLOv2, is state-of-the-art on

standard detection tasks like PA SCA L VOC and COCO. Us-

ing a novel, multi-scale training method the same YOLOv2

model can run at varying sizes, offering an easy tradeoff

between speed and accuracy. At 67 FPS, YOLOv2 gets

76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6

mAP, outperforming state-of-the-art methods like Faster R-

CNN with ResNet and SSD while still running signiﬁcantly

faster. Finally we propose a method to jointly train on ob-

ject detection and classiﬁcation. Using this method we train

YOLO9000 simultaneously on the COCO detection dataset

and the ImageNet classiﬁcation dataset. Our joint training

allows YOLO9000 to predict detections for object classes

that don’t have labelled detection data. We validate our

approach on the ImageNet detection task. YOLO9000 gets

19.7 mAP on the ImageNet detection validation set despite

only having detection data for 44 of the 200 classes. On

the 156 classes not in COCO, YOLO9000 gets 16.0 mAP.

YOLO9000 predicts detections for more than 9000 different

object categories, all in real-time.

1. Introduction

General purpose object detection should be fast, accu-

rate, and able to recognize a wide variety of objects. Since

the introduction of neural networks, detection frameworks

have become increasingly fast and accurate. However, most

detection methods are still constrained to a small set of ob-

jects.

Current object detection datasets are limited compared

to datasets for other tasks like classiﬁcation and tagging.

The most common detection datasets contain thousands to

hundreds of thousands of images with dozens to hundreds

of tags [3] [10] [2]. Classiﬁcation datasets have millions

of images with tens or hundreds of thousands of categories

[20] [2].

We would like detection to scale to level of object clas-

siﬁcation. However, labelling images for detection is far

more expensive than labelling for classiﬁcation or tagging

(tags are often user-supplied for free). Thus we are unlikely

to see detection datasets on the same scale as classiﬁcation

Figure 1: YOLO9000. YOLO9000 can detect a wide variety of

object classes in real-time.

7263

datasets in the near future.

We propose a new method to harness the large amount

of classiﬁcation data we already have and use it to expand

the scope of current detection systems. Our method uses a

hierarchical view of object classiﬁcation that allows us to

combine distinct datasets together.

We also propose a joint training algorithm that allows

us to train object detectors on both detection and classiﬁca-

tion data. Our method leverages labeled detection images to

learn to precisely localize objects while it uses classiﬁcation

images to increase its vocabulary and robustness.

Using this method we train YOLO9000, a real-time ob-

ject detector that can detect over 9000 different object cat-

egories. First we improve upon the base YOLO detection

system to produce YOLOv2, a state-of-the-art, real-time

detector. Then we use our dataset combination method

and joint training algorithm to train a model on more than

9000 classes from ImageNet as well as detection data from

COCO.

All of our code and pre-trained models are available on-

line at

http://pjreddie.com/yolo9000/.

2. Better

YOLO suffers from a variety of shortcomings relative to

state-of-the-art detection systems. Error analysis of YOLO

compared to Fast R-CNN shows that YOLO makes a sig-

niﬁcant number of localization errors. Furthermore, YOLO

has relatively low recall compared to region proposal-based

methods. Thus we focus mainly on improving recall and

localization while maintaining classiﬁcation accuracy.

Computer vision generally trends towards larger, deeper

networks [

6] [18] [17]. Better performance often hinges on

training larger networks or ensembling multiple models to-

gether. However, with YOLOv2 we want a more accurate

detector that is still fast. Instead of scaling up our network,

we simplify the network and then make the representation

easier to learn. We pool a variety of ideas from past work

with our own novel concepts to improve YOLO’s perfor-

mance. A summary of results can be found in Table

2.

Batch Normalization. Batch normalization leads to sig-

niﬁcant improvements in convergence while eliminating the

need for other forms of regularization [

7]. By adding batch

normalization on all of the convolutional layers in YOLO

we get more than 2% improvement in mAP. Batch normal-

ization also helps regularize the model. With batch nor-

malization we can remove dropout from the model without

overﬁtting.

High Resolution Classiﬁer. All state-of-the-art detec-

tion methods use classiﬁer pre-trained on ImageNet [

16].

Starting with AlexNet most classiﬁers operate on input im-

ages smaller than 256 × 256 [

8]. The original YOLO trains

the classiﬁer network at 224 × 224 and increases the reso-

lution to 448 for detection. This means the network has to

simultaneously switch to learning object detection and ad-

just to the new input resolution.

For YOLOv2 we ﬁrst ﬁne tune the classiﬁcation network

at the full 448 × 448 resolution for 10 epochs on ImageNet.

This gives the network time to adjust its ﬁlters to work better

on higher resolution input. We then ﬁne tune the resulting

network on detection. This high resolution classiﬁcation

network gives us an increase of almost 4% mAP.

Convolutional With Anchor Boxes. YOLO predicts

the coordinates of bounding boxes directly using fully con-

nected layers on top of the convolutional feature extractor.

Instead of predicting coordinates directly Faster R-CNN

predicts bounding boxes using hand-picked priors [

15]. Us-

ing only convolutional layers the region proposal network

(RPN) in Faster R-CNN predicts offsets and conﬁdences for

anchor boxes. Since the prediction layer is convolutional,

the RPN predicts these offsets at every location in a feature

map. Predicting offsets instead of coordinates simpliﬁes the

problem and makes it easier for the network to learn.

We remove the fully connected layers from YOLO and

use anchor boxes to predict bounding boxes. First we

eliminate one pooling layer to make the output of the net-

work’s convolutional layers higher resolution. We also

shrink the network to operate on 416 input images instead

of 448×448. We do this because we want an odd number of

locations in our feature map so there is a single center cell.

Objects, especially large objects, tend to occupy the center

of the image so it’s good to have a single location right at

the center to predict these objects instead of four locations

that are all nearby. YOLO’s convolutional layers downsam-

ple the image by a factor of 32 so by using an input image

of 416 we get an output feature map of 13 × 13.

When we move to anchor boxes we also decouple the

class prediction mechanism from the spatial location and

instead predict class and objectness for every anchor box.

Following YOLO, the objectness prediction still predicts

the IOU of the ground truth and the proposed box and the

class predictions predict the conditional probability of that

class given that there is an object.

Using anchor boxes we get a small decrease in accuracy.

YOLO only predicts 98 boxes per image but with anchor

boxes our model predicts more than a thousand. Without

anchor boxes our intermediate model gets 69.5 mAP with a

recall of 81%. With anchor boxes our model gets 69.2 mAP

with a recall of 88%. Even though the mAP decreases, the

increase in recall means that our model has more room to

improve.

Dimension Clusters. We encounter two issues with an-

chor boxes when using them with YOLO. The ﬁrst is that

the box dimensions are hand picked. The network can learn

to adjust the boxes appropriately but if we pick better priors

for the network to start with we can make it easier for the

network to learn to predict good detections.

Instead of choosing priors by hand, we run k-means

clustering on the training set bounding boxes to automat-

ically ﬁnd good priors. If we use standard k-means with

7264

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

COCO

# Clusters

Avg IOU

0.75

VOC 2007

Figure 2: Clustering box dimensions on VOC and COCO. We

run k-means clustering on the dimensions of bounding boxes to get

good priors for our model. The left image shows the average IOU

we get with various choices for k. k = 5 gives a good tradeoff for

recall vs. complexity of the model. The right image shows the rel-

ative centroids for VOC and COCO. COCO has greater variation

in size than VOC.

Euclidean distance larger boxes generate more error than

smaller boxes. However, what we really want are priors

that lead to good IOU scores, which is independent of the

size of the box. Thus for our distance metric we use:

d(box, centroid) = 1 − IOU(box, centroid)

We run k-means for various values of k and plot the av-

erage IOU with closest centroid, see Figure

2. We choose

k = 5 as a good tradeoff between model complexity and

high recall. The cluster centroids are signiﬁcantly different

than hand-picked anchor boxes. There are fewer short, wide

boxes and more tall, thin boxes.

We compare the average IOU to closest prior of our clus-

tering strategy and the hand-picked anchor boxes in Table 1.

At only 5 priors the centroids perform similarly to 9 anchor

boxes with an average IOU of 61.0 compared to 60.9. If

we use 9 centroids we see a much higher average IOU. This

indicates that using k-means to generate our bounding box

starts the model off with a better representation and makes

the task easier to learn.

Box Generation # Avg IOU

Cluster SSE 5 58.7

Cluster IOU 5 61.0

Anchor Boxes [15] 9 60.9

Cluster IOU 9 67.2

Table 1: Average IOU of boxes to closest priors on VOC 2007.

The average IOU of objects on VOC 2007 to their closest, unmod-

iﬁed prior using different generation methods. Clustering gives

much better results than using hand-picked priors.

Direct location prediction. When using anchor boxes

with YOLO we encounter a second issue: model instability,

especially during early iterations. Most of the instability

comes from predicting the (x, y) locations for the box. In

region proposal networks the network predicts values t

x

and

t

y

and the (x, y) center coordinates are calculated as:

x = (t

x

∗ w

a

) − x

a

y = (t

y

∗ h

a

) − y

a

For example, a prediction of t

x

= 1 would shift the box

to the right by the width of the anchor box, a prediction of

t

x

= −1 would shift it to the left by the same amount.

This formulation is unconstrained so any anchor box can

end up at any point in the image, regardless of what loca-

tion predicted the box. With random initialization the model

takes a long time to stabilize to predicting sensible offsets.

Instead of predicting offsets we follow the approach of

YOLO and predict location coordinates relative to the loca-

tion of the grid cell. This bounds the ground truth to fall

between 0 and 1. We use a logistic activation to constrain

the network’s predictions to fall in this range.

The network predicts 5 bounding boxes at each cell in

the output feature map. The network predicts 5 coordinates

for each bounding box, t

x

, t

y

, t

w

, t

h

, and t

o

. If the cell is

offset from the top left corner of the image by (c

x

, c

y

) and

the bounding box prior has width and height p

w

, p

h

, then

the predictions correspond to:

b

x

= σ(t

x

) + c

x

b

y

= σ(t

y

) + c

y

b

w

= p

w

e

t

w

b

h

= p

h

e

t

h

P r(object) ∗ IOU (b, object) = σ(t

o

)

Since we constrain the location prediction the

parametrization is easier to learn, making the network

more stable. Using dimension clusters along with directly

predicting the bounding box center location improves

YOLO by almost 5% over the version with anchor boxes.

Fine-Grained Features.This modiﬁed YOLO predicts

detections on a 13 × 13 feature map. While this is sufﬁ-

cient for large objects, it may beneﬁt from ﬁner grained fea-

tures for localizing smaller objects. Faster R-CNN and SSD

both run their proposal networks at various feature maps in

the network to get a range of resolutions. We take a differ-

ent approach, simply adding a passthrough layer that brings

features from an earlier layer at 26 × 26 resolution.

The passthrough layer concatenates the higher resolution

features with the low resolution features by stacking adja-

cent features into different channels instead of spatial lo-

cations, similar to the identity mappings in ResNet. This

turns the 26 × 26 × 512 feature map into a 13 × 13 × 2048

7265

σ(t

x

)

σ(t

y

)

p

w

p

h

b

h

b

w

b

w

=p

w

e

b

h

=p

h

e

c

x

c

y

b

x

=σ(t

x

)+c

x

b

y

=σ(t

y

)+c

y

t

w

t

h

Figure 3: Bounding boxes with dimension priors and location

prediction. We predict the width and height of the box as offsets

from cluster centroids. We predict the center coordinates of the

box relative to the location of ﬁlter application using a sigmoid

function.

feature map, which can be concatenated with the original

features. Our detector runs on top of this expanded feature

map so that it has access to ﬁne grained features. This gives

a modest 1% performance increase.

Multi-Scale Training. The original YOLO uses an input

resolution of 448 × 448. With the addition of anchor boxes

we changed the resolution to 416×416. However, since our

model only uses convolutional and pooling layers it can be

resized on the ﬂy. We want YOLOv2 to be robust to running

on images of different sizes so we train this into the model.

Instead of ﬁxing the input image size we change the net-

work every few iterations. Every 10 batches our network

randomly chooses new image dimensions. Since our model

downsamples by a factor of 32, we pull from the following

multiples of 32: {320, 352, ..., 608}. Thus the smallest op-

tion is 320 × 320 and the largest is 608 × 608. We resize the

network to that dimension and continue training.

This regime forces the network to learn to predict well

across a variety of input dimensions. This means the same

network can predict detections at different resolutions. The

network runs faster at smaller sizes so YOLOv2 offers an

easy tradeoff between speed and accuracy.

At low resolutions YOLOv2 operates as a cheap, fairly

accurate detector. At 288 × 288 it runs at more than 90 FPS

with mAP almost as good as Fast R-CNN. This makes it

ideal for smaller GPUs, high framerate video, or multiple

video streams.

At high resolution YOLOv2 is a state-of-the-art detector

with 78.6 mAP on VOC 2007 while still operating above

real-time speeds. See Table

3 for a comparison of YOLOv2

with other frameworks on VOC 2007. Figure

4

Further Experiments. We train YOLOv2 for detection

Mean Average Precision

Frames Per Second

R-CNN

YOLO

Fast R-CNN

Faster R-CNN

Resnet

SSD512

SSD300

YOLOv2

80

70

60

0 50 100

30

Figure 4: Accuracy and speed on VOC 2007.

on VOC 2012. Table

4 shows the comparative performance

of YOLOv2 versus other state-of-the-art detection systems.

YOLOv2 achieves 73.4 mAP while running far faster than

other methods. We also train on COCO, see Table

5. On the

VOC metric (IOU = .5) YOLOv2 gets 44.0 mAP, compara-

ble to SSD and Faster R-CNN.

3. Faster

We want detection to be accurate but we also want it to be

fast. Most applications for detection, like robotics or self-

driving cars, rely on low latency predictions. In order to

maximize performance we design YOLOv2 to be fast from

the ground up.

Most detection frameworks rely on VGG-16 as the base

feature extractor [

17]. VGG-16 is a powerful, accurate clas-

siﬁcation network but it is needlessly complex. The con-

volutional layers of VGG-16 require 30.69 billion ﬂoating

point operations for a single pass over a single image at

224 × 224 resolution.

The YOLO framework uses a custom network based on

the Googlenet architecture [

19]. This network is faster than

VGG-16, only using 8.52 billion operations for a forward

pass. However, it’s accuracy is slightly worse than VGG-

16. For single-crop, top-5 accuracy at 224 × 224, YOLO’s

custom model gets 88.0% ImageNet compared to 90.0% for

VGG-16.

Darknet-19. We propose a new classiﬁcation model to

be used as the base of YOLOv2. Our model builds off of

prior work on network design as well as common knowl-

edge in the ﬁeld. Similar to the VGG models we use mostly

3 × 3 ﬁlters and double the number of channels after ev-

ery pooling step [

17]. Following the work on Network in

Network (NIN) we use global average pooling to make pre-

dictions as well as 1 × 1 ﬁlters to compress the feature rep-

resentation between 3 × 3 convolutions [9]. We use batch

normalization to stabilize training, speed up convergence,

7266

YOLO YOLOv2

batch norm? X X X X X X X X

hi-res classiﬁer?

X X X X X X X

convolutional?

X X X X X X

anchor boxes?

X X

new network? X X X X X

dimension priors?

X X X X

location prediction?

X X X X

passthrough?

X X X

multi-scale?

X X

hi-res detector?

X

VOC2007 mAP 63.4 65.8 69.5 69.2 69.6 74.4 75.4 76.8 78.6

Table 2: The path from YOLO to YOLOv2. Most of the listed design decisions lead to signiﬁcant increases in mAP. Two

exceptions are switching to a fully convolutional network with anchor boxes and using the new network. Switching to the

anchor box style approach increased recall without changing mAP while using the new network cut computation by 33%.

Detection Frameworks Train mAP FPS

Fast R-CNN [5] 2007+2012 70.0 0.5

Faster R-CNN VGG-16[

15] 2007+2012 73.2 7

Faster R-CNN ResNet[

6] 2007+2012 76.4 5

YOLO [

14] 2007+2012 63.4 45

SSD300 [

11] 2007+2012 74.3 46

SSD500 [

11] 2007+2012 76.8 19

YOLOv2 288 × 288 2007+2012 69.0 91

YOLOv2 352 × 352 2007+2012 73.7 81

YOLOv2 416 × 416 2007+2012 76.8 67

YOLOv2 480 × 480 2007+2012 77.8 59

YOLOv2 544 × 544 2007+2012 78.6 40

Table 3: Detection frameworks on PASCAL VOC 2007.

YOLOv2 is faster and more accurate than prior detection meth-

ods. It can also run at different resolutions for an easy tradeoff

between speed and accuracy. Each YOLOv2 entry is actually the

same trained model with the same weights, just evaluated at a dif-

ferent size. All timing information is on a Geforce GTX Titan X

(original, not Pascal model).

and regularize the model [

7].

Our ﬁnal model, called Darknet-19, has 19 convolutional

layers and 5 maxpooling layers. For a full description see

Table

6. Darknet-19 only requires 5.58 billion operations

to process an image yet achieves 72. 9% top-1 accuracy and

91.2% top-5 accuracy on ImageNet.

Training for classiﬁcation. We train the network on

the standard ImageNet 1000 class classiﬁcation dataset for

160 epochs using stochastic gradient descent with a starting

learning rate of 0.1, polynomial rate decay with a power of

4, weight decay of 0.0005 and momentum of 0.9 using the

Darknet neural network framework [

13]. During training

we use standard data augmentation tricks including random

crops, rotations, and hue, saturation, and exposure shifts.

As discussed above, after our initial training on images

at 224 × 224 we ﬁne tune our network at a larger size, 448.

For this ﬁne tuning we train with the above parameters but

for only 10 epochs and starting at a learning rate of 10

−3

. At

this higher resolution our network achieves a top-1 accuracy

of 76.5% and a top-5 accuracy of 93.3%.

Training for detection. We modify this network for de-

tection by removing the last convolutional layer and instead

adding on three 3 × 3 convolutional layers with 1024 ﬁl-

ters each followed by a ﬁnal 1 × 1 convolutional layer with

the number of outputs we need for detection. For VOC we

predict 5 boxes with 5 coordinates each and 20 classes per

box so 125 ﬁlters. We also add a passthrough layer from the

ﬁnal 3 × 3 × 512 layer to the second to last convolutional

layer so that our model can use ﬁne grain features.

We train the network for 160 epochs with a starting

learning rate of 10

−3

, dividing it by 10 at 60 and 90 epochs.

We use a weight decay of 0.0005 and momentum of 0.9.

We use a similar data augmentation to YOLO and SSD with

random crops, color shifting, etc. We use the same training

strategy on COCO and VOC.

4. Stronger

We propose a mechanism for jointly training on classi-

ﬁcation and detection data. Our method uses images la-

belled for detection to learn detection-speciﬁc information

like bounding box coordinate prediction and objectness as

well as how to classify common objects. It uses images with

only class labels to expand the number of categories it can

detect.

During training we mix images from both detection and

classiﬁcation datasets. When our network sees an image

labelled for detection we can backpropagate based on the

full YOLOv2 loss function. When it sees a classiﬁcation

image we only backpropagate loss from the classiﬁcation-

speciﬁc parts of the architecture.

This approach presents a few challenges. Detection

datasets have only common objects and general labels, like

7267

YOLO9000: Better, Faster, Stronger

Citations

Libra R-CNN: Towards Balanced Learning for Object Detection

PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud

Receptive Field Block Net for Accurate and Fast Object Detection

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Benchmarking Single-Image Dehazing and Beyond

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

ImageNet Large Scale Visual Recognition Challenge

Related Papers (5)

You Only Look Once: Unified, Real-Time Object Detection

Deep Residual Learning for Image Recognition

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Microsoft COCO: Common Objects in Context

ImageNet Classification with Deep Convolutional Neural Networks

Trending Questions (3)