scispace - formally typeset
Open AccessProceedings ArticleDOI

DOTA: A Large-Scale Dataset for Object Detection in Aerial Images

TLDR
The Dataset for Object Detection in Aerial Images (DOTA) as discussed by the authors is a large-scale dataset of aerial images collected from different sensors and platforms and contains objects exhibiting a wide variety of scales, orientations, and shapes.
Abstract
Object detection is an important and challenging problem in computer vision. Although the past decade has witnessed major advances in object detection in natural scenes, such successes have been slow to aerial imagery, not only because of the huge variation in the scale, orientation and shape of the object instances on the earth's surface, but also due to the scarcity of well-annotated datasets of objects in aerial scenes. To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, we introduce a large-scale Dataset for Object deTection in Aerial images (DOTA). To this end, we collect 2806 aerial images from different sensors and platforms. Each image is of the size about 4000 A— 4000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. These DOTA images are then annotated by experts in aerial image interpretation using 15 common object categories. The fully annotated DOTA images contains 188, 282 instances, each of which is labeled by an arbitrary (8 d.o.f.) quadrilateral. To build a baseline for object detection in Earth Vision, we evaluate state-of-the-art object detection algorithms on DOTA. Experiments demonstrate that DOTA well represents real Earth Vision applications and are quite challenging.

read more

Content maybe subject to copyright    Report

DOTA: A Large-scale Dataset for Object Detection in Aerial Images
Gui-Song Xia
1
, Xiang Bai
2
, Jian Ding
1
, Zhen Zhu
2
, Serge Belongie
3
,
Jiebo Luo
4
, Mihai Datcu
5
, Marcello Pelillo
6
, Liangpei Zhang
1
1
Wuhan University,
2
Huazhong Univ. Sci. and Tech.
3
Cornell University,
4
Rochester University,
5
German Aerospace Center (DLR),
6
University of Venice
{guisong.xia, jian.ding, zlp62}@whu.edu.cn, {xbai, zzhu}@hust.edu.cn
sjb344@cornell.edu, jiebo.luo@gmail.com, mihai.datcu@dlr.de, pelillo@dsi.unive.it
Abstract
Object detection is an important and challenging prob-
lem in computer vision. Although the past decade has
witnessed major advances in object detection in natural
scenes, such successes have been slow to aerial imagery,
not only because of the huge variation in the scale, orien-
tation and shape of the object instances on the earth’s sur-
face, but also due to the scarcity of well-annotated datasets
of objects in aerial scenes. To advance object detection re-
search in Earth Vision, also known as Earth Observation
and Remote Sensing, we introduce a large-scale Dataset for
Object deTection in Aerial images (DOTA). To this end, we
collect 2806 aerial images from different sensors and plat-
forms. Each image is of the size about 4000 × 4000 pix-
els and contains objects exhibiting a wide variety of scales,
orientations, and shapes. These DOTA images are then an-
notated by experts in aerial image interpretation using 15
common object categories. The fully annotated DOTA im-
ages contains 188, 282 instances, each of which is labeled
by an arbitrary (8 d.o.f.) quadrilateral. To build a baseline
for object detection in Earth Vision, we evaluate state-of-
the-art object detection algorithms on DOTA. Experiments
demonstrate that DOTA well represents real Earth Vision
applications and are quite challenging.
1. Introduction
Object detection in Earth Vision refers to localizing ob-
jects of interest (e.g., vehicles, airplanes) on the earth’s sur-
face and predicting their categories. In contrast to conven-
tional object detection datasets, where objects are gener-
ally oriented upward due to gravity, the object instances in
aerial images often appear with arbitrary orientations, as il-
lustrated in Fig.
1, depending on the perspective of the Earth
Vision platforms.
DOTA website is
https://captain-whu.github.io/DOTA.
Equal contribution
(b) (d)(c)
(a)
ship harborcar roundabout
#Instances
#Instances
(e)
(f)
#pixels
radians
Figure 1: An example taken from DOTA. (a) Typical im-
age in DOTA consisting of many instances across multiple
categories. (b) Illustration of the variety in instance orien-
tation and size. (c),(d) Illustration of sparse instances and
crowded instances, respectively. Here we show four out of
fifteen of the possible categories in DOTA. Examples shown
in (b),(c),(d) are cropped from source image (a). The his-
tograms (e),(f) exhibit the distribution of instances with re-
spect to size and orientation in DOTA.
Extensive studies have been devoted to object detection
in aerial images [
24, 15, 18, 3, 20, 39, 19, 32, 31, 22],
drawing upon recent advances in Computer Vision and ac-
counting for the high demands of Earth Vision applications.
Most of these methods [
39, 19, 32, 3] attempt to transfer ob-
ject detection algorithms developed for natural scenes to the
aerial image domain. Recently, driven by the successes of
deep learning-based algorithms for object detection, Earth

Vision researchers have pursued approaches based on fine-
tuning networks pre-trained on large-scale image datasets
(e.g., ImageNet [
6] and MSCOCO [14]) for detection in the
aerial domain, see e.g. [
19, 30, 2, 3].
While such fine-tuning based approaches are a reason-
able avenue to explore, images such as Fig. 1 reveals that
the task of object detection in aerial images is distinguished
from the conventional object detection task:
- The scale variations of object instances in aerial im-
ages are huge. This is not only because of the spatial
resolutions of sensors, but also due to the size varia-
tions inside the same object category.
- Many small object instances are crowded in aerial im-
ages, for example, the ships in a harbor and the vehi-
cles in a parking lot, as illustrated in Fig.
1. Moreover,
the frequencies of instances in aerial images are un-
balanced, for instance, some small-size (e.g. 1k × 1k)
images contain 1900 instances, while some large-size
images (e.g. 4k × 4k) may contain only a handful of
small instances.
- Objects in aerial images often appear in arbitrary ori-
entations. There are also some instances with an ex-
tremely large aspect ratio, such as a bridge.
Besides these difficulties, the studies of object detection
in Earth Vision are also challenged by the dataset bias prob-
lem [
29], i.e. the degree of generalizability across datasets
is often low. To alleviate such biases, the dataset should be
annotated to reflect the demands of real world applications.
Therefore, it is not surprising that the object detectors
learned from natural images are not suitable for aerial im-
ages. However, existing annotated datasets for object detec-
tion in aerial images, such as UCAS-AOD [
41] and NWPU
VHR-10 [
2], tend to use images in ideal conditions (clear
backgrounds and without densely distributed instances),
which cannot adequately reflect the problem complexity.
To advance the object detection research in Earth Vision,
this paper introduces a large-scale Dataset for Object de-
Tection in Aerial images (DOTA). We collect 2806 aerial
images from different sensors and platforms with crowd-
sourcing. Each image is of the size about 4k × 4k pix-
els and contains objects of different scales, orientations and
shapes. These DOTA images are annotated by experts in
aerial image interpretation, with respect to 15 common ob-
ject categories. The fully annotated DOTA dataset contains
188,282 instances, each of which is labeled by an arbitrary
quadrilateral, instead of an axis-aligned bounding box, as is
typically used for object annotation in natural scenes. The
main contributions of this work are:
- To our knowledge, DOTA is the largest annotated ob-
ject dataset with a wide variety of categories in Earth
Vision.
1
It can be used to develop and evaluate object
1
The DIUx xView Detection Challenge with more categories and in-
stances opened in Feb. 2018:
http://xviewdataset.org
detectors in aerial images. We will continue to update
DOTA, to grow in size and scope and to reflect evolv-
ing real world conditions.
- We also benchmark state-of-the-art object detection al-
gorithms on DOTA, which can be used as the baseline
for future algorithm development.
In addition to advancing object detection studies in Earth
Vision, DOTA will also pose interesting algorithmic ques-
tions to conventional object detection in computer vision.
2. Motivations
Datasets have played an important role in data-driven re-
search in recent years [
36, 6, 14, 40, 38, 33]. Large datasets
like MSCOCO [
14] are instrumental in promoting object
detection and image captioning research. When it comes to
the classification task and scene recognition task, the same
is true for ImageNet [
6] and Places [40], respectively.
However, in aerial object detection, a dataset resembling
MSCOCO and ImageNet both in terms of image number
and detailed annotations has been missing, which becomes
one of the main obstacles to the research in Earth Vision,
especially for developing deep learning-based algorithms.
Aerial object detection is extremely helpful for remote ob-
ject tracking and unmanned driving. Therefore, a large-
scale and challenging aerial object detection benchmark,
being as close as possible to real-world applications, is im-
perative for promoting research in this field.
We argue that a good aerial image dataset should possess
four properties, namely, 1) a large number of images, 2)
many instances per categories, 3) properly oriented object
annotation, and 4) many different classes of objects, which
make it approach to real-world applications. However, ex-
isting aerial image datasets [
41, 18, 16, 25] share in com-
mon several shortcomings: insufficient data and classes,
lack of detailed annotations, as well as low image resolu-
tion. Moreover, their complexity is inadequate to be con-
sidered as a reflection of the real world.
Datasets like TAS [
9], VEDAI [25], COWC [21]
and DLR 3K Munich Vehicle [
16] only focus on vehi-
cles. UCAS-AOD [
41] contains vehicles and planes while
HRSC2016 [
18] only contains ships even though fine-
grained category information are given. All these datasets
are short in the number of classes, which restricts their
applicabilities to complicated scenes. In contrast, NWPU
VHR-10 [
2] is composed of ten different classes of objects
while its total number of instances is only around 3000. De-
tailed comparisons of these existing datasets are shown in
Tab.
1. Compared to these aerial datasets, as we shall see
in Section
4, DOTA is challenging for its tremendous object
instances, arbitrary but well-distributed orientations, vari-
ous categories and complicated aerial scenes. Moreover,
scenes in DOTA is in coincidence with natural scenes, so
DOTA is more helpful for real-world applications.

Dataset Annotation way #main categories #Instances #Images Image width
NWPU VHR-10 [2] horizontal BB 10 3651 800 1000
SZTAKI-INRIA [1] oriented BB 1 665 9 800
TAS [9] horizontal BB 1 1319 30 792
COWC [21] one dot 1 32716 53 200019,000
VEDAI [25] oriented BB 3 2950 1268 512, 1024
UCAS-AOD [41] oriented BB 2 14,596 1510 1000
HRSC2016 [18] oriented BB 1 2976 1061 1100
3K Vehicle Detection [
16] oriented BB 2 14,235 20 5616
DOTA oriented BB 14 188,282 2806 8004000
Table 1: Comparison among DOTA and object detection datasets in aerial images. BB is short for bounding box. One-dot
refers to annotations with only the center coordinates of an instance provided. Fine-grained categories are not taken into
account. For example, DOTA consists of 15 different categories but only 14 main categories, because small vehicle and large
vehicle are both sub-categories of vehicle.
When it comes to general objects datasets, ImageNet and
MSCOCO are favored due to the large number of images,
many categories and detailed annotations. ImageNet has
the largest number of images among all object detection
datasets. However, the average number of instances per im-
age is far smaller than MSCOCO and our DOTA, plus the
limitations of its clean backgrounds and carefully selected
scenes. Images in DOTA contain an extremely large number
of object instances, some of which have more than 1,000 in-
stances. PASCAL VOC Dataset [
7] is similar to ImageNet
in instances per image and scenes but the inadequate num-
ber of images makes it unsuitable to handle most detection
needs. Our DOTA resembles MSCOCO in terms of the in-
stance numbers and scene types, but DOTAs categories are
not as many as MSCOCO because objects which can be
seen clearly in aerial images are quite limited.
Besides, what makes DOTA unique among the above
mentioned large-scale general object detection benchmarks
is that the objects in DOTA are annotated with properly ori-
ented bounding boxes (OBB for short). OBB can better
enclose the objects and differentiate crowded objects from
each other. The benefits of annotating objects in aerial im-
ages with OBB are further described in Section
3. We draw
a comparison among DOTA, PASCAL VOC, ImageNet and
MSCOCO to show the differences in Tab.
2.
Dataset Category
Image
quantity
BBox
quantity
Avg. BBox
quantity
PASCAL VOC
(07++12)
20 21,503 62,199 2.89
MSCOCO
(2014 trainval)
80 123,287 886,266 7.19
ImageNet
(2017train)
200 349,319 478,806 1.37
DOTA 15 2,806 188,282 67.10
Table 2: Comparison among DOTA and other general ob-
ject detection datasets. BBox is short for bounding boxes,
Avg. BBox quantity indicates average bounding box quan-
tity per image. Note that for the average number of in-
stances per image, DOTA surpasses other datasets hugely.
3. Annotation of DOTA
3.1. Images collection
In aerial images, the resolution and variety of sensors be-
ing used are factors to produce dataset biases [
5]. To elim-
inate the biases, images in our dataset are collected from
multiple sensors and platforms (e.g. Google Earth) with
multiple resolutions. To increase the diversity of data, we
collect images shot in multiple cities carefully chosen by
experts in aerial image interpretation. We record the exact
geographical coordinates of the location and capture time of
each image to ensure there are no duplicate images.
3.2. Category selection
Fifteen categories are chosen and annotated in our DOTA
dataset, including plane, ship, storage tank, baseball dia-
mond, tennis court, swimming pool, ground track field, har-
bor, bridge, large vehicle, small vehicle, helicopter, round-
about, soccer ball field and basketball court.
The categories are selected by experts in aerial image in-
terpretation according to whether a kind of objects is com-
mon and its value for real-world applications. The first
10 categories are common in the existing datasets, e.g.,
[
16, 2, 41, 21], We keep them all except that we further
split vehicle into large ones and small ones because there
is obvious difference between these two sub-categories in
aerial images. Others are added mainly from the values in
real applications. For example, we select helicopter con-
sidering that moving objects are of significant importance
in aerial images. Roundabout is chosen because it plays an
important role in roadway analysis.
It is worth discussing whether to take “stuff” categories
into account. There are usually no clear definitions for the
”stuff” categories (e.g. harbor, airport, parking lot), as is
shown in the SUN dataset [
34]. However, the context in-
formation provided by them may be helpful for detection.
We only adopt the harbor category because its border is rela-
tively easy to define and there are abundant harbor instances

in our image sources. Soccer field is another new category
in DOTA.
In Fig.
2, we compare the categories of DOTA with
NWPU VHR-10 [
2], which has the largest number of cate-
gories in previous aerial object detection datasets. Note that
DOTA surpass NWPU VHR-10 not only in category num-
bers, but also the number of instances per category.
100 1000 10000 100000
plane
ship
storage tank
baseball diamond
tennis court
basketball court
ground track field
harbor
bridge
small vehicle
large vehicle
roundabout
swimming pool
helicopter
soccer ball field
NWPU VHR 10 DOTA
Figure 2: Comparison between DOTA and NWPU VHR-10
in categories and responding quantity of instances.
3.3. Annotation method
We consider different ways of annotating. In computer
vision, many visual concepts such as region descriptions,
objects, attributes, and relationships, are annotated with
bounding boxes, as shown in [
12]. A common description
of bounding boxes is (x
c
, y
c
, w, h), where (x
c
, y
c
) is the
center location, w, h are the width and height of the bound-
ing box, respectively.
Objects without many orientations can be adequately an-
notated with this method. However, bounding boxes labeled
in this way cannot accurately or compactly outline oriented
instances such as text and objects in aerial images. In an ex-
treme but actually common condition as shown in Fig. 3 (c)
and (d), the overlap between two bounding boxes is so large
that state-of-the-art object detection methods cannot differ-
entiate them. In order to remedy this, we need to find an
annotation method suitable for oriented objects.
An option for annotating oriented objects is θ-based ori-
ented bounding box which is adopted in some text detec-
tion benchmarks [
37], namely (x
c
, y
c
, w, h, θ), where θ
denotes the angle from the horizontal direction of the stan-
dard bounding box. A flaw of this method is the inabil-
ity to compactly enclose oriented objects with large defor-
mation among different parts. Considering the complicated
scenes and various orientations of objects in aerial images,
we need to abandon this method and choose a more flex-
ible and easy-to-understand way. An alternative is arbi-
trary quadrilateral bounding boxes, which can be denoted
as {(x
i
, y
i
), i = 1, 2, 3, 4}, where (x
i
, y
i
) denotes the posi-
tions of the oriented bounding boxes’ vertices in the image.
The vertices are arranged in a clockwise order. This way is
widely adopted in oriented text detection benchmarks [
11].
We draw inspiration from these researches and use arbitrary
quadrilateral bounding boxes to annotate objects.
To make a more detailed annotation, as shown in Fig.
3,
we emphasize the importance of the first point (x
1
, y
1
),
which normally implies the “head” of the object. For he-
licopter, large vehicle, small vehicle, harbor, baseball dia-
mond, ship and plane, we carefully denote their first point to
enrich potential usages. While for soccer-ball field, swim-
ming pool, bridge, ground track field, basketball court and
tennis court, there are no visual clues to decide the first
point, so we choose the top-left point as the starting point.
Some samples of annotated patches (not the whole orig-
inal image) in our dataset are shown in Fig.
4.
It is worth noticing that, Papadopoulos et al. [
23] have
explored an alternative annotation method and verify its ef-
ficiency and robustness. We assume that the annotations
would be more precise and robust with more elaborately de-
signed annotation methods, and alternative annotation pro-
tocols would facilitate more efficient crowd-sourced image
annotations.
3.4. Dataset splits
In order to ensure that the training data and test data dis-
tributions approximately match, we randomly select half of
the original images as the training set, 1/6 as validation set,
and 1/3 as the testing set. We will publicly provide all the
original images with ground truth for training set and val-
idation set, but not for the testing set. For testing, we are
currently building an evaluation server.
4. Properties of DOTA
4.1. Image size
Aerial images are usually very large in size compared to
those in natural images dataset. The original size of images
in our dataset ranges from about 800 × 800 to about 4k ×
4k while most images in regular datasets (e.g. PASCAL-
VOC and MSCOCO) are no more than 1k × 1k. We make
annotations on the original full image without partitioning
it into pieces to avoid the cases where a single instance is
partitioned into different pieces.
4.2. Various orientations of instances
As shown in Fig.
1 (f), our dataset achieves a good bal-
ance in the instances of different directions, which is signif-
icantly helpful for learning a robust detector. Moreover, our
dataset is closer to real scenes because it is common to see
objects in all kinds of orientations in the real world.

(a) (b) (c) (d)
Figure 3: Visualization of adopted annotation method. The yellow point represents the starting point, which refers to: (a) top
left corner of a plane, (b) the center of sector-shaped baseball diamond, (c) top left corner of a large vehicle. (d) is a failure
case of the horizontal rectangle annotation, which brings high overlap compared to (c).
Figure 4: Samples of annotated images in DOTA. We show three samples per each category, except six for large-vehicle.
(a) (b) (c)
Figure 5: Statistics of instances in DOTA. AR denotes the aspect ratio. (a) The AR of horizontal bounding box. (b) The AR
of oriented bounding box. (c) Histogram of number of annotated instances per image.

Figures
Citations
More filters
Journal ArticleDOI

Deep Learning for Generic Object Detection: A Survey

TL;DR: A comprehensive survey of the recent achievements in this field brought about by deep learning techniques, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics.
Posted Content

Object Detection in 20 Years: A Survey

TL;DR: This paper extensively reviews 400+ papers of object detection in the light of its technical evolution, spanning over a quarter-century's time (from the 1990s to 2019), and makes an in-deep analysis of their challenges as well as technical improvements in recent years.
Journal ArticleDOI

Object detection in optical remote sensing images: A survey and a new benchmark

TL;DR: A comprehensive review of the recent deep learning based object detection progress in both the computer vision and earth observation communities is provided and a large-scale, publicly available benchmark for object DetectIon in Optical Remote sensing images is proposed, which is named as DIOR.
Journal ArticleDOI

A Survey of Deep Learning-Based Object Detection

TL;DR: This survey provides a comprehensive overview of a variety of object detection methods in a systematic manner, covering the one-stage and two-stage detectors, and lists the traditional and new applications.
Proceedings ArticleDOI

Learning RoI Transformer for Oriented Object Detection in Aerial Images

TL;DR: The core idea of RoI Transformer is to apply spatial transformations on RoIs and learn the transformation parameters under the supervision of oriented bounding box (OBB) annotations.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Book ChapterDOI

Microsoft COCO: Common Objects in Context

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Journal ArticleDOI

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
Frequently Asked Questions (16)
Q1. What contributions have the authors mentioned in the paper "Dota: a large-scale dataset for object detection in aerial images" ?

To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, the authors introduce a large-scale Dataset for Object deTection in Aerial images ( DOTA ). 

The authors argue that a good aerial image dataset should possess four properties, namely, 1) a large number of images, 2) many instances per categories, 3) properly oriented object annotation, and 4) many different classes of objects, which make it approach to real-world applications. 

An option for annotating oriented objects is θ-based oriented bounding box which is adopted in some text detection benchmarks [37], namely (xc, yc, w, h, θ), where θ denotes the angle from the horizontal direction of the standard bounding box. 

Ground truths for HBB experiments are generated by calculating the axis-aligned bounding boxes over original annotated bounding boxes. 

The fully annotated DOTA dataset contains 188,282 instances, each of which is labeled by an arbitrary quadrilateral, instead of an axis-aligned bounding box, as is typically used for object annotation in natural scenes. 

Spatial resolution can also be used to filter mislabeled outliers in their dataset, as intra-class varieties of actual sizes for most categories are limited. 

In order to ensure that the training data and test data distributions approximately match, the authors randomly select half of the original images as the training set, 1/6 as validation set, and 1/3 as the testing set. 

The original size of images in their dataset ranges from about 800 × 800 to about 4k × 4k while most images in regular datasets (e.g. PASCALVOC and MSCOCO) are no more than 1k × 1k. 

existing aerial image datasets [41, 18, 16, 25] share in common several shortcomings: insufficient data and classes, lack of detailed annotations, as well as low image resolution. 

existing annotated datasets for object detection in aerial images, such as UCAS-AOD [41] and NWPU VHR-10 [2], tend to use images in ideal conditions (clear backgrounds and without densely distributed instances), which cannot adequately reflect the problem complexity. 

For the vertices of the newly generated parts, the authors need to ensure they can be described as an oriented bounding box with 4 vertices in the clockwise order with a fitting method. 

Fifteen categories are chosen and annotated in their DOTA dataset, including plane, ship, storage tank, baseball diamond, tennis court, swimming pool, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field and basketball court. 

a largescale and challenging aerial object detection benchmark, being as close as possible to real-world applications, is imperative for promoting research in this field. 

what makes DOTA unique among the above mentioned large-scale general object detection benchmarks is that the objects in DOTA are annotated with properly oriented bounding boxes (OBB for short). 

bounding boxes labeled in this way cannot accurately or compactly outline oriented instances such as text and objects in aerial images. 

The authors assume this dataset is challenging but similar to natural aerial scenes, which are more appropriate for practical applications.