Journal Article•DOI•

Learning Aerial Image Segmentation From Online Maps

Q: What are the future works in "Learning aerial image segmentation from online maps" ?

In future work, it may be useful to experiment with even larger amounts of open data. On the other hand, buildings are detected equally well, and no further improvement can be noticed. Locally well-defined compact objects of similar shape and appearance are easier to learn, so further training data do not add relevant information. While pretraining is nowadays a standard practice, the authors go one step further and pretrain with aerial images and the correct set of output labels, generated automatically from free map data.

Q: How much of the loss is compensated for by pretraining?

In other words, fine-tuning with a limited quantity of problemspecific high-accuracy labels compensates for a large portion (≈ 65%) of the loss between experiments II and IV, with only 15 % of the labeling effort.

Q: Why is it common practice to publish pretrain models together with source code and paper?

It is a common practice in deep learning to publish pretrained models together with source code and paper, to ease repeatability of results and to help others avoid training from scratch.

Q: What is the visionary goal of the project?

A visionary goal would be a large free publicly available “model zoo” of pretrained classifiers for the most important remote sensing applications, from which users world-wide can download suitable models and either apply them directly to their region of interest or use them as initialization for their own training.

Q: How do you generate pixel-wise label maps?

To generate pixel-wise label maps, the geographic coordinates of OSM building corners and road center lines are transformed to pixel coordinates.

Q: What is the possible interpretation of the effect of pretraining?

A possible interpretation is that complex network structures with long-range dependencies are hard to learn for the classifier, and thus more training data help.

Q: How can the authors learn semantic segmentation of overhead images without manual labeling effort?

Semantic segmentation of overhead images can indeedbe learned from OSM maps without any manual labeling effort albeit at the cost of reduced segmentation accuracy.

Pascal Kaiser¹, Jan Dirk Wegner¹, Aurelien Lucchi¹, Martin Jaggi¹, Thomas Hofmann¹, Konrad Schindler¹ - Show less +2 more•Institutions (1)

ETH Zurich¹

21 Jul 2017-IEEE Transactions on Geoscience and Remote Sensing (IEEE)-Vol. 55, Iss: 11, pp 6054-6068

TL;DR: In this article, a state-of-the-art CNN architecture was proposed for semantic segmentation of buildings and roads in aerial images, and compared with different training data sets, ranging from manually labeled ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations.

read less

Abstract: This paper deals with semantic segmentation of high-resolution (aerial) images where a semantic class label is assigned to each pixel via supervised classification as a basis for automatic map generation. Recently, deep convolutional neural networks (CNNs) have shown impressive performance and have quickly become the de-facto standard for semantic segmentation, with the added benefit that task-specific feature design is no longer necessary. However, a major downside of deep learning methods is that they are extremely data hungry, thus aggravating the perennial bottleneck of supervised classification, to obtain enough annotated training data. On the other hand, it has been observed that they are rather robust against noise in the training labels. This opens up the intriguing possibility to avoid annotating huge amounts of training data, and instead train the classifier from existing legacy data or crowd-sourced maps that can exhibit high levels of noise. The question addressed in this paper is: can training with large-scale publicly available labels replace a substantial part of the manual labeling effort and still achieve sufficient performance? Such data will inevitably contain a significant portion of errors, but in return virtually unlimited quantities of it are available in larger parts of the world. We adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations. We report our results that indicate that satisfying performance can be obtained with significantly less manual annotation effort, by exploiting noisy large-scale training data.

...read moreread less

Summary (3 min read)

Jump to: [Introduction] – [II. RELATED WORK] – [A. Generation of Training Data] – [B. Neural Network Architecture] – [D. Training] – [IV. EXPERIMENTS] – [A. Data Sets] – [B. Results and Discussion] and [V. CONCLUSION]

Introduction

4) If low-accuracy large-scale training data help, then it may also allow one to substitute a large portion of the manually annotated high-quality data.
At the same time, they also fulfill the other requirements for their study: they are data hungry and robust to label noise [4].
For practical reasons, their study is limited to buildings and roads, which are available from OSM, and to RGB images from Google Maps, subject to unknown radiometric manipulations.

A. Generation of Training Data

The authors use a simple automatic approach to generate data sets of VHR aerial images in RGB format and corresponding labels for classes building, road, and background.
Aerial images are downloaded from Google Maps, and geographic coordinates of buildings and roads are downloaded from OSM.
3 OSM data can be accessed and manipulated in vector format, and each object type comes with meta data and identifiers that allow straightforward filtering.
This simple strategy works reasonably well, with a mean error of ≈11 pixels for the road boundary, compared with ≈100 pixels of road width.
In (very rare) cases where the ad hoc procedure produced label collisions, pixels claimed by both building and road were assigned to buildings.

B. Neural Network Architecture

Following the standard neural network concept, transformations are ordered in sequential layers that gradually transform the pixel values to label probabilities.
4Note that it is technically possible to obtain world coordinates of objects in Google Maps and enter those into OSM, and this might in practice also be done to some extent.
5Average deviation based on ten random samples of Potsdam, Chicago, Paris, and Zurich.
Convolutional layers are interspersed with max-pooling layers that downsample the image and retain only the maximum value inside a (2 × 2) neighborhood.
Note that adding the third skip connection does not increase the total number of parameters but, on the contrary, slightly reduces it ( [5]: 134′277′737, ours: 134′276′540; the small difference is due to the decomposition of the final upsampling kernel into two smaller ones).

D. Training

All model parameters are learned by minimizing a multinomial logistic loss, summed over the entire 500 × 500 pixel patch that serves as input to the FCN.
Prior to training/inference, intensity distributions are centered independently per patch by subtracting the mean, separately for each channel (RGB).
Learning rates always start from 5 × 10−9 and are reduced by a factor of ten twice when the loss and average F1 scores stopped improving.
Starting from pretrained models, even if these have been trained on a completely different image data set, often improves performance, because low-level features like contrast edges and blobs learned in early network layers are very similar across different kinds of images.
Either the authors rely on weights previously learned on the Pascal VOC benchmark [53] (made available by Long et al. [5]), or they pretrain ourselves with OSM data.

IV. EXPERIMENTS

The authors present extensive experiments on four large data sets of different cities to explore the following scenarios.
Note that all experiments are designed to investigate different aspects of the hypotheses made in the introduction.

A. Data Sets

Four large data sets were downloaded from Google Maps and OSM, for the cities of Chicago, Paris, Zurich, and Berlin.
Example images and segmentation maps of Paris and Zurich are shown in Fig. 1. In Fig. 4, the authors show the full extent of the Potsdam scene, dictated by the available images and ground truth in the ISPRS benchmark.
In particular, the benchmark ground truth does not have a label street, but instead uses a broader class impervious surfaces, also comprising sidewalks, tarmacked courtyards, and so on.
To allow for a direct and fair comparison, the authors downsample the ISPRS Potsdam data, which comes at a GSD of 5 cm, to the same GSD as the Potsdam– Google data (9.1 cm).
Each data set is split into mutually exclusive training, validation, and test regions.

B. Results and Discussion

First, the authors validate their modifications of the FCN architecture, by comparing it with the original model of [5].
The visual comparison between baseline II in Fig. 7(g)–(i) and IV in Fig. 9(a)–(c) shows that buildings are segmented equally well, but roads deteriorate significantly.
The authors first train the FCN model on Google/OSM data of Chicago, Paris, Zurich, and Berlin, and use the resulting network weights as initial value, from which the model is tuned for the ISPRS data, using all the 21 training images as in baseline II.
The success of pretraining in previous experiments raises the question—also asked in [50]—of whether one could reduce the annotation effort and use a smaller hand-labeled training set, in conjunction with large-scale OSM labels.
Performance increases by 7 percent points to 0.837 over baseline Ia, where the model is trained from scratch on the same high-accuracy labels.

V. CONCLUSION

Traditionally, semantic segmentation of aerial and satellite images crucially relies on manually labeled images as training data.
Generating such training data for a new project is costly and time consuming, and presents a bottleneck for automatic image analysis.
Here, the authors have explored a possible solution, namely, to exploit existing data, in their case open image and map data from the Internet for supervised learning with deep CNNs.
Such training data are available in much larger quantities, but “weaker” in the sense that the images are not representative of the test images’ radiometry, and labels automatically generated from external maps are noisier than dedicated ground truth annotations.
3) Even if high-quality training data are available, the large volume of additional training data improves classification.

Did you find this useful? Give us your feedback

Figures (13)

TABLE I STATISTICS OF THE DATA SETS USED IN THE EXPERIMENTS. NOTE THAT WE DOWNSAMPLED THE ORIGINAL POTSDAM–ISPRS (GSD=5 cm) TO THE RESOLUTION OF THE POTSDAM–GOOGLE DATA (GSD=9.1 cm) FOR ALL EXPERIMENTS

Fig. 4. Overview of the ISPRS Potsdam data set. The aerial images shown are those provided by the ISPRS benchmark [54].

Fig. 9. (a)–(c) Complete substitution (IV) of manual labels, trained from scratch on ISPRS images and OSM labels of Potsdam (no pretraining). (d)–(f) Augmentation (V) with open data, pretrained on Chicago, Paris, Zurich, and Berlin and retrained on all 21 ISPRS training images with pixel-accurate ground truth. (g)–(i) Partial substitution (VI) of manual labels, pretrained on Chicago, Paris, Zurich, and Berlin and retrained on three ISPRS images with pixel-accurate ground truth.

Fig. 5. FCN trained on Google Maps imagery and OSM labels of Chicago. (a) Original aerial image. (b) Overlaid with classification result.

Fig. 6. Classification results and average F1-scores of the Tokyo scene with a model trained on (a) Chicago (F1: 0.485), (b) Paris (F1: 0.521), (c) Zurich (F1: 0.581), and (d) all three (F1: 0.644).

Fig. 8. Baseline experiments. (a)–(c) Baseline IIIa with Google Maps images and OSM Maps from only Potsdam. (d)–(f) Baseline IIIb with Google Maps images and OSM Maps and training on Potsdam and Berlin.

TABLE IV RESULTS OF EXPERIMENTS WITH THE POTSDAM DATA SET. THE THREE LEFT COLUMNS ARE AVERAGE VALUES OVER ALL CLASSES AND THE RIGHT THREE COLUMNS GIVE PER CLASS F1-SCORES. THE Best Results ACROSS ALL VARIANTS ARE WRITTEN IN BOLD, THE SECOND BEST RESULTS

TABLE III, P IS SHORT FOR POTSDAM, WHEREAS B IS SHORT FOR BERLIN

Fig. 1. Example of OSM labels overlaid with Google Maps images for (a) Zurich and (b) Paris. (Left) Aerial image and a magnified detail. (Right) Same images overlaid with building (red) and road (blue) labels. Background is transparent in the label map.

Fig. 2. Conceptual illustration of the data flow through our variant of an FCN, which is used for the semantic segmentation of aerial images. Three skip connections are highlighted by pale red, pale green, and pale blue, respectively. Note that we added a third (pale red) skip connection in addition to the original ones (pale green and pale blue) of [5].

Fig. 10. Probability maps for (a)–(c) road extraction of the gold standard baseline II and (d)–(f) complete substitution IV without any manual labels. Road probabilities range from red (high) to blue (low).

Fig. 7. Baseline experiments. (a)–(c) Baseline Ia trained on three ISPRS images without pretraining. (d)–(f) Baseline Ib trained on three ISPRS images with pretraining on Pascal VOC. (g)–(i) Gold standard II trained on 21 ISPRS images.

Fig. 3. Our FCN architecture, which adds one more skip connection (after Pool_2, shown red) to the original model of [5]. Neurons form a 3-D structure per layer: dimensions are written in brackets, where the first number indicates the amount of feature channels, and second and third represent spatial dimensions.

Content maybe subject to copyright Report

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1

Learning Aerial Image Segmentation

From Online Maps

Pascal Kaiser, Jan Dirk Wegner, Aurélien Lucchi, Martin Jaggi, Thomas Hofmann,

and Konrad Schindler, Senior Member, IEEE

Abstract—This paper deals with semantic segmentation of

high-resolution (aerial) images where a semantic class label is

assigned to each pixel via supervised classiﬁcation as a basis for

automatic map generation. Recently, deep convolutional neural

networks (CNNs) have shown impressive performance and have

quickly become the de-facto standard for semantic segmentation,

with the added beneﬁt that task-speciﬁc feature design is no

longer necessary. However, a major downside of deep learning

methods is that they are extremely data hungry, thus aggravating

the perennial bottleneck of supervised classiﬁcation, to obtain

enough annotated training data. On the other hand, it has

been observed that they are rather robust against noise in the

training labels. This opens up the intriguing possibility to avoid

annotating huge amounts of training data, and instead train the

classiﬁer from existing legacy data or crowd-sourced maps that

can exhibit high levels of noise. The question addressed in this

paper is: can training with large-scale publicly available labels

replace a substantial part of the manual labeling effort and

still achieve sufﬁcient performance? Such data will inevitably

contain a signiﬁcant portion of errors, but in return virtually

unlimited quantities of it are available in larger parts of the

world. We adapt a state-of-the-art CNN architecture for semantic

segmentation of buildings and roads in aerial images, and

compare its performance when using different training data sets,

ranging from manually labeled pixel-accurate ground truth of the

same city to automatic training data derived from OpenStreetMap

data from distant locations. We report our results that indicate

that satisfying performance can be obtained with signiﬁcantly less

manual annotation effort, by exploiting noisy large-scale training

data.

Index Terms— Crowdsourcing, image classiﬁcation, machine

learning, neural networks, supervised learning, terrain mapping,

urban areas.

I. INTRODUCTION

UGE volumes of optical overhead imagery are captured

every day with airborne or spaceborne platforms, and

that volume is still growing. This “data deluge” makes manual

interpretation prohibitive, and hence machine vision must be

employed if we want to make any use of the available data.

Perhaps the fundamental step of automatic mapping is to

assign a semantic class to each pixel, i.e., convert the raw

data to a semantically meaningful raster map (which can then

be further processed as appropriate with, e.g., vectorization or

map generalization techniques). The most popular tool for that

Manuscript received January 30, 2017; revised April 11, 2017

and May 29, 2017; accepted June 19, 2017. (Corresponding author:

Jan Dirk Wegner.)

The authors are with ETH Zürich, 8093 Zürich, Switzerland (e-mail:

jan.wegner@geod.baug.ethz.ch).

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TGRS.2017.2719738

task is supervised machine learning. Supervision with human-

annotated training data is necessary to inject the task-speciﬁc

class deﬁnitions into the generic statistical analysis. In most

cases, reference data for classiﬁer training are generated man-

ually for each new project, which is a time-consuming and

costly process. Manual annotation must be repeated every time

the task, the geographic location, the sensor characteristics, or

the imaging conditions change, and hence the process scales

poorly. In this paper, we explore the tradeoff between the

following:

1) pixel-accurate but small-scale ground truth available;

2) less accurate reference data that are readily available in

arbitrary quantities, at no cost.

For our study, we make use of online map data from Open-

StreetMap [1]–[3] (OSM, http://www.openstreetmap.org) to

automatically derive weakly labeled training data for three

classes, buildings, roads,andbackground (i.e., all others).

These data are typically collected using two main sources.

1) Volunteers collect OSM data either in situ with GPS

trackers or by manually digitizing very high resolution

(VHR) aerial or satellite images that have been donated.

2) National mapping agencies donate their data to OSM to

make it available to a wider public.

Since OSM is generated by volunteers, our approach can

be seen as a form of crowd-sourced data annotation; but other

existing map databases, e.g., legacy data within a mapping

agency, could also be used.

As image data for our study, we employ high-resolution

RGB orthophotographs from Google Maps,

since we could

not easily get access to comparable amounts of other high-

resolution imagery [> 100 km

at ≈ 10-cm ground sampling

distance (GSD)].

Clearly, these types of training data will be less accurate.

Sources of errors include coregistration errors, e.g., in our

case, OSM polygons and Google images were independently

geo-referenced; limitations of the data format, e.g., OSM only

has road centerlines and category, but no road boundaries;

temporal changes not depicted in outdated map or image data;

or simply sloppy annotations, not only because of a lack of

training or motivation, but also because the use cases of most

OSM users require not even meter-level accuracy.

Our study is driven by the following hypotheses.

1) The sheer volume of training data can possibly compen-

sate for the lower accuracy (if used with an appropriate

robust learning method).

speciﬁcations of Google Maps data can be found at

https://support.google.com/mapcontentpartners/answer/144284?hl=en

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

2) The large variety present in very large training sets

(e.g., spanning multiple different cities) could potentially

improve the classiﬁer’s ability to generalize to new

unseen locations.

3) Even if high-quality training data are available, the

large volume of additional training data could potentially

improve the classiﬁcation.

4) If low-accuracy large-scale training data help, then it

may also allow one to substitute a large portion of the

manually annotated high-quality data.

We investigate these hypotheses when using deep convolu-

tional neural networks (CNNs). Deep networks are at present

the top-performing method for high-resolution semantic label-

ing and are therefore the most appropriate choice for our

study.

At the same time, they also fulﬁll the other require-

ments for our study: they are data hungry and robust to label

noise [4]. And they make manual feature design somewhat

obsolete: once training data are available, retraining for differ-

ent sensor types or imaging conditions is fully automatic, with-

out scene-speciﬁc user interaction such as feature deﬁnition or

preprocessing. We adopt a variant of the fully convolutional

network (FCN) [5], and explore the potential of combining

end-to-end trained deep networks with massive amounts of

noisy OSM labels. We evaluate the extreme variant of our

approach, without any manual labeling, on three major cities

(Chicago, Paris, and Zurich) with different urban structures.

Since quantitative evaluations on these large data sets are

limited by the inaccuracy of the labels, which is also present

in the test sets, we also perform experiments for a smaller

data set from the city of Potsdam. There, high-precision

manually annotated ground truth is available, which allows us

to compare different levels of project-speciﬁc input, including

the baseline where only manually labeled training data are

used, the extreme case of only automatically generated training

labels, and variants in between. We also assess the mod-

els’ capabilities regarding generalization and transfer learning

between unseen geographic locations.

We ﬁnd in this paper that training on noisy labels does

work well, but only with substantially larger training sets.

Whereas with small training sets (≈ 2km

), it does not reach

the performance of hand-labeled pixel-accurate training data.

Moreover, even in the presence of high-quality training data,

massive OSM labels further improve the classiﬁer, and hence

can be used to signiﬁcantly reduce the manual labeling efforts.

According to our experiments, the differences are really due to

the training labels, since segmentation performance of OSM

labels is stable across different image sets of the same scene.

For practical reasons, our study is limited to buildings

and roads, which are available from OSM, and to RGB

images from Google Maps, subject to unknown radiometric

manipulations. We hope that similar studies will also be

performed with the vast archives of proprietary image and

map data held by state mapping authorities and commercial

All top-performing methods on big benchmarks are CNN

variants, both in generic computer vision, e.g., the Pascal

VOC Challenge, http://host.robots.ox.ac.uk/pascal/VOC/, and in

remote sensing, e.g., the ISPRS semantic labeling challenge,

http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html

satellite providers. Finally, this is a step in a journey that will

ultimately bring us closer to the utopian vision that a whole

range of mapping tasks no longer need user input, but can be

completely automated by the world wide Web.

II. R

ELATED WORK

There is a huge literature about semantic segmentation in

remote sensing. A large part deals with rather low-resolution

satellite images, whereas our work in this paper deals with

VHR aerial images (see [6] for an overview).

Aerial data with a ground sampling distance GSD ≤ 20 cm

contains rich details about urban objects such as roads, build-

ings, trees, and cars, and is a standard source for urban

mapping projects. Since urban environments are designed by

humans according to relatively stable design constraints, early

work attempted to construct object descriptors via sets of rules,

most prominently for building detection in 2-D [7], [8] or in

3-D [9]–[11], and for road extraction [12]–[14]. A general

limitation of hierarchical rule systems, be they top-down or

bottom-up, is poor generalization across different city layouts.

Hard thresholds at early stages tend to delete information

that can hardly be recovered later, and hard-coded expert

knowledge often misses important evidence that is less obvious

to the human observer.

Machine learning thus aims to learn classiﬁcation rules

directly from the data. As local evidence, conventional classi-

ﬁers are fed with raw pixel intensities, simple arithmetic com-

binations such as vegetation indices, and different statistics or

ﬁlter responses that describe the local image texture [15]–[17].

An alternative is to precompute a large redundant set of

local features for training and let a discriminative classi-

ﬁer (e.g., boosting and random forest) select the optimal

subset [18]–[21] for the task.

More global object knowledge that cannot be learned from

local pixel features can be introduced via probabilistic priors.

Two related probabilistic frameworks have been successfully

applied to this task, marked point processes (MPPs) and

graphical models. For example, [22] and [23] formulate MPPs

that explicitly model road network topologies, while [24] use

a similar approach to extract building footprints. MPPs rely

on object primitives like lines or rectangles that are matched

to the image data by sampling. Even if data driven [25], such

Monte Carlo sampling has high computational cost and does

not always ﬁnd good conﬁgurations. Graphical models provide

similar modeling ﬂexibility, but in general also lead to hard

optimization problems. For restricted cases (e.g., submodular

objective functions), efﬁcient optimizers exist. Although there

is a large body of literature that aims to tailor conditional ran-

dom ﬁelds for object extraction in computer vision and remote

sensing, relatively few authors tackle semantic segmentation

in urban scenes (see [26]–[30]).

Given the difﬁculty of modeling high-level correlations,

much effort has gone into improving the local evidence by

ﬁnding more discriminative object features [21], [31], [32].

The resulting feature vectors are fed to a standard classiﬁer

(e.g., decision trees or support vector machines) to infer

probabilities per object category. Some authors invest a lot

of efforts to reduce the dimension of the feature space to

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KAISER et al.: LEARNING AERIAL IMAGE SEGMENTATION FROM ONLINE MAPS 3

a maximally discriminative subset (see [33]–[36]), although

this seems to have only limited effect—at least with modern

discriminative classiﬁers.

Deep neural networks do not require a separate feature

deﬁnition step, but instead learn the most discriminative

feature set for a given data set and task directly from raw

images. They go back to [37] and [38], but at the time were

limited by a lack of computing power and training data. After

their comeback in the 2012 ImageNet challenge [39], [40],

deep learning approaches, and in particular deep CNNs, have

achieved impressive results for diverse image analysis tasks.

State-of-the-art network architectures (see [41]) have many

(often 10–20, but up to >100) layers of local ﬁlters and

thus large receptive ﬁelds in the deep layers, which makes

it possible to learn complex local-to-global (nonlinear) object

representations and long-range contextual relations directly

from raw image data. An important property of deep CNNs

is that both training and inference are easily parallelizable,

especially on GPUs, and thus scale to millions of training and

testing images.

Quickly, CNNs were also applied to semantic segmenta-

tion of images [42]. Our approach in this paper is based

on the FCN architecture of [5], which returns a structured

spatially explicit label image (rather than a global image label).

While spatial aggregation is nevertheless required to represent

context, FCNs also include in-network upsampling back to

the resolution of the original image. They have already been

successfully applied to semantic segmentation of aerial images

(see [43]–[45]). In fact, the top performers on the ISPRS

semantic segmentation benchmark all use CNNs. We note that

(nonconvolutional) deep networks in conjunction with OSM

labels have also been applied for patch-based road extraction

in overhead images of ≈ 1 m GSD at large scale [46], [47].

More recently, Máttyus et al. [48] combine OSM data with

aerial images to augment maps with additional information

from imagery like road widths. They design a sophisticated

random ﬁeld to probabilistically combine various sources of

road evidence, for instance, cars, to estimate road widths at

global scale using OSM and aerial images.

To the best of our knowledge, only two works have made

attempts to investigate how results of CNNs trained on large-

scale OSM labels can be ﬁne-tuned to achieve more accurate

results for labeling remote sensing images [49], [50]. However,

we are not aware of any large-scale, systematic, comparative,

and quantitative study that investigates using large-scale train-

ing labels from inaccurate map data for semantic segmentation

of aerial images.

III. M

ETHODS

We ﬁrst describe our straightforward approach to generate

training data automatically from OSM, and then give technical

details about the employed FCN architecture and the training

procedure used to train our model.

A. Generation of Training Data

We use a simple automatic approach to generate data sets of

VHR aerial images in RGB format and corresponding labels

for classes building, road,andbackground.Aerialimagesare

downloaded from Google Maps, and geographic coordinates

of buildings and roads are downloaded from OSM. We prefer

to use OSM maps instead of Google Maps, because the

latter can only be downloaded as raster images.

OSM data

can be accessed and manipulated in vector format, and each

object type comes with meta data and identiﬁers that allow

straightforward ﬁltering. Regarding coregistration, we ﬁnd that

OSM and Google Maps align relatively well, even though

they have been acquired and processed separately.

Most

local misalignments are caused by facades of high buildings

that overlap with roads or background due to perspective

effects. It is apparent that in our test areas Google provides

orthophotographs rectiﬁed with respect to a bare earth digital

terrain model (DTM), not “true” orthophotographs rectiﬁed

with a digital surface model (DSM). According to our own

measurements on a subset of the data, this effect is relatively

mild, generally < 10 pixels displacement. We found that this

does not introduce major errors as long as there are no high-

rise buildings. It may be more problematic for extreme scenes

such as Singapore or Manhattan.

To generate pixel-wise label maps, the geographic coor-

dinates of OSM building corners and road center lines are

transformed to pixel coordinates. For each building, a polygon

through the corner points is plotted at the corresponding image

location. For roads, the situation is slightly more complex.

OSM provides only coordinates of road center lines, but

no precise road widths. There is, however, a road category

label (“highway tag”) for most roads. We determined an

average road width for each category on a small subset of

the data, and validated it on a larger subset (manually, one-

off). This simple strategy works reasonably well, with a

mean error of ≈11 pixels for the road boundary, compared

with ≈100 pixels of road width.

In (very rare) cases where

the ad hoc procedure produced label collisions, pixels claimed

by both building and road were assigned to buildings. Pixels

neither labeled building nor road form the background class.

Examples of images overlaid with automatically generated

OSM labels are shown in Fig. 1.

B. Neural Network Architecture

We use a variant of FCNs in this paper (see Fig. 2). Fol-

lowing the standard neural network concept, transformations

are ordered in sequential layers that gradually transform the

pixel values to label probabilities. Most layers implement

learned convolution ﬁlters, where each neuron at level l takes

its input values only from a ﬁxed-size spatially localized

window W in the previous layer (l−1), and outputs a vector of

differently weighted sums of those values, c



i∈W

l−1

Weights w

are shared across all neurons of a layer, which

reﬂects the shift invariance of the image signal and drastically

Note that some national mapping agencies also provide publicly

available map and other geo-data, e.g., the USGS national map pro-

gram: https://nationalmap.gov/

Note that it is technically possible to obtain world coordinates of objects

in Google Maps and enter those into OSM, and this might in practice also be

done to some extent. However, OSM explicitly asks users not to do that.

Average deviation based on ten random samples of Potsdam, Chicago,

Paris, and Zurich.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Fig. 1. Example of OSM labels overlaid with Google Maps images for (a) Zurich and (b) Paris. (Left) Aerial image and a magniﬁed detail. (Right) Same

images overlaid with building (red) and road (blue) labels. Background is transparent in the label map.

Fig. 2. Conceptual illustration of the data ﬂow through our variant of an FCN, which is used for the semantic segmentation of aerial images. Three skip

connections are highlighted by pale red, pale green, and pale blue, respectively. Note that we added a third (pale red) skip connection in addition to the

original ones (pale green and pale blue) of [5].

reduces the number of parameters. Each convolutional layer is

followed by a rectiﬁed linear unit (ReLU) c

rec

= max(0, c

which simply truncates all negative values to 0 and leaves

positive values unchanged [51].

Convolutional layers are

interspersed with max-pooling layers that downsample the

image and retain only the maximum value inside a (2 × 2)

neighborhood. The downsampling increases the receptive ﬁeld

of subsequent convolutions, and lets the network learn corre-

lations over a larger spatial context. Moreover, max-pooling

achieves local translation invariance at object level. The out-

puts of the last convolutional layers (which are very big to

capture global context, equivalent to a fully connected layer

of standard CNNs) is converted to a vector of scores for the

Other nonlinearities are sometimes used, but ReLU has been shown to

facilitate training (backpropagation) and has become the de-facto standard.

three target classes. These score maps are of low resolution,

and hence they are gradually upsampled again with convo-

lutional layers using a stride of only (12) pixel.

Repeated

downsampling causes a loss of high-frequency content, which

leads to blurry boundaries that are undesirable for pixel-wise

semantic segmentation. To counter this effect, feature maps

at intermediate layers are merged back in during upsampling

(the so-called “skip connections,” see Fig. 2). The ﬁnal full-

resolution score maps are then converted to label probabilities

with the sof tmax function.

This operation is done by layers that are usually called “deconvolution

layers” in [5] (and also in Fig. 3) although the use of this terminology has been

criticized since most implementations do not perform a real deconvolution but

rather a transposed convolution.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KAISER et al.: LEARNING AERIAL IMAGE SEGMENTATION FROM ONLINE MAPS 5

Fig. 3. Our FCN architecture, which adds one more skip connection (after Pool_2, shown red) to the original model of [5]. Neurons form a 3-D structure per

layer: dimensions are written in brackets, where the ﬁrst number indicates the amount of feature channels, and second and third represent spatial dimensions.

C. Implementation Details

The FCN we use is an adaptation of the architecture

proposed in [5], which itself is largely based on the VGG-16

network architecture [41]. In our implementation, we slightly

modify the original FCN and introduce a third skip connection

(marked red in Fig. 2), to preserve even ﬁner image details.

We found that the original architecture, which has two skip

connections after Pool_3 and Pool_4 (see Fig. 3), was still

not delivering sufﬁciently sharp edges. The additional higher

resolution skip connection consistently improved the results

for our data (see Section IV-B). Note that adding the third skip

connection does not increase the total number of parameters

but, on the contrary, slightly reduces it ( [5]: 134



277



737, ours:

134



276



540; the small difference is due to the decomposition

of the ﬁnal upsampling kernel into two smaller ones).

D. Training

All model parameters are learned by minimizing a multino-

mial logistic loss, summed over the entire 500 × 500 pixel

patch that serves as input to the FCN. Prior to train-

ing/inference, intensity distributions are centered indepen-

dently per patch by subtracting the mean, separately for each

channel (RGB).

All models are trained with stochastic gradient descent

with a momentum of 0.9, and minibatch size of one image.

Learning rates always start from 5 × 10

−9

and are reduced

by a factor of ten twice when the loss and average F

scores stopped improving. The learning rates for biases of

convolutional layers were doubled with respect to learning

rates of the ﬁlter weights. Weight decay was set to 5 ×10

−4

and dropout probability for neurons in layers ReLU_6 and

ReLU_7wasalways0.5.

Training was run until the average F

-score on the validation

data set stopped improving, which took between 45 000 and

140 000 iterations (3.5–6.5 epochs). Weights were initialized

as in [52], except for experiments with pretrained weights.

It is a common practice in deep learning to publish pre-

trained models together with source code and paper, to ease

repeatability of results and to help others avoid training from

scratch. Starting from pretrained models, even if these have

been trained on a completely different image data set, often

improves performance, because low-level features like contrast

edges and blobs learned in early network layers are very

similar across different kinds of images.

We will use two different forms of pretraining. Either

we rely on weights previously learned on the Pascal VOC

benchmark [53] (made available by Long et al. [5]), or we

pretrain ourselves with OSM data. In Section IV, it is always

speciﬁed whether we use VOC, OSM, or no pretraining at all.

IV. E

XPERIMENTS

We present extensive experiments on four large data sets of

different cities to explore the following scenarios.

1) Complete Substitution: Can semantic segmentation be

learned without any manual labeling? What performance

HTML Viewer

Frequently Asked Questions (11)

Q1. What are the contributions in "Learning aerial image segmentation from online maps" ?

This paper deals with semantic segmentation of high-resolution ( aerial ) images where a semantic class label is assigned to each pixel via supervised classification as a basis for automatic map generation. The question addressed in this paper is: can training with large-scale publicly available labels replace a substantial part of the manual labeling effort and still achieve sufficient performance ? The authors adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations. The authors report their results that indicate that satisfying performance can be obtained with significantly less manual annotation effort, by exploiting noisy large-scale training data.

Q2. What are the future works in "Learning aerial image segmentation from online maps" ?

In future work, it may be useful to experiment with even larger amounts of open data. On the other hand, buildings are detected equally well, and no further improvement can be noticed. Locally well-defined compact objects of similar shape and appearance are easier to learn, so further training data do not add relevant information. While pretraining is nowadays a standard practice, the authors go one step further and pretrain with aerial images and the correct set of output labels, generated automatically from free map data.

Q3. How much of the loss is compensated for by pretraining?

In other words, fine-tuning with a limited quantity of problemspecific high-accuracy labels compensates for a large portion (≈ 65%) of the loss between experiments II and IV, with only 15 % of the labeling effort.

Q4. Why is it common practice to publish pretrain models together with source code and paper?

It is a common practice in deep learning to publish pretrained models together with source code and paper, to ease repeatability of results and to help others avoid training from scratch.

Q5. What is the visionary goal of the project?

A visionary goal would be a large free publicly available “model zoo” of pretrained classifiers for the most important remote sensing applications, from which users world-wide can download suitable models and either apply them directly to their region of interest or use them as initialization for their own training.

Q6. How do you generate pixel-wise label maps?

To generate pixel-wise label maps, the geographic coordinates of OSM building corners and road center lines are transformed to pixel coordinates.

Q7. How do you order the pixel values to label probabilities?

Following the standard neural network concept, transformations are ordered in sequential layers that gradually transform the pixel values to label probabilities.

Q8. What is the possible interpretation of the effect of pretraining?

A possible interpretation is that complex network structures with long-range dependencies are hard to learn for the classifier, and thus more training data help.

Q9. What are the two related probabilistic frameworks that have been successfully applied to this task?

Two related probabilistic frameworks have been successfully applied to this task, marked point processes (MPPs) and graphical models.

Q10. How can the authors learn semantic segmentation of overhead images without manual labeling effort?

Semantic segmentation of overhead images can indeedbe learned from OSM maps without any manual labeling effort albeit at the cost of reduced segmentation accuracy.

Q11. How does the large scale training data improve the classification performance?

4) Large-scale (but low-accuracy) training data allow substitution of the large majority (85% in their case) of the manually annotated high-quality data.

Learning Aerial Image Segmentation From Online Maps

Summary (3 min read)

Introduction

A. Generation of Training Data

B. Neural Network Architecture

D. Training

IV. EXPERIMENTS

A. Data Sets

B. Results and Discussion

V. CONCLUSION

Figures (13)

Citations

References

Related Papers (5)

Frequently Asked Questions (11)

Q1. What are the contributions in "Learning aerial image segmentation from online maps" ?

Q2. What are the future works in "Learning aerial image segmentation from online maps" ?

Q3. How much of the loss is compensated for by pretraining?

Q4. Why is it common practice to publish pretrain models together with source code and paper?

Q5. What is the visionary goal of the project?

Q6. How do you generate pixel-wise label maps?

Q7. How do you order the pixel values to label probabilities?

Q8. What is the possible interpretation of the effect of pretraining?

Q9. What are the two related probabilistic frameworks that have been successfully applied to this task?

Q10. How can the authors learn semantic segmentation of overhead images without manual labeling effort?

Q11. How does the large scale training data improve the classification performance?