scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Learning Aerial Image Segmentation From Online Maps

21 Jul 2017-IEEE Transactions on Geoscience and Remote Sensing (IEEE)-Vol. 55, Iss: 11, pp 6054-6068
TL;DR: In this article, a state-of-the-art CNN architecture was proposed for semantic segmentation of buildings and roads in aerial images, and compared with different training data sets, ranging from manually labeled ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations.
Abstract: This paper deals with semantic segmentation of high-resolution (aerial) images where a semantic class label is assigned to each pixel via supervised classification as a basis for automatic map generation. Recently, deep convolutional neural networks (CNNs) have shown impressive performance and have quickly become the de-facto standard for semantic segmentation, with the added benefit that task-specific feature design is no longer necessary. However, a major downside of deep learning methods is that they are extremely data hungry, thus aggravating the perennial bottleneck of supervised classification, to obtain enough annotated training data. On the other hand, it has been observed that they are rather robust against noise in the training labels. This opens up the intriguing possibility to avoid annotating huge amounts of training data, and instead train the classifier from existing legacy data or crowd-sourced maps that can exhibit high levels of noise. The question addressed in this paper is: can training with large-scale publicly available labels replace a substantial part of the manual labeling effort and still achieve sufficient performance? Such data will inevitably contain a significant portion of errors, but in return virtually unlimited quantities of it are available in larger parts of the world. We adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations. We report our results that indicate that satisfying performance can be obtained with significantly less manual annotation effort, by exploiting noisy large-scale training data.

Summary (3 min read)

Introduction

  • 4) If low-accuracy large-scale training data help, then it may also allow one to substitute a large portion of the manually annotated high-quality data.
  • At the same time, they also fulfill the other requirements for their study: they are data hungry and robust to label noise [4].
  • For practical reasons, their study is limited to buildings and roads, which are available from OSM, and to RGB images from Google Maps, subject to unknown radiometric manipulations.

A. Generation of Training Data

  • The authors use a simple automatic approach to generate data sets of VHR aerial images in RGB format and corresponding labels for classes building, road, and background.
  • Aerial images are downloaded from Google Maps, and geographic coordinates of buildings and roads are downloaded from OSM.
  • 3 OSM data can be accessed and manipulated in vector format, and each object type comes with meta data and identifiers that allow straightforward filtering.
  • This simple strategy works reasonably well, with a mean error of ≈11 pixels for the road boundary, compared with ≈100 pixels of road width.
  • In (very rare) cases where the ad hoc procedure produced label collisions, pixels claimed by both building and road were assigned to buildings.

B. Neural Network Architecture

  • Following the standard neural network concept, transformations are ordered in sequential layers that gradually transform the pixel values to label probabilities.
  • 4Note that it is technically possible to obtain world coordinates of objects in Google Maps and enter those into OSM, and this might in practice also be done to some extent.
  • 5Average deviation based on ten random samples of Potsdam, Chicago, Paris, and Zurich.
  • Convolutional layers are interspersed with max-pooling layers that downsample the image and retain only the maximum value inside a (2 × 2) neighborhood.
  • Note that adding the third skip connection does not increase the total number of parameters but, on the contrary, slightly reduces it ( [5]: 134′277′737, ours: 134′276′540; the small difference is due to the decomposition of the final upsampling kernel into two smaller ones).

D. Training

  • All model parameters are learned by minimizing a multinomial logistic loss, summed over the entire 500 × 500 pixel patch that serves as input to the FCN.
  • Prior to training/inference, intensity distributions are centered independently per patch by subtracting the mean, separately for each channel (RGB).
  • Learning rates always start from 5 × 10−9 and are reduced by a factor of ten twice when the loss and average F1 scores stopped improving.
  • Starting from pretrained models, even if these have been trained on a completely different image data set, often improves performance, because low-level features like contrast edges and blobs learned in early network layers are very similar across different kinds of images.
  • Either the authors rely on weights previously learned on the Pascal VOC benchmark [53] (made available by Long et al. [5]), or they pretrain ourselves with OSM data.

IV. EXPERIMENTS

  • The authors present extensive experiments on four large data sets of different cities to explore the following scenarios.
  • Note that all experiments are designed to investigate different aspects of the hypotheses made in the introduction.

A. Data Sets

  • Four large data sets were downloaded from Google Maps and OSM, for the cities of Chicago, Paris, Zurich, and Berlin.
  • Example images and segmentation maps of Paris and Zurich are shown in Fig. 1. In Fig. 4, the authors show the full extent of the Potsdam scene, dictated by the available images and ground truth in the ISPRS benchmark.
  • In particular, the benchmark ground truth does not have a label street, but instead uses a broader class impervious surfaces, also comprising sidewalks, tarmacked courtyards, and so on.
  • To allow for a direct and fair comparison, the authors downsample the ISPRS Potsdam data, which comes at a GSD of 5 cm, to the same GSD as the Potsdam– Google data (9.1 cm).
  • Each data set is split into mutually exclusive training, validation, and test regions.

B. Results and Discussion

  • First, the authors validate their modifications of the FCN architecture, by comparing it with the original model of [5].
  • The visual comparison between baseline II in Fig. 7(g)–(i) and IV in Fig. 9(a)–(c) shows that buildings are segmented equally well, but roads deteriorate significantly.
  • The authors first train the FCN model on Google/OSM data of Chicago, Paris, Zurich, and Berlin, and use the resulting network weights as initial value, from which the model is tuned for the ISPRS data, using all the 21 training images as in baseline II.
  • The success of pretraining in previous experiments raises the question—also asked in [50]—of whether one could reduce the annotation effort and use a smaller hand-labeled training set, in conjunction with large-scale OSM labels.
  • Performance increases by 7 percent points to 0.837 over baseline Ia, where the model is trained from scratch on the same high-accuracy labels.

V. CONCLUSION

  • Traditionally, semantic segmentation of aerial and satellite images crucially relies on manually labeled images as training data.
  • Generating such training data for a new project is costly and time consuming, and presents a bottleneck for automatic image analysis.
  • Here, the authors have explored a possible solution, namely, to exploit existing data, in their case open image and map data from the Internet for supervised learning with deep CNNs.
  • Such training data are available in much larger quantities, but “weaker” in the sense that the images are not representative of the test images’ radiometry, and labels automatically generated from external maps are noisier than dedicated ground truth annotations.
  • 3) Even if high-quality training data are available, the large volume of additional training data improves classification.

Did you find this useful? Give us your feedback

Figures (13)

Content maybe subject to copyright    Report

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1
Learning Aerial Image Segmentation
From Online Maps
Pascal Kaiser, Jan Dirk Wegner, Aurélien Lucchi, Martin Jaggi, Thomas Hofmann,
and Konrad Schindler, Senior Member, IEEE
AbstractThis paper deals with semantic segmentation of
high-resolution (aerial) images where a semantic class label is
assigned to each pixel via supervised classification as a basis for
automatic map generation. Recently, deep convolutional neural
networks (CNNs) have shown impressive performance and have
quickly become the de-facto standard for semantic segmentation,
with the added benefit that task-specific feature design is no
longer necessary. However, a major downside of deep learning
methods is that they are extremely data hungry, thus aggravating
the perennial bottleneck of supervised classification, to obtain
enough annotated training data. On the other hand, it has
been observed that they are rather robust against noise in the
training labels. This opens up the intriguing possibility to avoid
annotating huge amounts of training data, and instead train the
classifier from existing legacy data or crowd-sourced maps that
can exhibit high levels of noise. The question addressed in this
paper is: can training with large-scale publicly available labels
replace a substantial part of the manual labeling effort and
still achieve sufficient performance? Such data will inevitably
contain a significant portion of errors, but in return virtually
unlimited quantities of it are available in larger parts of the
world. We adapt a state-of-the-art CNN architecture for semantic
segmentation of buildings and roads in aerial images, and
compare its performance when using different training data sets,
ranging from manually labeled pixel-accurate ground truth of the
same city to automatic training data derived from OpenStreetMap
data from distant locations. We report our results that indicate
that satisfying performance can be obtained with significantly less
manual annotation effort, by exploiting noisy large-scale training
data.
Index Terms Crowdsourcing, image classification, machine
learning, neural networks, supervised learning, terrain mapping,
urban areas.
I. INTRODUCTION
H
UGE volumes of optical overhead imagery are captured
every day with airborne or spaceborne platforms, and
that volume is still growing. This “data deluge” makes manual
interpretation prohibitive, and hence machine vision must be
employed if we want to make any use of the available data.
Perhaps the fundamental step of automatic mapping is to
assign a semantic class to each pixel, i.e., convert the raw
data to a semantically meaningful raster map (which can then
be further processed as appropriate with, e.g., vectorization or
map generalization techniques). The most popular tool for that
Manuscript received January 30, 2017; revised April 11, 2017
and May 29, 2017; accepted June 19, 2017. (Corresponding author:
Jan Dirk Wegner.)
The authors are with ETH Zürich, 8093 Zürich, Switzerland (e-mail:
jan.wegner@geod.baug.ethz.ch).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TGRS.2017.2719738
task is supervised machine learning. Supervision with human-
annotated training data is necessary to inject the task-specific
class definitions into the generic statistical analysis. In most
cases, reference data for classifier training are generated man-
ually for each new project, which is a time-consuming and
costly process. Manual annotation must be repeated every time
the task, the geographic location, the sensor characteristics, or
the imaging conditions change, and hence the process scales
poorly. In this paper, we explore the tradeoff between the
following:
1) pixel-accurate but small-scale ground truth available;
2) less accurate reference data that are readily available in
arbitrary quantities, at no cost.
For our study, we make use of online map data from Open-
StreetMap [1]–[3] (OSM, http://www.openstreetmap.org) to
automatically derive weakly labeled training data for three
classes, buildings, roads,andbackground (i.e., all others).
These data are typically collected using two main sources.
1) Volunteers collect OSM data either in situ with GPS
trackers or by manually digitizing very high resolution
(VHR) aerial or satellite images that have been donated.
2) National mapping agencies donate their data to OSM to
make it available to a wider public.
Since OSM is generated by volunteers, our approach can
be seen as a form of crowd-sourced data annotation; but other
existing map databases, e.g., legacy data within a mapping
agency, could also be used.
As image data for our study, we employ high-resolution
RGB orthophotographs from Google Maps,
1
since we could
not easily get access to comparable amounts of other high-
resolution imagery [> 100 km
2
at 10-cm ground sampling
distance (GSD)].
Clearly, these types of training data will be less accurate.
Sources of errors include coregistration errors, e.g., in our
case, OSM polygons and Google images were independently
geo-referenced; limitations of the data format, e.g., OSM only
has road centerlines and category, but no road boundaries;
temporal changes not depicted in outdated map or image data;
or simply sloppy annotations, not only because of a lack of
training or motivation, but also because the use cases of most
OSM users require not even meter-level accuracy.
Our study is driven by the following hypotheses.
1) The sheer volume of training data can possibly compen-
sate for the lower accuracy (if used with an appropriate
robust learning method).
1
specifications of Google Maps data can be found at
https://support.google.com/mapcontentpartners/answer/144284?hl=en
0196-2892 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
2) The large variety present in very large training sets
(e.g., spanning multiple different cities) could potentially
improve the classifier’s ability to generalize to new
unseen locations.
3) Even if high-quality training data are available, the
large volume of additional training data could potentially
improve the classification.
4) If low-accuracy large-scale training data help, then it
may also allow one to substitute a large portion of the
manually annotated high-quality data.
We investigate these hypotheses when using deep convolu-
tional neural networks (CNNs). Deep networks are at present
the top-performing method for high-resolution semantic label-
ing and are therefore the most appropriate choice for our
study.
2
At the same time, they also fulfill the other require-
ments for our study: they are data hungry and robust to label
noise [4]. And they make manual feature design somewhat
obsolete: once training data are available, retraining for differ-
ent sensor types or imaging conditions is fully automatic, with-
out scene-specific user interaction such as feature definition or
preprocessing. We adopt a variant of the fully convolutional
network (FCN) [5], and explore the potential of combining
end-to-end trained deep networks with massive amounts of
noisy OSM labels. We evaluate the extreme variant of our
approach, without any manual labeling, on three major cities
(Chicago, Paris, and Zurich) with different urban structures.
Since quantitative evaluations on these large data sets are
limited by the inaccuracy of the labels, which is also present
in the test sets, we also perform experiments for a smaller
data set from the city of Potsdam. There, high-precision
manually annotated ground truth is available, which allows us
to compare different levels of project-specific input, including
the baseline where only manually labeled training data are
used, the extreme case of only automatically generated training
labels, and variants in between. We also assess the mod-
els’ capabilities regarding generalization and transfer learning
between unseen geographic locations.
We find in this paper that training on noisy labels does
work well, but only with substantially larger training sets.
Whereas with small training sets ( 2km
2
), it does not reach
the performance of hand-labeled pixel-accurate training data.
Moreover, even in the presence of high-quality training data,
massive OSM labels further improve the classifier, and hence
can be used to significantly reduce the manual labeling efforts.
According to our experiments, the differences are really due to
the training labels, since segmentation performance of OSM
labels is stable across different image sets of the same scene.
For practical reasons, our study is limited to buildings
and roads, which are available from OSM, and to RGB
images from Google Maps, subject to unknown radiometric
manipulations. We hope that similar studies will also be
performed with the vast archives of proprietary image and
map data held by state mapping authorities and commercial
2
All top-performing methods on big benchmarks are CNN
variants, both in generic computer vision, e.g., the Pascal
VOC Challenge, http://host.robots.ox.ac.uk/pascal/VOC/, and in
remote sensing, e.g., the ISPRS semantic labeling challenge,
http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html
satellite providers. Finally, this is a step in a journey that will
ultimately bring us closer to the utopian vision that a whole
range of mapping tasks no longer need user input, but can be
completely automated by the world wide Web.
II. R
ELATED WORK
There is a huge literature about semantic segmentation in
remote sensing. A large part deals with rather low-resolution
satellite images, whereas our work in this paper deals with
VHR aerial images (see [6] for an overview).
Aerial data with a ground sampling distance GSD 20 cm
contains rich details about urban objects such as roads, build-
ings, trees, and cars, and is a standard source for urban
mapping projects. Since urban environments are designed by
humans according to relatively stable design constraints, early
work attempted to construct object descriptors via sets of rules,
most prominently for building detection in 2-D [7], [8] or in
3-D [9]–[11], and for road extraction [12]–[14]. A general
limitation of hierarchical rule systems, be they top-down or
bottom-up, is poor generalization across different city layouts.
Hard thresholds at early stages tend to delete information
that can hardly be recovered later, and hard-coded expert
knowledge often misses important evidence that is less obvious
to the human observer.
Machine learning thus aims to learn classification rules
directly from the data. As local evidence, conventional classi-
fiers are fed with raw pixel intensities, simple arithmetic com-
binations such as vegetation indices, and different statistics or
filter responses that describe the local image texture [15]–[17].
An alternative is to precompute a large redundant set of
local features for training and let a discriminative classi-
fier (e.g., boosting and random forest) select the optimal
subset [18]–[21] for the task.
More global object knowledge that cannot be learned from
local pixel features can be introduced via probabilistic priors.
Two related probabilistic frameworks have been successfully
applied to this task, marked point processes (MPPs) and
graphical models. For example, [22] and [23] formulate MPPs
that explicitly model road network topologies, while [24] use
a similar approach to extract building footprints. MPPs rely
on object primitives like lines or rectangles that are matched
to the image data by sampling. Even if data driven [25], such
Monte Carlo sampling has high computational cost and does
not always find good configurations. Graphical models provide
similar modeling flexibility, but in general also lead to hard
optimization problems. For restricted cases (e.g., submodular
objective functions), efficient optimizers exist. Although there
is a large body of literature that aims to tailor conditional ran-
dom fields for object extraction in computer vision and remote
sensing, relatively few authors tackle semantic segmentation
in urban scenes (see [26]–[30]).
Given the difficulty of modeling high-level correlations,
much effort has gone into improving the local evidence by
finding more discriminative object features [21], [31], [32].
The resulting feature vectors are fed to a standard classifier
(e.g., decision trees or support vector machines) to infer
probabilities per object category. Some authors invest a lot
of efforts to reduce the dimension of the feature space to

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
KAISER et al.: LEARNING AERIAL IMAGE SEGMENTATION FROM ONLINE MAPS 3
a maximally discriminative subset (see [33]–[36]), although
this seems to have only limited effect—at least with modern
discriminative classifiers.
Deep neural networks do not require a separate feature
definition step, but instead learn the most discriminative
feature set for a given data set and task directly from raw
images. They go back to [37] and [38], but at the time were
limited by a lack of computing power and training data. After
their comeback in the 2012 ImageNet challenge [39], [40],
deep learning approaches, and in particular deep CNNs, have
achieved impressive results for diverse image analysis tasks.
State-of-the-art network architectures (see [41]) have many
(often 10–20, but up to >100) layers of local filters and
thus large receptive fields in the deep layers, which makes
it possible to learn complex local-to-global (nonlinear) object
representations and long-range contextual relations directly
from raw image data. An important property of deep CNNs
is that both training and inference are easily parallelizable,
especially on GPUs, and thus scale to millions of training and
testing images.
Quickly, CNNs were also applied to semantic segmenta-
tion of images [42]. Our approach in this paper is based
on the FCN architecture of [5], which returns a structured
spatially explicit label image (rather than a global image label).
While spatial aggregation is nevertheless required to represent
context, FCNs also include in-network upsampling back to
the resolution of the original image. They have already been
successfully applied to semantic segmentation of aerial images
(see [43]–[45]). In fact, the top performers on the ISPRS
semantic segmentation benchmark all use CNNs. We note that
(nonconvolutional) deep networks in conjunction with OSM
labels have also been applied for patch-based road extraction
in overhead images of 1 m GSD at large scale [46], [47].
More recently, Máttyus et al. [48] combine OSM data with
aerial images to augment maps with additional information
from imagery like road widths. They design a sophisticated
random field to probabilistically combine various sources of
road evidence, for instance, cars, to estimate road widths at
global scale using OSM and aerial images.
To the best of our knowledge, only two works have made
attempts to investigate how results of CNNs trained on large-
scale OSM labels can be fine-tuned to achieve more accurate
results for labeling remote sensing images [49], [50]. However,
we are not aware of any large-scale, systematic, comparative,
and quantitative study that investigates using large-scale train-
ing labels from inaccurate map data for semantic segmentation
of aerial images.
III. M
ETHODS
We first describe our straightforward approach to generate
training data automatically from OSM, and then give technical
details about the employed FCN architecture and the training
procedure used to train our model.
A. Generation of Training Data
We use a simple automatic approach to generate data sets of
VHR aerial images in RGB format and corresponding labels
for classes building, road,andbackground.Aerialimagesare
downloaded from Google Maps, and geographic coordinates
of buildings and roads are downloaded from OSM. We prefer
to use OSM maps instead of Google Maps, because the
latter can only be downloaded as raster images.
3
OSM data
can be accessed and manipulated in vector format, and each
object type comes with meta data and identifiers that allow
straightforward filtering. Regarding coregistration, we find that
OSM and Google Maps align relatively well, even though
they have been acquired and processed separately.
4
Most
local misalignments are caused by facades of high buildings
that overlap with roads or background due to perspective
effects. It is apparent that in our test areas Google provides
orthophotographs rectified with respect to a bare earth digital
terrain model (DTM), not “true” orthophotographs rectified
with a digital surface model (DSM). According to our own
measurements on a subset of the data, this effect is relatively
mild, generally < 10 pixels displacement. We found that this
does not introduce major errors as long as there are no high-
rise buildings. It may be more problematic for extreme scenes
such as Singapore or Manhattan.
To generate pixel-wise label maps, the geographic coor-
dinates of OSM building corners and road center lines are
transformed to pixel coordinates. For each building, a polygon
through the corner points is plotted at the corresponding image
location. For roads, the situation is slightly more complex.
OSM provides only coordinates of road center lines, but
no precise road widths. There is, however, a road category
label (“highway tag”) for most roads. We determined an
average road width for each category on a small subset of
the data, and validated it on a larger subset (manually, one-
off). This simple strategy works reasonably well, with a
mean error of 11 pixels for the road boundary, compared
with 100 pixels of road width.
5
In (very rare) cases where
the ad hoc procedure produced label collisions, pixels claimed
by both building and road were assigned to buildings. Pixels
neither labeled building nor road form the background class.
Examples of images overlaid with automatically generated
OSM labels are shown in Fig. 1.
B. Neural Network Architecture
We use a variant of FCNs in this paper (see Fig. 2). Fol-
lowing the standard neural network concept, transformations
are ordered in sequential layers that gradually transform the
pixel values to label probabilities. Most layers implement
learned convolution filters, where each neuron at level l takes
its input values only from a fixed-size spatially localized
window W in the previous layer (l1), and outputs a vector of
differently weighted sums of those values, c
l
=
iW
w
i
c
l1
i
.
Weights w
i
are shared across all neurons of a layer, which
reflects the shift invariance of the image signal and drastically
3
Note that some national mapping agencies also provide publicly
available map and other geo-data, e.g., the USGS national map pro-
gram: https://nationalmap.gov/
4
Note that it is technically possible to obtain world coordinates of objects
in Google Maps and enter those into OSM, and this might in practice also be
done to some extent. However, OSM explicitly asks users not to do that.
5
Average deviation based on ten random samples of Potsdam, Chicago,
Paris, and Zurich.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
Fig. 1. Example of OSM labels overlaid with Google Maps images for (a) Zurich and (b) Paris. (Left) Aerial image and a magnified detail. (Right) Same
images overlaid with building (red) and road (blue) labels. Background is transparent in the label map.
Fig. 2. Conceptual illustration of the data flow through our variant of an FCN, which is used for the semantic segmentation of aerial images. Three skip
connections are highlighted by pale red, pale green, and pale blue, respectively. Note that we added a third (pale red) skip connection in addition to the
original ones (pale green and pale blue) of [5].
reduces the number of parameters. Each convolutional layer is
followed by a rectified linear unit (ReLU) c
l
rec
= max(0, c
l
),
which simply truncates all negative values to 0 and leaves
positive values unchanged [51].
6
Convolutional layers are
interspersed with max-pooling layers that downsample the
image and retain only the maximum value inside a (2 × 2)
neighborhood. The downsampling increases the receptive field
of subsequent convolutions, and lets the network learn corre-
lations over a larger spatial context. Moreover, max-pooling
achieves local translation invariance at object level. The out-
puts of the last convolutional layers (which are very big to
capture global context, equivalent to a fully connected layer
of standard CNNs) is converted to a vector of scores for the
6
Other nonlinearities are sometimes used, but ReLU has been shown to
facilitate training (backpropagation) and has become the de-facto standard.
three target classes. These score maps are of low resolution,
and hence they are gradually upsampled again with convo-
lutional layers using a stride of only (12) pixel.
7
Repeated
downsampling causes a loss of high-frequency content, which
leads to blurry boundaries that are undesirable for pixel-wise
semantic segmentation. To counter this effect, feature maps
at intermediate layers are merged back in during upsampling
(the so-called “skip connections, see Fig. 2). The final full-
resolution score maps are then converted to label probabilities
with the sof tmax function.
7
This operation is done by layers that are usually called “deconvolution
layers” in [5] (and also in Fig. 3) although the use of this terminology has been
criticized since most implementations do not perform a real deconvolution but
rather a transposed convolution.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
KAISER et al.: LEARNING AERIAL IMAGE SEGMENTATION FROM ONLINE MAPS 5
Fig. 3. Our FCN architecture, which adds one more skip connection (after Pool_2, shown red) to the original model of [5]. Neurons form a 3-D structure per
layer: dimensions are written in brackets, where the rst number indicates the amount of feature channels, and second and third represent spatial dimensions.
C. Implementation Details
The FCN we use is an adaptation of the architecture
proposed in [5], which itself is largely based on the VGG-16
network architecture [41]. In our implementation, we slightly
modify the original FCN and introduce a third skip connection
(marked red in Fig. 2), to preserve even finer image details.
We found that the original architecture, which has two skip
connections after Pool_3 and Pool_4 (see Fig. 3), was still
not delivering sufficiently sharp edges. The additional higher
resolution skip connection consistently improved the results
for our data (see Section IV-B). Note that adding the third skip
connection does not increase the total number of parameters
but, on the contrary, slightly reduces it ( [5]: 134
277
737, ours:
134
276
540; the small difference is due to the decomposition
of the final upsampling kernel into two smaller ones).
D. Training
All model parameters are learned by minimizing a multino-
mial logistic loss, summed over the entire 500 × 500 pixel
patch that serves as input to the FCN. Prior to train-
ing/inference, intensity distributions are centered indepen-
dently per patch by subtracting the mean, separately for each
channel (RGB).
All models are trained with stochastic gradient descent
with a momentum of 0.9, and minibatch size of one image.
Learning rates always start from 5 × 10
9
and are reduced
by a factor of ten twice when the loss and average F
1
scores stopped improving. The learning rates for biases of
convolutional layers were doubled with respect to learning
rates of the filter weights. Weight decay was set to 5 ×10
4
,
and dropout probability for neurons in layers ReLU_6 and
ReLU_7wasalways0.5.
Training was run until the average F
1
-score on the validation
data set stopped improving, which took between 45 000 and
140 000 iterations (3.5–6.5 epochs). Weights were initialized
as in [52], except for experiments with pretrained weights.
It is a common practice in deep learning to publish pre-
trained models together with source code and paper, to ease
repeatability of results and to help others avoid training from
scratch. Starting from pretrained models, even if these have
been trained on a completely different image data set, often
improves performance, because low-level features like contrast
edges and blobs learned in early network layers are very
similar across different kinds of images.
We will use two different forms of pretraining. Either
we rely on weights previously learned on the Pascal VOC
benchmark [53] (made available by Long et al. [5]), or we
pretrain ourselves with OSM data. In Section IV, it is always
specified whether we use VOC, OSM, or no pretraining at all.
IV. E
XPERIMENTS
We present extensive experiments on four large data sets of
different cities to explore the following scenarios.
1) Complete Substitution: Can semantic segmentation be
learned without any manual labeling? What performance

Citations
More filters
Proceedings ArticleDOI
15 Jun 2019
TL;DR: This work introduces two simple yet effective network units, the spatial relation module and the channel relation module, to learn and reason about global relationships between any two spatial positions or feature maps, and then produce relation-augmented feature representations.
Abstract: Most current semantic segmentation approaches fall back on deep convolutional neural networks (CNNs). However, their use of convolution operations with local receptive fields causes failures in modeling contextual spatial relations. Prior works have sought to address this issue by using graphical models or spatial propagation modules in networks. But such models often fail to capture long-range spatial relationships between entities, which leads to spatially fragmented predictions. Moreover, recent works have demonstrated that channel-wise information also acts a pivotal part in CNNs. In this work, we introduce two simple yet effective network units, the spatial relation module and the channel relation module, to learn and reason about global relationships between any two spatial positions or feature maps, and then produce relation-augmented feature representations. The spatial and channel relation modules are general and extensible, and can be used in a plug-and-play fashion with the existing fully convolutional network (FCN) framework. We evaluate relation module-equipped networks on semantic segmentation tasks using two aerial image datasets, which fundamentally depend on long-range spatial relational reasoning. The networks achieve very competitive results, bringing significant improvements over baselines.

167 citations

Journal ArticleDOI
TL;DR: In this paper, a patch attention module (PAM) is proposed to enhance the embedding of context information based on a patchwise calculation of local attention, which can be applied to process the extracted features of convolutional neural networks (CNNs).
Abstract: The trade-off between feature representation power and spatial localization accuracy is crucial for the dense classification/semantic segmentation of remote sensing images (RSIs). High-level features extracted from the late layers of a neural network are rich in semantic information, yet have blurred spatial details; low-level features extracted from the early layers of a network contain more pixel-level information but are isolated and noisy. It is therefore difficult to bridge the gap between high- and low-level features due to their difference in terms of physical information content and spatial distribution. In this article, we contribute to solve this problem by enhancing the feature representation in two ways. On the one hand, a patch attention module (PAM) is proposed to enhance the embedding of context information based on a patchwise calculation of local attention. On the other hand, an attention embedding module (AEM) is proposed to enrich the semantic information of low-level features by embedding local focus from high-level features. Both proposed modules are lightweight and can be applied to process the extracted features of convolutional neural networks (CNNs). Experiments show that, by integrating the proposed modules into a baseline fully convolutional network (FCN), the resulting local attention network (LANet) greatly improves the performance over the baseline and outperforms other attention-based methods on two RSI data sets.

160 citations

Journal ArticleDOI
TL;DR: Results indicate that U-Nets trained on weak labels outperform baseline methods with as few as 100 labels, and Neural networks can combine superior classification performance with efficient label usage, and allow pixel-level labels to be obtained from image labels.
Abstract: Accurate automated segmentation of remote sensing data could benefit applications from land cover mapping and agricultural monitoring to urban development surveyal and disaster damage assessment. While convolutional neural networks (CNNs) achieve state-of-the-art accuracy when segmenting natural images with huge labeled datasets, their successful translation to remote sensing tasks has been limited by low quantities of ground truth labels, especially fully segmented ones, in the remote sensing domain. In this work, we perform cropland segmentation using two types of labels commonly found in remote sensing datasets that can be considered sources of “weak supervision”: (1) labels comprised of single geotagged points and (2) image-level labels. We demonstrate that (1) a U-Net trained on a single labeled pixel per image and (2) a U-Net image classifier transferred to segmentation can outperform pixel-level algorithms such as logistic regression, support vector machine, and random forest. While the high performance of neural networks is well-established for large datasets, our experiments indicate that U-Nets trained on weak labels outperform baseline methods with as few as 100 labels. Neural networks, therefore, can combine superior classification performance with efficient label usage, and allow pixel-level labels to be obtained from image labels.

153 citations

Journal ArticleDOI
TL;DR: An end-to-end trainable gated residual refinement network (GRRNet) that fuses high-resolution aerial images and LiDAR point clouds for building extraction and has competitive building extraction performance in comparison with other approaches is developed.
Abstract: Automated extraction of buildings from remotely sensed data is important for a wide range of applications but challenging due to difficulties in extracting semantic features from complex scenes like urban areas. The recently developed fully convolutional neural networks (FCNs) have shown to perform well on urban object extraction because of the outstanding feature learning and end-to-end pixel labeling abilities. The commonly used feature fusion or skip-connection refine modules of FCNs often overlook the problem of feature selection and could reduce the learning efficiency of the networks. In this paper, we develop an end-to-end trainable gated residual refinement network (GRRNet) that fuses high-resolution aerial images and LiDAR point clouds for building extraction. The modified residual learning network is applied as the encoder part of GRRNet to learn multi-level features from the fusion data and a gated feature labeling (GFL) unit is introduced to reduce unnecessary feature transmission and refine classification results. The proposed model - GRRNet is tested in a publicly available dataset with urban and suburban scenes. Comparison results illustrated that GRRNet has competitive building extraction performance in comparison with other approaches. The source code of the developed GRRNet is made publicly available for studies.

141 citations

Journal ArticleDOI
TL;DR: This paper proposes a so-called untied denoising autoencoder with sparsity, in which the encoder and decoder of the network are independent, and only the decoding of thenetwork is enforced to be nonnegative, and makes two critical additions to the network design.
Abstract: Linear spectral unmixing is the practice of decomposing the mixed pixel into a linear combination of the constituent endmembers and the estimated abundances. This paper focuses on unsupervised spectral unmixing where the endmembers are unknown a priori . Conventional approaches use either geometrical- or statistical-based approaches. In this paper, we address the challenges of spectral unmixing with unsupervised deep learning models, in specific, the autoencoder models, where the decoder serves as the endmembers and the hidden layer output serves as the abundances. In several recent attempts, part-based autoencoders have been designed to solve the unsupervised spectral unmixing problem. However, the performance has not been satisfactory. In this paper, we first discuss some important findings we make on issues with part-based autoencoders. By proof of counterexample, we show that all existing part-based autoencoder networks with nonnegative and tied encoder and decoder are inherently defective by making these inappropriate assumptions on the network structure. As a result, they are not suitable for solving the spectral unmixing problem. We propose a so-called untied denoising autoencoder with sparsity, in which the encoder and decoder of the network are independent, and only the decoder of the network is enforced to be nonnegative. Furthermore, we make two critical additions to the network design. First, since denoising is an essential step for spectral unmixing, we propose to incorporate the denoising capacity into the network optimization in the format of a denoising constraint rather than cascading another denoising preprocessor in order to avoid the introduction of additional reconstruction error. Second, to be more robust to the inaccurate estimation of a number of endmembers, we adopt an $l_{21}$ -norm on the encoder of the network to reduce the redundant endmembers while decreasing the reconstruction error simultaneously. The experimental results demonstrate that the proposed approach outperforms several state-of-the-art methods, especially for highly noisy data.

135 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Abstract: Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [20], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [3] to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes less than one fifth of a second for a typical image.

28,225 citations

Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations

Frequently Asked Questions (11)
Q1. What are the contributions in "Learning aerial image segmentation from online maps" ?

This paper deals with semantic segmentation of high-resolution ( aerial ) images where a semantic class label is assigned to each pixel via supervised classification as a basis for automatic map generation. The question addressed in this paper is: can training with large-scale publicly available labels replace a substantial part of the manual labeling effort and still achieve sufficient performance ? The authors adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations. The authors report their results that indicate that satisfying performance can be obtained with significantly less manual annotation effort, by exploiting noisy large-scale training data. 

In future work, it may be useful to experiment with even larger amounts of open data. On the other hand, buildings are detected equally well, and no further improvement can be noticed. Locally well-defined compact objects of similar shape and appearance are easier to learn, so further training data do not add relevant information. While pretraining is nowadays a standard practice, the authors go one step further and pretrain with aerial images and the correct set of output labels, generated automatically from free map data. 

In other words, fine-tuning with a limited quantity of problemspecific high-accuracy labels compensates for a large portion (≈ 65%) of the loss between experiments II and IV, with only 15 % of the labeling effort. 

It is a common practice in deep learning to publish pretrained models together with source code and paper, to ease repeatability of results and to help others avoid training from scratch. 

A visionary goal would be a large free publicly available “model zoo” of pretrained classifiers for the most important remote sensing applications, from which users world-wide can download suitable models and either apply them directly to their region of interest or use them as initialization for their own training. 

To generate pixel-wise label maps, the geographic coordinates of OSM building corners and road center lines are transformed to pixel coordinates. 

Following the standard neural network concept, transformations are ordered in sequential layers that gradually transform the pixel values to label probabilities. 

A possible interpretation is that complex network structures with long-range dependencies are hard to learn for the classifier, and thus more training data help. 

Two related probabilistic frameworks have been successfully applied to this task, marked point processes (MPPs) and graphical models. 

Semantic segmentation of overhead images can indeedbe learned from OSM maps without any manual labeling effort albeit at the cost of reduced segmentation accuracy. 

4) Large-scale (but low-accuracy) training data allow substitution of the large majority (85% in their case) of the manually annotated high-quality data.