Learning Aerial Image Segmentation From Online Maps
Summary (3 min read)
Introduction
- 4) If low-accuracy large-scale training data help, then it may also allow one to substitute a large portion of the manually annotated high-quality data.
- At the same time, they also fulfill the other requirements for their study: they are data hungry and robust to label noise [4].
- For practical reasons, their study is limited to buildings and roads, which are available from OSM, and to RGB images from Google Maps, subject to unknown radiometric manipulations.
A. Generation of Training Data
- The authors use a simple automatic approach to generate data sets of VHR aerial images in RGB format and corresponding labels for classes building, road, and background.
- Aerial images are downloaded from Google Maps, and geographic coordinates of buildings and roads are downloaded from OSM.
- 3 OSM data can be accessed and manipulated in vector format, and each object type comes with meta data and identifiers that allow straightforward filtering.
- This simple strategy works reasonably well, with a mean error of ≈11 pixels for the road boundary, compared with ≈100 pixels of road width.
- In (very rare) cases where the ad hoc procedure produced label collisions, pixels claimed by both building and road were assigned to buildings.
B. Neural Network Architecture
- Following the standard neural network concept, transformations are ordered in sequential layers that gradually transform the pixel values to label probabilities.
- 4Note that it is technically possible to obtain world coordinates of objects in Google Maps and enter those into OSM, and this might in practice also be done to some extent.
- 5Average deviation based on ten random samples of Potsdam, Chicago, Paris, and Zurich.
- Convolutional layers are interspersed with max-pooling layers that downsample the image and retain only the maximum value inside a (2 × 2) neighborhood.
- Note that adding the third skip connection does not increase the total number of parameters but, on the contrary, slightly reduces it ( [5]: 134′277′737, ours: 134′276′540; the small difference is due to the decomposition of the final upsampling kernel into two smaller ones).
D. Training
- All model parameters are learned by minimizing a multinomial logistic loss, summed over the entire 500 × 500 pixel patch that serves as input to the FCN.
- Prior to training/inference, intensity distributions are centered independently per patch by subtracting the mean, separately for each channel (RGB).
- Learning rates always start from 5 × 10−9 and are reduced by a factor of ten twice when the loss and average F1 scores stopped improving.
- Starting from pretrained models, even if these have been trained on a completely different image data set, often improves performance, because low-level features like contrast edges and blobs learned in early network layers are very similar across different kinds of images.
- Either the authors rely on weights previously learned on the Pascal VOC benchmark [53] (made available by Long et al. [5]), or they pretrain ourselves with OSM data.
IV. EXPERIMENTS
- The authors present extensive experiments on four large data sets of different cities to explore the following scenarios.
- Note that all experiments are designed to investigate different aspects of the hypotheses made in the introduction.
A. Data Sets
- Four large data sets were downloaded from Google Maps and OSM, for the cities of Chicago, Paris, Zurich, and Berlin.
- Example images and segmentation maps of Paris and Zurich are shown in Fig. 1. In Fig. 4, the authors show the full extent of the Potsdam scene, dictated by the available images and ground truth in the ISPRS benchmark.
- In particular, the benchmark ground truth does not have a label street, but instead uses a broader class impervious surfaces, also comprising sidewalks, tarmacked courtyards, and so on.
- To allow for a direct and fair comparison, the authors downsample the ISPRS Potsdam data, which comes at a GSD of 5 cm, to the same GSD as the Potsdam– Google data (9.1 cm).
- Each data set is split into mutually exclusive training, validation, and test regions.
B. Results and Discussion
- First, the authors validate their modifications of the FCN architecture, by comparing it with the original model of [5].
- The visual comparison between baseline II in Fig. 7(g)–(i) and IV in Fig. 9(a)–(c) shows that buildings are segmented equally well, but roads deteriorate significantly.
- The authors first train the FCN model on Google/OSM data of Chicago, Paris, Zurich, and Berlin, and use the resulting network weights as initial value, from which the model is tuned for the ISPRS data, using all the 21 training images as in baseline II.
- The success of pretraining in previous experiments raises the question—also asked in [50]—of whether one could reduce the annotation effort and use a smaller hand-labeled training set, in conjunction with large-scale OSM labels.
- Performance increases by 7 percent points to 0.837 over baseline Ia, where the model is trained from scratch on the same high-accuracy labels.
V. CONCLUSION
- Traditionally, semantic segmentation of aerial and satellite images crucially relies on manually labeled images as training data.
- Generating such training data for a new project is costly and time consuming, and presents a bottleneck for automatic image analysis.
- Here, the authors have explored a possible solution, namely, to exploit existing data, in their case open image and map data from the Internet for supervised learning with deep CNNs.
- Such training data are available in much larger quantities, but “weaker” in the sense that the images are not representative of the test images’ radiometry, and labels automatically generated from external maps are noisier than dedicated ground truth annotations.
- 3) Even if high-quality training data are available, the large volume of additional training data improves classification.
Did you find this useful? Give us your feedback
Citations
167 citations
160 citations
153 citations
141 citations
135 citations
References
73,978 citations
49,914 citations
30,811 citations
28,225 citations
18,620 citations
Related Papers (5)
Frequently Asked Questions (11)
Q2. What are the future works in "Learning aerial image segmentation from online maps" ?
In future work, it may be useful to experiment with even larger amounts of open data. On the other hand, buildings are detected equally well, and no further improvement can be noticed. Locally well-defined compact objects of similar shape and appearance are easier to learn, so further training data do not add relevant information. While pretraining is nowadays a standard practice, the authors go one step further and pretrain with aerial images and the correct set of output labels, generated automatically from free map data.
Q3. How much of the loss is compensated for by pretraining?
In other words, fine-tuning with a limited quantity of problemspecific high-accuracy labels compensates for a large portion (≈ 65%) of the loss between experiments II and IV, with only 15 % of the labeling effort.
Q4. Why is it common practice to publish pretrain models together with source code and paper?
It is a common practice in deep learning to publish pretrained models together with source code and paper, to ease repeatability of results and to help others avoid training from scratch.
Q5. What is the visionary goal of the project?
A visionary goal would be a large free publicly available “model zoo” of pretrained classifiers for the most important remote sensing applications, from which users world-wide can download suitable models and either apply them directly to their region of interest or use them as initialization for their own training.
Q6. How do you generate pixel-wise label maps?
To generate pixel-wise label maps, the geographic coordinates of OSM building corners and road center lines are transformed to pixel coordinates.
Q7. How do you order the pixel values to label probabilities?
Following the standard neural network concept, transformations are ordered in sequential layers that gradually transform the pixel values to label probabilities.
Q8. What is the possible interpretation of the effect of pretraining?
A possible interpretation is that complex network structures with long-range dependencies are hard to learn for the classifier, and thus more training data help.
Q9. What are the two related probabilistic frameworks that have been successfully applied to this task?
Two related probabilistic frameworks have been successfully applied to this task, marked point processes (MPPs) and graphical models.
Q10. How can the authors learn semantic segmentation of overhead images without manual labeling effort?
Semantic segmentation of overhead images can indeedbe learned from OSM maps without any manual labeling effort albeit at the cost of reduced segmentation accuracy.
Q11. How does the large scale training data improve the classification performance?
4) Large-scale (but low-accuracy) training data allow substitution of the large majority (85% in their case) of the manually annotated high-quality data.