scispace - formally typeset
Search or ask a question

Showing papers on "Aerial image published in 2017"


Journal ArticleDOI
TL;DR: The Aerial Image Data Set (AID) as mentioned in this paper is a large-scale data set for aerial scene classification, which contains more than 10,000 aerial images from remote sensing images.
Abstract: Aerial scene classification, which aims to automatically label an aerial image with a specific semantic category, is a fundamental problem for understanding high-resolution remote sensing imagery. In recent years, it has become an active task in the remote sensing area, and numerous algorithms have been proposed for this task, including many machine learning and data-driven approaches. However, the existing data sets for aerial scene classification, such as UC-Merced data set and WHU-RS19, contain relatively small sizes, and the results on them are already saturated. This largely limits the development of scene classification algorithms. This paper describes the Aerial Image data set (AID): a large-scale data set for aerial scene classification. The goal of AID is to advance the state of the arts in scene classification of remote sensing images. For creating AID, we collect and annotate more than 10000 aerial scene images. In addition, a comprehensive review of the existing aerial scene classification techniques as well as recent widely used deep learning methods is given. Finally, we provide a performance analysis of typical aerial scene classification and deep learning approaches on AID, which can be served as the baseline results on this benchmark.

1,081 citations


Proceedings ArticleDOI
23 Jul 2017
TL;DR: This paper proposes an aerial image labeling dataset that covers a wide range of urban settlement appearances, from different geographic locations, and experiments with convolutional neural networks on this dataset.
Abstract: New challenges in remote sensing impose the necessity of designing pixel classification methods that, once trained on a certain dataset, generalize to other areas of the earth. This may include regions where the appearance of the same type of objects is significantly different. In the literature it is common to use a single image and split it into training and test sets to train a classifier and assess its performance, respectively. However, this does not prove the generalization capabilities to other inputs. In this paper, we propose an aerial image labeling dataset that covers a wide range of urban settlement appearances, from different geographic locations. Moreover, the cities included in the test set are different from those of the training set. We also experiment with convolutional neural networks on our dataset.

560 citations


Journal ArticleDOI
TL;DR: Can training with large-scale publicly available labels replace a substantial part of the manual labeling effort and still achieve sufficient performance and can satisfying performance can be obtained with significantly less manual annotation effort?
Abstract: This study deals with semantic segmentation of high-resolution (aerial) images where a semantic class label is assigned to each pixel via supervised classification as a basis for automatic map generation. Recently, deep convolutional neural networks (CNNs) have shown impressive performance and have quickly become the de-facto standard for semantic segmentation, with the added benefit that task-specific feature design is no longer necessary. However, a major downside of deep learning methods is that they are extremely data-hungry, thus aggravating the perennial bottleneck of supervised classification, to obtain enough annotated training data. On the other hand, it has been observed that they are rather robust against noise in the training labels. This opens up the intriguing possibility to avoid annotating huge amounts of training data, and instead train the classifier from existing legacy data or crowd-sourced maps which can exhibit high levels of noise. The question addressed in this paper is: can training with large-scale, publicly available labels replace a substantial part of the manual labeling effort and still achieve sufficient performance? Such data will inevitably contain a significant portion of errors, but in return virtually unlimited quantities of it are available in larger parts of the world. We adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled, pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations. We report our results that indicate that satisfying performance can be obtained with significantly less manual annotation effort, by exploiting noisy large-scale training data.

202 citations


Journal ArticleDOI
TL;DR: In this article, a large-scale aerial image data set is constructed for remote sensing image captioning and a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing captioning.
Abstract: Inspired by recent development of artificial satellite, remote sensing images have attracted extensive attention. Recently, noticeable progress has been made in scene classification and target detection.However, it is still not clear how to describe the remote sensing image content with accurate and concise sentences. In this paper, we investigate to describe the remote sensing images with accurate and flexible sentences. First, some annotated instructions are presented to better describe the remote sensing images considering the special characteristics of remote sensing images. Second, in order to exhaustively exploit the contents of remote sensing images, a large-scale aerial image data set is constructed for remote sensing image caption. Finally, a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing caption. Extensive experiments on the proposed data set demonstrate that the content of the remote sensing image can be completely described by generating language descriptions. The data set is available at this https URL

196 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: In this paper, instead of manually labeling the aerial imagery, the authors propose to predict (noisy) semantic features automatically extracted from co-located ground imagery, and apply an adaptive transformation to map these features into the ground-level perspective.
Abstract: We introduce a novel strategy for learning to extract semantically meaningful features from aerial imagery. Instead of manually labeling the aerial imagery, we propose to predict (noisy) semantic features automatically extracted from co-located ground imagery. Our network architecture takes an aerial image as input, extracts features using a convolutional neural network, and then applies an adaptive transformation to map these features into the ground-level perspective. We use an end-to-end learning approach to minimize the difference between the semantic segmentation extracted directly from the ground image and the semantic segmentation predicted solely based on the aerial image. We show that a model learned using this strategy, with no additional training, is already capable of rough semantic labeling of aerial imagery. Furthermore, we demonstrate that by finetuning this model we can achieve more accurate semantic segmentation than two baseline initialization strategies. We use our network to address the task of estimating the geolocation and geo-orientation of a ground image. Finally, we show how features extracted from an aerial image can be used to hallucinate a plausible ground-level panorama.

171 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate that the trained RSRCNN model is able to advance the state-of-the-art road extraction for aerial images, in terms of precision, recall, F-score, and accuracy.
Abstract: In this letter, we propose a road structure refined convolutional neural network (RSRCNN) approach for road extraction in aerial images. In order to obtain structured output of road extraction, both deconvolutional and fusion layers are designed in the architecture of RSRCNN. For training RSRCNN, a new loss function is proposed to incorporate the geometric information of road structure in cross-entropy loss, thus called road-structure-based loss function. Experimental results demonstrate that the trained RSRCNN model is able to advance the state-of-the-art road extraction for aerial images, in terms of precision, recall, F-score, and accuracy.

159 citations


Journal ArticleDOI
TL;DR: A CNN framework specifically adapted to the semantic labeling problem, which integrates local and global information in an efficient and flexible manner and outperforms previous techniques.
Abstract: The problem of dense semantic labeling consists in assigning semantic labels to every pixel in an image. In the context of aerial image analysis, it is particularly important to yield high-resolution outputs. In order to use convolutional neural networks (CNNs) for this task, it is required to design new specific architectures to provide fine-grained classification maps. Many dense semantic labeling CNNs have been recently proposed. Our first contribution is an in-depth analysis of these architectures. We establish the desired properties of an ideal semantic labeling CNN, and assess how those methods stand with regard to these properties. We observe that even though they provide competitive results, these CNNs often underexploit properties of semantic labeling that could lead to more effective and efficient architectures. Out of these observations, we then derive a CNN framework specifically adapted to the semantic labeling problem. In addition to learning features at different resolutions, it learns how to combine these features. By integrating local and global information in an efficient and flexible manner, it outperforms previous techniques. We evaluate the proposed framework and compare it with state-of-the-art architectures on public benchmarks of high-resolution aerial image labeling.

155 citations


Journal ArticleDOI
TL;DR: In this article, a state-of-the-art CNN architecture was proposed for semantic segmentation of buildings and roads in aerial images, and compared with different training data sets, ranging from manually labeled ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations.
Abstract: This paper deals with semantic segmentation of high-resolution (aerial) images where a semantic class label is assigned to each pixel via supervised classification as a basis for automatic map generation. Recently, deep convolutional neural networks (CNNs) have shown impressive performance and have quickly become the de-facto standard for semantic segmentation, with the added benefit that task-specific feature design is no longer necessary. However, a major downside of deep learning methods is that they are extremely data hungry, thus aggravating the perennial bottleneck of supervised classification, to obtain enough annotated training data. On the other hand, it has been observed that they are rather robust against noise in the training labels. This opens up the intriguing possibility to avoid annotating huge amounts of training data, and instead train the classifier from existing legacy data or crowd-sourced maps that can exhibit high levels of noise. The question addressed in this paper is: can training with large-scale publicly available labels replace a substantial part of the manual labeling effort and still achieve sufficient performance? Such data will inevitably contain a significant portion of errors, but in return virtually unlimited quantities of it are available in larger parts of the world. We adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations. We report our results that indicate that satisfying performance can be obtained with significantly less manual annotation effort, by exploiting noisy large-scale training data.

149 citations


Journal ArticleDOI
TL;DR: The final field emergence of maize can rapidly be assessed and allows more precise assessment of the final yield parameters, which has the potential to optimize farm management and to support field experimentation for agronomic and breeding purposes.
Abstract: Precision phenotyping, especially the use of image analysis, allows researchers to gain information on plant properties and plant health. Aerial image detection with unmanned aerial vehicles (UAVs) provides new opportunities in precision farming and precision phenotyping. Precision farming has created a critical need for spatial data on plant density. The plant number reflects not only the final field emergence but also allows a more precise assessment of the final yield parameters. The aim of this work is to advance UAV use and image analysis as a possible high-throughput phenotyping technique. In this study, four different maize cultivars were planted in plots with different seeding systems (in rows and equidistantly spaced) and different nitrogen fertilization levels (applied at 50, 150 and 250 kg N/ha). The experimental field, encompassing 96 plots, was overflown at a 50-m height with an octocopter equipped with a 10-megapixel camera taking a picture every 5 s. Images were recorded between BBCH 13–15 (it is a scale to identify the phenological development stage of a plant which is here the 3- to 5-leaves development stage) when the color of young leaves differs from older leaves. Close correlations up to R2 = 0.89 were found between in situ and image-based counted plants adapting a decorrelation stretch contrast enhancement procedure, which enhanced color differences in the images. On average, the error between visually and digitally counted plants was ≤5%. Ground cover, as determined by analyzing green pixels, ranged between 76% and 83% at these stages. However, the correlation between ground cover and digitally counted plants was very low. The presence of weeds and blurry effects on the images represent possible errors in counting plants. In conclusion, the final field emergence of maize can rapidly be assessed and allows more precise assessment of the final yield parameters. The use of UAVs and image processing has the potential to optimize farm management and to support field experimentation for agronomic and breeding purposes.

130 citations


Proceedings ArticleDOI
08 May 2017
TL;DR: The experimental results show that the CNN-based washed-aways detection system achieves 94–96% classification accuracy across all conditions, indicating the promising applicability of CNNs for washed-away building detection.
Abstract: This paper explores the effective use of Convolutional Neural Networks (CNNs) in the context of washed-away building detection from pre- and post-tsunami aerial images. To this end, we compile a dedicated, labeled aerial image dataset to construct models that classify whether a building is washed-away. Each datum in the set is a pair of pre- and post-tsunami image patches and encompasses a target building at the center of the patch. Using this dataset, we comprehensively evaluate CNNs from a practical-application viewpoint, e.g., input scenarios (pre-tsunami images are not always available), input scales (building size varies) and different configurations for CNNs. The experimental results show that our CNN-based washed-away detection system achieves 94–96% classification accuracy across all conditions, indicating the promising applicability of CNNs for washed-away building detection.

115 citations


Journal ArticleDOI
TL;DR: The underpinning concepts behind image capture are revisit, from which the requirements for acquiring sharp, well-exposed and suitable image data are derived, and recommendations for the reporting of results are given to help improve the confidence in, and reusability of, surveys.
Abstract: Aerial image capture has become very common within the geosciences due to the increasing affordability of low-payload (<20 kg) unmanned aerial vehicles (UAVs) for consumer markets. Their applicatio...

Journal ArticleDOI
24 Nov 2017-Sensors
TL;DR: A CNN-based detection model combining two independent convolutional neural networks, where the first network is applied to generate a set of vehicle-like regions from multi-feature maps of different hierarchies and scales, yields high performance, not only in detection accuracy but also in detection speed.
Abstract: Vehicle detection in aerial images is an important and challenging task. Traditionally, many target detection models based on sliding-window fashion were developed and achieved acceptable performance, but these models are time-consuming in the detection phase. Recently, with the great success of convolutional neural networks (CNNs) in computer vision, many state-of-the-art detectors have been designed based on deep CNNs. However, these CNN-based detectors are inefficient when applied in aerial image data due to the fact that the existing CNN-based models struggle with small-size object detection and precise localization. To improve the detection accuracy without decreasing speed, we propose a CNN-based detection model combining two independent convolutional neural networks, where the first network is applied to generate a set of vehicle-like regions from multi-feature maps of different hierarchies and scales. Because the multi-feature maps combine the advantage of the deep and shallow convolutional layer, the first network performs well on locating the small targets in aerial image data. Then, the generated candidate regions are fed into the second network for feature extraction and decision making. Comprehensive experiments are conducted on the Vehicle Detection in Aerial Imagery (VEDAI) dataset and Munich vehicle dataset. The proposed cascaded detection model yields high performance, not only in detection accuracy but also in detection speed.

Proceedings ArticleDOI
14 May 2017
TL;DR: A deep learning approach for power infrastructure detection is proposed using graph based post processing and spectral clustering for the conductor lines, pylons and insulators that form the core parts of power infrastructure.
Abstract: Infrastructure detection and monitoring is a difficult task. Due to the advances in unmanned vehicles and image analytics, it is possible to decrease the human effort and achieve consistent results in infrastructure assessments using aerial image processing. Reliable detection and integrity checking of power infrastructure including conductor lines, pylons and insulators in a diverse background is the most challenging task in drone based automatic infrastructure monitoring. Most techniques in literature use first principle approach that tries to represent the image as features of interest. This paper proposes a deep learning approach for power infrastructure detection. Graph based post processing is applied for improving the outcomes of the generated deep model. A f-score of 75% is achieved using the deep model which is further improved using spectral clustering for the conductor lines, pylons and insulators that form the core parts of power infrastructure.

Journal ArticleDOI
TL;DR: This letter proposes a novel rotation-invariant feature for object detection in optical remote sensing images that can incorporate partial angular spatial information in addition to radial spatial information and further calculated between different rings for a redundant representation of the spatial layout.
Abstract: This letter proposes a novel rotation-invariant feature for object detection in optical remote sensing images. Different from previous rotation-invariant features, the proposed rotation-invariant matrix (RIM) can incorporate partial angular spatial information in addition to radial spatial information. Moreover, it can be further calculated between different rings for a redundant representation of the spatial layout. Based on the RIM, we further propose an RIM_FV_RPP feature for object detection. For an image region, we first densely extract RIM features from overlapping blocks; then, these RIM features are encoded into Fisher vectors; finally, a pyramid pooling strategy that hierarchically accumulates Fisher vectors in ring subregions is used to encode richer spatial information while maintaining rotation invariance. Both of the RIM and RIM_FV_RPP are rotation invariant. Experiments on airplane and car detection in optical remote sensing images demonstrate the superiority of our feature to the state of the art.

Journal ArticleDOI
TL;DR: It is shown that an accuracy of a few centimeters can be reached with this system which uses low-cost UAV and GPS module coupled with the IGN-LOEMI home-made camera.
Abstract: This article presents a coupled system consisting of a single-frequency GPS receiver and a light photogrammetric quality camera embedded in an Unmanned Aerial Vehicle (UAV). The aim is to produce high quality data that can be used in metrology applications. The issue of Integrated Sensor Orientation (ISO) of camera poses using only GPS measurements is presented and discussed. The accuracy reached by our system based on sensors developed at the French Mapping Agency (IGN) Opto-Electronics, Instrumentation and Metrology Laboratory (LOEMI) is qualified. These sensors are specially designed for close-range aerial image acquisition with a UAV. Lever-arm calibration and time synchronization are explained and performed to reach maximum accuracy. All processing steps are detailed from data acquisition to quality control of final products. We show that an accuracy of a few centimeters can be reached with this system which uses low-cost UAV and GPS module coupled with the IGN-LOEMI home-made camera.

Journal ArticleDOI
Chen Chen1, Weiguo Gong1, Yufeng Hu, Yongliang Chen1, Y. Ding 
TL;DR: Zhang et al. as discussed by the authors introduced a new oriented layer network for detecting the rotation angle of building on the basis of the successful VGG-net R-CNN model and proposed a novel model with a key feature related to orientation.
Abstract: The automated building detection in aerial images is a fundamental problem encountered in aerial and satellite images analysis Recently, thanks to the advances in feature descriptions, Region-based CNN model (R-CNN) for object detection is receiving an increasing attention Despite the excellent performance in object detection, it is problematic to directly leverage the features of R-CNN model for building detection in single aerial image As we know, the single aerial image is in vertical view and the buildings possess significant directional feature However, in R-CNN model, direction of the building is ignored and the detection results are represented by horizontal rectangles For this reason, the detection results with horizontal rectangle cannot describe the building precisely To address this problem, in this paper, we proposed a novel model with a key feature related to orientation, namely, Oriented R-CNN (OR-CNN) Our contributions are mainly in the following two aspects: 1) Introducing a new oriented layer network for detecting the rotation angle of building on the basis of the successful VGG-net R-CNN model; 2) the oriented rectangle is proposed to leverage the powerful R-CNN for remote-sensing building detection In experiments, we establish a complete and bran-new data set for training our oriented R-CNN model and comprehensively evaluate the proposed method on a publicly available building detection data set We demonstrate State-of-the-art results compared with the previous baseline methods

Journal ArticleDOI
TL;DR: In this paper, the suitability of aerial image point clouds for estimating canopy density and gaps was evaluated using spectral data from aerial images and Landsat 5 data, and the results showed that point clouds can accurately describe the outermost layer of the canopy and missed the details in lower canopy.
Abstract: Canopy cover (CC) is a variable used to describe the status of forests and forested habitats, but also the variable used primarily to define what counts as a forest. The estimation of CC has relied heavily on remote sensing with past studies focusing on satellite imagery as well as Airborne Laser Scanning (ALS) using light detection and ranging (lidar). Of these, ALS has been proven highly accurate, because the fraction of pulses penetrating the canopy represents a direct measurement of canopy gap percentage. However, the methods of photogrammetry can be applied to produce point clouds fairly similar to airborne lidar data from aerial images. Currently there is little information about how well such point clouds measure canopy density and gaps. The aim of this study was to assess the suitability of aerial image point clouds for CC estimation and compare the results with those obtained using spectral data from aerial images and Landsat 5. First, we modeled CC for n = 1149 lidar plots using field-measured CCs and lidar data. Next, this data was split into five subsets in north-south direction (y-coordinate). Finally, four CC models ( AerialSpectral , AerialPointcloud , AerialCombi (spectral + pointcloud) and Landsat ) were created and they were used to predict new CC values to the lidar plots, subset by subset, using five-fold cross validation. The Landsat and AerialSpectral models performed with RMSEs of 13.8% and 12.4%, respectively. AerialPointcloud model reached an RMSE of 10.3%, which was further improved by the inclusion of spectral data; RMSE of the AerialCombi model was 9.3%. We noticed that the aerial image point clouds managed to describe only the outermost layer of the canopy and missed the details in lower canopy, which was resulted in weak characterization of the total CC variation, especially in the tails of the data.

Journal ArticleDOI
TL;DR: Comparisons to state-of-the-art methods demonstrate the superiority of the proposed method in VHSR image classification.
Abstract: Aerial image classification has become popular and has attracted extensive research efforts in recent decades. The main challenge lies in its very high spatial resolution but relatively insufficient spectral information. To this end, spatial-spectral feature extraction is a popular strategy for classification. However, parameter determination for that feature extraction is usually time-consuming and depends excessively on experience. In this paper, an automatic spatial feature extraction approach based on image raster and segmental vector data cross-analysis is proposed for the classification of very high spatial resolution (VHSR) aerial imagery. First, multi-resolution segmentation is used to generate strongly homogeneous image objects and extract corresponding vectors. Then, to automatically explore the region of a ground target, two rules, which are derived from Tobler’s First Law of Geography (TFL) and a topological relationship of vector data, are integrated to constrain the extension of a region around a central object. Third, the shape and size of the extended region are described. A final classification map is achieved through a supervised classifier using shape, size, and spectral features. Experiments on three real aerial images of VHSR (0.1 to 0.32 m) are done to evaluate effectiveness and robustness of the proposed approach. Comparisons to state-of-the-art methods demonstrate the superiority of the proposed method in VHSR image classification.

Journal ArticleDOI
Menghan Xia1, Jian Yao1, Renping Xie1, Li Li1, Wei Zhang2 
TL;DR: A generic framework for globally consistent alignment of images captured from approximately planar scenes via topology analysis, capable of resisting the perspective distortion meanwhile preserving the local alignment accuracy is proposed.

Proceedings ArticleDOI
01 May 2017
TL;DR: This paper proposes Feature-based Localization between Air and Ground (FLAG), a method for computing global position updates by matching features observed from ground to features in an aerial image, using stable, descriptorless features associated with vertical structure in the environment around a ground robot in previously unmapped areas.
Abstract: In GPS-denied environments, robot systems typically revert to navigating with dead-reckoning and relative mapping, accumulating error in their global pose estimate. In this paper, we propose Feature-based Localization between Air and Ground (FLAG), a method for computing global position updates by matching features observed from ground to features in an aerial image. Our method uses stable, descriptorless features associated with vertical structure in the environment around a ground robot in previously unmapped areas, referencing only overhead imagery, without GPS. Multiple-hypothesis data association with a particle filter enables efficient recovery from data association error and odometry uncertainty. We implement a stereo system to demonstrate our vertical feature based global positioning approach in both indoor and outdoor scenarios, and show comparable performance to laser-scan-matching results in both environments.

Journal ArticleDOI
TL;DR: This paper presents research on an ortho-rectification technique based on a field programmable gate array (FPGA) platform that can be implemented on board spacecraft for (near) real-time processing.
Abstract: The traditional ortho-rectification technique for remotely sensed (RS) images, which is performed on the basis of a ground image processing platform, has been unable to meet timeliness or near timeliness requirements. To solve this problem, this paper presents research on an ortho-rectification technique based on a field programmable gate array (FPGA) platform that can be implemented on board spacecraft for (near) real-time processing. The proposed FPGA-based ortho-rectification method contains three modules, i.e., a memory module, a coordinate transformation module (including the transformation from geodetic coordinates to photo coordinates, and the transformation from photo coordinates to scanning coordinates), and an interpolation module. Two datasets, aerial images located in central Denver, Colorado, USA, and an aerial image from the example dataset of ERDAS IMAGINE 9.2, are used to validate the processing speed and accuracy. Compared to traditional ortho-rectification technology, the throughput from the proposed FPGA-based platform and the personal computer (PC)-based platform are 11,182.3 kilopixels per second and 2582.9 kilopixels per second, respectively. This means that the proposed FPGA-based platform is 4.3 times faster than the PC-based platform for processing the same RS images. In addition, the root-mean-square errors of the planimetric coordinates φX and φY and the distance φS are 1.09 m, 1.61 m, and 1.93 m, respectively, which can meet the requirements of correction accuracy in practice.

Proceedings ArticleDOI
TL;DR: This work proposes a deep neural network derived from the Faster R-CNN approach for multi- category object detection in aerial images and shows how the detection accuracy can be improved by replacing the network architecture by an architecture especially designed for handling small object sizes.
Abstract: Multi-category object detection in aerial images is an important task for many applications such as surveillance, tracking or search and rescue tasks. In recent years, deep learning approaches using features extracted by convolutional neural networks (CNN) significantly improved the detection accuracy on detection benchmark datasets compared to traditional approaches based on hand-crafted features as used for object detection in aerial images. However, these approaches are not transferable one to one on aerial images as the used network architectures have an insufficient resolution of feature maps for handling small instances. This consequently results in poor localization accuracy or missed detections as the network architectures are explored and optimized for datasets that considerably differ from aerial images in particular in object size and image fraction occupied by an object. In this work, we propose a deep neural network derived from the Faster R-CNN approach for multi- category object detection in aerial images. We show how the detection accuracy can be improved by replacing the network architecture by an architecture especially designed for handling small object sizes. Furthermore, we investigate the impact of different parameters of the detection framework on the detection accuracy for small objects. Finally, we demonstrate the suitability of our network for object detection in aerial images by comparing our network to traditional baseline approaches and deep learning based approaches on the publicly available DLR 3K Munich Vehicle Aerial Image Dataset that comprises multiple object classes such as car, van, truck, bus and camper.

Journal ArticleDOI
TL;DR: A framework to bridge the gap between sketches and aerial images is proposed and the Euclidean distance is used to measure the cross-domain similarity between aerial images and sketches.
Abstract: This paper investigates the problem of retrieving aerial scene images by using semantic sketches, since the state-of-the-art retrieval systems turn out to be invalid when there is no exemplar query aerial image available. However, due to the complex surface structures and huge variations of resolutions of aerial images, it is very challenging to retrieve aerial images with sketches and few studies have been devoted to this task. In this article, for the first time to our knowledge, we propose a framework to bridge the gap between sketches and aerial images. First, an aerial sketch-image database is collected, and the images and sketches it contains are augmented to various levels of details. We then train a multi-scale deep model by the new dataset. The fully-connected layers of the network in each scale are finally connected and used as cross-domain features, and the Euclidean distance is used to measure the cross-domain similarity between aerial images and sketches. Experiments on several commonly used aerial image datasets demonstrate the superiority of the proposed method compared with the traditional approaches.

Patent
08 Sep 2017
TL;DR: In this article, a deep learning-based insulator identification method is proposed, which comprises the steps of pre-processing an aerial image, and secondly, extending the data via the methods, such as the geometric transformation, the contrast enhancement, an analog noise adding method, etc.
Abstract: The present invention discloses a deep learning-based insulator identification method. The insulator identification method comprises the steps of pre-processing an aerial image, and secondly, extending the data via the methods, such as the geometric transformation, the contrast enhancement, an analog noise adding method, etc.; acquiring the insulator samples, aiming at the insulators of different types, classifying to acquire; determining a to-be-trained model structure; inputting the samples in the to-be-trained model, and continuously adjusting the weights and the bias parameters by the forward propagation and backward propagation methods, and finally determining an optimal model parameter, based on the trained model, taking a to-be-detected image as an input signal, and by the network multi-layer convolution, pooling and full-connection operations, obtaining a final detection identification result. According to the present invention, by a deep learning method, the insulator characteristics are learned continuously, a learning network model is determined, the different insulators are identified under different background environments, and support is provided for the electric power maintenance decisions.

Proceedings ArticleDOI
TL;DR: Preliminary results show an improvement in the deep learning algorithm when real image training data are augmented with the simulated images, especially when obtaining sufficient real data was particularly challenging.
Abstract: Training deep convolutional networks for satellite or aerial image analysis often requires a large amount of training data. For a more robust algorithm, training data need to have variations not only in the background and target, but also radiometric variations in the image such as shadowing, illumination changes, atmospheric conditions, and imaging platforms with different collection geometry. Data augmentation is a commonly used approach to generating additional training data. However, this approach is often insufficient in accounting for real world changes in lighting, location or viewpoint outside of the collection geometry. Alternatively, image simulation can be an efficient way to augment training data that incorporates all these variations, such as changing backgrounds, that may be encountered in real data. The Digital Imaging and Remote Sensing Image Image Generation (DIRSIG) model is a tool that produces synthetic imagery using a suite of physics-based radiation propagation modules. DIRSIG can simulate images taken from different sensors with variation in collection geometry, spectral response, solar elevation and angle, atmospheric models, target, and background. Simulation of Urban Mobility (SUMO) is a multi-modal traffic simulation tool that explicitly models vehicles that move through a given road network. The output of the SUMO model was incorporated into DIRSIG to generate scenes with moving vehicles. The same approach was used when using helicopters as targets, but with slight modifications. Using the combination of DIRSIG and SUMO, we quickly generated many small images, with the target at the center with different backgrounds. The simulations generated images with vehicles and helicopters as targets, and corresponding images without targets. Using parallel computing, 120,000 training images were generated in about an hour. Some preliminary results show an improvement in the deep learning algorithm when real image training data are augmented with the simulated images, especially when obtaining sufficient real data was particularly challenging.

Posted Content
TL;DR: The Dataset for Object Detection in Aerial Images (DOTA) as mentioned in this paper is a large-scale dataset of aerial images collected from different sensors and platforms and contains objects exhibiting a wide variety of scales, orientations, and shapes.
Abstract: Object detection is an important and challenging problem in computer vision. Although the past decade has witnessed major advances in object detection in natural scenes, such successes have been slow to aerial imagery, not only because of the huge variation in the scale, orientation and shape of the object instances on the earth's surface, but also due to the scarcity of well-annotated datasets of objects in aerial scenes. To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, we introduce a large-scale Dataset for Object deTection in Aerial images (DOTA). To this end, we collect $2806$ aerial images from different sensors and platforms. Each image is of the size about 4000-by-4000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. These DOTA images are then annotated by experts in aerial image interpretation using $15$ common object categories. The fully annotated DOTA images contains $188,282$ instances, each of which is labeled by an arbitrary (8 d.o.f.) quadrilateral To build a baseline for object detection in Earth Vision, we evaluate state-of-the-art object detection algorithms on DOTA. Experiments demonstrate that DOTA well represents real Earth Vision applications and are quite challenging.

Proceedings ArticleDOI
08 May 2017
TL;DR: A new deep learning architecture for power line detection uses Histogram of Gradient features as the input instead of the image itself to ensure capture of accurate line features and is compared with GoogleNet pre-trained model.
Abstract: The use of drones in infrastructure monitoring aims at decreasing the human effort and in achieving consistency. Accurate aerial image analysis is the key block to achieve the same. Reliable detection and integrity checking of power line conductors in a diverse background are the most challenging in drone based automatic infrastructure monitoring. Most techniques in literature use first principle approach that tries to represent the image as features of interest. This paper proposes a machine learning approach for power line detection. A new deep learning architecture is proposed with very good results and is compared with GoogleNet pre-trained model. The proposed architecture uses Histogram of Gradient features as the input instead of the image itself to ensure capture of accurate line features. The system is tested on aerial image collected using drone. A healthy F-score of 84.6% is obtained using the proposed architecture as against 81% using GoogleNet model.

Journal ArticleDOI
TL;DR: The proposed approach recovers accurate camera pose and (sparse) 3-D structure using bundle adjustment for sequential imagery (BA4S) and then stabilize the video from the moving platform by analytically solving for the image-plane-to-ground-plane homography transformation.
Abstract: We describe a fast and efficient camera pose refinement and Structure from Motion (SfM) method for sequential aerial imagery with applications to georegistration and 3-D reconstruction. Inputs to the system are 2-D images combined with initial noisy camera metadata measurements, available from on-board sensors (e.g., camera, global positioning system, and inertial measurement unit). Georegistration is required to stabilize the ground-plane motion to separate camera-induced motion from object motion to support vehicle tracking in aerial imagery. In the proposed approach, we recover accurate camera pose and (sparse) 3-D structure using bundle adjustment for sequential imagery (BA4S) and then stabilize the video from the moving platform by analytically solving for the image-plane-to-ground-plane homography transformation. Using this approach, we avoid relying upon image-to-image registration, which requires estimating feature correspondences (i.e., matching) followed by warping between images (in a 2-D space) that is an error prone process for complex scenes with parallax, appearance, and illumination changes. Both our SfM (BA4S) and our analytical ground-plane georegistration method avoid the use of iterative consensus combinatorial methods like RANdom SAmple Consensus which is a core part of many published approaches. BA4S is very efficient for long sequential imagery and is more than 130 times faster than VisualSfM, 35 times faster than MavMap, and about 274 times faster than Pix4D. Various experimental results demonstrate the efficiency and robustness of the proposed pipeline for the refinement of camera parameters in sequential aerial imagery and georegistration.

Journal ArticleDOI
TL;DR: A cost-effective network extension scheme that addresses the problem of class-imbalance in joint vehicle localization and categorization in high resolution aerial images and shows an equivalent or better performance, while requiring the least parameter and memory overheads.
Abstract: Joint vehicle localization and categorization in high resolution aerial images can provide useful information for applications such as traffic flow structure analysis. To maintain sufficient features to recognize small-scaled vehicles, a regions with convolutional neural network features (R-CNN) -like detection structure is employed. In this setting, cascaded localization error can be averted by equally treating the negatives and differently typed positives as a multi-class classification task, but the problem of class-imbalance remains. To address this issue, a cost-effective network extension scheme is proposed. In it, the correlated convolution and connection costs during extension are reduced by feature map selection and bi-partite main-side network construction, which are realized with the assistance of a novel feature map class-importance measurement and a new class-imbalance sensitive main-side loss function. By using an image classification dataset established from a set of traditional real-colored aerial images with 0.13 m ground sampling distance which are taken from the height of 1000 m by an imaging system composed of non-metric cameras, the effectiveness of the proposed network extension is verified by comparing with its similarly shaped strong counter-parts. Experiments show an equivalent or better performance, while requiring the least parameter and memory overheads are required.

Patent
20 Jun 2017
TL;DR: In this article, an aerial image-based method for detecting insulator deficiency in a power transmission line was proposed, in which a rapid detection network faster-RCNN and a regional generation network RPN were employed.
Abstract: The invention relates to an aerial image-based method for detecting insulator deficiency in a power transmission line. The method comprises the steps of 1, constructing a database; 2, training a detection module, in which a rapid detection network faster-RCNN is employed and comprises a detection network fast-RCNN and a regional generation network RPN; 3, constructing the detection module by cross training and slight adjustment of the network, in which the network is trained again by employing a cross training method, and the detection network fast-RCNN and the regional generation network RPN are combined to be an end-to-end convolutional neural network so as to construct a deficient insulator detection module; and 4, detecting an image to be detected by the detection model to obtain a candidate frame of an insulator target, adjusting a threshold value of the candidate frame, and performing non-maximal suppression on the candidate frame to obtain a final candidate frame.