scispace - formally typeset
Search or ask a question

Showing papers on "Aerial image published in 2019"


Proceedings ArticleDOI
01 Oct 2019
TL;DR: Zhang et al. as discussed by the authors proposed a cluster proposal sub-network (CPNet), a scale estimation sub-networks (ScaleNet), and a dedicated detection network (DetecNet) to detect small objects in aerial images.
Abstract: Detecting objects in aerial images is challenging for at least two reasons: (1) target objects like pedestrians are very small in pixels, making them hardly distinguished from surrounding background; and (2) targets are in general sparsely and non-uniformly distributed, making the detection very inefficient. In this paper, we address both issues inspired by observing that these targets are often clustered. In particular, we propose a Clustered Detection (ClusDet) network that unifies object clustering and detection in an end-to-end framework. The key components in ClusDet include a cluster proposal sub-network (CPNet), a scale estimation sub-network (ScaleNet), and a dedicated detection network (DetecNet). Given an input image, CPNet produces object cluster regions and ScaleNet estimates object scales for these regions. Then, each scale-normalized cluster region is fed into DetecNet for object detection. ClusDet has several advantages over previous solutions: (1) it greatly reduces the number of chips for final object detection and hence achieves high running time efficiency, (2) the cluster-based scale estimation is more accurate than previously used single-object based ones, hence effectively improves the detection for small objects, and (3) the final DetecNet is dedicated for clustered regions and implicitly models the prior context information so as to boost detection accuracy. The proposed method is tested on three popular aerial image datasets including VisDrone, UAVDT and DOTA. In all experiments, ClusDet achieves promising performance in comparison with state-of-the-art detectors.

161 citations


Journal ArticleDOI
Mengya Zhang1, Guangluan Xu1, Keming Chen1, Menglong Yan1, Xian Sun1 
TL;DR: This letter presents a novel supervised change detection method based on a deep siamese semantic network framework, which is trained by using improved triplet loss function for optical aerial images and produces comparable, even better results, favorably to the state-of-the-art methods in terms of F-measure.
Abstract: This letter presents a novel supervised change detection method based on a deep siamese semantic network framework, which is trained by using improved triplet loss function for optical aerial images. The proposed framework can not only extract features directly from image pairs which include multiscale information and are more abstract as well as robust, but also enhance the interclass separability and the intraclass inseparability by learning semantic relation. The feature vectors of the pixels pair with the same label are closer, and at the same time, the feature vectors of the pixels with different labels are farther from each other. Moreover, we use the distance of the feature map to detect the changes on the difference map between the image pair. Binarized change map can be obtained by a simple threshold. Experiments on optical aerial image data set validate that the proposed approach produces comparable, even better results, favorably to the state-of-the-art methods in terms of F-measure.

144 citations


Journal ArticleDOI
TL;DR: A novel fully convolutional network (FCN), in which a spatial residual inception (SRI) module is proposed to capture and aggregate multi-scale contexts for semantic understanding by successively fusing multi-level features, shows promising potential for building detection from remote sensing images on a large scale.
Abstract: The rapid development in deep learning and computer vision has introduced new opportunities and paradigms for building extraction from remote sensing images. In this paper, we propose a novel fully convolutional network (FCN), in which a spatial residual inception (SRI) module is proposed to capture and aggregate multi-scale contexts for semantic understanding by successively fusing multi-level features. The proposed SRI-Net is capable of accurately detecting large buildings that might be easily omitted while retaining global morphological characteristics and local details. On the other hand, to improve computational efficiency, depthwise separable convolutions and convolution factorization are introduced to significantly decrease the number of model parameters. The proposed model is evaluated on the Inria Aerial Image Labeling Dataset and the Wuhan University (WHU) Aerial Building Dataset. The experimental results show that the proposed methods exhibit significant improvements compared with several state-of-the-art FCNs, including SegNet, U-Net, RefineNet, and DeepLab v3+. The proposed model shows promising potential for building detection from remote sensing images on a large scale.

134 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: OriCNN as discussed by the authors proposes a Siamese network to explicitly encode the orientation (i.e., spherical directions) of each pixel of the images, which significantly boosts the discriminative power of the learned deep features, leading to much higher recall and precision outperforming all previous methods.
Abstract: This paper studies image-based geo-localization (IBL) problem using ground-to-aerial cross-view matching. The goal is to predict the spatial location of a ground-level query image by matching it to a large geotagged aerial image database (e.g., satellite imagery). This is a challenging task due to the drastic differences in their viewpoints and visual appearances. Existing deep learning methods for this problem have been focused on maximizing feature similarity between spatially close-by image pairs, while minimizing other images pairs which are far apart. They do so by deep feature embedding based on visual appearance in those ground-and-aerial images. However, in everyday life, humans commonly use orientation information as an important cue for the task of spatial localization. Inspired by this insight, this paper proposes a novel method which endows deep neural networks with the `commonsense' of orientation. Given a ground-level spherical panoramic image as query input (and a large georeferenced satellite image database), we design a Siamese network which explicitly encodes the orientation (i.e., spherical directions) of each pixel of the images. Our method significantly boosts the discriminative power of the learned deep features, leading to a much higher recall and precision outperforming all previous methods. Our network is also more compact using only 1/5th number of parameters than a previously best-performing network. To evaluate the generalization of our method, we also created a large-scale cross-view localization benchmark containing 100K geotagged ground-aerial pairs covering a city. Our codes and datasets are available at https://github.com/Liumouliu/OriCNN.

124 citations


Proceedings Article
28 Aug 2019
TL;DR: This work introduces the first benchmark dataset for instance segmentation in aerial imagery that combines instance-level object detection and pixel-level segmentation tasks, and introduces a large-scale and densely annotated Instance Segmentation in Aerial Images Dataset (iSAID).
Abstract: Existing Earth Vision datasets are either suitable for semantic segmentation or object detection. In this work, we introduce the first benchmark dataset for instance segmentation in aerial imagery that combines instance-level object detection and pixel-level segmentation tasks. In comparison to instance segmentation in natural scenes, aerial images present unique challenges e.g., a huge number of instances per image, large object-scale variations and abundant tiny objects. Our large-scale and densely annotated Instance Segmentation in Aerial Images Dataset (iSAID) comes with 655,451 object instances for 15 categories across 2,806 high-resolution images. Such precise per-pixel annotations for each instance ensure accurate localization that is essential for detailed scene analysis. Compared to existing small-scale aerial image based instance segmentation datasets, iSAID contains 15$\times$ the number of object categories and 5$\times$ the number of instances. We benchmark our dataset using two popular instance segmentation approaches for natural images, namely Mask R-CNN and PANet. In our experiments we show that direct application of off-the-shelf Mask R-CNN and PANet on aerial images provide suboptimal instance segmentation results, thus requiring specialized solutions from the research community. The dataset is publicly available at: this https URL

123 citations


Journal ArticleDOI
TL;DR: A novel end-to-end network, namely class-wise attention-based convolutional and bidirectional LSTM network (CA-Conv-BiLSTM), for aerial image multi-label classification is proposed, which models the underlying class dependency in both directions and produces structured multiple object labels.
Abstract: Aerial image classification is of great significance in the remote sensing community, and many researches have been conducted over the past few years. Among these studies, most of them focus on categorizing an image into one semantic label, while in the real world, an aerial image is often associated with multiple labels, e.g., multiple object-level labels in our case. Besides, a comprehensive picture of present objects in a given high-resolution aerial image can provide a more in-depth understanding of the studied region. For these reasons, aerial image multi-label classification has been attracting increasing attention. However, one common limitation shared by existing methods in the community is that the co-occurrence relationship of various classes, so-called class dependency, is underexplored and leads to an inconsiderate decision. In this paper, we propose a novel end-to-end network, namely class-wise attention-based convolutional and bidirectional LSTM network (CA-Conv-BiLSTM), for this task. The proposed network consists of three indispensable components: (1) a feature extraction module, (2) a class attention learning layer, and (3) a bidirectional LSTM-based sub-network. Particularly, the feature extraction module is designed for extracting fine-grained semantic feature maps, while the class attention learning layer aims at capturing discriminative class-specific features. As the most important part, the bidirectional LSTM-based sub-network models the underlying class dependency in both directions and produce structured multiple object labels. Experimental results on UCM multi-label dataset and DFC15 multi-label dataset validate the effectiveness of our model quantitatively and qualitatively.

121 citations


Journal ArticleDOI
Kai Yue1, Lei Yang1, Ruirui Li1, Wei Hu1, Fan Zhang1, Wei Li1 
TL;DR: This paper proposes TreeUNet, a tool that uses an adaptive network to increase the classification rate at the pixel level and shows that the improvement brought by the adaptive Tree-CNN block is significant.
Abstract: Fine-grained semantic segmentation results are typically difficult to obtain for subdecimeter aerial imagery segmentation as a result of complex remote sensing content and optical conditions. Recently, convolutional neural networks (CNNs) have shown outstanding performance on this task. Although many deep neural network structures and techniques have been applied to improve accuracy, few have attended to improving the differentiation of easily confused classes. In this paper, we propose TreeUNet, a tool that uses an adaptive network to increase the classification rate at the pixel level. Specifically, based on a deep semantic model infrastructure, a Tree-CNN block in which each node represents a ResNeXt unit is constructed adaptively in accordance with the confusion matrix and the proposed TreeCutting algorithm. By transmitting feature maps through concatenating connections, the Tree-CNN block fuses multiscale features and learns best weights for the model. In experiments on the ISPRS two-dimensional Vaihingen and Potsdam semantic labelling datasets, the results obtained by TreeUNet are competitive among published state-of-the-art methods. Detailed comparison and analysis show that the improvement brought by the adaptive Tree-CNN block is significant.

116 citations


Journal ArticleDOI
TL;DR: A DHA quality evaluation method is proposed by integrating some dehazing-relevant features, including image structure recovering, color rendition, and over-enhancement of low-contrast areas, which works for both types of images, but is further improved for aerial images by incorporating its specific characteristics.
Abstract: To enhance the visibility and usability of images captured in hazy conditions, many image dehazing algorithms (DHAs) have been proposed. With so many image DHAs, there is a need to evaluate and compare these DHAs. Due to the lack of the reference haze-free images, DHAs are generally evaluated qualitatively using real hazy images. But it is possible to perform quantitative evaluation using synthetic hazy images since the reference haze-free images are available and full-reference (FR) image quality assessment (IQA) measures can be utilized. In this paper, we follow this strategy and study DHA evaluation using synthetic hazy images systematically. We first build a synthetic haze removing quality (SHRQ) database. It consists of two subsets: regular and aerial image subsets, which include 360 and 240 dehazed images created from 45 and 30 synthetic hazy images using 8 DHAs, respectively. Since aerial imaging is an important application area of dehazing, we create an aerial image subset specifically. We then carry out subjective quality evaluation study on these two subsets. We observe that taking DHA evaluation as an exact FR IQA process is questionable, and the state-of-the-art FR IQA measures are not effective for DHA evaluation. Thus, we propose a DHA quality evaluation method by integrating some dehazing-relevant features, including image structure recovering, color rendition, and over-enhancement of low-contrast areas. The proposed method works for both types of images, but we further improve it for aerial images by incorporating its specific characteristics. Experimental results on two subsets of the SHRQ database validate the effectiveness of the proposed measures.

111 citations


Journal ArticleDOI
TL;DR: A deep scene representation to achieve the invariance of CNN features and further enhance the discriminative power is proposed and, even with a simple linear classifier, can achieve the state-of-the-art performance.
Abstract: As a fundamental problem in earth observation, aerial scene classification tries to assign a specific semantic label to an aerial image. In recent years, the deep convolutional neural networks (CNNs) have shown advanced performances in aerial scene classification. The successful pretrained CNNs can be transferable to aerial images. However, global CNN activations may lack geometric invariance and, therefore, limit the improvement of aerial scene classification. To address this problem, this paper proposes a deep scene representation to achieve the invariance of CNN features and further enhance the discriminative power. The proposed method: 1) extracts CNN activations from the last convolutional layer of pretrained CNN; 2) performs multiscale pooling (MSP) on these activations; and 3) builds a holistic representation by the Fisher vector method. MSP is a simple and effective multiscale strategy, which enriches multiscale spatial information in affordable computational time. The proposed representation is particularly suited at aerial scenes and consistently outperforms global CNN activations without requiring feature adaptation. Extensive experiments on five aerial scene data sets indicate that the proposed method, even with a simple linear classifier, can achieve the state-of-the-art performance.

101 citations


Journal ArticleDOI
TL;DR: A generative adversarial network with spatial and channel attention mechanisms (GAN-SCA) for the robust segmentation of buildings in remote sensing images and outperforms several state-of-the-art approaches.
Abstract: Segmentation of high-resolution remote sensing images is an important challenge with wide practical applications. The increasing spatial resolution provides fine details for image segmentation but also incurs segmentation ambiguities. In this paper, we propose a generative adversarial network with spatial and channel attention mechanisms (GAN-SCA) for the robust segmentation of buildings in remote sensing images. The segmentation network (generator) of the proposed framework is composed of the well-known semantic segmentation architecture (U-Net) and the spatial and channel attention mechanisms (SCA). The adoption of SCA enables the segmentation network to selectively enhance more useful features in specific positions and channels and enables improved results closer to the ground truth. The discriminator is an adversarial network with channel attention mechanisms that can properly discriminate the outputs of the generator and the ground truth maps. The segmentation network and adversarial network are trained in an alternating fashion on the Inria aerial image labeling dataset and Massachusetts buildings dataset. Experimental results show that the proposed GAN-SCA achieves a higher score (the overall accuracy and intersection over the union of Inria aerial image labeling dataset are 96.61% and 77.75%, respectively, and the F1-measure of the Massachusetts buildings dataset is 96.36%) and outperforms several state-of-the-art approaches.

96 citations


Journal ArticleDOI
TL;DR: In this article, an architecture based on a deep convolutional neural network (CNN) is proposed in order to estimate the height values from a single aerial image, which is an ambiguous and ill-posed problem.
Abstract: Extracting 3D information from aerial images is an important and still challenging topic in photogrammetry and remote sensing. Height estimation from only a single aerial image is an ambiguous and ill-posed problem. To address this challenging problem, in this paper, an architecture based on a deep convolutional neural network (CNN) is proposed in order to estimate the height values from a single aerial image. Methodologies for data preprocessing, selection of training data as well as data augmentation are presented. Subsequently, a deep CNN architecture is proposed consisting of encoding and decoding steps. In the encoding part, a deep residual learning is employed for extracting the local and global features. An up-sampling approach is proposed in the decoding part for increasing the output resolution and skip connections are employed in each scale to modify the estimated height values at the object boundaries. Finally, a post-processing approach is proposed to merge the predicted height image patches and generate a seamless continuous height map. The quantitative evaluation of the proposed approaches on the ISPRS datasets indicates relative and root mean square errors of approximately 0.9 m and 3.2 m, respectively.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper proposes a novel in-batch reweighting triplet loss to emphasize the positive effect of hard exemplars during end-to-end training and integrates an attention mechanism into the model using feature-level contextual information.
Abstract: The task of ground-to-aerial image geo-localization can be achieved by matching a ground view query image to a reference database of aerial/satellite images. It is highly challenging due to the dramatic viewpoint changes and unknown orientations. In this paper, we propose a novel in-batch reweighting triplet loss to emphasize the positive effect of hard exemplars during end-to-end training. We also integrate an attention mechanism into our model using feature-level contextual information. To analyze the difficulty level of each triplet, we first enforce a modified logistic regression to triplets with a distance rectifying factor. Then, the reference negative distances for corresponding anchors are set, and the relative weights of triplets are computed by comparing their difficulty to the corresponding references. To reduce the influence of extreme hard data and less useful simple exemplars, the final weights are pruned using upper and lower bound constraints. Experiments on two benchmark datasets show that the proposed approach significantly outperforms the state-of-the-art methods.

Journal ArticleDOI
TL;DR: A light-weight deep learning model integrating spatial pyramid pooling with an encoder-decoder structure that has the potential to deliver automatic building segmentation from high-resolution remote sensing images at an accuracy that makes it a useful tool for practical application scenarios is proposed.
Abstract: Automatic extraction of buildings from remote sensing imagery plays a significant role in many applications, such as urban planning and monitoring changes to land cover. Various building segmentation methods have been proposed for visible remote sensing images, especially state-of-the-art methods based on convolutional neural networks (CNNs). However, high-accuracy building segmentation from high-resolution remote sensing imagery is still a challenging task due to the potentially complex texture of buildings in general and image background. Repeated pooling and striding operations used in CNNs reduce feature resolution causing a loss of detailed information. To address this issue, we propose a light-weight deep learning model integrating spatial pyramid pooling with an encoder-decoder structure. The proposed model takes advantage of a spatial pyramid pooling module to capture and aggregate multi-scale contextual information and of the ability of encoder-decoder networks to restore losses of information. The proposed model is evaluated on two publicly available datasets; the Massachusetts roads and buildings dataset and the INRIA Aerial Image Labeling Dataset. The experimental results on these datasets show qualitative and quantitative improvement against established image segmentation models, including SegNet, FCN, U-Net, Tiramisu, and FRRN. For instance, compared to the standard U-Net, the overall accuracy gain is 1.0% (0.913 vs. 0.904) and 3.6% (0.909 vs. 0.877) with a maximal increase of 3.6% in model-training time on these two datasets. These results demonstrate that the proposed model has the potential to deliver automatic building segmentation from high-resolution remote sensing images at an accuracy that makes it a useful tool for practical application scenarios.

Proceedings Article
01 Jan 2019
TL;DR: A new deep network is developed to explicitly address inherent differences between ground and aerial views, and introduces a feature aggregation strategy via learning multiple spatial embeddings to improve the robustness of feature representation.
Abstract: In this paper, we develop a new deep network to explicitly address these inherent differences between ground and aerial views. We observe there exist some approximate domain correspondences between ground and aerial images. Specifically, pixels lying on the same azimuth direction in an aerial image approximately correspond to a vertical image column in the ground view image. Thus, we propose a two-step approach to exploit this prior knowledge. The first step is to apply a regular polar transform to warp an aerial image such that its domain is closer to that of a ground-view panorama. Note that polar transform as a pure geometric transformation is agnostic to scene content, hence cannot bring the two domains into full alignment. Then, we add a subsequent spatial-attention mechanism which further brings corresponding deep features closer in the embedding space. To improve the robustness of feature representation, we introduce a feature aggregation strategy via learning multiple spatial embeddings. By the above two-step approach, we achieve more discriminative deep representations, facilitating cross-view Geo-localization more accurate. Our experiments on standard benchmark datasets show significant performance boosting, achieving more than doubled recall rate compared with the previous state of the art.

Journal ArticleDOI
TL;DR: It is concluded that multiple different optical, topographical, and vegetation height datasets should be used when mapping vegetation in spatially heterogeneous landscapes, and that sub-meter resolution data (e.g. UAV or aerial) are necessary for the most accurate maps.

Journal ArticleDOI
TL;DR: This article addresses the question of mapping building functions jointly using both aerial and street view images via deep learning techniques through a decision-level fusion of a diverse ensemble of models trained from each image type independently, and chooses a highly compact classification scheme with four classes.
Abstract: This article addresses the question of mapping building functions jointly using both aerial and street view images via deep learning techniques. One of the central challenges here is determining a data fusion strategy that can cope with heterogeneous image modalities. We demonstrate that geometric combinations of the features of such two types of images, especially in an early stage of the convolutional layers, often lead to a destructive effect due to the spatial misalignment of the features. Therefore, we address this problem through a decision-level fusion of a diverse ensemble of models trained from each image type independently. In this way, the significant differences in appearance of aerial and street view images are taken into account. Compared to the common multi-stream end-to-end fusion approaches proposed in the literature, we are able to increase the precision scores from 68% to 76%. Another challenge is that sophisticated classification schemes needed for real applications are highly overlapping and not very well defined without sharp boundaries. As a consequence, classification using machine learning becomes significantly harder. In this work, we choose a highly compact classification scheme with four classes, commercial, residential, public, and industrial because such a classification has a very high value to urban geography being correlated with socio-demographic parameters such as population density and income.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this paper, a joint feature learning approach is proposed to synthesize an aerial representation of a ground-level panorama query and use it to minimize the domain gap between the two views.
Abstract: The visual entities in cross-view (e.g. ground and aerial) images exhibit drastic domain changes due to the differences in viewpoints each set of images is captured from. Existing state-of-the-art methods address the problem by learning view-invariant images descriptors. We propose a novel method for solving this task by exploiting the gener- ative powers of conditional GANs to synthesize an aerial representation of a ground-level panorama query and use it to minimize the domain gap between the two views. The synthesized image being from the same view as the ref- erence (target) image, helps the network to preserve im- portant cues in aerial images following our Joint Feature Learning approach. We fuse the complementary features from a synthesized aerial image with the original ground- level panorama features to obtain a robust query represen- tation. In addition, we employ multi-scale feature aggre- gation in order to preserve image representations at dif- ferent scales useful for solving this complex task. Experi- mental results show that our proposed approach performs significantly better than the state-of-the-art methods on the challenging CVUSA dataset in terms of top-1 and top-1% retrieval accuracies. Furthermore, we evaluate the gen- eralization of the proposed method for urban landscapes on our newly collected cross-view localization dataset with geo-reference information.

Posted Content
TL;DR: This work synthesizes an aerial representation of a ground-level panorama query and uses it to minimize the domain gap between the two views, and fuse the complementary features from a synthesized aerial image with the original ground- level panorama features to obtain a robust query represen- tation.
Abstract: The visual entities in cross-view images exhibit drastic domain changes due to the difference in viewpoints each set of images is captured from. Existing state-of-the-art methods address the problem by learning view-invariant descriptors for the images. We propose a novel method for solving this task by exploiting the generative powers of conditional GANs to synthesize an aerial representation of a ground level panorama and use it to minimize the domain gap between the two views. The synthesized image being from the same view as the target image helps the network to preserve important cues in aerial images following our Joint Feature Learning approach. Our Feature Fusion method combines the complementary features from a synthesized aerial image with the corresponding ground features to obtain a robust query representation. In addition, multi-scale feature aggregation preserves image representations at different feature scales useful for solving this complex task. Experimental results show that our proposed approach performs significantly better than the state-of-the-art methods on the challenging CVUSA dataset in terms of top-1 and top-1% retrieval accuracies. Furthermore, to evaluate the generalization of our method on urban landscapes, we collected a new cross-view localization dataset with geo-reference information.

Proceedings ArticleDOI
16 Jun 2019
TL;DR: A dedicated Aerial Image Database for Emergency Response (AIDER) applications is introduced and a lightweight convolutional neural network (CNN) architecture is developed, capable of running efficiently on an embedded platform achieving ~3x higher performance compared to existing models with minimal memory requirements with less than 2% accuracy drop compared to the state-of-the-art.
Abstract: Unmanned Aerial Vehicles (UAVs), equipped with camera sensors can facilitate enhanced situational awareness for many emergency response and disaster management applications since they are capable of operating in remote and difficult to access areas. In addition, by utilizing an embedded platform and deep learning UAVs can autonomously monitor a disaster stricken area, analyze the image in real-time and alert in the presence of various calamities such as collapsed buildings, flood, or fire in order to faster mitigate their effects on the environment and on human population. To this end, this paper focuses on the automated aerial scene classification of disaster events from on-board a UAV. Specifically, a dedicated Aerial Image Database for Emergency Response (AIDER) applications is introduced and a comparative analysis of existing approaches is performed. Through this analysis a lightweight convolutional neural network (CNN) architecture is developed, capable of running efficiently on an embedded platform achieving ~3x higher performance compared to existing models with minimal memory requirements with less than 2% accuracy drop compared to the state-of-the-art. These preliminary results provide a solid basis for further experimentation towards real-time aerial image classification for emergency response applications using UAVs.

Journal ArticleDOI
TL;DR: The dense spatial pyramid pooling (DSPP) is designed to extract dense and multi-scale features simultaneously, which facilitate the extraction of buildings at all scales and increases the prediction efficiency by two to four times.
Abstract: Automatic building extraction from high-resolution remote sensing images has many practical applications, such as urban planning and supervision. However, fine details and various scales of building structures in high-resolution images bring new challenges to building extraction. An increasing number of neural network-based models have been proposed to handle these issues, while they are not efficient enough, and still suffer from the error ground truth labels. To this end, we propose an efficient end-to-end model, EU-Net, in this paper. We first design the dense spatial pyramid pooling (DSPP) to extract dense and multi-scale features simultaneously, which facilitate the extraction of buildings at all scales. Then, the focal loss is used in reverse to suppress the impact of the error labels in ground truth, making the training stage more stable. To assess the universality of the proposed model, we tested it on three public aerial remote sensing datasets: WHU aerial imagery dataset, Massachusetts buildings dataset, and Inria aerial image labeling dataset. Experimental results show that the proposed EU-Net is superior to the state-of-the-art models of all three datasets and increases the prediction efficiency by two to four times.

Journal ArticleDOI
TL;DR: A novel fault detection method is proposed that can detect both insulator one fault and multi-fault in aerial images and is more effective and efficient than the state-of-the-art insulator fault detection methods.
Abstract: Insulator faults detection is an important task for high-voltage transmission line inspection. However, current methods often suffer from the lack of accuracy and robustness. Moreover, these methods can only detect one fault in the insulator string, but cannot detect a multi-fault. In this paper, a novel method is proposed for insulator one fault and multi-fault detection in UAV-based aerial images, the backgrounds of which usually contain much complex interference. The shapes of the insulators also vary obviously due to the changes in filming angle and distance. To reduce the impact of complex interference on insulator faults detection, we make full use of the deep neural network to distinguish between insulators and background interference. First of all, plenty of insulator aerial images with manually labelled ground-truth are collected to construct a standard insulator detection dataset ‘InST_detection’. Secondly, a new convolutional network is proposed to obtain accurate insulator string positions in the aerial image. Finally, a novel fault detection method is proposed that can detect both insulator one fault and multi-fault in aerial images. Experimental results on a large number of aerial images show that our proposed method is more effective and efficient than the state-of-the-art insulator fault detection methods.

Journal ArticleDOI
TL;DR: A deep learning (DL)-based approach is proposed for the detection and reconstruction of buildings from a single aerial image and does not need any additional or auxiliary data and employs a single image to reconstruct the 3D models of buildings with the competitive precision.
Abstract: In this study, a deep learning (DL)-based approach is proposed for the detection and reconstruction of buildings from a single aerial image. The pre-required knowledge to reconstruct the 3D shapes of buildings, including the height data as well as the linear elements of individual roofs, is derived from the RGB image using an optimized multi-scale convolutional–deconvolutional network (MSCDN). The proposed network is composed of two feature extraction levels to first predict the coarse features, and then automatically refine them. The predicted features include the normalized digital surface models (nDSMs) and linear elements of roofs in three classes of eave, ridge, and hip lines. Then, the prismatic models of buildings are generated by analyzing the eave lines. The parametric models of individual roofs are also reconstructed using the predicted ridge and hip lines. The experiments show that, even in the presence of noises in height values, the proposed method performs well on 3D reconstruction of buildings with different shapes and complexities. The average root mean square error (RMSE) and normalized median absolute deviation (NMAD) metrics are about 3.43 m and 1.13 m, respectively for the predicted nDSM. Moreover, the quality of the extracted linear elements is about 91.31% and 83.69% for the Potsdam and Zeebrugge test data, respectively. Unlike the state-of-the-art methods, the proposed approach does not need any additional or auxiliary data and employs a single image to reconstruct the 3D models of buildings with the competitive precision of about 1.2 m and 0.8 m for the horizontal and vertical RMSEs over the Potsdam data and about 3.9 m and 2.4 m over the Zeebrugge test data.

Journal ArticleDOI
TL;DR: A novel road extraction method from aerial images based on an improved generative adversarial network, which is an end-to-end framework only requiring a few samples for training is presented.
Abstract: Aerial photographs and satellite images are one of the resources used for earth observation. In practice, automated detection of roads on aerial images is of significant values for the application such as car navigation, law enforcement, and fire services. In this paper, we present a novel road extraction method from aerial images based on an improved generative adversarial network, which is an end-to-end framework only requiring a few samples for training. Experimental results on the Massachusetts Roads Dataset show that the proposed method provides better performance than several state of the art techniques in terms of detection accuracy, recall, precision and F1-score.

Journal ArticleDOI
TL;DR: The proposed aerial count method will improve the accuracy of population estimates and will decrease the standard error of population estimate estimates by 31% to 67% and has the potential to outperform humans in detecting animals from the air when supplied with images taken at a fixed rate.
Abstract: Animal population sizes are often estimated using aerial sample counts by human observers, both for wildlife and livestock. The associated methods of counting remained more or less the same since the 1970s, but suffer from low precision and low accuracy of population estimates. Aerial counts using cost-efficient Unmanned Aerial Vehicles or microlight aircrafts with cameras and an automated animal detection algorithm can potentially improve this precision and accuracy. Therefore, we evaluated the performance of the multi-class convolutional neural network RetinaNet in detecting elephants, giraffes and zebras in aerial images from two Kenyan animal counts. The algorithm detected 95% of the number of elephants, 91% of giraffes and 90% of zebras that were found by four layers of human annotation, of which it correctly detected an extra 2.8% of elephants, 3.8% giraffes and 4.0% zebras that were missed by all humans, while detecting only 1.6 to 5.0 false positives per true positive. Furthermore, the animal detections by the algorithm were less sensitive to the sighting distance than humans were. With such a high recall and precision, we posit it is feasible to replace manual aerial animal count methods (from images and/or directly) by only the manual identification of image bounding boxes selected by the algorithm and then use a correction factor equal to the inverse of the undercounting bias in the calculation of the population estimates. This correction factor causes the standard error of the population estimate to increase slightly compared to a manual method, but this increase can be compensated for when the sampling effort would increase by 23%. However, an increase in sampling effort of 160% to 1,050% can be attained with the same expenses for equipment and personnel using our proposed semi-automatic method compared to a manual method. Therefore, we conclude that our proposed aerial count method will improve the accuracy of population estimates and will decrease the standard error of population estimates by 31% to 67%. Most importantly, this animal detection algorithm has the potential to outperform humans in detecting animals from the air when supplied with images taken at a fixed rate.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed an end-to-end multiple instance dense connected convolution neural network (MIDCCNN) for aerial image scene classification, which is capable of preserving middle and low level convolution features.
Abstract: With the development of deep learning, many state-of-the-art natural image scene classification methods have demonstrated impressive performance. While the current convolution neural network tends to extract global features and global semantic information in a scene, the geo-spatial objects can be located at anywhere in an aerial image scene and their spatial arrangement tends to be more complicated. One possible solution is to preserve more local semantic information and enhance feature propagation. In this paper, an end to end multiple instance dense connected convolution neural network (MIDCCNN) is proposed for aerial image scene classification. First, a 23 layer dense connected convolution neural network (DCCNN) is built and served as a backbone to extract convolution features. It is capable of preserving middle and low level convolution features. Then, an attention based multiple instance pooling is proposed to highlight the local semantics in an aerial image scene. Finally, we minimize the loss between the bag-level predictions and the ground truth labels so that the whole framework can be trained directly. Experiments on three aerial image datasets demonstrate that our proposed methods can outperform current baselines by a large margin.

Book ChapterDOI
01 Jan 2019
TL;DR: The history of small-scale aerial (SSA) platforms for aerial image acquisition is examined as the context for the current growth in the popularity of UAVs, the associated technologies, and the range ofsmall-scale sensors currently available for environmental monitoring, mapping, and modeling applications.
Abstract: Both fixed-wing and multirotor platforms are increasingly being identified as potentially useful airborne platforms for a range of professional environmental remote-sensing applications, including both expensive high-end specialist solutions and lower-cost, off-the-shelf commercial options. Small UAVs are ideal for applications where aerial coverage requirements are small, flying experience and expertise are limited, and the operating budget is usually relatively small. With advances in battery technology, navigational controls, and payload capacities, many of the smaller UAVs are now capable of utilizing a number of different sensors to collect photographic data, video footage, and multispectral, thermal, and hyperspectral imagery as well as LiDAR. With the aid of specialist digital image processing and soft-copy photogrammetry software, aerial data and imagery can be processed into a number of different products including ortho-photos, mosaics, and digital elevation models (DEM), then analyzed to generate useful information for input to a GIS. This chapter begins by examining the history of small-scale aerial (SSA) platforms for aerial image acquisition as the context for the current growth in the popularity of UAVs, the associated technologies, and the range of small-scale sensors (e.g., GoPro Hero cameras) currently available for environmental monitoring, mapping, and modeling applications. This is taken together with image processing, analysis, and information extraction as well as soft-copy photogrammetry software such as Pix4D, AgiSoft, and AirPhotoSE. The past and present advantages and disadvantages of small aerial platforms are also considered, together with some of the emerging technologies and recent developments. Examples of monitoring and mapping macro-algal weedmats and coastal shorelines are used to illustrate the potential of today's small UAV platforms and the accompanying sensors to acquire low-cost airborne data and imagery.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: A novel multi-task model is proposed, which incorporates semantic edge detection and is better tuned for feature extraction from a wide range of scales, which achieves notable improvements over the baselines in region outlines and level of detail on both tasks.
Abstract: Understanding the complex urban infrastructure with centimeter-level accuracy is essential for many applications from autonomous driving to mapping, infrastructure monitoring, and urban management. Aerial images provide valuable information over a large area instantaneously; nevertheless, no current dataset captures the complexity of aerial scenes at the level of granularity required by real-world applications. To address this, we introduce SkyScapes, an aerial image dataset with highly-accurate, fine-grained annotations for pixel-level semantic labeling. SkyScapes provides annotations for 31 semantic categories ranging from large structures, such as buildings, roads and vegetation, to fine details, such as 12 (sub-)categories of lane markings. We have defined two main tasks on this dataset: dense semantic segmentation and multi-class lane-marking prediction. We carry out extensive experiments to evaluate state-of-the-art segmentation methods on SkyScapes. Existing methods struggle to deal with the wide range of classes, object sizes, scales, and fine details present. We therefore propose a novel multi-task model, which incorporates semantic edge detection and is better tuned for feature extraction from a wide range of scales. This model achieves notable improvements over the baselines in region outlines and level of detail on both tasks.

Journal ArticleDOI
TL;DR: By integrating the proposed modules into the baseline Fully Convolutional Network (FCN), the resulting local attention network (LANet) greatly improves the performance over the baseline and outperforms other attention based methods on two aerial image datasets.
Abstract: The trade-off between feature representation power and spatial localization accuracy is crucial for the dense classification/semantic segmentation of aerial images. High-level features extracted from the late layers of a neural network are rich in semantic information, yet have blurred spatial details; low-level features extracted from the early layers of a network contain more pixel-level information, but are isolated and noisy. It is therefore difficult to bridge the gap between high and low-level features due to their difference in terms of physical information content and spatial distribution. In this work, we contribute to solve this problem by enhancing the feature representation in two ways. On the one hand, a patch attention module (PAM) is proposed to enhance the embedding of context information based on a patch-wise calculation of local attention. On the other hand, an attention embedding module (AEM) is proposed to enrich the semantic information of low-level features by embedding local focus from high-level features. Both of the proposed modules are light-weight and can be applied to process the extracted features of convolutional neural networks (CNNs). Experiments show that, by integrating the proposed modules into the baseline Fully Convolutional Network (FCN), the resulting local attention network (LANet) greatly improves the performance over the baseline and outperforms other attention based methods on two aerial image datasets.

Posted Content
TL;DR: This paper proposes a Clustered Detection (ClusDet) network that unifies object clustering and detection in an end-to-end framework and achieves promising performance in comparison with state-of-the-art detectors.
Abstract: Detecting objects in aerial images is challenging for at least two reasons: (1) target objects like pedestrians are very small in pixels, making them hardly distinguished from surrounding background; and (2) targets are in general sparsely and non-uniformly distributed, making the detection very inefficient. In this paper, we address both issues inspired by observing that these targets are often clustered. In particular, we propose a Clustered Detection (ClusDet) network that unifies object clustering and detection in an end-to-end framework. The key components in ClusDet include a cluster proposal sub-network (CPNet), a scale estimation sub-network (ScaleNet), and a dedicated detection network (DetecNet). Given an input image, CPNet produces object cluster regions and ScaleNet estimates object scales for these regions. Then, each scale-normalized cluster region is fed into DetecNet for object detection. ClusDet has several advantages over previous solutions: (1) it greatly reduces the number of chips for final object detection and hence achieves high running time efficiency, (2) the cluster-based scale estimation is more accurate than previously used single-object based ones, hence effectively improves the detection for small objects, and (3) the final DetecNet is dedicated for clustered regions and implicitly models the prior context information so as to boost detection accuracy. The proposed method is tested on three popular aerial image datasets including VisDrone, UAVDT and DOTA. In all experiments, ClusDet achieves promising performance in comparison with state-of-the-art detectors. Code will be available in \url{this https URL}.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper proposes an adaptive cropping method based on a Difficult Region Estimation Network (DREN) to enhance the detection of the difficult targets, which allows the detector to fully exploit its performance during the testing phase.
Abstract: Detecting objects in aerial images usually faces two major challenges: (1) detecting difficult targets (e.g., small objects, objects that are interfered by the background, or various orientation of the objects, etc.); (2) the imbalance problem inherent in object detection (e.g., imbalanced quantity in different categories, imbalanced sampling method, or imbalanced loss between classification and localization, etc.). Due to these challenges, detectors are often unable to perform the most effective training and testing. In this paper, we propose a simple but effective framework to address these concerns. First, we propose an adaptive cropping method based on a Difficult Region Estimation Network (DREN) to enhance the detection of the difficult targets, which allows the detector to fully exploit its performance during the testing phase. Second, we use the well-trained DREN to generate more diverse and representative training images, which is effective in enhancing the training set. Besides, in order to alleviate the impact of imbalance during training, we add a balance module in which the IoU balanced sampling method and balanced L1 loss are adopted. Finally, we evaluate our method on two aerial image datasets. Without bells and whistles, our framework achieves 8.0 points and 3.3 points higher Average Precision (AP) than the corresponding baselines on VisDrone and UAVDT, respectively.