scispace - formally typeset
Search or ask a question

Showing papers on "Aerial image published in 2021"


Journal ArticleDOI
TL;DR: A deeply supervised (DS) attention metric-based network (DSAMNet) is proposed in this article to learn change maps by means of deep metric learning, in which convolutional block attention modules (CBAM) are integrated to provide more discriminative features.
Abstract: Change detection (CD) aims to identify surface changes from bitemporal images. In recent years, deep learning (DL)-based methods have made substantial breakthroughs in the field of CD. However, CD results can be easily affected by external factors, including illumination, noise, and scale, which leads to pseudo-changes and noise in the detection map. To deal with these problems and achieve more accurate results, a deeply supervised (DS) attention metric-based network (DSAMNet) is proposed in this article. A metric module is employed in DSAMNet to learn change maps by means of deep metric learning, in which convolutional block attention modules (CBAM) are integrated to provide more discriminative features. As an auxiliary, a DS module is introduced to enhance the feature extractor's learning ability and generate more useful features. Moreover, another challenge encountered by data-driven DL algorithms is posed by the limitations in change detection datasets (CDDs). Therefore, we create a CD dataset, Sun Yat-Sen University (SYSU)-CD, for bitemporal image CD, which contains a total of 20,000 aerial image pairs of size 256 x 256. Experiments are conducted on both the CDD and the SYSU-CD dataset. Compared to other state-of-the-art methods, our network achieves the highest accuracy on both datasets, with an F1 of 93.69% on the CDD dataset and 78.18% on the SYSU-CD dataset.

206 citations


Posted Content
TL;DR: A Rotation-equivariant Detector (ReDet) is proposed, which explicitly encodes rotation equivariance and rotation invariance and incorporates rotation- equivariant networks into the detector to extract rotation-Equivariant features, which can accurately predict the orientation and lead to a huge reduction of model size.
Abstract: Recently, object detection in aerial images has gained much attention in computer vision. Different from objects in natural images, aerial objects are often distributed with arbitrary orientation. Therefore, the detector requires more parameters to encode the orientation information, which are often highly redundant and inefficient. Moreover, as ordinary CNNs do not explicitly model the orientation variation, large amounts of rotation augmented data is needed to train an accurate object detector. In this paper, we propose a Rotation-equivariant Detector (ReDet) to address these issues, which explicitly encodes rotation equivariance and rotation invariance. More precisely, we incorporate rotation-equivariant networks into the detector to extract rotation-equivariant features, which can accurately predict the orientation and lead to a huge reduction of model size. Based on the rotation-equivariant features, we also present Rotation-invariant RoI Align (RiRoI Align), which adaptively extracts rotation-invariant features from equivariant features according to the orientation of RoI. Extensive experiments on several challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and HRSC2016, show that our method can achieve state-of-the-art performance on the task of aerial object detection. Compared with previous best results, our ReDet gains 1.2, 3.5 and 2.6 mAP on DOTA-v1.0, DOTA-v1.5 and HRSC2016 respectively while reducing the number of parameters by 60\% (313 Mb vs. 121 Mb). The code is available at: \url{this https URL}.

153 citations


Journal ArticleDOI
TL;DR: In this article, a large-scale dataset of object detection in aerial images (DOTA) is presented, which contains 1,793,658 object instances of 18 categories of oriented-bounding-box annotations collected from 11,268 aerial images.
Abstract: In the past decade, object detection has achieved significant progress in natural images but not in aerial images, due to the massive variations in the scale and orientation of objects caused by the bird's-eye view of aerial images. More importantly, the lack of large-scale benchmarks has become a major obstacle to the development of object detection in aerial images (ODAI). In this paper, we present a large-scale Dataset of Object deTection in Aerial images (DOTA) and comprehensive baselines for ODAI. The proposed DOTA dataset contains 1,793,658 object instances of 18 categories of oriented-bounding-box annotations collected from 11,268 aerial images. Based on this large-scale and well-annotated dataset, we build baselines covering 10 state-of-the-art algorithms with over 70 configurations, where the speed and accuracy performances of each model have been evaluated. Furthermore, we provide a code library for ODAI and build a website for evaluating different algorithms. Previous challenges run on DOTA have attracted more than 1300 teams worldwide. We believe that the expanded large-scale DOTA dataset, the extensive baselines, the code library and the challenges can facilitate the designs of robust algorithms and reproducible research on the problem of object detection in aerial images.

145 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Csuhan et al. as discussed by the authors proposed a Rotation-equivariant Detector (ReDet), which explicitly encodes rotation equivariance and rotation invariance. But it requires large amounts of rotation augmented data to train an accurate object detector.
Abstract: Recently, object detection in aerial images has gained much attention in computer vision. Different from objects in natural images, aerial objects are often distributed with arbitrary orientation. Therefore, the detector requires more parameters to encode the orientation information, which are often highly redundant and inefficient. Moreover, as ordinary CNNs do not explicitly model the orientation variation, large amounts of rotation augmented data is needed to train an accurate object detector. In this paper, we propose a Rotation-equivariant Detector (ReDet) to address these issues, which explicitly encodes rotation equivariance and rotation invariance. More precisely, we incorporate rotation-equivariant networks into the detector to extract rotation-equivariant features, which can accurately predict the orientation and lead to a huge reduction of model size. Based on the rotation-equivariant features, we also present Rotation-invariant RoI Align (RiRoI Align), which adaptively extracts rotation-invariant features from equivariant features according to the orientation of RoI. Extensive experiments on several challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and HRSC2016, show that our method can achieve state-of-the-art performance on the task of aerial object detection. Compared with previous best results, our ReDet gains 1.2, 3.5 and 2.6 mAP on DOTA-v1.0, DOTA-v1.5 and HRSC2016 respectively while reducing the number of parameters by 60% (313 Mb vs. 121 Mb). The code is available at: https://github.com/csuhan/ReDet.

138 citations


Journal ArticleDOI
TL;DR: This letter proposes a new method, called self-attention-based deep feature fusion (SAFF), to aggregate deep layer features and emphasize the weights of the complex objects of remote sensing scene images forRemote sensing scene classification.
Abstract: Remote sensing scene classification aims to assign automatically each aerial image a specific sematic label. In this letter, we propose a new method, called self-attention-based deep feature fusion (SAFF), to aggregate deep layer features and emphasize the weights of the complex objects of remote sensing scene images for remote sensing scene classification. First, the pretrained convolutional neural network (CNN) model is applied to extract the abstract multilayer feature maps from the original aerial imagery. Then, a nonparametric self-attention layer is proposed for spatial-wise and channel-wise weightings, which enhances the effects of the spatial responses of the representative objects and uses the infrequently occurring features more sufficiently. Thus, it can extract more discriminative features. Finally, the aggregated features are fed into a support vector machine (SVM) for classification. The proposed method is experimented on several data sets, and the results prove the effectiveness and efficiency of the scheme for remote sensing scene classification.

110 citations


Proceedings ArticleDOI
11 Mar 2021
TL;DR: PointFlow as mentioned in this paper proposes a point-wise affinity propagation module based on the Feature Pyramid Network (FPN) framework, which generates a sparse affinity map upon selected points between the adjacent features, which reduces the noise introduced by the background.
Abstract: Aerial Image Segmentation is a particular semantic segmentation problem and has several challenging characteristics that general semantic segmentation does not have. There are two critical issues: The one is an extremely foreground-background imbalanced distribution, and the other is multiple small objects along with the complex background. Such problems make the recent dense affinity context modeling perform poorly even compared with baselines due to over-introduced background context. To handle these problems, we propose a point-wise affinity propagation module based on the Feature Pyramid Network (FPN) framework, named PointFlow. Rather than dense affinity learning, a sparse affinity map is generated upon selected points between the adjacent features, which reduces the noise introduced by the background while keeping efficiency. In particular, we design a dual point matcher to select points from the salient area and object boundaries, respectively. Experimental results on three different aerial segmentation datasets suggest that the proposed method is more effective and efficient than state-of-the-art general semantic segmentation methods. Especially, our methods achieve the best speed and accuracy trade-off on three aerial benchmarks. Further experiments on three general semantic segmentation datasets prove the generality of our method. Code and models are made available (https://github.com/lxtGH/PFSegNets).

63 citations


Journal ArticleDOI
TL;DR: Experimental results and analysis validate the claim that the proposed YOLOv3-dense network achieves good performance in the detection of different size insulators amid diverse background interference.
Abstract: Automatic inspection of insulators from high-voltage transmission lines is of paramount importance to the safety and reliable operation of the power grid. Due to different size insulators and the complex background of aerial images, it is a difficult task to recognize insulators in aerial views. Most of the traditional image processing methods and machine learning methods cannot achieve sufficient performance for insulator detection when diverse background interference is present. In this study, a deep learning method—based on You Only Look Once (YOLO)—will be proposed, capable of detecting insulators from aerial images with complex backgrounds. Firstly, aerial images with common aerial scenes were collected by Unmanned Aerial Vehicle (UAV), and a novel insulator dataset was constructed. Secondly, to enhance feature reuse and propagation, on the basis of YOLOv3 and Dense-Blocks, the YOLOv3-dense network was utilized for insulator detection. To improve detection accuracy for different sized insulators, a structure of multiscale feature fusion was adapted to the YOLOv3-dense network. To obtain abundant semantic information of upper and lower layers, multilevel feature mapping modules were employed across the YOLOv3-dense network. Finally, the YOLOv3-dense network and compared networks were trained and tested on the testing set. The average precision of YOLOv3-dense, YOLOv3, and YOLOv2 were 94.47%, 90.31%, and 83.43%, respectively. Experimental results and analysis validate the claim that the proposed YOLOv3-dense network achieves good performance in the detection of different size insulators amid diverse background interference.

48 citations


Journal ArticleDOI
TL;DR: The experimental results and analysis demonstrate that the proposed CSPD-YOLO model performs better in insulator fault detection from high-voltage transmission lines with a complex background.
Abstract: Insulator fault detection is one of the essential tasks for high-voltage transmission lines’ intelligent inspection. In this study, a modified model based on You Only Look Once (YOLO) is proposed for detecting insulator faults in aerial images with a complex background. Firstly, aerial images with one fault or multiple faults are collected in diverse scenes, and then a novel dataset is established. Secondly, to increase feature reuse and propagation in the low-resolution feature layers, a Cross Stage Partial Dense YOLO (CSPD-YOLO) model is proposed based on YOLO-v3 and the Cross Stage Partial Network. The feature pyramid network and improved loss function are adopted to the CSPD-YOLO model, improving the accuracy of insulator fault detection. Finally, the proposed CSPD-YOLO model and compared models are trained and tested on the established dataset. The average precision of CSPD-YOLO model is 4.9% and 1.8% higher than that of YOLO-v3 and YOLO-v4, and the running time of CSPD-YOLO (0.011 s) model is slightly longer than that of YOLO-v3 (0.01 s) and YOLO-v4 (0.01 s). Compared with the excellent object detection models YOLO-v3 and YOLO-v4, the experimental results and analysis demonstrate that the proposed CSPD-YOLO model performs better in insulator fault detection from high-voltage transmission lines with a complex background.

45 citations


Journal ArticleDOI
TL;DR: A one-stage, anchor-free detection approach to detect arbitrarily oriented vehicles in high-resolution aerial images by directly predicting high-level vehicle features via a fully convolutional network is proposed.
Abstract: Vehicle detection in aerial images is an important and challenging task in the field of remote sensing. Recently, deep learning technologies have yielded superior performance for object detection in remote sensing images. However, the detection results of the existing methods are horizontal bounding boxes that ignore vehicle orientations, thereby having limited applicability in scenes with dense vehicles or clutter backgrounds. In this article, we propose a one-stage, anchor-free detection approach to detect arbitrarily oriented vehicles in high-resolution aerial images. The vehicle detection task is transformed into a multitask learning problem by directly predicting high-level vehicle features via a fully convolutional network. That is, a classification subtask is created to look for vehicle central points and three regression subtasks are created to predict vehicle orientations, scales, and offsets of vehicle central points. First, coarse and fine feature maps outputted from different stages of a residual network are concatenated together by a feature pyramid fusion strategy. Upon the concatenated features, four convolutional layers are attached in parallel to predict high-level vehicle features. During training, task uncertainty learned from the training data is used to weight loss function in the multitask learning setting. For inferencing, oriented bounding boxes are generated using the predicted vehicle features, and oriented nonmaximum suppression (NMS) postprocessing is used to reduce redundant results. Experiments on two public aerial image data sets have shown the effectiveness of the proposed approach.

44 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a local semantic enhanced ConvNet (LSE-Net) for aerial scene recognition, which mimics the human visual perception of key local regions in aerial scenes, in the hope of building a discriminative local semantic representation.
Abstract: Aerial scene recognition is challenging due to the complicated object distribution and spatial arrangement in a large-scale aerial image. Recent studies attempt to explore the local semantic representation capability of deep learning models, but how to exactly perceive the key local regions remains to be handled. In this paper, we present a local semantic enhanced ConvNet (LSE-Net) for aerial scene recognition, which mimics the human visual perception of key local regions in aerial scenes, in the hope of building a discriminative local semantic representation. Our LSE-Net consists of a context enhanced convolutional feature extractor, a local semantic perception module and a classification layer. Firstly, we design a multi-scale dilated convolution operators to fuse multi-level and multi-scale convolutional features in a trainable manner in order to fully receive the local feature responses in an aerial scene. Then, these features are fed into our two-branch local semantic perception module. In this module, we design a context-aware class peak response (CACPR) measurement to precisely depict the visual impulse of key local regions and the corresponding context information. Also, a spatial attention weight matrix is extracted to describe the importance of each key local region for the aerial scene. Finally, the refined class confidence maps are fed into the classification layer. Exhaustive experiments on three aerial scene classification benchmarks indicate that our LSE-Net achieves the state-of-the-art performance, which validates the effectiveness of our local semantic perception module and CACPR measurement.

42 citations


Journal ArticleDOI
TL;DR: The proposed CF-Net can capture more accurate small-scale semantic information from two aspects: on the one hand, it develops a channel attention refinement block to select the informative features, and on the other hand, a cross fusion block to enlarge the receptive field of the low-level feature maps.
Abstract: Capturing accurate multiscale semantic information from the images is of great importance for high-quality semantic segmentation Over the past years, a large number of methods attempt to improve the multiscale information capturing ability of the networks via various means However, these methods always suffer unsatisfactory efficiency (eg, speed or accuracy) on the images that include a large number of small-scale objects, for example, aerial images In this article, we propose a new network named cross fusion net (CF-Net) for fast and effective extraction of the multiscale semantic information, especially for small-scale semantic information In particular, the proposed CF-Net can capture more accurate small-scale semantic information from two aspects On the one hand, we develop a channel attention refinement block to select the informative features On the other hand, we propose a cross fusion block to enlarge the receptive field of the low-level feature maps As a result, the network can encode more accurate semantic information from the small-scale objects, and the segmentation accuracy of the small-scale objects is improved accordingly We have compared the proposed CF-Net with several state-of-the-art semantic segmentation methods on two popular aerial image segmentation data sets Experimental results reveal that the average F₁ score gain brought by our CF-Net is about 043% and the F₁ score gain of the small-scale objects (eg, cars) is about 261% In addition, our CF-Net has the fastest inference speed, which proves its superiority in the aerial scenes Our code will be released at: https://githubcom/pcl111/CF-Net

Proceedings ArticleDOI
Jinwang Wang1, Wen Yang1, Haowen Guo1, Ruixiang Zhang1, Gui-Song Xia1 
10 Jan 2021
TL;DR: Li et al. as mentioned in this paper proposed a multiple center points based learning network (M-CenterNet) to improve the localization performance of tiny object detection, and experimental results show the significant performance gain over the competitors.
Abstract: Object detection in Earth Vision has achieved great progress in recent years. However, tiny object detection in aerial images remains a very challenging problem since the tiny objects contain a small number of pixels and are easily confused with the background. To advance tiny object detection research in aerial images, we present a new dataset for Tiny Object Detection in Aerial Images (AI-TOD). Specifically, AI-TOD comes with 700,621 object instances for eight categories across 28,036 aerial images. Compared to existing object detection datasets in aerial images, the mean size of objects in AI-TOD is about 12.8 pixels, which is much smaller than others. To build a benchmark for tiny object detection in aerial images, we evaluate the state-of-the-art object detectors on our AI-TOD dataset. Experimental results show that direct application of these approaches on AI-TOD produces suboptimal object detection results, thus new specialized detectors for tiny object detection need to be designed. Therefore, we propose a multiple center points based learning network (M-CenterNet) to improve the localization performance of tiny object detection, and experimental results show the significant performance gain over the competitors.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a dual-pathway transformer structure that learns the long-term dependency of tokens in both spatial and channel dimensions and achieves state-of-the-art accuracy on benchmark building extraction datasets.
Abstract: Deep learning methods have achieved considerable progress in remote sensing image building extraction. Most building extraction methods are based on Convolutional Neural Networks (CNN). Recently, vision transformers have provided a better perspective for modeling long-range context in images, but usually suffer from high computational complexity and memory usage. In this paper, we explored the potential of using transformers for efficient building extraction. We design an efficient dual-pathway transformer structure that learns the long-term dependency of tokens in both their spatial and channel dimensions and achieves state-of-the-art accuracy on benchmark building extraction datasets. Since single buildings in remote sensing images usually only occupy a very small part of the image pixels, we represent buildings as a set of “sparse” feature vectors in their feature space by introducing a new module called “sparse token sampler”. With such a design, the computational complexity in transformers can be greatly reduced over an order of magnitude. We refer to our method as Sparse Token Transformers (STT). Experiments conducted on the Wuhan University Aerial Building Dataset (WHU) and the Inria Aerial Image Labeling Dataset (INRIA) suggest the effectiveness and efficiency of our method. Compared with some widely used segmentation methods and some state-of-the-art building extraction methods, STT has achieved the best performance with low time cost.

Journal ArticleDOI
TL;DR: This study introduces lie group machine learning into the CNN model, tries to combine both approaches to extract more distinguishing ability and effective features, and proposes a novel network model, namely, the lie group regional influence network (LGRIN).
Abstract: The existing convolutional neural network (CNN) models have shown excellent performance in remote sensing scene classification. However, the structure of such models is becoming more and more complex, and the learning of low-level features is difficult to interpret. To address this problem, in this study, we introduce lie group machine learning into the CNN model, try to combine both approaches to extract more distinguishing ability and effective features, and propose a novel network model, namely, the lie group regional influence network (LGRIN). First, manifold space samples of the lie group are obtained by mapping, and then, the features of the lie group are extracted after the operations of image decomposition and integral image calculation. Second, the multidilation pooling is integrated into the CNN architecture. At the same time, the image regional influence network module is designed to guide the attention of the classification model by using the regional-level supervision of the decomposition. Finally, the fusion features are classified, and the predicted results are obtained. Our model takes full advantage of regional influence, lie group kernel function, and lie group feature learning. Moreover, our model produces satisfactory performance on three public and challenging data sets: Aerial Image Dataset (AID), UC Merced, and NWPU-RESISC45. The experimental results verify that, compared with the state-of-the-art methods, this method is more explanatory and achieves higher accuracy.

Journal ArticleDOI
TL;DR: Experimental results on the aerial image data set (AID) and NWPU-RESISC45 datasets prove that the proposed GLDBS method achieves remarkable classification performance compared with some state-of-the-art (SOTA) methods.
Abstract: Scene classification of high-resolution images is an active research topic in the remote sensing community. Although convolutional neural network (CNN)-based methods have obtained good performance, large-scale changes of ground objects in complex scenes restrict the further improvement of classification accuracy. In this letter, a global-local dual-branch structure (GLDBS) is designed to explore discriminative features of the original images and the crucial areas, and the strategy of decision-level fusion is applied for performance improvement. To discover the crucial area of the original image, the energy map generated by CNNs is transformed to the binary image, and the coordinates of the maximally connected region can be obtained. Among them, two shallow CNNs, ResNet18 and ResNet34, are selected as the backbone to construct a dual-branch network, and a joint loss is designed to optimize the whole model. In the GLDBS, the two streams employ the same structure (ResNet18-ResNet34) as the backbone, while the parameters are not shared. Experimental results on the aerial image data set (AID) and NWPU-RESISC45 datasets prove that the proposed GLDBS method achieves remarkable classification performance compared with some state-of-the-art (SOTA) methods. The highest overall accuracies (OAs) on the AID and NWPU-RESISC45 datasets are 97.01% and 94.46%, respectively.

Journal ArticleDOI
Tao Xu1, Xian Sun1, Wenhui Diao1, Liangjin Zhao1, Kun Fu1, Hongqi Wang1 
TL;DR: An efficient feature aligned single-shot detector (ASSD) is proposed, which consists of a novel pseudo anchor proposal module (PAPM) and a flexible context-based feature alignment module (CFAM), which achieves the state-of-the-art performance.
Abstract: Object detection is a fundamental part of the interpretation of remote sensing imagery. The one-stage object detector has been adopted into this field because of its high computational efficiency. However, this detector suffers from the misalignment among predefined anchor, object, and feature extracted by standard convolution kernel both in spatial and scale. It limits the further improvement of performance, especially for the long-narrow and multiscale geospatial objects. In this article, the problem is defined as the feature misalignment problem. To deal with this issue, an efficient feature aligned single-shot detector (ASSD) is proposed, which consists of two modules: a novel pseudo anchor proposal module (PAPM) and a flexible context-based feature alignment module (CFAM). The PAPM replaces the regular anchor group with the proposed core anchor and refines it to get aligned locations. It can tackle the spatial misalignment between anchors and their corresponding objects and alleviate the negative/positive imbalance problem. Then, the CFAM adaptively adjusts the sampling points of the convolution kernel and collects the context information according to the aligned core anchor. This plug-and-play module can effectively rectify the misalignment between kernel and objects and extract aligned and robust features. A series of comprehensive experiments are conducted on two large-scale public remote sensing object detection datasets. Experiment results suggest that the proposed method is effective to alleviate the misalignment problem. Compared with the baseline model, the detection accuracy is improved by 8.5% mAP and 11.0% mAP on the challenging benchmark for object detection in optical remote sensing image (DIOR) and a large-scale dataset for object detection in aerial image (DOTA) dataset, respectively. Our best-resulting model achieves the state-of-the-art performance, surpassing other one-stage detectors both on the two datasets at a high detection speed of 21 FPS.

Journal ArticleDOI
02 Feb 2021
TL;DR: Li et al. as mentioned in this paper proposed a novel solution to detect road curbs off-line using aerial images and formulated the problem as an imitation learning problem, and designed a novel network and an innovative training strategy to train an agent to iteratively find the road-curb graph.
Abstract: Detection of road curbs is an essential capability for autonomous driving It can be used for autonomous vehicles to determine drivable areas on roads Usually, road curbs are detected on-line using vehicle-mounted sensors, such as video cameras and 3-D Lidars However, on-line detection using video cameras may suffer from challenging illumination conditions, and Lidar-based approaches may be difficult to detect far-away road curbs due to the sparsity issue of point clouds In recent years, aerial images are becoming more and more worldwide available We find that the visual appearances between road areas and off-road areas are usually different in aerial images, so we propose a novel solution to detect road curbs off-line using aerial images The input to our method is an aerial image, and the output is directly a graph (ie, vertices and edges) representing road curbs To this end, we formulate the problem as an imitation learning problem, and design a novel network and an innovative training strategy to train an agent to iteratively find the road-curb graph The experimental results on a public dataset confirm the effectiveness and superiority of our method This work is accompanied with a demonstration video and a supplementary document at https://tonyxuqaqgithubio/iCurb/

Journal ArticleDOI
TL;DR: This article scientifically distinguishes the inappropriateness of ground-view images for aerial object detection and endorses the efficiency of YOLOv4 as it outperforms the other developed models by a minimum mAP margin of 88%.
Abstract: In the contemporary era, the global explosion of traffic has created many eye-catching concerns for policymakers. This not only enhances pollution but also leads to several road accident fatalities which may be greatly reduced by proper monitoring and surveillance. Further, with the advent of UAV technology and due to the incompatibility of traditional techniques, surveillance has become one of UAVs prominent application domains. However, it requires algorithmic analysis of aerial images which becomes extremely challenging due to multi-scale rotating objects with large aspect ratios, extremely imbalanced categories, cluttered background, and birds-eye view. Therefore, this article presents the novel aerial image traffic monitoring and surveillance algorithms based on the most advanced and popular DL object detection models (Faster-RCNN, SSD, YOLOv3, and YOLOv4) using the AU-AIR dataset. This dataset is exceedingly imbalanced and to resolve this issue, another 500 images have been grabbed by web-mining techniques. The novel contribution of this work is two-fold. First, this article scientifically distinguishes the inappropriateness of ground-view images for aerial object detection. Second, a regress comparison of these algorithms has been made to investigate their effectiveness. Extensive experimental analysis endorses the efficiency of YOLOv4 as it outperforms the other developed models by a minimum mAP margin of 88%. Also, more than 6 times high detection speed and greater adaptability with stronger detection robustness ensure its real-time practical implementation.

Journal ArticleDOI
Qi Bi1, Han Zhang1, Kun Qin1
TL;DR: Zhang et al. as mentioned in this paper proposed a multi-scale staking attention pooling (MS2AP) to enhance the feature representation capability for remote sensing scenes. But, the MS2AP is not suitable for image classification in large-scale aerial images.

Journal ArticleDOI
Aihua Zheng1, Ming Wang1, Chenglong Li1, Jin Tang1, Bin Luo1 
TL;DR: An effective unsupervised domain adaptation approach, which relies on a novel entropy guided adversarial learning algorithm, for aerial image semantic segmentation by performing local feature alignment between domains by learning a self-adaptive weight from the target prediction probability map to measure the interdomain discrepancy.
Abstract: Recent advances on aerial image semantic segmentation mainly employ the domain adaption to transfer knowledge from the source domain to the target domain. Despite the remarkable achievement, most methods focus on the global marginal distribution alignment to reduce the domain shift between source and target domains, leading to a wrong mapping of the well-aligned features. In this article, we propose an effective unsupervised domain adaptation approach, which relies on a novel entropy guided adversarial learning algorithm, for aerial image semantic segmentation. In specific, we perform local feature alignment between domains by learning a self-adaptive weight from the target prediction probability map to measure the interdomain discrepancy. To exploit the meaningful structure information among semantic regions, we propose to utilize the graph convolutions for long-range semantic reasoning. Comprehensive experimental results on the benchmark dataset of aerial image semantic segmentation and natural scenes demonstrate the superior performance of the proposed method compared to the state-of-the-art methods.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that, compared with the existing CNN-based model, the proposed E-D-Net provides noticeably more robust and higher building extraction performance, thus making it a useful tool for practical application scenarios.
Abstract: The automatic extraction of buildings from high-resolution aerial imagery plays a significant role in many urban applications. Recently, the convolution neural network (CNN) has gained much attention in remote sensing field and achieved a remarkable performance in building segmentation from visible aerial images. However, most of the existing CNN-based methods still have the problem of tending to produce predictions with poor boundaries. To address this problem, in this article, a novel semantic segmentation neural network named edge-detail-network (E-D-Net) is proposed for building segmentation from visible aerial images. The proposed E-D-Net consists of two subnetworks E-Net and D-Net. On the one hand, E-Net is designed to capture and preserve the edge information of the images. On the other hand, D-Net is designed to refine the results of E-Net and get a prediction with higher detail quality. Furthermore, a novel fusion strategy, which combines the outputs of the two subnetworks is proposed to integrate edge information with fine details. Experimental results on the INRIA aerial image labeling dataset and the ISPRS Vaihingen 2-D semantic labeling dataset demonstrate that, compared with the existing CNN-based model, the proposed E-D-Net provides noticeably more robust and higher building extraction performance, thus making it a useful tool for practical application scenarios.

Journal ArticleDOI
Qinglin Tian, Yingjun Zhao, Yao Li1, Chen Jun, Xuejiao Chen, Kai Qin 
TL;DR: This study developed a novel multiscale building extraction method based on refined attention pyramid networks (RAPNets), and built an encoder–decoder structure to improve the performance of feature extraction in the encoding path.
Abstract: Automatic building extraction from high-resolution aerial and satellite images has many practical applications, such as urban planning and disaster management. However, the complex appearance and various scales of buildings in remote-sensing images bring a challenge for building extraction. In this study, we developed a novel multiscale building extraction method based on refined attention pyramid networks (RAPNets). We built an encoder-decoder structure, and combine atrous convolution, deformable convolution, attention mechanism, and pyramid pooling module to improve the performance of feature extraction in the encoding path. Moreover, the salient multiscale features were extracted by embedding the convolutional block attention module into the lateral connections. Finally, the refined feature pyramid structure was adopted in the decoding path to fuse the multiscale features to obtain the final extraction results. Experiments on two standard data sets (Inria aerial image labeling data set and xBD data set) show that our method achieves reliable results and outperforms the comparing methods.

Journal ArticleDOI
01 Jan 2021-Energies
TL;DR: The improved YOLOv3 model achieves good performance for insulator fault detection in aerial images with diverse backgrounds and is the smallest between the four compared models.
Abstract: Insulators play a significant role in high-voltage transmission lines, and detecting insulator faults timely and accurately is important for the safe and stable operation of power grids. Since insulator faults are extremely small and the backgrounds of aerial images are complex, insulator fault detection is a challenging task for automatically inspecting transmission lines. In this paper, a method based on deep learning is proposed for insulator fault detection in diverse aerial images. Firstly, to provide sufficient insulator fault images for training, a novel insulator fault dataset named “InSF-detection” is constructed. Secondly, an improved YOLOv3 model is proposed to reuse features and prevent feature loss. To improve the accuracy of insulator fault detection, SPP-networks and a multi-scale prediction network are employed for the improved YOLOv3 model. Finally, the improved YOLOv3 model and the compared models are trained and tested on the “InSF-detection”. The average precision (AP) of the improved YOLOv3 model is superior to YOLOv3 and YOLOv3-dense models, and just a little (1.2%) lower than that of CSPD-YOLO model; more importantly, the memory usage of the improved YOLOv3 model is 225 MB, which is the smallest between the four compared models. The experimental results and analysis validate that the improved YOLOv3 model achieves good performance for insulator fault detection in aerial images with diverse backgrounds.

Journal ArticleDOI
TL;DR: Comparative studies with six existing methods confirm the superior performance of the CapFPN in accurately extracting building footprints, and take advantage of the properties of capsules and fusing different levels of capsule features.
Abstract: Building footprint extraction plays an important role in a wide range of applications. However, due to size and shape diversities, occlusions, and complex scenarios, it is still challenging to accurately extract building footprints from aerial images. This letter proposes a capsule feature pyramid network (CapFPN) for building footprint extraction from aerial images. Taking advantage of the properties of capsules and fusing different levels of capsule features, the CapFPN can extract high-resolution, intrinsic, and semantically strong features, which perform effectively in improving the pixel-wise building footprint extraction accuracy. With the use of signed distance maps as ground truths, the CapFPN can extract solid building regions free of tiny holes. Quantitative evaluations on an aerial image data set show that a precision, recall, intersection-over-union (IoU), and F-score of 0.928, 0.914, 0.853, and 0.921, respectively, are obtained. Comparative studies with six existing methods confirm the superior performance of the CapFPN in accurately extracting building footprints.

Journal ArticleDOI
TL;DR: A self-supervised learning approach, Self-Supervised Monocular Depth Estimation (SMDE), which does not need ground truth depth or any extra information other than images for learning to estimate depth and outperforms the state-of-the-art methods in estimating the depths.
Abstract: Unmanned Aerial Vehicles (UAVs) have become an essential photogrammetric measurement as they are affordable, easily accessible and versatile. Aerial images captured from UAVs have applications in small and large scale texture mapping, 3D modelling, object detection tasks, Digital Terrain Model (DTM) and Digital Surface Model (DSM) generation etc. Photogrammetric techniques are routinely used for 3D reconstruction from UAV images where multiple images of the same scene are acquired. Developments in computer vision and deep learning techniques have made Single Image Depth Estimation (SIDE) a field of intense research. Using SIDE techniques on UAV images can overcome the need for multiple images for 3D reconstruction. This paper aims to estimate depth from a single UAV aerial image using deep learning. We follow a self-supervised learning approach, Self-Supervised Monocular Depth Estimation (SMDE), which does not need ground truth depth or any extra information other than images for learning to estimate depth. Monocular video frames are used for training the deep learning model which learns depth and pose information jointly through two different networks, one each for depth and pose. The predicted depth and pose are used to reconstruct one image from the viewpoint of another image utilising the temporal information from videos. We propose a novel architecture with two 2D Convolutional Neural Network (CNN) encoders and a 3D CNN decoder for extracting information from consecutive temporal frames. A contrastive loss term is introduced for improving the quality of image generation. Our experiments are carried out on the public UAVid video dataset. The experimental results demonstrate that our model outperforms the state-of-the-art methods in estimating the depths.

Journal ArticleDOI
TL;DR: Mughal et al. as mentioned in this paper proposed a complete trainable pipeline to localize an aerial image in a pre-stored orthomosaic map in the context of UAV localization.
Abstract: In this article, we aim to explore the potential of using onboard cameras and pre-stored geo-referenced imagery for Unmanned Aerial Vehicle (UAV) localization. Such a vision-based localization enhancing system is of vital importance, particularly in situations where the integrity of the global positioning system (GPS) is in question (i.e., in the occurrence of GPS outages, jamming, etc.). To this end, we propose a complete trainable pipeline to localize an aerial image in a pre-stored orthomosaic map in the context of UAV localization. The proposed deep architecture extracts the features from the aerial imagery and localizes it in a pre-ordained, larger, and geotagged image. The idea is to train a deep learning model to find neighborhood consensus patterns that encapsulate the local patterns in the neighborhood of the established dense feature correspondences by introducing semi-local constraints. We qualitatively and quantitatively evaluate the performance of our approach on real UAV imagery. The training and testing data is acquired via multiple flights over different regions. The source code along with the entire dataset, including the annotations of the collected images has been made public. 1 1 https://github.com/m-hamza-mughal/Aerial-Template-Matching . Up-to our knowledge, such a dataset is novel and first of its kind which consists of 2052 high-resolution aerial images acquired at different times over three different areas in Pakistan spanning a total area of around 2 km $^2$ .

Journal ArticleDOI
Guohui Deng1, Zhaocong Wu1, Chengjun Wang1, Miaozhong Xu1, Yanfei Zhong1 
TL;DR: A novel class-constraint coarse-to-fine attentional (CCA) deep network, which enables the formation of class information constraints to obtain explicit long-range context information and achieves state-of-the-art performance on the IEEE GRSS Data Fusion Contest Zeebrugge data set.
Abstract: Semantic segmentation is important for the understanding of subdecimeter aerial images. In recent years, deep convolutional neural networks (DCNNs) have been used widely for semantic segmentation in the field of remote sensing. However, because of the highly complex subdecimeter resolution of aerial images, inseparability often occurs among some geographic entities of interest in the spectral domain. In addition, the semantic segmentation methods based on DCNNs mostly obtain context information using extra information within the added receptive field. However, the context information obtained this way is not explicit. We propose a novel class-constraint coarse-to-fine attentional (CCA) deep network, which enables the formation of class information constraints to obtain explicit long-range context information. Further, the performance of subdecimeter aerial image semantic segmentation can be improved, particularly for fine-structured geographic entities. Based on coarse-to-fine technology, we obtained a coarse segmentation result and constructed an image class feature library. We propose the use of the attention mechanism to obtain strong class-constrained features. Consequently, pixels of different geographic entities can adaptively match the corresponding categories in the class feature library. Additionally, we employed a novel loss function, CCA-loss to realize end-to-end training. The experimental results obtained using two popular open benchmarks, International Society for Photogrammetry and Remote Sensing (ISPRS) 2-D semantic labeling Vaihingen data set and Institute of Electrical and Electronics Engineers (IEEE) Geoscience and Remote Sensing Society (GRSS) Data Fusion Contest Zeebrugge data set, validated the effectiveness and superiority of our proposed model. The proposed method achieved state-of-the-art performance on the IEEE GRSS Data Fusion Contest Zeebrugge data set.

Posted Content
TL;DR: This work proposes a simple yet effective calibrated-guidance (CG) scheme to enhance channel communications in a feature transformer fashion, which can adaptively determine the calibration weights for each channel based on the global feature affinity correlations.
Abstract: Recently, the study on object detection in aerial images has made tremendous progress in the community of computer vision. However, most state-of-the-art methods tend to develop elaborate attention mechanisms for the space-time feature calibrations with high computational complexity, while surprisingly ignoring the importance of feature calibrations in channels. In this work, we propose a simple yet effective Calibrated-Guidance (CG) scheme to enhance channel communications in a feature transformer fashion, which can adaptively determine the calibration weights for each channel based on the global feature affinity-pairs. Specifically, given a set of feature maps, CG first computes the feature similarity between each channel and the remaining channels as the intermediary calibration guidance. Then, re-representing each channel by aggregating all the channels weighted together via the guidance. Our CG can be plugged into any deep neural network, which is named as CG-Net. To demonstrate its effectiveness and efficiency, extensive experiments are carried out on both oriented and horizontal object detection tasks of aerial images. Results on two challenging benchmarks (i.e., DOTA and HRSC2016) demonstrate that our CG-Net can achieve state-of-the-art performance in accuracy with a fair computational overhead. this https URL

Proceedings ArticleDOI
01 Jan 2021
TL;DR: In this paper, a domain-aware hazy-to-hyperspectral (H2H) module and a conditional GAN (cGAN) based multi-cue image-toimage translation module (I2I) are proposed for haze removal in aerial images.
Abstract: Haze removal in aerial images is a challenging problem due to considerable variation in spatial details and varying contrast. Changes in particulate matter density often lead to degradation in visibility. Therefore, several approaches utilize multi-spectral data as auxiliary information for haze removal. In this paper, we propose SkyGAN for haze removal in aerial images. SkyGAN consists of 1) a domain-aware hazy-to-hyperspectral (H2H) module, and 2) a conditional GAN (cGAN) based multi-cue image-to-image translation module (I2I) for dehazing. The proposed H2H module reconstructs several visual bands from RGB images in an unsupervised manner, which overcomes the lack of hazy hyperspectral aerial image datasets. The module utilizes task supervision and domain adaptation in order to create a "hyperspectral catalyst" for image dehazing. The I2I module uses the hyperspectral catalyst along with a 12-channel multi-cue input and performs effective image de-hazing by utilizing the entire visual spectrum. In addition, this work introduces a new dataset, called Hazy Aerial-Image (HAI) dataset, that contains more than 65,000 pairs of hazy and ground truth aerial images with realistic, non- homogeneous haze of varying density. The performance of SkyGAN is evaluated on the recent SateHaze1k dataset as well as the HAI dataset. We also present a comprehensive evaluation of HAI dataset with a representative set of state-of-the-art techniques in terms of PSNR and SSIM.

Journal ArticleDOI
TL;DR: An approach for the multi-label classification of remote sensing images based on data-efficient transformers that extracts a compact feature representation from each image with the help of a self-attention mechanism, which can handle the global dependencies between different regions of the high-resolution aerial image.
Abstract: In this paper, we present an approach for the multi-label classification of remote sensing images based on data-efficient transformers. During the training phase, we generated a second view for each image from the training set using data augmentation. Then, both the image and its augmented version were reshaped into a sequence of flattened patches and then fed to the transformer encoder. The latter extracts a compact feature representation from each image with the help of a self-attention mechanism, which can handle the global dependencies between different regions of the high-resolution aerial image. On the top of the encoder, we mounted two classifiers, a token and a distiller classifier. During training, we minimized a global loss consisting of two terms, each corresponding to one of the two classifiers. In the test phase, we considered the average of the two classifiers as the final class labels. Experiments on two datasets acquired over the cities of Trento and Civezzano with a ground resolution of two-centimeter demonstrated the effectiveness of the proposed model.