scispace - formally typeset
Search or ask a question

Showing papers on "Aerial image published in 2020"


Journal ArticleDOI
TL;DR: A novel deep convolutional neural network (CNN) cascading architecture for performing localization and detecting defects in insulators is proposed, which uses a CNN based on a region proposal network to transform defect inspection into a two-level object detection problem.
Abstract: As the failure of power line insulators leads to the failure of power transmission systems, an insulator inspection system based on an aerial platform is widely used. Insulator defect detection is performed against complex backgrounds in aerial images, presenting an interesting but challenging problem. Traditional methods, based on handcrafted features or shallow-learning techniques, can only localize insulators and detect faults under specific detection conditions, such as when sufficient prior knowledge is available, with low background interference, at certain object scales, or under specific illumination conditions. This paper discusses the automatic detection of insulator defects using aerial images, accurately localizing insulator defects appearing in input images captured from real inspection environments. We propose a novel deep convolutional neural network (CNN) cascading architecture for performing localization and detecting defects in insulators. The cascading network uses a CNN based on a region proposal network to transform defect inspection into a two-level object detection problem. To address the scarcity of defect images in a real inspection environment, a data augmentation method is also proposed that includes four operations: 1) affine transformation; 2) insulator segmentation and background fusion; 3) Gaussian blur; and 4) brightness transformation. Defect detection precision and recall of the proposed method are 0.91 and 0.96 using a standard insulator dataset, and insulator defects under various conditions can be successfully detected. Experimental results demonstrate that this method meets the robustness and accuracy requirements for insulator defect detection.

324 citations


Proceedings ArticleDOI
12 Apr 2020
TL;DR: This paper proposes a Density-Map guided object detection Network (DMNet), which is inspired from the observation that the object density map of an image presents how objects distribute in terms of the pixel intensity of the map.
Abstract: Object detection in high-resolution aerial images is a challenging task because of 1) the large variation in object size, and 2) non-uniform distribution of objects A common solution is to divide the large aerial image into small (uniform) crops and then apply object detection on each small crop In this paper, we investigate the image cropping strategy to address these challenges Specifically, we propose a Density-Map guided object detection Network (DMNet), which is inspired from the observation that the object density map of an image presents how objects distribute in terms of the pixel intensity of the map As pixel intensity varies, it is able to tell whether a region has objects or not, which in turn provides guidance for cropping images statistically DMNet has three key components: a density map generation module, an image cropping module and an object detector DMNet generates a density map and learns scale information based on density intensities to form cropping regions Extensive experiments show that DMNet achieves state-of-the-art performance on two popular aerial image datasets, ie VisionDrone [30] and UAVDT [4]

112 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work presents Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns, and proposes an effective model designed for aerial agricultural pattern recognition.
Abstract: The success of deep learning in visual recognition tasks has driven advancements in multiple fields of research. Particularly, increasing attention has been drawn towards its application in agriculture. Nevertheless, while visual pattern recognition on farmlands carries enormous economic values, little progress has been made to merge computer vision and crop sciences due to the lack of suitable agricultural image datasets. Meanwhile, problems in agriculture also pose new challenges in computer vision. For example, semantic segmentation of aerial farmland images requires inference over extremely large-size images with extreme annotation sparsity. These challenges are not present in most of the common object datasets, and we show that they are more challenging than many other aerial image datasets. To encourage research in computer vision for agriculture, we present Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. We collected 94,986 high-quality aerial images from 3,432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel. We annotate nine types of field anomaly patterns that are most important to farmers. As a pilot study of aerial agricultural semantic segmentation, we perform comprehensive experiments using popular semantic segmentation models; we also propose an effective model designed for aerial agricultural pattern recognition. Our experiments demonstrate several challenges Agriculture-Vision poses to both the computer vision and agriculture communities. Future versions of this dataset will include even more aerial images, anomaly patterns and image channels.

105 citations


Posted Content
TL;DR: This paper explores a relatively less-studied methodology based on classification for rotation detection, and proposes new techniques to inherently dismiss the boundary discontinuity issue as encountered by the regression-based detectors.
Abstract: Rotation detection serves as a fundamental building block in many visual applications involving aerial image, scene text, and face etc. Differing from the dominant regression-based approaches for orientation estimation, this paper explores a relatively less-studied methodology based on classification. The hope is to inherently dismiss the boundary discontinuity issue as encountered by the regression-based detectors. We propose new techniques to push its frontier in two aspects: i) new encoding mechanism: the design of two Densely Coded Labels (DCL) for angle classification, to replace the Sparsely Coded Label (SCL) in existing classification-based detectors, leading to three times training speed increase as empirically observed across benchmarks, further with notable improvement in detection accuracy; ii) loss re-weighting: we propose Angle Distance and Aspect Ratio Sensitive Weighting (ADARSW), which improves the detection accuracy especially for square-like objects, by making DCL-based detectors sensitive to angular distance and object's aspect ratio. Extensive experiments and visual analysis on large-scale public datasets for aerial images i.e. DOTA, UCAS-AOD, HRSC2016, as well as scene text dataset ICDAR2015 and MLT, show the effectiveness of our approach. The source code is available at this https URL and is also integrated in our open source rotation detection benchmark: this https URL.

87 citations


Journal ArticleDOI
TL;DR: A multiloss neural network based on attention is proposed, based on U-Net, that can improve the sensitivity of the model by the attention block and suppress the background influence of irrelevant feature areas in buildings with high accuracy.
Abstract: Semantic segmentation of high-resolution remote sensing images plays an important role in applications for building extraction. However, the current algorithms have some semantic information extraction limitations, and these can lead to poor segmentation results. To extract buildings with high accuracy, we propose a multiloss neural network based on attention. The designed network, based on U-Net, can improve the sensitivity of the model by the attention block and suppress the background influence of irrelevant feature areas. To improve the ability of the model, a multiloss approach is proposed during training the network. The experimental results show that the proposed model offers great improvement over other state-of-the-art methods. For the public Inria Aerial Image Labeling dataset, the F1 score reached 76.96% and showed good performance on the Aerial Imagery for Roof Segmentation dataset.

73 citations


Journal ArticleDOI
TL;DR: A convolutional neural network (CNN)-based change detection method with a newly designed loss function to achieve transfer learning among different data sets and compares favorably to the state-of-the-art unsupervised methods.
Abstract: Considering the lack of labeled training data sets for the supervised change detection task, in this letter, we try to relieve this problem by proposing a convolutional neural network (CNN)-based change detection method with a newly designed loss function to achieve transfer learning among different data sets. To reach this goal, we first pretrain a U-Net model on an open source data set by taking advantages of the relatively sufficient training data used for the supervised semantic segmentation task. Then, we minimize a skillfully designed loss function to combine the high-level features extracted from the pretrained model and the semantic information contained in the change detection data set, by which a transfer learning is achieved. Third, we compute the distance between the feature vectors obtained from the above step and produce a difference map. Finally, a simple clustering method used on the difference map can even obtain satisfied change map. Experiments carried out on typical optical aerial image data sets validate that the proposed approach compares favorably to the state-of-the-art unsupervised methods.

62 citations


Journal ArticleDOI
TL;DR: A lightweight convolutional neural network architecture is proposed, referred to as EmergencyNet, based on atrous convolutions to process multiresolution features and capable of running efficiently on low-power embedded platforms achieving upto $20\times higher performance compared to existing models with minimal memory requirements with less than 1 accuracy drop compared to state-of-the-art models.
Abstract: Deep learning-based algorithms can provide state-of-the-art accuracy for remote sensing technologies such as unmanned aerial vehicles (UAVs)/drones, potentially enhancing their remote sensing capabilities for many emergency response and disaster management applications. In particular, UAVs equipped with camera sensors can operating in remote and difficult to access disaster-stricken areas, analyze the image and alert in the presence of various calamities such as collapsed buildings, flood, or fire in order to faster mitigate their effects on the environment and on human population. However, the integration of deep learning introduces heavy computational requirements, preventing the deployment of such deep neural networks in many scenarios that impose low-latency constraints on inference, in order to make mission-critical decisions in real time. To this end, this article focuses on the efficient aerial image classification from on-board a UAV for emergency response/monitoring applications. Specifically, a dedicated Aerial Image Database for Emergency Response applications is introduced and a comparative analysis of existing approaches is performed. Through this analysis a lightweight convolutional neural network architecture is proposed, referred to as EmergencyNet , based on atrous convolutions to process multiresolution features and capable of running efficiently on low-power embedded platforms achieving upto $20\times$ higher performance compared to existing models with minimal memory requirements with less than $\text{1}\%$ accuracy drop compared to state-of-the-art models.

62 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: A novel network, called RED-Net, for wide-range depth inference, which was developed from a recurrent encoder-decoder structure to regularize cost maps across depths and a 2D fully convolutional network as framework as framework, and it is proved that the RED- net model pre-trained on the synthetic WHU dataset can be efficiently transferred to very different multi-view aerial image datasets without any fine-tuning.
Abstract: A great deal of research has demonstrated recently that multi-view stereo (MVS) matching can be solved with deep learning methods. However, these efforts were focused on close-range objects and only a very few of the deep learning-based methods were specifically designed for large-scale 3D urban reconstruction due to the lack of multi-view aerial image benchmarks. In this paper, we present a synthetic aerial dataset, called the WHU dataset, we created for MVS tasks, which, to our knowledge, is the first large-scale multi-view aerial dataset. It was generated from a highly accurate 3D digital surface model produced from thousands of real aerial images with precise camera parameters. We also introduce in this paper a novel network, called RED-Net, for wide-range depth inference, which we developed from a recurrent encoder-decoder structure to regularize cost maps across depths and a 2D fully convolutional network as framework. RED-Net’s low memory requirements and high performance make it suitable for large-scale and highly accurate 3D Earth surface reconstruction. Our experiments confirmed that not only did our method exceed the current state-of-the-art MVS methods by more than 50% mean absolute error (MAE) with less memory and computational cost, but its efficiency as well. It outperformed one of the best commercial software programs based on conventional methods, improving their efficiency 16 times over. Moreover, we proved that our RED-Net model pre-trained on the synthetic WHU dataset can be efficiently transferred to very different multi-view aerial image datasets without any fine-tuning. Dataset and code are available at http://gpcv.whu.edu.cn/data.

56 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed an attention-aware label relational reasoning network, which consists of three elemental modules: a label-wise feature parcel learning module, an attentional region extraction module, and a label relational inference module.
Abstract: Multilabel classification plays a momentous role in perceiving intricate contents of an aerial image and triggers several related studies over the last years. However, most of them deploy few efforts in exploiting label relations, while such dependencies are crucial for making accurate predictions. Although an long short term memory (LSTM) layer can be introduced to modeling such label dependencies in a chain propagation manner, the efficiency might be questioned when certain labels are improperly inferred. To address this, we propose a novel aerial image multilabel classification network, attention-aware label relational reasoning network. Particularly, our network consists of three elemental modules: 1) a label-wise feature parcel learning module; 2) an attentional region extraction module; and 3) a label relational inference module. To be more specific, the label-wise feature parcel learning module is designed for extracting high-level label-specific features. The attentional region extraction module aims at localizing discriminative regions in these features without region proposal generation, yielding attentional label-specific features. The label relational inference module finally predicts label existences using label relations reasoned from outputs of the previous module. The proposed network is characterized by its capacities of extracting discriminative label-wise features and reasoning about label relations naturally and interpretably. In our experiments, we evaluate the proposed model on two multilabel aerial image data sets, of which one is newly produced. Quantitative and qualitative results on these two data sets demonstrate the effectiveness of our model. To facilitate progress in the multilabel aerial image classification, our produced data set will be made publicly available.

55 citations


Journal ArticleDOI
TL;DR: A new one-stage anchor-free method to detect orientated objects in per-pixel prediction fashion with less computational complexity is proposed and a new aspect-ratio-aware orientation centerness method is proposed to better weigh positive pixel points, in order to guide the network to learn discriminative features from a complex background, which brings improvements for large aspect ratio object detection.
Abstract: Orientated object detection in aerial images is still a challenging task due to the bird’s eye view and the various scales and arbitrary angles of objects in aerial images. Most current methods for orientated object detection are anchor-based, which require considerable pre-defined anchors and are time consuming. In this article, we propose a new one-stage anchor-free method to detect orientated objects in per-pixel prediction fashion with less computational complexity. Arbitrary orientated objects are detected by predicting the axis of the object, which is the line connecting the head and tail of the object, and the width of the object is vertical to the axis. By predicting objects at the pixel level of feature maps directly, the method avoids setting a number of hyperparameters related to anchor and is computationally efficient. Besides, a new aspect-ratio-aware orientation centerness method is proposed to better weigh positive pixel points, in order to guide the network to learn discriminative features from a complex background, which brings improvements for large aspect ratio object detection. The method is tested on two common aerial image datasets, achieving better performance compared with most one-stage orientated methods and many two-stage anchor-based methods with a simpler procedure and lower computational complexity.

54 citations


Journal ArticleDOI
TL;DR: This paper proposes a residual attention based dense connected convolutional neural network (RADC-Net) that outperforms some state-of-the-art methods with much fewer parameters and introduces an enhanced classification layer in this framework to further refine the extracted Convolutional features and highlight local semantic information.

Journal ArticleDOI
TL;DR: This letter proposes an attention pooling-based dense connected convolutional network (APDC-Net) for aerial scene classification that uses a simplified dense connection structure as the backbone to preserve features from different levels and introduces a multi-level supervision strategy.
Abstract: Deep learning methods have boosted the performance of a series of visual tasks. However, the aerial image scene classification remains challenging. The object distribution and spatial arrangement in aerial scenes are often more complicated than in natural image scenes. Possible solutions include highlighting local semantics relevant to the scene label and preserving more discriminative features. To tackle this challenge, in this letter, we propose an attention pooling-based dense connected convolutional network (APDC-Net) for aerial scene classification. First, it uses a simplified dense connection structure as the backbone to preserve features from different levels. Then, we propose a trainable pooling to down-sample the feature maps and to enhance the local semantic representation capability. Finally, we introduce a multi-level supervision strategy, so that features from different levels are all allowed to supervise the training process directly. Exhaustive experiments on three aerial scene classification benchmarks demonstrate that our proposed APDC-Net outperforms other state-of-the-art methods with much fewer parameters and validate the effectiveness of our attention-based pooling and multi-level supervision strategy.

Journal ArticleDOI
TL;DR: A new method named global multi-scale encoder-decoder network (GMEDN), developed with a local and global encoder and a distilling decoder, which can achieve better performance in building extraction than some existing methods.
Abstract: Semantic segmentation is an important and challenging task in the aerial image community since it can extract the target level information for understanding the aerial image. As a practical application of aerial image semantic segmentation, building extraction always attracts researchers’ attention as the building is the specific land cover in the aerial images. There are two key points for building extraction from aerial images. One is learning the global and local features to fully describe the buildings with diverse shapes. The other one is mining the multi-scale information to discover the buildings with different resolutions. Taking these two key points into account, we propose a new method named global multi-scale encoder-decoder network (GMEDN) in this paper. Based on the encoder-decoder framework, GMEDN is developed with a local and global encoder and a distilling decoder. The local and global encoder aims at learning the representative features from the aerial images for describing the buildings, while the distilling decoder focuses on exploring the multi-scale information for the final segmentation masks. Combining them together, the building extraction is accomplished in an end-to-end manner. The effectiveness of our method is validated by the experiments counted on two public aerial image datasets. Compared with some existing methods, our model can achieve better performance.

Journal ArticleDOI
01 Nov 2020-Optik
TL;DR: An enhanced multilayer perceptron (MLP) depending on Adagrad optimizer is employed in the classification step in this paper as a deep classifier to contribute to higher classification performance relative to state-of-the-art methods.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed an end-to-end model, called ARC-Net, which includes residual blocks with asymmetric convolution (RBAC) to reduce the computational cost and to shrink the model size.
Abstract: Automatic building extraction based on high-resolution aerial images has important applications in urban planning and environmental management. In recent years advances and performance improvements have been achieved in building extraction through the use of deep learning methods. However, the design of existing models focuses attention to improve accuracy through an overflowing number of parameters and complex structure design, resulting in large computational costs during the learning phase and low inference speed. To address these issues, we propose a new, efficient end-to-end model, called ARC-Net. The model includes residual blocks with asymmetric convolution (RBAC) to reduce the computational cost and to shrink the model size. In addition, dilated convolutions and multi-scale pyramid pooling modules are utilized to enlarge the receptive field and to enhance accuracy. We verify the performance and efficiency of the proposed ARC-Net on the INRIA Aerial Image Labeling dataset and WHU building dataset. Compared to available deep learning models, the proposed ARC-Net demonstrates better segmentation performance with less computational costs. This indicates that the proposed ARC-Net is both effective and efficient in automatic building extraction from high-resolution aerial images.

Journal ArticleDOI
TL;DR: The developed CNN model is applied to aerial images in Chiba, Japan, damaged by the typhoon in September 2019 and the result shows that more than 90 % of the building damage are correctly classified by the CNN model.
Abstract: A methodology for the automated identification of building damage from post-disaster aerial images was developed based on convolutional neural network (CNN) and building damage inventories. The aerial images and the building damage data obtained in the 2016 Kumamoto, and the 1995 Kobe, Japan earthquakes were analyzed. Since the roofs of many moderately damaged houses are covered with blue tarps immediately after disasters, not only collapsed and non-collapsed buildings but also the buildings covered with blue tarps were identified by the proposed method. The CNN architecture developed in this study correctly classifies the building damage with the accuracy of approximately 95 % in both earthquake data. We applied the developed CNN model to aerial images in Chiba, Japan, damaged by the typhoon in September 2019. The result shows that more than 90 % of the building damage are correctly classified by the CNN model.

Journal ArticleDOI
TL;DR: An image correction methodology is proposed, which first exploits recent machine learning procedures that recover depth from image-based dense point clouds and then corrects refraction on the original imaging dataset, resulting in highly accurate bathymetric maps.
Abstract: Although aerial image-based bathymetric mapping can provide, unlike acoustic or LiDAR (Light Detection and Ranging) sensors, both water depth and visual information, water refraction poses significant challenges for accurate depth estimation. In order to tackle this challenge, we propose an image correction methodology, which first exploits recent machine learning procedures that recover depth from image-based dense point clouds and then corrects refraction on the original imaging dataset. This way, the structure from motion (SfM) and multi-view stereo (MVS) processing pipelines are executed on a refraction-free set of aerial datasets, resulting in highly accurate bathymetric maps. Performed experiments and validation were based on datasets acquired during optimal sea state conditions and derived from four different test-sites characterized by excellent sea bottom visibility and textured seabed. Results demonstrated the high potential of our approach, both in terms of bathymetric accuracy, as well as texture and orthoimage quality.

Posted Content
TL;DR: In this article, a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns is presented, which includes 94,986 high-quality aerial images from 3,432 farmlands across the US, where each image consists of RGB and NIR channels with resolution as high as 10 cm per pixel.
Abstract: The success of deep learning in visual recognition tasks has driven advancements in multiple fields of research. Particularly, increasing attention has been drawn towards its application in agriculture. Nevertheless, while visual pattern recognition on farmlands carries enormous economic values, little progress has been made to merge computer vision and crop sciences due to the lack of suitable agricultural image datasets. Meanwhile, problems in agriculture also pose new challenges in computer vision. For example, semantic segmentation of aerial farmland images requires inference over extremely large-size images with extreme annotation sparsity. These challenges are not present in most of the common object datasets, and we show that they are more challenging than many other aerial image datasets. To encourage research in computer vision for agriculture, we present Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. We collected 94,986 high-quality aerial images from 3,432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel. We annotate nine types of field anomaly patterns that are most important to farmers. As a pilot study of aerial agricultural semantic segmentation, we perform comprehensive experiments using popular semantic segmentation models; we also propose an effective model designed for aerial agricultural pattern recognition. Our experiments demonstrate several challenges Agriculture-Vision poses to both the computer vision and agriculture communities. Future versions of this dataset will include even more aerial images, anomaly patterns and image channels. More information at this https URL.

Journal ArticleDOI
TL;DR: A new framework for automated dead tree detection from aerial images is presented using a re-trained Mask RCNN (Mask Region-based Convolutional Neural Network) approach, with a transfer learning scheme, and eight fine-tuned models are compared.
Abstract: Global climate change has had a drastic impact on our environment. Previous study showed that pest disaster occured from global climate change may cause a tremendous number of trees died and they inevitably became a factor of forest fire. An important portent of the forest fire is the condition of forests. Aerial image-based forest analysis can give an early detection of dead trees and living trees. In this paper, we applied a synthetic method to enlarge imagery dataset and present a new framework for automated dead tree detection from aerial images using a re-trained Mask RCNN (Mask Region-based Convolutional Neural Network) approach, with a transfer learning scheme. We apply our framework to our aerial imagery datasets,and compare eight fine-tuned models. The mean average precision score (mAP) for the best of these models reaches 54\%. Following the automated detection, we are able to automatically produce and calculate number of dead tree masks to label the dead trees in an image, as an indicator of forest health that could be linked to the causal analysis of environmental changes and the predictive likelihood of forest fire.

Journal ArticleDOI
TL;DR: This work proposes a novel metric learning method called center-metric learning, and couple it with a new kind of loss called positive-negative center loss, which enables CNNs to cope successfully with within-class variations and makes the first attempt to embed uncertainty regarding similarity into the training process.
Abstract: In recent years, convolutional neural networks (CNNs) have become the predominant method for content-based aerial image retrieval (CBAIR) and aerial scene classification (ASC) due to their overwhelming performance advantages. However, existing CNN-based models have the following shortcomings: first, they do not deal with large intraclass variations, thereby overlooking the possibility of fine-grained retrieval and classification; second, all similarity learning methods for CBAIR consider similarity between two images as a constant, neglecting the fact that image similarity is uncertain in nature; third, similarity learning is separated from ASC, ignoring the advantages of joint optimization. To address these issues, we propose a novel metric learning method called center-metric learning, and couple it with a new kind of loss called positive-negative center loss, which, with the help of several “experts,” enables CNNs to cope successfully with within-class variations. Besides, we propose similarity distribution learning, making the first attempt to embed uncertainty regarding similarity into the training process. The resulting fine-grained similarity predictions can further strengthen CNNs’ fine discrimination ability. Furthermore, three tasks, that is, center-metric learning, similarity distribution learning, and ASC, are incorporated into one CNN, benefitting from one another and leading to a better generalization capability. Just like an eagle, our model is able to discriminate subtle differences among aerial images, hence the name “eagle-eyed multitask CNN.” We carry out extensive experiments over four publicly available aerial image sets and achieve a performance better than all existing methods.

Posted Content
TL;DR: In this paper, the authors collected and released a new TT/PL Aerial-image (TTPLA) dataset, consisting of 1,100 images with the resolution of 3,840$\times$2,160 pixels, as well as manually labeled 8,987 instances of transmission towers and power lines.
Abstract: Accurate detection and segmentation of transmission towers~(TTs) and power lines~(PLs) from aerial images plays a key role in protecting power-grid security and low-altitude UAV safety. Meanwhile, aerial images of TTs and PLs pose a number of new challenges to the computer vision researchers who work on object detection and segmentation -- PLs are long and thin, and may show similar color as the background; TTs can be of various shapes and most likely made up of line structures of various sparsity; The background scene, lighting, and object sizes can vary significantly from one image to another. In this paper we collect and release a new TT/PL Aerial-image (TTPLA) dataset, consisting of 1,100 images with the resolution of 3,840$\times$2,160 pixels, as well as manually labeled 8,987 instances of TTs and PLs. We develop novel policies for collecting, annotating, and labeling the images in TTPLA. Different from other relevant datasets, TTPLA supports evaluation of instance segmentation, besides detection and semantic segmentation. To build a baseline for detection and segmentation tasks on TTPLA, we report the performance of several state-of-the-art deep learning models on our dataset. TTPLA dataset is publicly available at this https URL

Journal ArticleDOI
TL;DR: An oversampling and stitching data augmentation method to decrease the negative effect of category imbalance in the training dataset and construct a new dataset with balanced number of samples, and a joint training loss function including center loss for both horizontal and oriented bounding boxes to reduce the impact of small inter-class diversity on vehicle detection.
Abstract: Vehicles in aerial images are generally with small sizes and unbalanced number of samples, which leads to the poor performances of the existing vehicle detection algorithms. Therefore, an oriented vehicle detection framework based on improved Faster RCNN is proposed for aerial images. First of all, we propose an oversampling and stitching data augmentation method to decrease the negative effect of category imbalance in the training dataset and construct a new dataset with balanced number of samples. Then considering that the pooling operation may loss the discriminative ability of features for small objects, we propose to amplify the feature map so that detailed information hidden in the last feature map can be enriched. Finally, we design a joint training loss function including center loss for both horizontal and oriented bounding boxes, and reduce the impact of small inter-class diversity on vehicle detection. The proposed framework is evaluated on the VEDAI dataset that consists of 9 vehicle categories. The experimental results show that the proposed framework outperforms previous approaches with a mean average precision of 60.4% and 60.1% in detecting horizontal and oriented bounding boxes respectively, which is about 8% better than Faster RCNN.

Journal ArticleDOI
TL;DR: A new aerial image dataset, VAID (Vehicle Aerial Imaging from Drone), for the development and evaluation of vehicle detection algorithms, and contains about 6000 images captured under different traffic conditions, and annotated with 7 common vehicle categories for network training and testing.
Abstract: The availability of commercial UAVs and low-cost imaging devices has made the airborne imagery popular and widely available. The aerial images are now extensively used for many applications, especially in the area of intelligent transportation systems. In this work, we present a new aerial image dataset, VAID (Vehicle Aerial Imaging from Drone), for the development and evaluation of vehicle detection algorithms. It contains about 6000 images captured under different traffic conditions, and annotated with 7 common vehicle categories for network training and testing. We compare the of vehicle detection results using the current state-of-the-art network architectures and various aerial image datasets. The experiments have demonstrated that training the networks using our VAID dataset can provide the best vehicle detection results. Our aerial image dataset is made available publicly at http://vision.ee.ccu.edu.tw/aerialimage/ and the code is available at https://github.com/KaiChun-RVL/VAID_dataset .

Journal ArticleDOI
TL;DR: Competitive results demonstrate that the RDN based on channel-spatial attention for scene classification of a high-resolution remote sensing image can extract more effective features and is more conducive to classifying a scene.
Abstract: The scene classification of a remote sensing image has been widely used in various fields as an important task of understanding the content of a remote sensing image. Specially, a high-resolution remote sensing scene contains rich information and complex content. Considering that the scene content in a remote sensing image is very tight to the spatial relationship characteristics, how to design an effective feature extraction network directly decides the quality of classification by fully mining the spatial information in a high-resolution remote sensing image. In recent years, convolutional neural networks (CNNs) have achieved excellent performance in remote sensing image classification, especially the residual dense network (RDN) as one of the representative networks of CNN, which shows a stronger feature learning ability as it fully utilizes all the convolutional layer information. Therefore, we design an RDN based on channel-spatial attention for scene classification of a high-resolution remote sensing image. First, multi-layer convolutional features are fused with residual dense blocks. Then, a channel-spatial attention module is added to obtain more effective feature representation. Finally, softmax classifier is applied to classify the scene after adopting data augmentation strategy for meeting the training requirements of the network parameters. Five experiments are conducted on the UC Merced Land-Use Dataset (UCM) and Aerial Image Dataset (AID), and the competitive results demonstrate that our method can extract more effective features and is more conducive to classifying a scene.

Posted Content
TL;DR: A new dataset, called Hazy Aerial-Image (HAI) dataset, is introduced that contains more than 65,000 pairs of hazy and ground truth aerial images with realistic, non- homogeneous haze of varying density.
Abstract: Haze removal in aerial images is a challenging problem due to considerable variation in spatial details and varying contrast. Changes in particulate matter density often lead to degradation in visibility. Therefore, several approaches utilize multi-spectral data as auxiliary information for haze removal. In this paper, we propose SkyGAN for haze removal in aerial images. SkyGAN consists of 1) a domain-aware hazy-to-hyperspectral (H2H) module, and 2) a conditional GAN (cGAN) based multi-cue image-to-image translation module (I2I) for dehazing. The proposed H2H module reconstructs several visual bands from RGB images in an unsupervised manner, which overcomes the lack of hazy hyperspectral aerial image datasets. The module utilizes task supervision and domain adaptation in order to create a "hyperspectral catalyst" for image dehazing. The I2I module uses the hyperspectral catalyst along with a 12-channel multi-cue input and performs effective image dehazing by utilizing the entire visual spectrum. In addition, this work introduces a new dataset, called Hazy Aerial-Image (HAI) dataset, that contains more than 65,000 pairs of hazy and ground truth aerial images with realistic, non-homogeneous haze of varying density. The performance of SkyGAN is evaluated on the recent SateHaze1k dataset as well as the HAI dataset. We also present a comprehensive evaluation of HAI dataset with a representative set of state-of-the-art techniques in terms of PSNR and SSIM.

Journal ArticleDOI
04 Aug 2020
TL;DR: This study proposes a cross-domain and cross-view image matching, using a color aerial image and an underwater acoustic image to identify if these images are captured in the same place.
Abstract: Underwater localization is a challenging task due to the lack of a Global Positioning System (GPS). However, the capability to match georeferenced aerial images and acoustic data can help with this task. Autonomous hybrid aerial and underwater vehicles also demand a new localization method capable of combining the perception from both environments. This study proposes a cross-domain and cross-view image matching, using a color aerial image and an underwater acoustic image to identify if these images are captured in the same place. The method is designed to match images acquired in partially structured environments with shared features, such as harbors and marinas. Our pipeline combines traditional image processing methods and deep neural network techniques. Real-world datasets from multiple regions are used to validate our work, obtaining a matching precision of up to 80%.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed an Atrous Spatial Pyramid Pooling (ASPP) module to extract features from multiple dilated convolution layers, and a postprocessing technique was designed to transform the predicted height map of each patch into a seamless height map.
Abstract: Understanding the 3-D geometric structure of the Earth's surface has been an active research topic in photogrammetry and remote sensing community for decades, serving as an essential building block for various applications such as 3-D digital city modeling, change detection, and city management. Previous research studies have extensively studied the problem of height estimation from aerial images based on stereo or multiview image matching. These methods require two or more images from different perspectives to reconstruct 3-D coordinates with camera information provided. In this letter, we deal with the ambiguous and unsolved problem of height estimation from a single aerial image. Driven by the great success of deep learning, especially deep convolutional neural networks (CNNs), some research studies have proposed to estimate height information from a single aerial image by training a deep CNN model with large-scale annotated data sets. These methods treat height estimation as a regression problem and directly use an encoder-decoder network to regress the height values. In this letter, we propose to divide height values into spacing-increasing intervals and transform the regression problem into an ordinal regression problem, using an ordinal loss for network training. To enable multiscale feature extraction, we further incorporate an Atrous Spatial Pyramid Pooling (ASPP) module to extract features from multiple dilated convolution layers. After that, a postprocessing technique is designed to transform the predicted height map of each patch into a seamless height map. Finally, we conduct extensive experiments on International Society for Photogrammetry and Remote Sensing (ISPRS) Vaihingen and Potsdam data sets. Experimental results demonstrate significantly better performance of our method compared to state-of-the-art methods.

Proceedings ArticleDOI
30 Mar 2020
TL;DR: This work proposes TEMPO as a novel generative learning-based framework for efficient and accurate 3D aerial image prediction and demonstrates that it can obtain up to 1170x speedup compared with rigorous simulation while achieving satisfactory accuracy.
Abstract: With the continuous shrinking of the semiconductor device dimensions, mask topography effects stand out among the major factors influencing the lithography process. Including these effects in the lithography optimization procedure has become necessary for advanced technology nodes. However, conventional rigorous simulation for mask topography effects is extremely computationally expensive for high accuracy. In this work, we propose TEMPO as a novel generative learning-based framework for efficient and accurate 3D aerial image prediction. At its core, TEMPO comprises a generative adversarial network capable of predicting aerial image intensity at different resist heights. Compared to the default approach of building a unique model for each desired height, TEMPO takes as one of its inputs the desired height to produce the corresponding aerial image. In this way, the global model in TEMPO can capture the shared behavior among different heights, thus, resulting in smaller model size. Besides, across-height information sharing results in better model accuracy and generalization capability. Our experimental results demonstrate that TEMPO can obtain up to 1170x speedup compared with rigorous simulation while achieving satisfactory accuracy.

Journal ArticleDOI
TL;DR: This work proposes a novel encoder-decoder based architecture by introducing the recursive feature pyramid into a single-stage object detection framework that achieves the state-of-the-art accuracy while running very fast and is more robust to adversarial image patch attacks.

Journal ArticleDOI
TL;DR: A novel method to precisely match two aerial images that were obtained in different environments via a two-stream deep network by internally augmenting the target image, which alleviates asymmetric matching results and enable significant improvement in performance by fusing two outcomes.
Abstract: In this paper, we propose a novel method to precisely match two aerial images that were obtained in different environments via a two-stream deep network. By internally augmenting the target image, the network considers the two-stream with the three input images and reflects the additional augmented pair in the training. As a result, the training process of the deep network is regularized and the network becomes robust for the variance of aerial images. Furthermore, we introduce an ensemble method that is based on the bidirectional network, which is motivated by the isomorphic nature of the geometric transformation. We obtain two global transformation parameters without any additional network or parameters, which alleviate asymmetric matching results and enable significant improvement in performance by fusing two outcomes. For the experiment, we adopt aerial images from Google Earth and the International Society for Photogrammetry and Remote Sensing (ISPRS). To quantitatively assess our result, we apply the probability of correct keypoints (PCK) metric, which measures the degree of matching. The qualitative and quantitative results show the sizable gap of performance compared to the conventional methods for matching the aerial images. All code and our trained model, as well as the dataset are available online.