scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Attention Mechanism Cloud Detection With Modified FCN for Infrared Remote Sensing Images

22 Oct 2021-IEEE Access (IEEE)-Vol. 9, pp 150975-150983
TL;DR: Zhang et al. as discussed by the authors proposed a compact attention mechanism cloud detection network (AM-CDN) based on the modified FCN to refine and fuse the multi-scale features for on-orbit CD.
Abstract: Semantic segmentation (SS) has been widely applied for cloud detection (CD) in remote sensing images (RSIs) with high spatial and spectral resolution because of its effective pixel-level feature extraction structure. However, the typical model of lightweight SS, namely the fully convolutional network (FCN) with only seven layers, has difficulty in extracting high-level features, and the heavy pyramid scene parsing network (PSPNet) with complicated calculations is not practical in real-time CD, let alone on-orbit CD. So, in view of the problems above, we propose a compact attention mechanism cloud detection network (AM-CDN) based on the modified FCN to refine and fuse the multi-scale features for on-orbit CD. Specifically, taking the FCN as the baseline, our model increases the numbers of hidden layers and adds the residual connections between the input and output to eliminate the network degradation and extract the advanced context feature maps effectively. To expand the receptive field without losing the spatial information, the ordinary convolutions in FCN are replaced by the dilated convolution in AM-CDN. And inspired by the selective kernels of human vision, we introduce the convolutional attention mechanism (AM) into the encoder to adaptively adjust the receptive field to highlight the key texture features. According to experimental results using Landsat-8 infrared RSIs, the accuracy of the proposed CD method is 95.31%, which is 10.17% higher than that of FCN. And the calculation complexity of AM-CDN is only 7.63% of that of PSPNet.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article, a complete YOLO-based ship detection method (CYSDM) for TIRSIs under complex backgrounds is proposed, which is used to detect the ship candidate area quickly.
Abstract: The automatic ship detection method for thermal infrared remote sensing images (TIRSIs) is of great significance due to its broad applicability in maritime security, port management, and target searching, especially at night. Most ship detection algorithms utilize manual features to detect visible image blocks which are accurately cut, and they are limited by illumination, clouds, and atmospheric strong waves in practical applications. In this paper, a complete YOLO-based ship detection method (CYSDM) for TIRSIs under complex backgrounds is proposed. In addition, thermal infrared ship datasets were made using the SDGSAT-1 thermal imaging system. First, in order to avoid the loss of texture characteristics during large-scale deep convolution, the TIRSIs with the resolution of 30 m were up-sampled to 10 m via bicubic interpolation method. Then, complete ships with similar characteristics were selected and marked in the middle of the river, the bay, and the sea. To enrich the datasets, the gray value stretching module was also added. Finally, the improved YOLOv5 s model was used to detect the ship candidate area quickly. To reduce intra-class variation, the 4.23–7.53 aspect ratios of ships were manually selected during labeling, and 8–10.5 μm ship datasets were constructed. Test results show that the precision of the CYSDM is 98.68%, which is 9.07% higher than that of the YOLOv5s algorithm. CYSDM provides an effective reference for large-scale, all-day ship detection.

25 citations

Journal ArticleDOI
01 May 2022
TL;DR: Wang et al. as mentioned in this paper proposed a multi-level pixel spatial attention network for thermal image segmentation, where the edge and small target features were fused with the output features of backbone, and specialized loss functions were used to supervise them.
Abstract: The thermal image segmentation has been widely concerned to alleviate the limitations of image segmentation caused by imaging of visible spectrum under challenge environmental conditions. The existing thermal image segmentation methods are not easy to obtain high quality segmented results, because thermal images are lack of color information, edges are unclear, details are not obvious. To this end, we proposed a multi-level pixel spatial attention network for thermal image segmentation. Specifically, we designed a pixel space attention module on each layer of the backbone network(MPAM), which recovers the more spatial detailed and maintains more semantic information. Then, we designed an edge extraction module (EEM) and a small target extraction module (STEM), which enhances the edge and small target features of the network by explicitly modeling the edge and small target features. Finally, the edge and small target features were fused with the output features of backbone, and the specialized loss functions was used to supervise them. Experimental results on SCUT-SEG, SODA and STI-Cityscpaes dataset demonstrate that our approach is slightly improved by 2.2% compared with other the state-of-art algorithms in the same scene.

4 citations

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a cloud detection method for satellite cloud images based on fused FCN features, which effectively fuses spatial and high-level semantic information, and a voting ensemble strategy is used to improve the accuracy and robustness of cloud detection.
Abstract: Cloud detection for satellite cloud images is a challenging image processing task owing to the blurring of cloud boundaries, multiplicity, and complexity of cloud types. Currently, the commonly used cloud detection methods include original full convolutional neural network (original FCN), FCN with an 8-pixel stride (FCN- 8s), FCN with a 2-pixel stride (FCN-2s), and so on. However, the aforementioned methods exclusively rely on a single network layer − the final layer feature map; thus, shallow cloud image information, such as cloud profile information may not be captured. In this letter, a cloud detection method for satellite cloud images based on fused FCN features is proposed. The proposed method effectively fuses spatial and high-level semantic information, and a voting ensemble strategy is used to improve the accuracy and robustness of cloud detection. Finally, the experimental results demonstrate that the average overall accuracy (OA), average producer’ accuracy (PA), and average user’ accuracy (UA) of the proposed method for multiple training sample sizes and image sizes of the collected Fengyun satellite (FY-2 G) cloud image database increased by 7.15%, 9.04%, and 8.46%, respectively, relative to the average accuracies of the original FCN, FCN-8s, FCN-2s, SegNet, and DeepLabV3 methods.

1 citations

DOI
TL;DR: The proposed cloud detection method effectively fuses spatial and high-level semantic information, and a voting ensemble strategy is used to improve the accuracy and robustness of cloud detection.
Abstract: ABSTRACT Cloud detection for satellite cloud images is a challenging image processing task owing to the blurring of cloud boundaries, multiplicity, and complexity of cloud types. Currently, the commonly used cloud detection methods include original full convolutional neural network (original FCN), FCN with an 8-pixel stride (FCN- 8s), FCN with a 2-pixel stride (FCN-2s), and so on. However, the aforementioned methods exclusively rely on a single network layer the final layer feature map; thus, shallow cloud image information, such as cloud profile information may not be captured. In this letter, a cloud detection method for satellite cloud images based on fused FCN features is proposed. The proposed method effectively fuses spatial and high-level semantic information, and a voting ensemble strategy is used to improve the accuracy and robustness of cloud detection. Finally, the experimental results demonstrate that the average overall accuracy (OA), average producer’ accuracy (PA), and average user’ accuracy (UA) of the proposed method for multiple training sample sizes and image sizes of the collected Fengyun satellite (FY-2 G) cloud image database increased by 7.15%, 9.04%, and 8.46%, respectively, relative to the average accuracies of the original FCN, FCN-8s, FCN-2s, SegNet, and DeepLabV3 methods.
Journal ArticleDOI
TL;DR: In this article , a novel framework named cascaded dense dilated network (CDD-Net), which combines DenseNet, ASPP, and PointRend, is proposed for RRL extraction from VHR images.
Abstract: Accurate recognition and extraction of rural residential land (RRL) is significant for scientific planning, utilization, and management of rural land. Very-High Resolution (VHR) Unmanned Aerial Vehicle (UAV) images and deep learning techniques can provide data and methodological support for the target. However, RRL, as a complex land use assemblage, exhibits features of different scales under VHR images, as well as the presence of complex impervious layers and backgrounds such as natural surfaces and tree shadows in rural areas. It still needs further research to determine how to deal with multi-scale features and accurate edge features in such scenarios. In response to the above problems, a novel framework named cascaded dense dilated network (CDD-Net), which combines DenseNet, ASPP, and PointRend, is proposed for RRL extraction from VHR images. The advantages of the proposed framework are as follows: Firstly, DenseNet is used as a feature extraction network, allowing feature reuse and better network design with fewer parameters. Secondly, the ASPP module can better handle multi-scale features. Thirdly, PointRend is added to the model to improve the segmentation accuracy of the edges. The research takes a plain village in China as the research area. Experimental results show that the Precision, Recall, F1 score, and Dice coefficients of our approach are 91.41%, 93.86%, 92.62%, and 0.8359, respectively, higher than other advanced models used for comparison. It is feasible in the task of high-precision extraction of RRL using VHR UAV images. This research could provide technical support for rural land planning, analysis, and formulation of land management policies.
References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Book ChapterDOI
05 Oct 2015
TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Abstract: There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

49,590 citations

Journal ArticleDOI
18 Jun 2018
TL;DR: This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
Abstract: The central building block of convolutional neural networks (CNNs) is the convolution operator, which enables networks to construct informative features by fusing both spatial and channel-wise information within local receptive fields at each layer. A broad range of prior research has investigated the spatial component of this relationship, seeking to strengthen the representational power of a CNN by enhancing the quality of spatial encodings throughout its feature hierarchy. In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We show that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets. We further demonstrate that SE blocks bring significant improvements in performance for existing state-of-the-art CNNs at slight additional computational cost. Squeeze-and-Excitation Networks formed the foundation of our ILSVRC 2017 classification submission which won first place and reduced the top-5 error to 2.251 percent, surpassing the winning entry of 2016 by a relative improvement of ${\sim }$ ∼ 25 percent. Models and code are available at https://github.com/hujie-frank/SENet .

14,807 citations

Journal ArticleDOI
TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.
Abstract: We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1] . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/ .

13,468 citations

Proceedings ArticleDOI
21 Jul 2017
TL;DR: This paper exploits the capability of global context information by different-region-based context aggregation through the pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet) to produce good quality results on the scene parsing task.
Abstract: Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.

10,189 citations

Trending Questions (2)
Where attention mechanism is used as post treatement or in fcn encoder remote sensing ?

The attention mechanism is used in the FCN encoder for cloud detection in remote sensing images.

Where attention mechanism is used as post treatement or in fcn encoder ?

The attention mechanism is used in the FCN encoder to adaptively adjust the receptive field and highlight key texture features.