scispace - formally typeset
Search or ask a question
Author

Jian Su

Bio: Jian Su is an academic researcher from Nanjing University of Information Science and Technology. The author has contributed to research in topics: Feature extraction & Deep learning. The author has an hindex of 1, co-authored 3 publications receiving 1 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a multiscale attention network (MSA-Network), which integrates a multi-scale module and a channel and position attention (CPA) module to boost the performance of remote sensing scene classification.
Abstract: The remote sensing scene images classification has been of great value to civil and military fields. Deep learning models, especially the convolutional neural network (CNN), have achieved great success in this task, however, they may suffer from two challenges: first, the sizes of the category objects are usually different, but the conventional CNN extracts the features with fixed convolution extractor, which could cause the failure in learning the multiscale features; second, some image regions may not be useful during the feature learning process, therefore, how to guide the network to select and focus on the most relevant regions is crucially vital for remote sensing scene image classification. To address these two challenges, we propose a multiscale attention network (MSA-Network), which integrates a multiscale (MS) module and a channel and position attention (CPA) module to boost the performance of the remote sensing scene classification. The proposed MS module learns multiscale features by adopting various sizes of sliding windows from different depths’ layers and receptive fields. The CPA module is composed of two parts: the channel attention (CA) module and the position attention (PA) one. The CA module learns the global attention features from channel-level, and the PA module extracts the local attention features from pixel-level. Thus, fusing both of those two attention features, the network is apt to focus on the more critical and salient regions automatically. Extensive experiments on UC Merced, AID, NWPU-RESISC45 datasets demonstrate that the proposed MSA-Network outperforms several state-of-the-art methods.

27 citations

Journal ArticleDOI
TL;DR: In this article, an unsupervised multi-attention-guided network named UMAG-Net was proposed to fuse a low-resolution hyperspectral image (HSI) with a high-resolution (HR) multispectral images (MSI) of the same scene.
Abstract: To reconstruct images with high spatial resolution and high spectral resolution, one of the most common methods is to fuse a low-resolution hyperspectral image (HSI) with a high-resolution (HR) multispectral image (MSI) of the same scene. Deep learning has been widely applied in the field of HSI-MSI fusion, which is limited with hardware. In order to break the limits, we construct an unsupervised multiattention-guided network named UMAG-Net without training data to better accomplish HSI-MSI fusion. UMAG-Net first extracts deep multiscale features of MSI by using a multiattention encoding network. Then, a loss function containing a pair of HSI and MSI is used to iteratively update parameters of UMAG-Net and learn prior knowledge of the fused image. Finally, a multiscale feature-guided network is constructed to generate an HR-HSI. The experimental results show the visual and quantitative superiority of the proposed method compared to other methods.

24 citations

Journal ArticleDOI
TL;DR: In this article, a label softening strategy is introduced to soften the binary label matrix which provides more freedom for label fitting, and a regularized term based on manifold learning is proposed to solve the label fitting problem.
Abstract: Due to the need of practical application, multiple sensors are often used for data acquisition, so as to realize the multimodal description of the same object. How to effectively fuse multimodal data has become a challenge problem in different scenarios including remote sensing. Nonsparse multi-Kernel learning has won many successful applications in multimodal data fusion due to the full utilization of multiple Kernels. Most existing models assume that the nonsparse combination of multiple Kernels is infinitely close to a strict binary label matrix during the training process. However, this assumption is very strict so that label fitting has very little freedom. To address this issue, in this article, we develop a novel nonsparse multi-Kernel model for multimodal data fusion. To be specific, we introduce a label softening strategy to soften the binary label matrix which provides more freedom for label fitting. Additionally, we introduce a regularized term based on manifold learning to anti over fitting problems caused by label softening. Experimental results on one synthetic dataset, several UCI multimodal datasets and one multimodal remoting sensor dataset demonstrate the promising performance of the proposed model.

3 citations


Cited by
More filters
Journal ArticleDOI
Di Wang, Jing Zhang, Bo Du, Gui-Song Xia, Dacheng Tao 
TL;DR: Empirical study shows that RSP can help deliver distinctive performances in scene recognition tasks and in perceiving RS related semantics such as “Bridge” and “Airplane”, and finds that, although RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, it may still suffer from task discrepancies, where downstream tasks require different representations from scene Recognition tasks.
Abstract: Deep learning has largely reshaped remote sensing (RS) research for aerial image understanding and made a great success. Nevertheless, most of the existing deep models are initialized with the ImageNet pretrained weights. Since natural images inevitably present a large domain gap relative to aerial images, probably limiting the finetuning performance on downstream aerial scene tasks. This issue motivates us to conduct an empirical study of remote sensing pretraining (RSP) on aerial images. To this end, we train different networks from scratch with the help of the largest RS scene recognition dataset up to now — MillionAID, to obtain a series of RS pretrained backbones, including both convolutional neural networks (CNN) and vision transformers such as Swin and ViTAE, which have shown promising performance on computer vision tasks. Then, we investigate the impact of RSP on representative downstream tasks including scene recognition, semantic segmentation, object detection, and change detection using these CNN and vision transformer backbones. Empirical study shows that RSP can help deliver distinctive performances in scene recognition tasks and in perceiving RS related semantics such as “Bridge” and “Airplane”. We also find that, although RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, it may still suffer from task discrepancies, where downstream tasks require different representations from scene recognition tasks. These findings call for further research efforts on both large-scale pretraining datasets and effective pretraining methods. The codes and pretrained models will be released at https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing.

44 citations

Journal ArticleDOI
TL;DR: In this paper , a literature survey is conducted to analyze the trends of multimodal remote sensing data fusion, and some prevalent sub-fields in multi-modal RS data fusion are reviewed in terms of the to-be-fused data modalities.
Abstract: With the extremely rapid advances in remote sensing (RS) technology, a great quantity of Earth observation (EO) data featuring considerable and complicated heterogeneity is readily available nowadays, which renders researchers an opportunity to tackle current geoscience applications in a fresh way. With the joint utilization of EO data, much research on multimodal RS data fusion has made tremendous progress in recent years, yet these developed traditional algorithms inevitably meet the performance bottleneck due to the lack of the ability to comprehensively analyse and interpret these strongly heterogeneous data. Hence, this non-negligible limitation further arouses an intense demand for an alternative tool with powerful processing competence. Deep learning (DL), as a cutting-edge technology, has witnessed remarkable breakthroughs in numerous computer vision tasks owing to its impressive ability in data representation and reconstruction. Naturally, it has been successfully applied to the field of multimodal RS data fusion, yielding great improvement compared with traditional methods. This survey aims to present a systematic overview in DL-based multimodal RS data fusion. More specifically, some essential knowledge about this topic is first given. Subsequently, a literature survey is conducted to analyse the trends of this field. Some prevalent sub-fields in the multimodal RS data fusion are then reviewed in terms of the to-be-fused data modalities, i.e., spatiospectral, spatiotemporal, light detection and ranging-optical, synthetic aperture radar-optical, and RS-Geospatial Big Data fusion. Furthermore, We collect and summarize some valuable resources for the sake of the development in multimodal RS data fusion. Finally, the remaining challenges and potential future directions are highlighted.

39 citations

Journal ArticleDOI
TL;DR: To handle the large image size and objects of various orientations in RS images, a new rotated varied-size window attention is proposed to substitute the original full attention in transformers, which could reduce the computational cost and memory footprint while learn better object representation by extracting rich context from the generated diverse windows.
Abstract: Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers (ViTs) being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this article, we resort to plain ViTs with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks show the superiority of our model over all state-of-the-art models, achieving 81.24% mean average precision (mAP) on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also show competitive performance compared to existing advanced methods. Further experiments show the advantages of our models in terms of computational complexity and data efficiency in transferring. The code and models will be released at https://github.com/ViTAE-Transformer/Remote-Sensing-RVSA.

29 citations

Journal ArticleDOI
TL;DR: Experimental results on the four public remote sensing data sets demonstrate that the proposed ET-GSNet method possesses the superior classification performance compared to some state-of-the-art (SOTA) methods.
Abstract: Scene classification is an active research topic in remote sensing community, and complex spatial layouts with various types of objects bring huge challenges to classification. Convolutional neural network (CNN)-based methods attempt to explore the global features by gradually expanding the receptive field, while long-range contextual information is ignored. Vision transformer (ViT) can extract contextual feature, but the learning ability of local information is limited, and it has a large computational complexity simultaneously. In this article, an end-to-end method is exploited by employing ViT as an excellent teacher for guiding small networks (ET-GSNet) in remote sensing image scene classification. In the ET-GSNet, ResNet18 is selected as the student model, which integrates the superiorities of the two models via knowledge distillation (KD), and the computational complexity does not increase. In the KD process, the ViT and ResNet18 are optimized together without independent pre-training, and the learning rate of teacher model gradually decreases until zero, while the weight coefficient of KD loss module is doubled. Based on the above procedures, dark knowledge from the teacher model can be transferred to the student model more smoothly. Experimental results on the four public remote sensing data sets demonstrate that the proposed ET-GSNet method possesses the superior classification performance compared to some state-of-the-art (SOTA) methods. In addition, we evaluate the ET-GSNet on a fine-grained ship recognition data set, and the results show that our method has good generalization for different tasks in terms of some metrics.

27 citations

Journal ArticleDOI
TL;DR: A homo–heterogenous transformer learning (HHTL) framework for the RS scene classification is proposed and encouraging results demonstrate that the HHTL framework can outperform many state-of-the-art methods.
Abstract: Remote sensing (RS) scene classification plays an essential role in the RS community and has attracted increasing attention due to its wide applications. Recently, benefiting from the powerful feature learning capabilities of convolutional neural networks (CNNs), the accuracy of the RS scene classification has significantly been improved. Although the existing CNN-based methods achieve excellent results, there is still room for improvement. First, the CNN-based methods are adept at capturing the global information from RS scenes. Still, the context relationships hidden in RS scenes cannot be thoroughly mined. Second, due to the specific structure, it is easy for normal CNNs to exploit the heterogenous information from RS scenes. Nevertheless, the homogenous information, which is also crucial to comprehensively understand complex contents within RS scenes, does not get the attention it deserves. Third, most CNNs focus on establishing the relationships between RS scenes and semantic labels. However, the similarities between them are not considered deeply, which are helpful to distinguish the intra-/interclass samples. To overcome the limitations mentioned previously, we propose a homo–heterogenous transformer learning (HHTL) framework for the RS scene classification in this article. First, a patch generation module is designed to generate homogenous and heterogenous patches. Then, a dual-branch feature learning module (FLM) is proposed to mine homogenous and heterogenous information within RS scenes simultaneously. In the FLM, based on vision transformer, not only the global information but also the local areas and their context information can be captured. Finally, we design a classification module, which consists of a fusion submodule and a metric-learning module. It can integrate homo–heterogenous information and compact/separate samples from the same/different RS scene categories. Extensive experiments are conducted on four public RS scene datasets. The encouraging results demonstrate that our HHTL framework can outperform many state-of-the-art methods. Our source codes are available at the below website.

19 citations