SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer
TL;DR: An attention-guided cross-domain module is devised to achieve sufficient integration of complementary information and global interaction, and an elaborate loss function, consisting of SSIM loss, texture loss, and intensity loss, drives the network to preserve abundant texture details and structural information, as well as presenting optimal apparent intensity.
Abstract: This study proposes a novel general image fusion framework based on cross-domain long-range learning and Swin Transformer, termed as SwinFusion. On the one hand, an attention-guided cross-domain module is devised to achieve sufficient integration of complementary information and global interaction. More specifically, the proposed method involves an intra-domain fusion unit based on self-attention and an inter-domain fusion unit based on cross-attention, which mine and integrate long dependencies within the same domain and across domains. Through long-range dependency modeling, the network is able to fully implement domain-specific information extraction and cross-domain complementary information integration as well as maintaining the appropriate apparent intensity from a global perspective. In particular, we introduce the shifted windows mechanism into the self-attention and cross-attention, which allows our model to receive images with arbitrary sizes. On the other hand, the multi-scene image fusion problems are generalized to a unified framework with structure maintenance, detail preservation, and proper intensity control. Moreover, an elaborate loss function, consisting of SSIM loss, texture loss, and intensity loss, drives the network to preserve abundant texture details and structural information, as well as presenting optimal apparent intensity. Extensive experiments on both multi-modal image fusion and digital photography image fusion demonstrate the superiority of our SwinFusion compared to the state-of-the-art unified image fusion algorithms and task-specific alternatives. Implementation code and pre-trained weights can be accessed at https://github.com/Linfeng-Tang/SwinFusion.
Citations
More filters
TL;DR: Wang et al. as mentioned in this paper proposed a decoupling network-based IVIF method (DNFusion), which utilizes the decoupled maps to design additional constraints on the network and force the network to retain the saliency information of the source image effectively.
Abstract: In general, the goal of the existing infrared and visible image fusion (IVIF) methods is to make the fused image contain both the high-contrast regions of the infrared image and the texture details of the visible image. However, this definition would lead the fusion image losing information from the visible image in high-contrast areas. For this problem, this article proposed a decoupling network-based IVIF method (DNFusion), which utilizes the decoupled maps to design additional constraints on the network to force the network to retain the saliency information of the source image effectively. The current definition of image fusion is satisfied while effectively maintaining the saliency objective of the source images. Specifically, the feature interaction module (FIM) inside effectively facilitates the information exchange within the encoder and improves the utilization of complementary information. Also, a hybrid loss function constructed with weight fidelity loss, gradient loss, and decoupling loss ensures the fusion image to be generated to effectively preserve the source image’s texture details and luminance information. The qualitative and quantitative comparison of extensive experiments demonstrates that our model can generate a fused image containing saliency objects and clear details of the source images, and the method we proposed has a better performance than other state-of-the-art (SOTA) methods.
86 citations
TL;DR: Tang et al. as discussed by the authors proposed a novel image registration and fusion method, named SuperFusion, which combines image registration, image fusion, and semantic requirements of high-level vision tasks into a single framework.
Abstract: Image fusion aims to integrate complementary information in source images to synthesize a fused image comprehensively characterizing the imaging scene. However, existing image fusion algorithms are only applicable to strictly aligned source images and cause severe artifacts in the fusion results when input images have slight shifts or deformations. In addition, the fusion results typically only have good visual effect, but neglect the semantic requirements of high-level vision tasks. This study incorporates image registration, image fusion, and semantic requirements of high-level vision tasks into a single framework and proposes a novel image registration and fusion method, named SuperFusion. Specifically, we design a registration network to estimate bidirectional deformation fields to rectify geometric distortions of input images under the supervision of both photometric and end-point constraints. The registration and fusion are combined in a symmetric scheme, in which while mutual promotion can be achieved by optimizing the naive fusion loss, it is further enhanced by the mono-modal consistent constraint on symmetric fusion outputs. In addition, the image fusion network is equipped with the global spatial attention mechanism to achieve adaptive feature integration. Moreover, the semantic constraint based on the pre-trained segmentation model and Lovasz-Softmax loss is deployed to guide the fusion network to focus more on the semantic requirements of high-level vision tasks. Extensive experiments on image registration, image fusion, and semantic segmentation tasks demonstrate the superiority of our SuperFusion compared to the state-of-the-art alternatives. The source code and pre-trained model are publicly available at https://github.com/Linfeng-Tang/SuperFusion.
50 citations
TL;DR: Wu et al. as mentioned in this paper proposed a multitype fusion and enhancement network (MFENet) for RGB-thermal (RGB-T) salient object detection by exploiting the advantages of the RGB and thermal modalities through feature integration and enhancement.
Abstract: Recent progress in salient object detection (SOD) has been fueled substantially by the development of convolutional neural networks. However, several SOD methods do not fully exploit information from different modalities, consequently performing only marginally better than methods using a single modality. Therefore, we propose a multitype fusion and enhancement network (MFENet), following three steps “Encoder- Pre-decoder- Decoder” for RGB-thermal (RGB-T) SOD by completely exploiting the advantages of the RGB and thermal modalities through feature integration and enhancement. To better fuse two modalities' features, we have designed the cross-modality fusion module (CMFM) in the encoder part. As shallow features describe details and deep features provide semantic information, a multiscale interactive refinement module is designed in the pre-decoder part to complement multilevel features. Additionally, to further sharpen salient objects, we have proposed a high-level, low-level module that takes inputs from adjacent layers for gradual translation into a saliency map in the decoder part. This module provides semantic information for shallower features and the boundaries of salient objects can be gradually sharpened with subtle details. Extensive experiments show the effectiveness and robustness of the proposed MFENet and its substantial improvement over state-of-the-art RGB-T SOD methods. The codes and results will be available at: https://github.com/wujunyi1412/MFENet_DSP.
18 citations
TL;DR: DIVFusion as mentioned in this paper proposes a scene-illumination disentangled network (SIDNet) to strip the illumination degradation in nighttime visible images while preserving informative features of source images, and a texture contrast enhancement fusion network (TCEFNet) is devised to integrate complementary information and enhance the contrast and texture details of fused features.
Abstract: As a vital image enhancement technology, infrared and visible image fusion aims to generate high-quality fused images with salient targets and abundant texture in extreme environments. However, current image fusion methods are all designed for infrared and visible images with normal illumination. In the night scene, existing methods suffer from weak texture details and poor visual perception due to the severe degradation in visible images, which affects subsequent visual applications. To this end, this paper advances a darkness-free infrared and visible image fusion method (DIVFusion), which reasonably lights up the darkness and facilitates complementary information aggregation. Specifically, to improve the fusion quality of nighttime images, which suffer from low illumination, texture concealment, and color distortion, we first design a scene-illumination disentangled network (SIDNet) to strip the illumination degradation in nighttime visible images while preserving informative features of source images. Then, a texture–contrast enhancement fusion network (TCEFNet) is devised to integrate complementary information and enhance the contrast and texture details of fused features. Moreover, a color consistency loss is designed to mitigate color distortion from enhancement and fusion. Finally, we fully consider the intrinsic relationship between low-light image enhancement and image fusion, achieving effective coupling and reciprocity. In this way, the proposed method is able to generate fused images with real color and significant contrast in an end-to-end manner. Extensive experiments demonstrate that DIVFusion is superior to state-of-the-art algorithms in terms of visual quality and quantitative evaluations. Particularly, low-light enhancement and dual-modal fusion provide more effective information to the fused image and boost high-level vision tasks. Our code is publicly available at https://github.com/Xinyu-Xiang/DIVFusion.
16 citations
TL;DR: In this paper , a thorough overview of image fusion approaches, including associated background and current breakthroughs, is presented, which can assist researchers in coping with multiple imaging modalities, recent fusion developments, and future perspectives.
Abstract: • The image fusion methods are comprehensively reviewed, and recent developments of DL are elaborated. • The image fusion applications are briefly discussed. • The imaging technologies are summarized for image fusion. • The spectral and polarized image fusion is broadly conferred. • Future perspectives are comprehensively discussed. Multiple imaging modalities can be combined to provide more information about the real world than a single modality alone. Infrared images discriminate targets with respect to their thermal radiation differences, and visible images are promising for texture details. On the other hand, polarized images deliver intensity and polarization information, and multispectral images dispense the spatial, spectral, and temporal information depending upon the environment. Different sensors provide images with different characteristics, such as type of degradation, important features, textural attributes, etc. Several stimulating tasks have been explored in the last decades based on algorithms, performance assessments, processing techniques, and prospective applications. However, most of the reviews and surveys have not properly addressed the issues of additional possibilities of imaging fusion. The primary goal of this paper is to give a thorough overview of image fusion approaches, including associated background and current breakthroughs. We introduce image fusion and categorize the methods based on conventional image processing, deep learning (DL) architectures, and fusion scenarios. Further, we emphasize the recent DL developments in various image fusion scenarios. However, there are still several difficulties to overcome, including developing more advanced algorithms to support more dependable and real-time practical applications, discussed in future perspectives. This study can assist researchers in coping with multiple imaging modalities, recent fusion developments, and future perspectives.
12 citations
References
More filters
Proceedings Article•
12 Jun 2017TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
52,856 citations
TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Abstract: Objective methods for assessing perceptual image quality traditionally attempted to quantify the visibility of errors (differences) between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative complementary framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a structural similarity index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MATLAB implementation of the proposed algorithm is available online at http://www.cns.nyu.edu//spl sim/lcv/ssim/.
40,609 citations
27 Jun 2016
TL;DR: Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
Abstract: We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
27,256 citations
Posted Content•
TL;DR: PyTorch as discussed by the authors is a machine learning library that provides an imperative and Pythonic programming style that makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.
Abstract: Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.
In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.
12,767 citations
Posted Content•
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
12,690 citations