Showing papers in "IEEE Transactions on Image Processing in 2021"
TL;DR: EnlightenGAN as mentioned in this paper proposes a highly effective unsupervised generative adversarial network that can be trained without low/normal-light image pairs, yet proves to generalize very well on various real-world test images.
Abstract: Deep learning-based methods have achieved remarkable success in image restoration and enhancement, but are they still competitive when there is a lack of paired training data? As one such example, this paper explores the low-light image enhancement problem, where in practice it is extremely challenging to simultaneously take a low-light and a normal-light photo of the same visual scene. We propose a highly effective unsupervised generative adversarial network, dubbed EnlightenGAN , that can be trained without low/normal-light image pairs, yet proves to generalize very well on various real-world test images. Instead of supervising the learning using ground truth data, we propose to regularize the unpaired training using the information extracted from the input itself, and benchmark a series of innovations for the low-light image enhancement problem, including a global-local discriminator structure, a self-regularized perceptual loss fusion, and the attention mechanism. Through extensive experiments, our proposed approach outperforms recent methods under a variety of metrics in terms of visual quality and subjective user study. Thanks to the great flexibility brought by unpaired training, EnlightenGAN is demonstrated to be easily adaptable to enhancing real-world images from various domains. Our codes and pre-trained models are available at: https://github.com/VITA-Group/EnlightenGAN .
TL;DR: Wu et al. as mentioned in this paper developed a joint classification and segmentation (JCS) system to perform real-time and explainable COVID-19 chest CT diagnosis, which obtains an average sensitivity of 95.0% and a specificity of 93.0%.
Abstract: Recently, the coronavirus disease 2019 (COVID-19) has caused a pandemic disease in over 200 countries, influencing billions of humans. To control the infection, identifying and separating the infected people is the most crucial step. The main diagnostic tool is the Reverse Transcription Polymerase Chain Reaction (RT-PCR) test. Still, the sensitivity of the RT-PCR test is not high enough to effectively prevent the pandemic. The chest CT scan test provides a valuable complementary tool to the RT-PCR test, and it can identify the patients in the early-stage with high sensitivity. However, the chest CT scan test is usually time-consuming, requiring about 21.5 minutes per case. This paper develops a novel Joint Classification and Segmentation ( JCS ) system to perform real-time and explainable COVID- 19 chest CT diagnosis. To train our JCS system, we construct a large scale COVID- 19 Classification and Segmentation ( COVID-CS ) dataset, with 144,167 chest CT images of 400 COVID- 19 patients and 350 uninfected cases. 3,855 chest CT images of 200 patients are annotated with fine-grained pixel-level labels of opacifications, which are increased attenuation of the lung parenchyma. We also have annotated lesion counts, opacification areas, and locations and thus benefit various diagnosis aspects. Extensive experiments demonstrate that the proposed JCS diagnosis system is very efficient for COVID-19 classification and segmentation. It obtains an average sensitivity of 95.0% and a specificity of 93.0% on the classification test set, and 78.5% Dice score on the segmentation test set of our COVID-CS dataset. The COVID-CS dataset and code are available at https://github.com/yuhuan-wu/JCS .
TL;DR: Wang et al. as mentioned in this paper proposed a Context Guided Network (CGNet) which is a light-weight and efficient network for semantic segmentation, which learns the joint feature of both local feature and surrounding context effectively and efficiently.
Abstract: The demand of applying semantic segmentation model on mobile devices has been increasing rapidly. Current state-of-the-art networks have enormous amount of parameters hence unsuitable for mobile devices, while other small memory footprint models follow the spirit of classification network and ignore the inherent characteristic of semantic segmentation. To tackle this problem, we propose a novel Context Guided Network (CGNet), which is a light-weight and efficient network for semantic segmentation. We first propose the Context Guided (CG) block, which learns the joint feature of both local feature and surrounding context effectively and efficiently, and further improves the joint feature with the global context. Based on the CG block, we develop CGNet which captures contextual information in all stages of the network. CGNet is specially tailored to exploit the inherent property of semantic segmentation and increase the segmentation accuracy. Moreover, CGNet is elaborately designed to reduce the number of parameters and save memory footprint. Under an equivalent number of parameters, the proposed CGNet significantly outperforms existing light-weight segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach. Specifically, without any post-processing and multi-scale testing, the proposed CGNet achieves 64.8% mean IoU on Cityscapes with less than 0.5 M parameters.
TL;DR: Li et al. as mentioned in this paper proposed a simple yet effective method, called LayerCAM, to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately.
Abstract: The class activation maps are generated from the final convolutional layer of CNN. They can highlight discriminative object regions for the class of interest. These discovered object regions have been widely used for weakly-supervised tasks. However, due to the small spatial resolution of the final convolutional layer, such class activation maps often locate coarse regions of the target objects, limiting the performance of weakly-supervised tasks that need pixel-accurate object locations. Thus, we aim to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately. In this paper, by rethinking the relationships between the feature maps and their corresponding gradients, we propose a simple yet effective method, called LayerCAM. It can produce reliable class activation maps for different layers of CNN. This property enables us to collect object localization information from coarse (rough spatial localization) to fine (precise fine-grained details) levels. We further integrate them into a high-quality class activation map, where the object-related pixels can be better highlighted. To evaluate the quality of the class activation maps produced by LayerCAM, we apply them to weakly-supervised object localization and semantic segmentation. Experiments demonstrate that the class activation maps generated by our method are more effective and reliable than those by the existing attention methods. The code will be made publicly available.
TL;DR: Li et al. as mentioned in this paper proposed an underwater image enhancement network via medium transmission-guided multi-color space embedding, which enriches the diversity of feature representations by incorporating the characteristics of different color spaces into a unified structure.
Abstract: Underwater images suffer from color casts and low contrast due to wavelength- and distance-dependent attenuation and scattering. To solve these two degradation issues, we present an underwater image enhancement network via medium transmission-guided multi-color space embedding, called Ucolor . Concretely, we first propose a multi-color space encoder network, which enriches the diversity of feature representations by incorporating the characteristics of different color spaces into a unified structure. Coupled with an attention mechanism, the most discriminative features extracted from multiple color spaces are adaptively integrated and highlighted. Inspired by underwater imaging physical models, we design a medium transmission (indicating the percentage of the scene radiance reaching the camera)-guided decoder network to enhance the response of network towards quality-degraded regions. As a result, our network can effectively improve the visual quality of underwater images by exploiting multiple color spaces embedding and the advantages of both physical model-based and learning-based methods. Extensive experiments demonstrate that our Ucolor achieves superior performance against state-of-the-art methods in terms of both visual quality and quantitative metrics. The code is publicly available at: https://li-chongyi.github.io/Proj_Ucolor.html .
TL;DR: Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
Abstract: The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a robust tracker, we propose a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training stage, we generate pseudo-labels of consecutive video frames by forward-backward prediction under a Siamese correlation tracking framework and utilize the proposed multi-cycle consistency loss to learn a feature extraction network. Furthermore, we propose a similarity dropout strategy to enable some low-quality training sample pairs to be dropped and also adopt a cycle trajectory consistency loss in each sample pair to improve the training loss function. At the tracking stage, we employ the pre-trained feature extraction network to extract features and utilize a Siamese correlation tracking framework to locate the target using forward tracking alone. Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
TL;DR: Inspired by the guided image filtering, a novel guided network is designed to predict kernel weights from the guidance image and these predicted kernels are then applied to extract the depth image features.
Abstract: Dense depth perception is critical for autonomous driving and other robotics applications. However, modern LiDAR sensors only provide sparse depth measurement. It is thus necessary to complete the sparse LiDAR data, where a synchronized guidance RGB image is often used to facilitate this completion. Many neural networks have been designed for this task. However, they often naively fuse the LiDAR data and RGB image information by performing feature concatenation or element-wise addition. Inspired by the guided image filtering, we design a novel guided network to predict kernel weights from the guidance image. These predicted kernels are then applied to extract the depth image features. In this way, our network generates content-dependent and spatially-variant kernels for multi-modal feature fusion. Dynamically generated spatially-variant kernels could lead to prohibitive GPU memory consumption and computation overhead. We further design a convolution factorization to reduce computation and memory consumption. The GPU memory reduction makes it possible for feature fusion to work in multi-stage scheme. We conduct comprehensive experiments to verify our method on real-world outdoor, indoor and synthetic datasets. Our method produces strong results. It outperforms state-of-the-art methods on the NYUv2 dataset and ranks 1st on the KITTI depth completion benchmark at the time of submission. It also presents strong generalization capability under different 3D point densities, various lighting and weather conditions as well as cross-dataset evaluations. The code will be released for reproduction.
TL;DR: An end-to-end Dense Attention Fluid Network (DAFNet) for SOD in optical RSIs that significantly outperforms the existing state-of-the-art SOD competitors is proposed and a new and challenging optical RSI dataset is constructed that contains 2,000 images with pixel-wise saliency annotations.
Abstract: Despite the remarkable advances in visual saliency analysis for natural scene images (NSIs), salient object detection (SOD) for optical remote sensing images (RSIs) still remains an open and challenging problem. In this paper, we propose an end-to-end Dense Attention Fluid Network (DAFNet) for SOD in optical RSIs. A Global Context-aware Attention (GCA) module is proposed to adaptively capture long-range semantic context relationships, and is further embedded in a Dense Attention Fluid (DAF) structure that enables shallow attention cues flow into deep layers to guide the generation of high-level feature attention maps. Specifically, the GCA module is composed of two key components, where the global feature aggregation module achieves mutual reinforcement of salient feature embeddings from any two spatial locations, and the cascaded pyramid attention module tackles the scale variation issue by building up a cascaded pyramid framework to progressively refine the attention map in a coarse-to-fine manner. In addition, we construct a new and challenging optical RSI dataset for SOD that contains 2,000 images with pixel-wise saliency annotations, which is currently the largest publicly available benchmark. Extensive experiments demonstrate that our proposed DAFNet significantly outperforms the existing state-of-the-art SOD competitors. https://github.com/rmcong/DAFNet_TIP20
TL;DR: This article proposes a Hierarchical Alternate Interactions Network (HAINet) for RGB-D Salient Object Detection, which achieves competitive performance as compared with 19 relevant state-of-the-art methods, but also reaches a real-time processing speed of 43 fps on a single NVIDIA Titan X GPU.
Abstract: Existing RGB-D Salient Object Detection (SOD) methods take advantage of depth cues to improve the detection accuracy, while pay insufficient attention to the quality of depth information. In practice, a depth map is often with uneven quality and sometimes suffers from distractors, due to various factors in the acquisition procedure. In this article, to mitigate distractors in depth maps and highlight salient objects in RGB images, we propose a Hierarchical Alternate Interactions Network (HAINet) for RGB-D SOD. Specifically, HAINet consists of three key stages: feature encoding, cross-modal alternate interaction, and saliency reasoning. The main innovation in HAINet is the Hierarchical Alternate Interaction Module (HAIM), which plays a key role in the second stage for cross-modal feature interaction. HAIM first uses RGB features to filter distractors in depth features, and then the purified depth features are exploited to enhance RGB features in turn. The alternate RGB-depth-RGB interaction proceeds in a hierarchical manner, which progressively integrates local and global contexts within a single feature scale. In addition, we adopt a hybrid loss function to facilitate the training of HAINet. Extensive experiments on seven datasets demonstrate that our HAINet not only achieves competitive performance as compared with 19 relevant state-of-the-art methods, but also reaches a real-time processing speed of 43 fps on a single NVIDIA Titan X GPU. The code and results of our method are available at https://github.com/MathLee/HAINet .
TL;DR: Zhang et al. as discussed by the authors proposed a graded-feature multilabel learning network for RGB-thermal urban scene semantic segmentation, which split multilevel features into junior, intermediate, and senior levels.
Abstract: Semantic segmentation is a fundamental task in computer vision, and it has various applications in fields such as robotic sensing, video surveillance, and autonomous driving. A major research topic in urban road semantic segmentation is the proper integration and use of cross-modal information for fusion. Here, we attempt to leverage inherent multimodal information and acquire graded features to develop a novel multilabel-learning network for RGB-thermal urban scene semantic segmentation. Specifically, we propose a strategy for graded-feature extraction to split multilevel features into junior, intermediate, and senior levels. Then, we integrate RGB and thermal modalities with two distinct fusion modules, namely a shallow feature fusion module and deep feature fusion module for junior and senior features. Finally, we use multilabel supervision to optimize the network in terms of semantic, binary, and boundary characteristics. Experimental results confirm that the proposed architecture, the graded-feature multilabel-learning network, outperforms state-of-the-art methods for urban scene semantic segmentation, and it can be generalized to depth data.
TL;DR: An end-to-end learnt lossy image compression approach, which is built on top of the deep nerual network (DNN)-based variational auto-encoder (VAE) structure with Non-Local Attention optimization and Improved Context modeling (NLAIC).
Abstract: This article proposes an end-to-end learnt lossy image compression approach, which is built on top of the deep nerual network (DNN)-based variational auto-encoder (VAE) structure with Non-Local Attention optimization and Improved Context modeling (NLAIC). Our NLAIC 1) embeds non-local network operations as non-linear transforms in both main and hyper coders for deriving respective latent features and hyperpriors by exploiting both local and global correlations, 2) applies attention mechanism to generate implicit masks that are used to weigh the features for adaptive bit allocation, and 3) implements the improved conditional entropy modeling of latent features using joint 3D convolutional neural network (CNN)-based autoregressive contexts and hyperpriors. Towards the practical application, additional enhancements are also introduced to speed up the computational processing (e.g., parallel 3D CNN-based context prediction), decrease the memory consumption (e.g., sparse non-local processing) and reduce the implementation complexity (e.g., a unified model for variable rates without re-training). The proposed model outperforms existing learnt and conventional (e.g., BPG, JPEG2000, JPEG) image compression methods, on both Kodak and Tecnick datasets with the state-of-the-art compression efficiency, for both PSNR and MS-SSIM quality measurements. We have made all materials publicly accessible at https://njuvision.github.io/NIC for reproducible research.
TL;DR: A novel network named DPANet is proposed to explicitly model the potentiality of the depth map and effectively integrate the cross-modal complementarity and guide the fusion process of two modal data to prevent the contamination effect from the unreliable depth map.
Abstract: There are two main issues in RGB-D salient object detection: (1) how to effectively integrate the complementarity from the cross-modal RGB-D data; (2) how to prevent the contamination effect from the unreliable depth map. In fact, these two problems are linked and intertwined, but the previous methods tend to focus only on the first problem and ignore the consideration of depth map quality, which may yield the model fall into the sub-optimal state. In this paper, we address these two issues in a holistic model synergistically, and propose a novel network named DPANet to explicitly model the potentiality of the depth map and effectively integrate the cross-modal complementarity. By introducing the depth potentiality perception, the network can perceive the potentiality of depth information in a learning-based manner, and guide the fusion process of two modal data to prevent the contamination occurred. The gated multi-modality attention module in the fusion process exploits the attention mechanism with a gate controller to capture long-range dependencies from a cross-modal perspective. Experimental results compared with 16 state-of-the-art methods on 8 datasets demonstrate the validity of the proposed approach both quantitatively and qualitatively. https://github.com/JosephChenHub/DPANet
TL;DR: In this article, a Stereoscopically Attentive Multi-Scale (SAM) module is proposed to adaptively fuse the features of various scales for salient object detection.
Abstract: Recent progress on salient object detection (SOD) mostly benefits from the explosive development of Convolutional Neural Networks (CNNs). However, much of the improvement comes with the larger network size and heavier computation overhead, which, in our view, is not mobile-friendly and thus difficult to deploy in practice. To promote more practical SOD systems, we introduce a novel Stereoscopically Attentive Multi-scale (SAM) module, which adopts a stereoscopic attention mechanism to adaptively fuse the features of various scales. Embarking on this module, we propose an extremely lightweight network, namely SAMNet, for SOD. Extensive experiments on popular benchmarks demonstrate that the proposed SAMNet yields comparable accuracy with state-of-the-art methods while running at a GPU speed of 343fps and a CPU speed of 5fps for $336 \times 336$ inputs with only 1.33M parameters. Therefore, SAMNet paves a new path towards SOD. The source code is available on the project page https://mmcheng.net/SAMNet/ .
TL;DR: Wang et al. as mentioned in this paper proposed a new subspace clustering method to represent the data samples as linear combinations of the bases in a given dictionary, where the sparse structure is induced by low-rank factorization for the affinity matrix.
Abstract: Hyperspectral image super-resolution by fusing high-resolution multispectral image (HR-MSI) and low-resolution hyperspectral image (LR-HSI) aims at reconstructing high resolution spatial-spectral information of the scene. Existing methods mostly based on spectral unmixing and sparse representation are often developed from a low-level vision task perspective, they cannot sufficiently make use of the spatial and spectral priors available from higher-level analysis. To this issue, this paper proposes a novel HSI super-resolution method that fully considers the spatial/spectral subspace low-rank relationships between available HR-MSI/LR-HSI and latent HSI. Specifically, it relies on a new subspace clustering method named “structured sparse low-rank representation” (SSLRR), to represent the data samples as linear combinations of the bases in a given dictionary, where the sparse structure is induced by low-rank factorization for the affinity matrix. Then we exploit the proposed SSLRR model to learn the SSLRR along spatial/spectral domain from the MSI/HSI inputs. By using the learned spatial and spectral low-rank structures, we formulate the proposed HSI super-resolution model as a variational optimization problem, which can be readily solved by the ADMM algorithm. Compared with state-of-the-art hyperspectral super-resolution methods, the proposed method shows better performance on three benchmark datasets in terms of both visual and quantitative evaluation.
TL;DR: RefineDNet as mentioned in this paper proposes a two-stage weakly supervised dehazing framework, which adopts the dark channel prior to restore visibility and then refines preliminary de-hazing results via adversarial learning with unpaired foggy and clear images.
Abstract: Haze-free images are the prerequisites of many vision systems and algorithms, and thus single image dehazing is of paramount importance in computer vision. In this field, prior-based methods have achieved initial success. However, they often introduce annoying artifacts to outputs because their priors can hardly fit all situations. By contrast, learning-based methods can generate more natural results. Nonetheless, due to the lack of paired foggy and clear outdoor images of the same scenes as training samples, their haze removal abilities are limited. In this work, we attempt to merge the merits of prior-based and learning-based approaches by dividing the dehazing task into two sub-tasks, i.e., visibility restoration and realness improvement. Specifically, we propose a two-stage weakly supervised dehazing framework, RefineDNet. In the first stage, RefineDNet adopts the dark channel prior to restore visibility. Then, in the second stage, it refines preliminary dehazing results of the first stage to improve realness via adversarial learning with unpaired foggy and clear images. To get more qualified results, we also propose an effective perceptual fusion strategy to blend different dehazing outputs. Extensive experiments corroborate that RefineDNet with the perceptual fusion has an outstanding haze removal capability and can also produce visually pleasing results. Even implemented with basic backbone networks, RefineDNet can outperform supervised dehazing approaches as well as other state-of-the-art methods on indoor and outdoor datasets. To make our results reproducible, relevant code and data are available at https://github.com/xiaofeng94/RefineDNet-for-dehazing .
TL;DR: Wang et al. as mentioned in this paper designed an end-to-end signal prior-guided layer separation and data-driven mapping network with layer-specified constraints for single-image low-light enhancement.
Abstract: Due to the absence of a desirable objective for low-light image enhancement, previous data-driven methods may provide undesirable enhanced results including amplified noise, degraded contrast and biased colors. In this work, inspired by Retinex theory, we design an end-to-end signal prior-guided layer separation and data-driven mapping network with layer-specified constraints for single-image low-light enhancement. A Sparse Gradient Minimization sub-Network (SGM-Net) is constructed to remove the low-amplitude structures and preserve major edge information, which facilitates extracting paired illumination maps of low/normal-light images. After the learned decomposition, two sub-networks (Enhance-Net and Restore-Net) are utilized to predict the enhanced illumination and reflectance maps, respectively, which helps stretch the contrast of the illumination map and remove intensive noise in the reflectance map. The effects of all these configured constraints, including the signal structure regularization and losses, combine together reciprocally, which leads to good reconstruction results in overall visual quality. The evaluation on both synthetic and real images, particularly on those containing intensive noise, compression artifacts and their interleaved artifacts, shows the effectiveness of our novel models, which significantly outperforms the state-of-the-art methods.
TL;DR: In this paper, an end-to-end fully convolutional segmentation network (FCSN) is proposed to simultaneously identify land-cover labels of all pixels in a hyperspectral image (HSI) cube.
Abstract: Recently, deep learning has drawn broad attention in the hyperspectral image (HSI) classification task Many works have focused on elaborately designing various spectral-spatial networks, where convolutional neural network (CNN) is one of the most popular structures To explore the spatial information for HSI classification, pixels with its adjacent pixels are usually directly cropped from hyperspectral data to form HSI cubes in CNN-based methods However, the spatial land-cover distributions of cropped HSI cubes are usually complicated The land-cover label of a cropped HSI cube cannot simply be determined by its center pixel In addition, the spatial land-cover distribution of a cropped HSI cube is fixed and has less diversity For CNN-based methods, training with cropped HSI cubes will result in poor generalization to the changes of spatial land-cover distributions In this paper, an end-to-end fully convolutional segmentation network (FCSN) is proposed to simultaneously identify land-cover labels of all pixels in a HSI cube First, several experiments are conducted to demonstrate that recent CNN-based methods show the weak generalization capabilities Second, a fine label style is proposed to label all pixels of HSI cubes to provide detailed spatial land-cover distributions of HSI cubes Third, a HSI cube generation method is proposed to generate plentiful HSI cubes with fine labels to improve the diversity of spatial land-cover distributions Finally, a FCSN is proposed to explore spectral-spatial features from finely labeled HSI cubes for HSI classification Experimental results show that FCSN has the superior generalization capability to the changes of spatial land-cover distributions
TL;DR: Zhang et al. as discussed by the authors proposed a bilateral attention network (BiANet) for RGB-D salient object detection with a complementary attention mechanism, namely foreground-first (FF) attention and backgroundfirst (BF) attention.
Abstract: RGB-D salient object detection (SOD) aims to segment the most attractive objects in a pair of cross-modal RGB and depth images. Currently, most existing RGB-D SOD methods focus on the foreground region when utilizing the depth images. However, the background also provides important information in traditional SOD methods for promising performance. To better explore salient information in both foreground and background regions, this paper proposes a Bilateral Attention Network (BiANet) for the RGB-D SOD task. Specifically, we introduce a Bilateral Attention Module (BAM) with a complementary attention mechanism: foreground-first (FF) attention and background-first (BF) attention. The FF attention focuses on the foreground region with a gradual refinement style, while the BF one recovers potentially useful salient information in the background region. Benefited from the proposed BAM module, our BiANet can capture more meaningful foreground and background cues, and shift more attention to refining the uncertain details between foreground and background regions. Additionally, we extend our BAM by leveraging the multi-scale techniques for better SOD performance. Extensive experiments on six benchmark datasets demonstrate that our BiANet outperforms other state-of-the-art RGB-D SOD methods in terms of objective metrics and subjective visual comparison. Our BiANet can run up to 80 fps on $224\times 224$ RGB-D images, with an NVIDIA GeForce RTX 2080Ti GPU. Comprehensive ablation studies also validate our contributions.
TL;DR: Zhang et al. as discussed by the authors proposed a new unsupervised framework termed DerainCycleGAN for single image rain removal and generation, which can fully utilize the constrained transfer learning ability and circulatory structures of CycleGAN.
Abstract: Single Image Deraining (SID) is a relatively new and still challenging topic in emerging vision applications, and most of the recently emerged deraining methods use the supervised manner depending on the ground-truth (i.e., using paired data). However, in practice it is rather common to encounter unpaired images in real deraining task. In such cases, how to remove the rain streaks in an unsupervised way will be a challenging task due to lack of constraints between images and hence suffering from low-quality restoration results. In this paper, we therefore explore the unsupervised SID issue using unpaired data, and propose a new unsupervised framework termed DerainCycleGAN for single image rain removal and generation, which can fully utilize the constrained transfer learning ability and circulatory structures of CycleGAN. In addition, we design an unsupervised rain attentive detector (UARD) for enhancing the rain information detection by paying attention to both rainy and rain-free images. Besides, we also contribute a new synthetic way of generating the rain streak information, which is different from the previous ones. Specifically, since the generated rain streaks have diverse shapes and directions, existing derianing methods trained on the generated rainy image by this way can perform much better for processing real rainy images. Extensive experimental results on synthetic and real datasets show that our DerainCycleGAN is superior to current unsupervised and semi-supervised methods, and is also highly competitive to the fully-supervised ones.
TL;DR: In this article, the authors proposed a generalized nonconvex low-rank tensor approximation (GNLTA) for multi-view subspace clustering, where instead of the pairwise correlation between data points, GNLTA adopts the low rank tensor approximations to capture the high-order correlation among multiple views and proposes the generalized nonsmall tensor norm to well consider the physical meanings of different singular values.
Abstract: The low-rank tensor representation (LRTR) has become an emerging research direction to boost the multi-view clustering performance. This is because LRTR utilizes not only the pairwise relation between data points, but also the view relation of multiple views. However, there is one significant challenge: LRTR uses the tensor nuclear norm as the convex approximation but provides a biased estimation of the tensor rank function. To address this limitation, we propose the generalized nonconvex low-rank tensor approximation (GNLTA) for multi-view subspace clustering. Instead of the pairwise correlation, GNLTA adopts the low-rank tensor approximation to capture the high-order correlation among multiple views and proposes the generalized nonconvex low-rank tensor norm to well consider the physical meanings of different singular values. We develop a unified solver to solve the GNLTA model and prove that under mild conditions, any accumulation point is a stationary point of GNLTA. Extensive experiments on seven commonly used benchmark databases have demonstrated that the proposed GNLTA achieves better clustering performance over state-of-the-art methods.
TL;DR: In this paper, the authors proposed Data Augmentation Optimized for GAN (DAG) to enable the use of augmented data in GAN training to improve the learning of the original distribution.
Abstract: Recent successes in Generative Adversarial Networks (GAN) have affirmed the importance of using more data in GAN training. Yet it is expensive to collect data in many domains such as medical applications. Data Augmentation (DA) has been applied in these applications. In this work, we first argue that the classical DA approach could mislead the generator to learn the distribution of the augmented data, which could be different from that of the original data. We then propose a principled framework, termed Data Augmentation Optimized for GAN (DAG), to enable the use of augmented data in GAN training to improve the learning of the original distribution. We provide theoretical analysis to show that using our proposed DAG aligns with the original GAN in minimizing the Jensen–Shannon (JS) divergence between the original distribution and model distribution. Importantly, the proposed DAG effectively leverages the augmented data to improve the learning of discriminator and generator. We conduct experiments to apply DAG to different GAN models: unconditional GAN, conditional GAN, self-supervised GAN and CycleGAN using datasets of natural images and medical images. The results show that DAG achieves consistent and considerable improvements across these models. Furthermore, when DAG is used in some GAN models, the system establishes state-of-the-art Frechet Inception Distance (FID) scores. Our code is available ( https://github.com/tntrung/dag-gans ).
TL;DR: In this paper, a unified blind image quality assessment (BIQA) model was developed and an approach of training it for both synthetic and realistic distortions was proposed to confront the cross-distortion-scenario challenge.
Abstract: Performance of blind image quality assessment (BIQA) models has been significantly boosted by end-to-end optimization of feature engineering and quality regression. Nevertheless, due to the distributional shift between images simulated in the laboratory and captured in the wild, models trained on databases with synthetic distortions remain particularly weak at handling realistic distortions (and vice versa). To confront the cross-distortion-scenario challenge, we develop a unified BIQA model and an approach of training it for both synthetic and realistic distortions. We first sample pairs of images from individual IQA databases, and compute a probability that the first image of each pair is of higher quality. We then employ the fidelity loss to optimize a deep neural network for BIQA over a large number of such image pairs. We also explicitly enforce a hinge constraint to regularize uncertainty estimation during optimization. Extensive experiments on six IQA databases show the promise of the learned method in blindly assessing image quality in the laboratory and wild. In addition, we demonstrate the universality of the proposed training strategy by using it to improve existing BIQA models.
TL;DR: Zhang et al. as discussed by the authors proposed a dynamic selection network (DSNet) to distinguish the corrupted regions from the valid ones throughout the entire network architecture, which may help make full use of the information in the known area.
Abstract: Image inpainting is a challenging computer vision task that aims to fill in missing regions of corrupted images with realistic contents. With the development of convolutional neural networks, many deep learning models have been proposed to solve image inpainting issues by learning information from a large amount of data. In particular, existing algorithms usually follow an encoding and decoding network architecture in which some operations with standard schemes are employed, such as static convolution, which only considers pixels with fixed grids, and the monotonous normalization style (e.g., batch normalization). However, these techniques are not well-suited for the image inpainting task because the random corrupted regions in the input images tend to mislead the inpainting process and generate unreasonable content. In this paper, we propose a novel dynamic selection network (DSNet) to solve this problem in image inpainting tasks. The principal idea of the proposed DSNet is to distinguish the corrupted region from the valid ones throughout the entire network architecture, which may help make full use of the information in the known area. Specifically, the proposed DSNet has two novel dynamic selection modules, namely, the validness migratable convolution (VMC) and regional composite normalization (RCN) modules, which share a dynamic selection mechanism that helps utilize valid pixels better. By replacing vanilla convolution with the VMC module, spatial sampling locations are dynamically selected in the convolution phase, resulting in a more flexible feature extraction process. Besides, the RCN module not only combines several normalization methods but also normalizes the feature regions selectively. Therefore, the proposed DSNet can illustrate realistic and fine-detailed images by adaptively selecting features and normalization styles. Experimental results on three public datasets show that our proposed method outperforms state-of-the-art methods both quantitatively and qualitatively.
TL;DR: In this article, a correlation model is proposed to represent the latent multi-source correlation, and the correlation representations across modalities are fused via attention mechanism into a shared representation to emphasize the most important features for segmentation.
Abstract: Magnetic Resonance Imaging (MRI) is a widely used imaging technique to assess brain tumor. Accurately segmenting brain tumor from MR images is the key to clinical diagnostics and treatment planning. In addition, multi-modal MR images can provide complementary information for accurate brain tumor segmentation. However, it’s common to miss some imaging modalities in clinical practice. In this paper, we present a novel brain tumor segmentation algorithm with missing modalities. Since it exists a strong correlation between multi-modalities, a correlation model is proposed to specially represent the latent multi-source correlation. Thanks to the obtained correlation representation, the segmentation becomes more robust in the case of missing modality. First, the individual representation produced by each encoder is used to estimate the modality independent parameter. Then, the correlation model transforms all the individual representations to the latent multi-source correlation representations. Finally, the correlation representations across modalities are fused via attention mechanism into a shared representation to emphasize the most important features for segmentation. We evaluate our model on BraTS 2018 and BraTS 2019 dataset, it outperforms the current state-of-the-art methods and produces robust results when one or more modalities are missing.
TL;DR: Zhang et al. as mentioned in this paper introduced a new parameter, i.e., light absorption coefficient, into the atmospheric scattering model (ASM) to address the dim effect and better model outdoor hazy scenes.
Abstract: Atmospheric scattering model (ASM) is one of the most widely used model to describe the imaging processing of hazy images. However, we found that ASM has an intrinsic limitation which leads to a dim effect in the recovered results. In this paper, by introducing a new parameter, i.e., light absorption coefficient, into ASM, an enhanced ASM (EASM) is attained, which can address the dim effect and better model outdoor hazy scenes. Relying on this EASM, a simple yet effective gray-world-assumption-based technique called IDE is then developed to enhance the visibility of hazy images. Experimental results show that IDE eliminates the dim effect and exhibits excellent dehazing performance. It is worth mentioning that IDE does not require any training process or extra information related to scene depth, which makes it very fast and robust. Moreover, the global stretch strategy used in IDE can effectively avoid some undesirable effects in recovery results, e.g., over-enhancement, over-saturation, and mist residue, etc. Comparison between the proposed IDE and other state-of-the-art techniques reveals the superiority of IDE in terms of both dehazing quality and efficiency over all the comparable techniques.
TL;DR: Zhang et al. as mentioned in this paper proposed a multi-interactive dual-decoder to mine and model the multi-type interactions for accurate RGB-thermal salient object detection (SOD).
Abstract: RGB-thermal salient object detection (SOD) aims to segment the common prominent regions of visible image and corresponding thermal infrared image that we call it RGBT SOD. Existing methods don’t fully explore and exploit the potentials of complementarity of different modalities and multi-type cues of image contents, which play a vital role in achieving accurate results. In this paper, we propose a multi-interactive dual-decoder to mine and model the multi-type interactions for accurate RGBT SOD. In specific, we first encode two modalities into multi-level multi-modal feature representations. Then, we design a novel dual-decoder to conduct the interactions of multi-level features, two modalities and global contexts. With these interactions, our method works well in diversely challenging scenarios even in the presence of invalid modality. Finally, we carry out extensive experiments on public RGBT and RGBD SOD datasets, and the results show that the proposed method achieves the outstanding performance against state-of-the-art algorithms. The source code has been released at: https://github.com/lz118/Multi-interactive-Dual-decoder .
TL;DR: Li et al. as mentioned in this paper used pairwise similarity to weigh the reconstruction loss to capture local structure information, while a similarity is learned by the self-expression layer to learn high-level similarity without feeding semantic labels.
Abstract: Auto-Encoder (AE)-based deep subspace clustering (DSC) methods have achieved impressive performance due to the powerful representation extracted using deep neural networks while prioritizing categorical separability. However, self-reconstruction loss of an AE ignores rich useful relation information and might lead to indiscriminative representation, which inevitably degrades the clustering performance. It is also challenging to learn high-level similarity without feeding semantic labels. Another unsolved problem facing DSC is the huge memory cost due to $n\times n$ similarity matrix, which is incurred by the self-expression layer between an encoder and decoder. To tackle these problems, we use pairwise similarity to weigh the reconstruction loss to capture local structure information, while a similarity is learned by the self-expression layer. Pseudo-graphs and pseudo-labels, which allow benefiting from uncertain knowledge acquired during network training, are further employed to supervise similarity learning. Joint learning and iterative training facilitate to obtain an overall optimal solution. Extensive experiments on benchmark datasets demonstrate the superiority of our approach. By combining with the $k$ -nearest neighbors algorithm, we further show that our method can address the large-scale and out-of-sample problems. The source code of our method is available: https://github.com/sckangz/SelfsupervisedSC .
TL;DR: AP-CNN as discussed by the authors proposes a dual pathway hierarchy structure with a top-down feature pathway and a bottom-up attention pathway, hence learning both high-level semantic and low-level detailed feature representation.
Abstract: Classifying the sub-categories of an object from the same super-category ( e.g. , bird species and cars) in fine-grained visual classification (FGVC) highly relies on discriminative feature representation and accurate region localization. Existing approaches mainly focus on distilling information from high-level features. In this article, by contrast, we show that by integrating low-level information ( e.g. , color, edge junctions, texture patterns), performance can be improved with enhanced feature representation and accurately located discriminative regions. Our solution, named Attention Pyramid Convolutional Neural Network (AP-CNN), consists of 1) a dual pathway hierarchy structure with a top-down feature pathway and a bottom-up attention pathway, hence learning both high-level semantic and low-level detailed feature representation, and 2) an ROI-guided refinement strategy with ROI-guided dropblock and ROI-guided zoom-in operation, which refines features with discriminative local regions enhanced and background noises eliminated. The proposed AP-CNN can be trained end-to-end, without the need of any additional bounding box/part annotation. Extensive experiments on three popularly tested FGVC datasets (CUB-200-2011, Stanford Cars, and FGVC-Aircraft) demonstrate that our approach achieves state-of-the-art performance. Models and code are available at https://github.com/PRIS-CV/AP-CNN_Pytorch-master .
TL;DR: In this article, the VIDeo quality EVALuator (VIDEVAL) is proposed to improve the performance of VQA models for UGC/consumer videos.
Abstract: Recent years have witnessed an explosion of user-generated content (UGC) videos shared and streamed over the Internet, thanks to the evolution of affordable and reliable consumer capture devices, and the tremendous popularity of social media platforms. Accordingly, there is a great need for accurate video quality assessment (VQA) models for UGC/consumer videos to monitor, control, and optimize this vast content. Blind quality prediction of in-the-wild videos is quite challenging, since the quality degradations of UGC videos are unpredictable, complicated, and often commingled. Here we contribute to advancing the UGC-VQA problem by conducting a comprehensive evaluation of leading no-reference/blind VQA (BVQA) features and models on a fixed evaluation architecture, yielding new empirical insights on both subjective video quality studies and objective VQA model design. By employing a feature selection strategy on top of efficient BVQA models, we are able to extract 60 out of 763 statistical features used in existing methods to create a new fusion-based model, which we dub the VIDeo quality EVALuator (VIDEVAL), that effectively balances the trade-off between VQA performance and efficiency. Our experimental results show that VIDEVAL achieves state-of-the-art performance at considerably lower computational cost than other leading models. Our study protocol also defines a reliable benchmark for the UGC-VQA problem, which we believe will facilitate further research on deep learning-based VQA modeling, as well as perceptually-optimized efficient UGC video processing, transcoding, and streaming. To promote reproducible research and public evaluation, an implementation of VIDEVAL has been made available online: https://github.com/vztu/VIDEVAL .
TL;DR: Zhang et al. as mentioned in this paper proposed a novel Long Short-Term Memory (LSTM) to incorporate multiple sources of information from pedestrians and vehicles adaptively, such as vehicle speed and ego-motion, pedestrian intention and historical locations.
Abstract: Accurate predictions of future pedestrian trajectory could prevent a considerable number of traffic injuries and improve pedestrian safety. It involves multiple sources of information and real-time interactions, e.g. , vehicle speed and ego-motion, pedestrian intention and historical locations. Existing methods directly apply a simple concatenation operation to combine multiple cues while their dynamics over time are less studied. In this paper, we propose a novel Long Short-Term Memory (LSTM), namely, to incorporate multiple sources of information from pedestrians and vehicles adaptively. Different from LSTM, our considers mutual interactions and explores intrinsic relations among multiple cues. First, we introduce extra memory cells to improve the transferability of LSTMs in modeling future variations. These extra memory cells include a speed cell to explicitly model vehicle speed dynamics, an intention cell to dynamically analyze pedestrian crossing intentions and a correlation cell to exploit correlations among temporal frames. These three individual cells uncover the future movement of vehicles, pedestrians and global scenes. Second, we propose a gated shifting operation to learn the movement of pedestrians. The intention of crossing the road or not would significantly affect pedestrian’s spatial locations. To this end, global scene dynamics and pedestrian intention information are leveraged to model the spatial shifts. Third, we integrate the speed variations to the output gate and dynamically reweight the output channels via the scaling of vehicle speed. The movement of the vehicle would alter the scale of the predicted pedestrian bounding box: as the vehicle gets closer to the pedestrian, the bounding box is enlarging. Our rescaling process captures the relative movement and updates the size of pedestrian bounding boxes accordingly. Experiments conducted on three pedestrian trajectory forecasting benchmarks show that our achieves state-of-the-art performance.