scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2023"



Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a texture-aware refinement module to amplify the subtle texture difference between camouflaged objects and the background for camouflaged object detection by formulating multiple textureaware refinement modules to learn the textureaware features in a deep CNN.
Abstract: Camouflaged object detection is a challenging task that aims to identify objects having similar texture to the surroundings. This paper presents to amplify the subtle texture difference between camouflaged objects and the background for camouflaged object detection by formulating multiple texture-aware refinement modules to learn the texture-aware features in a deep convolutional neural network. The texture-aware refinement module computes the biased co-variance matrices of feature responses to extract the texture information, adopts an affinity loss to learn a set of parameter maps that help to separate the texture between camouflaged objects and the background, and leverages a boundary-consistency loss to explore the structures of object details. We evaluate our network on the benchmark datasets for camouflaged object detection both qualitatively and quantitatively. Experimental results show that our approach outperforms various state-of-the-art methods by a large margin.

16 citations


Journal ArticleDOI
TL;DR: In this article , the authors proposed a joint attention and multi-scale fusion network (JAMSNet) for remote photoplethysmography (rPPG) to extract the pulse signal and motion artifacts in different layers of a Gaussian pyramid.
Abstract: Remote photoplethysmography (rPPG) has been an active research topic in recent years. While most existing methods are focusing on eliminating motion artifacts in the raw traces obtained from single-scale region-of-interest (ROI), it is worth noting that there are some noise signals that cannot be effectively separated in single-scale space but can be separated more easily in multi-scale space. In this paper, we analyze the distribution of pulse signal and motion artifacts in different layers of a Gaussian pyramid. We propose a method that combines multi-scale analysis and neural network for pulse extraction in different scales, and a layer-wise attention mechanism to adaptively fuse the features according to signal strength. In addition, we propose spatial-temporal joint attention module and channel-temporal joint attention module to learn and exaggerate pulse features in the joint spaces, respectively. The proposed remote pulse extraction network is called Joint Attention and Multi-Scale fusion Network (JAMSNet). Extensive experiments have been conducted on two publicly available datasets and one self-collected dataset. The results show that the proposed JAMSNet shows better performance than state-of-the-art methods.

14 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a Motion Stimulation (MS) block, which is specifically designed to mine dynamic clues of the local regions autonomously from adjacent frames, which can be directly and conveniently integrated into existing video backbones to enhance the ability of compositional generalization for action recognition algorithms.
Abstract: Recognizing the unseen combinations of action and different objects, namely (zero-shot) compositional action recognition, is extremely challenging for conventional action recognition algorithms in real-world applications. Previous methods focus on enhancing the dynamic clues of objects that appear in the scene by building region features or tracklet embedding from ground-truths or detected bounding boxes. These methods rely heavily on manual annotation or the quality of detectors, which are inflexible for practical applications. In this work, we aim to mining the temporal clues from moving objects or hands without explicit supervision. Thus, we propose a novel Motion Stimulation (MS) block, which is specifically designed to mine dynamic clues of the local regions autonomously from adjacent frames. Furthermore, MS consists of the following three steps: motion feature extraction, motion feature recalibration, and action-centric excitation. The proposed MS block can be directly and conveniently integrated into existing video backbones to enhance the ability of compositional generalization for action recognition algorithms. Extensive experimental results on three action recognition datasets, the Something-Else, IKEA-Assembly and EPIC-KITCHENS datasets, indicate the effectiveness and interpretability of our MS block.

11 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a self-designed threshold prediction network and a probability estimation network with adaptive similarity mutual attention to help to find the overlapping area of the point clouds.
Abstract: Point cloud registration is a key problem in the application of computer vision to robotics, autopilot and other fields. However, because the object is partially covered up or the resolution of 3D scanners is different, point clouds collected by the same sense may be inconsistent and even incomplete. Inspired by the recently proposed learning-based approaches, we propose Inliers Estimation Network (INENet) which includes a self-designed threshold prediction network and a probability estimation network with adaptive similarity mutual attention to help to find the overlapping area of the point clouds. In order to solve the above problems, we divide the partially overlapping point cloud registration task into two sub-tasks: overlapping areas detection and registration. The threshold prediction network can automatically calculate the threshold according to the input point clouds, and then the probability estimation network estimates the overlapping points by using threshold. The advantages of the proposed approach include: (1) threshold prediction network avoids bias and the complexity of manually adjusting the threshold. (2) Probability estimation network with similarity matrix can deeply fuse the information between a pair of point clouds, which is helpful to improve the accuracy. (3) INENet can be easily integrated into other overlapping region sensitive algorithms and without adjusting parameters. We conduct experiments on the ModelNet40, S3DIS and 3DMatch data sets. Specifically, the rotation error of the registration algorithm integrated with INENet is improved by at least 25% compared with direct partial overlap registration, our method improves the $F_{1} $ score by 5% and has better anti-noise ability compared with the existing overlap detection methods, showing the effectiveness of the proposed method.

9 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a skip-attention based correspondence filtering network (SACF-Net) for point cloud registration, which utilizes both low-level geometric information and high-level context-aware information to enhance the original pointwise matching map.
Abstract: Rigid registration is a transformation estimation problem between two point clouds. The two point clouds captured may partially overlap owing to different viewpoints and acquisition times. Some previous correspondence matching based methods utilize an encoder-decoder network to carry out partial-to-partial registration task and adopt a skip-connection structure to convey information between the encoder and decoder. However, equally revisiting them with skip-connection may introduce the information redundancy, and limit the feature learning ability of the entire network. To address these problems, we propose a skip-attention based correspondence filtering network ( SACF-Net ) for point cloud registration. A novel feature interaction mechanism is designed to utilize both low-level geometric information and high-level context-aware information to enhance the original pointwise matching map. Additionally, a skip-attention based correspondence filtering method is proposed to selectively revisits features in the encoder at different resolutions, allowing the decoder to extract high-quality correspondences within overlapping regions. We conduct comprehensive experiments on indoor and outdoor scene datasets, and the results show that the proposed SACF-Net yields unprecedented performance improvements.

8 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a context-aware mixup framework for domain adaptive semantic segmentation, which exploits the important clue of context-dependency as explicit prior knowledge in a fully end-to-end trainable manner.
Abstract: Unsupervised domain adaptation (UDA) aims to adapt a model of the labeled source domain to an unlabeled target domain. Existing UDA-based semantic segmentation approaches always reduce the domain shifts in pixel level, feature level, and output level. However, almost all of them largely neglect the contextual dependency, which is generally shared across different domains, leading to less-desired performance. In this paper, we propose a novel Context-Aware Mixup (CAMix) framework for domain adaptive semantic segmentation, which exploits this important clue of context-dependency as explicit prior knowledge in a fully end-to-end trainable manner for enhancing the adaptability toward the target domain. Firstly, we present a contextual mask generation strategy by leveraging the accumulated spatial distributions and prior contextual relationships. The generated contextual mask is critical in this work and will guide the context-aware domain mixup on three different levels. Besides, provided the context knowledge, we introduce a significance-reweighted consistency loss to penalize the inconsistency between the mixed student prediction and the mixed teacher prediction, which alleviates the negative transfer of the adaptation, e.g., early performance degradation. Extensive experiments and analysis demonstrate the effectiveness of our method against the state-of-the-art approaches on widely-used UDA benchmarks.

7 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors designed an asynchronous updating Boolean network encryption algorithm based on chaos (ABNEA), which can complete the encryption tasks of asynchronously updating Boolean networks and exhibits good security characteristics.
Abstract: An asynchronous updating Boolean network is employed to simulate and analyze the gene expression of a particular tissue or species, revealing the life activity process from a system perspective to reveal the disease mechanism and treat the disease. Therefore, to ensure the safe transmission of the asynchronous updating Boolean network in the network, we designed an asynchronous updating Boolean network encryption algorithm based on chaos (ABNEA). First, a novel 2D chaotic system (2D-FPSM) is designed. This system has better performance than the classical 2D chaotic system. It is very suitable for cryptographic systems to generate key streams. Second, an encoding rule is designed to convert the asynchronous updating Boolean network to a Boolean matrix and propagate it on the network as an image. The receiver and sender jointly save the encoding rule. Last, to protect the safe propagation of the Boolean network matrix on the network, the method of synchronous scrambling-diffusion is adapted to encrypt the Boolean network matrix based on the 2D-FPSM. Simulation experiments and security analysis show that the average correlation of adjacent pixels of ciphertext are 0.0010, -0.0010, -0.0020, and the average information entropy is 7.9984. The ABNEA can complete the encryption tasks of asynchronously updating Boolean networks and exhibits good security characteristics.

6 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an end-to-end network to reconstruct high-resolution images from low-resolution spike streams, which adopts a multi-level features learning mechanism, including intra-stream feature extraction by spike encoder, inter-stream dependencies extraction based on optical flow module, and joint features learning via spike-based iterative projection.
Abstract: Spike camera is a new type of bio-inspired vision sensor, each pixel of which perceives the brightness of the scene independently, and finally outputs 3-dimensional spatiotemporal spike streams. To bridge the spike camera and traditional frame-based vision, there is some works to reconstruct spike streams into regular images. However, the low spatial resolution ( $400\times 250$ ) of the spike camera limits the quality of the reconstructed images. Thus, it is meaningful to explore a super-resolution reconstruction for spike streams. In this paper, we propose an end-to-end network to reconstruct high-resolution images from low-resolution spike streams. To utilize more spatiotemporal features of spike streams, our network adopts a multi-level features learning mechanism, including intra-stream feature extraction by spike encoder, inter-stream dependencies extraction based on optical flow module, and joint features learning via spike-based iterative projection. Experimental results demonstrate that our network is superior to the combination of state-of-the-art intensity image reconstruction methods and super-resolution networks on simulated and real datasets.

5 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a cross-modality double bidirectional interaction and fusion network (CMDBIF-Net) for RGB-T salient object detection.
Abstract: RGB-T salient object detection (SOD) aims to detect and segment saliency regions on RGB images and the corresponding thermal maps. The ability of alleviating the modality difference between RGB and thermal modality plays a vital role in the development of RGB-T SOD. However, most of the existing methods try to integrate multi-modal information through various fusion strategies, or reduce the modality difference via unidirectional or undifferentiated bidirectional interaction, but failing in some challenging scenes. To deal with the above question, a novel Cross-Modality Double Bidirectional Interaction and Fusion Network (CMDBIF-Net) for RGB-T SOD is proposed. Specifically, we construct an interactive branch to indirectly bridge the RGB and thermal modalities. In addition, we propose a double bidirectional interaction (DBI) module composed of a forward interaction block (FIB) and a backward interaction block (BIB) to reduce the cross-modality differences. Moreover, a multi-scale feature enhancement and fusion (MSFEF) module is introduced to integrate the multi-modal features with considering the internal gap of different modality. Finally, we use a cascaded decoder and a cross-level feature enhancement (CLFE) module to generate high-quality saliency map. Extensive experiments are conducted on three publicly available RGB-T SOD datasets shows that the proposed CMDBIF-Net achieves outstanding performance against the state-of-the-art (SOTA) RGB-T SOD methods.

5 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an interpretable deep neural network, namely multisource aligning joint contextual representation model-informed interpretable classification network (MACRMoI-N), which aligns complementary spectral-spatial-elevation information during end-to-end training.
Abstract: The effective utilization of hyperspectral image (HSI) and light detection and ranging (LiDAR) data is essential for land cover classification. Recently, deep learning-based classification approaches have achieved remarkable success. However, most deep learning classification methods are data-driven and designed in a black-box architecture, lacking sufficient interpretability, and ignoring the potential correlation of heterogeneous complementary information between multisource data. To address these issues, we propose an interpretable deep neural network, namely multisource aligning joint contextual representation model-informed interpretable classification network (MACRMoI-N), which fully exploits correlation of multisource data by aligning complementary spectral-spatial-elevation information during end-to-end training. We first present a multimodal aligning joint contextual representation classification model (MACR-M), which incorporates local spatial-spectral prior information into representation. MACR-M is optimized by an iterative algorithm to solve dictionaries of HSI and LiDAR and their corresponding sparse coefficients, in which the dictionary distribution are aligned to enable the complementary information of multisource data to guide a more accurate classification. We further propose the unfolded MACRMoI-N, where each module corresponds to a specific operation of the optimization algorithm, and the parameters are optimized in an end-to-end manner. Comparative experiment results and ablation studies show that MACRMoI-N performs better than other advanced methods.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a degradation-aware restoration network with GAN prior, dubbed DEAR-GAN, for FR tasks by explicitly learning the degradation representations (DR) to adapt to various degradation.
Abstract: With the development of generative adversarial networks (GANs), recent face restoration (FR) methods often utilize pre-trained GAN models ( i.e ., StyleGAN2) as prior to generate rich details. However, these methods usually struggle to balance realness and fidelity when facing various degradation levels. In this paper, we propose a novel DEgradation-Aware Restoration network with GAN prior, dubbed DEAR-GAN, for FR tasks by explicitly learning the degradation representations (DR) to adapt to various degradation. Specifically, an unsupervised degradation representation learning (UDRL) strategy is first developed to extract DR of the input degraded images. Then, a degradation-aware feature interpolation (DAFI) module is proposed to dynamically fuse the two types of informative features ( i.e ., features from degraded images and features from GAN prior network) with flexible adaption to various degradation based on DR. Extensive experiments show that our proposed DEAR-GAN outperforms the state-of-the-art methods for face restoration under multiple degradation and face super-resolution, and demonstrate the effectiveness of feature interpolation, which can be extended to face inpainting to achieve excellent results.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a Transformer-Auxiliary by operator-induced neural network (TANet) to localize forged regions for image manipulation localization, where a stacked multi-scale transformer (SMT) branch is introduced as a compensation for feature representations of the mainstream convolutional neural network branch.
Abstract: Image manipulation localization (IML), which seeks to accurately segment tampered regions that are artfully fastened into a normal image, is a fundamental yet challenging computer vision task. Despite that impressive results have been achieved by some progressive deep learning methods, they usually fail in capturing the subtle manipulation artifacts at different object scales, which are not competent to generate a perfect segmentation mask with complete and fine object structures. Besides, the problem of coarse boundaries also occurs frequently. To this end, in this paper, we propose a Transformer-Auxiliary by operator-induced neural Network (TANet) to localize forged regions for IML. Specifically, a stacked multi-scale transformer (SMT) branch is first introduced as a compensation for feature representations of the mainstream convolutional neural network branch. SMT can detect structured abnormalities of the input image at multi-levels by operating on patches of different sizes. Then TANet explicitly exploits an operator induction module (OIM) to excavate valuable and manipulated region-related boundary semantics to guide the representative learning of the mainstream branch. The OIM encourages the network to generate features that highlight object structure, thereby promoting precise boundary localization of forged regions. We conduct extensive experiments on various datasets and settings to validate the effectiveness of TANet. Results show that TANet outperforms the state-of-the-art methods by a large margin under widely-used evaluation metrics.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed an ensemble approach to better exploit the priors from untrained NNs for BID, which aggregates the deblurring results of multiple untrained neural networks for improvement.
Abstract: Blind image deconvolution (BID) is about recovering a latent image with sharp details from its blurred observation generated by the convolution with an unknown smoothing kernel. Recently, deep generative priors from untrained neural networks (NNs) have emerged as a promising deep learning approach for BID, with the benefit of being free of external training samples. However, existing untrained-NN-based BID methods may suffer from under-deblurring or overfitting. In this paper, we propose an ensemble approach to better exploit the priors from untrained NNs for BID, which aggregates the deblurring results of multiple untrained NNs for improvement. To enjoy both the effectiveness and computational efficiency in ensemble learning, the untrained NNs are designed with a specific shared-base and multi-head architecture. In addition, a kernel-centering layer is proposed for handling the shift ambiguity among different predictions during ensemble, which also improves the robustness of kernel prediction to the setting of the kernel size parameter. Extensive experiments show that the proposed approach noticeably outperforms both exiting dataset-free methods and dataset-based methods.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a signal noise separation-based (SNIS) network to solve the problem of detecting post-processed image forgery, which adopts the signal separation module to separate tampered region from the complex background region with post-processing noise, which weakens or even eliminates the negative impact of postprocessing on forgery detection.
Abstract: Image forgery detection has aroused widespread research interest in both academia and industry because of its potential security threats. Existing forgery detection methods achieve excellent tampered regions localization performance when forged images have not undergone post-processing, which can be detected by observing changes in the statistical features of images. However, forged images may be carefully post-processed to conceal forgery boundaries in a particular scenario. It becomes tough challenging to these methods. In this paper, we perform an analogous analysis between image forgery detection and blind signal separation, and formulate the post-processed image forgery detection problem into a signal noise separation problem. We also propose a signal noise separation-based (SNIS) network to solve the problem of detecting post-processed image forgery. Specifically, we first adopt the signal noise separation module to separate tampered region from the complex background region with post-processing noise, which weakens or even eliminates the negative impact of post-processing on forgery detection. Then, the multi-scale feature learning module uses a parallel atrous convolution architecture to learn high-level global features from multiple perspectives. Besides, a feature fusion module is utilized to enhance the discriminability of tampered regions and real regions by strengthening the boundary information. Finally, the prediction module is designed to predict the tampered region and classify the type of tampering operation. Extensive experiments show that the proposed SNIS is not only effective for forgery detection on forged images without post-processing, but also promising in robustness against multiple post-processing attacks. Furthermore, SNIS is robust in detecting forged images from unknown sources.

Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed a local attention transformer-based approach to extract dependency features of a token (a patch or an image) on its neighborhood's tokens among image patches and among full-view images, respectively.
Abstract: Multi-view finger-vein recognition technology has attracted increasing attentions in recent years. Despite recent advances in the multi-view finger-vein identification, existing solutions employ multiple monocular cameras from different views to record two-dimensional (2D) projections of 3D vein vessels, which causes the following problems: 1) 2D images collected from limited views (two or three views) are insufficient for robust 3D vein vessel feature representation. Furthermore, image sequences of the same finger acquired from different views usually show significant differences. As a result, the existing works are still sensitive to positional variations of the fingers, specifically those caused by finger roll movements. 2) Using multiple cameras can lead to increased costs. Moreover, it is impossible to employ several cameras to acquire full-view images because of the limited space on capturing devices. To address the above issues, we present FV-LT, a Full-View Finger-Vein identification system based on a Local attention Transformer, by implementing an image acquisition device with a single camera. First, we design and implement a finger-vein acquisition prototype device that utilizes a single camera and a LED group to rotate along a finger for full-view image collection. This allows capturing all vein patterns concealed beneath human skin to form a complete representation of finger features. Second, given the full-view vein images, we propose a local attention transformer-based approach to extract dependency features of a token (a patch or an image) on its neighborhood’s tokens among image patches and among full-view images, respectively. These dependency features are shown to be robust to positional variations induced by finger rolls. Based on the public database of full-view finger-vein images captured by our designed device, we verify the performance of the proposed FV-LT. The experimental results show that FV-LT significantly outperforms existing 2D/multi-view based approaches with respect to improving the tolerance against finger roll and achieving the state-of-the-art identification accuracy.

Journal ArticleDOI
TL;DR: In this paper , a variational Retinex model is presented to simultaneously estimate a smoothed illumination component and a detail-revealed reflectance component and predict the noise map from a pre-processed nighttime hazy image in a unified manner.
Abstract: Under the nighttime haze environment, the quality of acquired images will be deteriorated significantly owing to the influences of multiple adverse degradation factors. In this paper, we develop a multi-purpose oriented haze removal framework focusing on nighttime hazy images. First, we construct a nonlinear model based on the classic Retinex theory to formulate multiple adverse degradations of a nighttime hazy image. Then, a novel variational Retinex model is presented to simultaneously estimate a smoothed illumination component and a detail-revealed reflectance component and predict the noise map from a pre-processed nighttime hazy image in a unified manner. Specifically, an ${\ell _{0}}$ norm is imposed on the reflectance to reveal the structural details and we make use of $\ell _{1}$ norm to constrain the piece-wise smoothness of the illumination and apply ${\ell _{2}}$ norm to enforce the total intensity of the noise map. Afterwards, the haze in the illumination component is removed based on prior-based dehazing method and the contrast of the reflectance component is improved in the gradient domain. Finally, we combine the dehazed illumination and the improved reflectance to generate the haze-free image. Experiments show that our proposed framework performs better than famous nighttime image dehazing methods both in visual effects and objective comparisons. In addition, the proposed framework can also be applicable to other types of degraded images.

Journal ArticleDOI
TL;DR: In this article , an oriented proposal generation mechanism is proposed to explicitly generate oriented proposals, which provides better positional priors for pooling features to modulate the cross-attention in the transformer decoder.
Abstract: Arbitrary-oriented object detection (AOOD) is a challenging task to detect objects in the wild with arbitrary orientations and cluttered arrangements. Existing approaches are mainly based on anchor-based boxes or dense points, which rely on complicated hand-designed processing steps and inductive bias, such as anchor generation, transformation, and non-maximum suppression reasoning. Recently, the emerging transformer-based approaches view object detection as a direct set prediction problem that effectively removes the need for hand-designed components and inductive biases. In this paper, we propose an Arbitrary-Oriented Object DEtection TRansformer framework, termed AO2-DETR, which comprises three dedicated components. More precisely, an oriented proposal generation mechanism is proposed to explicitly generate oriented proposals, which provides better positional priors for pooling features to modulate the cross-attention in the transformer decoder. An adaptive oriented proposal refinement module is introduced to extract rotation-invariant region features and eliminate the misalignment between region features and objects. And a rotation-aware set matching loss is used to ensure the one-to-one matching process for direct set prediction without duplicate predictions. Our method considerably simplifies the overall pipeline and presents a new AOOD paradigm. Comprehensive experiments on several challenging datasets show that our method achieves superior performance on the AOOD task.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper provided an extensive review to bridge the gap between audio-visual fusion and saliency detection, and provided a deep insight into key factors that could directly determine AVSD deep models' performances.
Abstract: Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect. In contrast, our audio system is the most vital complementary part of our visual system. Also, audio-visual saliency detection (AVSD), one of the most representative research topics for mimicking human perceptual mechanisms, is currently in its infancy, and none of the existing survey papers have touched on it, especially from the perspective of saliency detection. Thus, the ultimate goal of this paper is to provide an extensive review to bridge the gap between audio-visual fusion and saliency detection. In addition, as another highlight of this review, we have provided a deep insight into key factors that could directly determine AVSD deep models’ performances. We claim that the audio-visual consistency degree (AVC) — a long-overlooked issue, can directly influence the effectiveness of using audio to benefit its visual counterpart when performing saliency detection. Moreover, to make the AVC issue more practical and valuable for future followers, we have newly equipped almost all existing publicly available AVSD datasets with additional frame-wise AVC labels. Based on these upgraded datasets, we have conducted extensive quantitative evaluations to ground our claim on the importance of AVC in the AVSD task. In a word, our ideas and new sets serve as a convenient platform with preliminaries and guidelines, all of which can potentially facilitate future works in further promoting state-of-the-art (SOTA) performance.

Journal ArticleDOI
TL;DR: In this paper , a modality-interaction-enabled (MIE) similarity generator is first trained to generate a superior MIE similarity matrix for the training set, then the generated MIE is utilized as guiding information to train the deep hashing networks.
Abstract: Recently, numerous unsupervised cross-modal hashing methods have been proposed to deal the image-text retrieval tasks for the unlabeled cross-modal data. However, when these methods learn to generate hash codes, almost all of them lack modality-interaction in the following two aspects: (1) The instance similarity matrix used to guide the hashing networks training is constructed without image-text interaction, which fails to capture the fine-grained cross-modal cues to elaborately characterize the intrinsic semantic similarity among the datapoints. (2) The binary codes used for quantization loss are inferior because they are generated by directly quantizing a simple combination of continuous hash codes from different modalities without the interaction among these continuous hash codes. Such problems will cause the generated hash codes to be of poor quality and degrade the retrieval performance. Hence, in this paper, we propose a novel Unsupervised Cross-modal Hashing with Modality-interaction, termed UCHM. Specifically, by optimizing a novel hash-similarity-friendly loss, a modality-interaction-enabled (MIE) similarity generator is first trained to generate a superior MIE similarity matrix for the training set. Then, the generated MIE similarity matrix is utilized as guiding information to train the deep hashing networks. Furthermore, during the process of training the hashing networks, a novel bit-selection module is proposed to generate high-quality unified binary codes for the quantization loss with the interaction among continuous codes from different modalities, thereby further enhancing the retrieval performance. Extensive experiments on two widely used datasets show that the proposed UCHM outperforms state-of-the-art techniques on cross-modal retrieval tasks.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a novel pseudo-monocular 3D object detection framework, which is called Pseudo-Mono, where stereo images are taken as input, then a lightweight depth predictor is used to generate the depth map of input images.
Abstract: Current monocular 3D object detection algorithms generally suffer from inaccurate depth estimation, which leads to reduction of detection accuracy. The depth error from image-to-image generation for the stereo view is insignificant compared with the gap in single-image generation. Therefore, a novel pseudo-monocular 3D object detection framework is proposed, which is called Pseudo-Mono. Particularly, stereo images are brought into monocular 3D detection. Firstly, stereo images are taken as input, then a lightweight depth predictor is used to generate the depth map of input images. Secondly, the left input images obtained from stereo camera are used as subjects, which generate enhanced visual feature and multi-scale depth feature by depth indexing and feature matching probabilities, respectively. Finally, sparse anchors set by the foreground probability maps and the multi-scale feature maps are used as reference points to find the suitable initialization approach of object query. The encoded visual feature is adopted to enhance object query for enabling deep interaction between visual feature and depth feature. Compared with popular monocular 3D object detection methods, Pseudo-Mono is able to achieve richer fine-grained information without additional data input. Extensive experimental results on the datasets of KITTI, NuScenes, and MS-COCO demonstrate the generalizability and portability of the proposed method. The effectiveness and efficiency of Pseudo-Mono have been demonstrated by extensive ablation experiments. Experiments on a real vehicle platform have shown that the proposed method maintains high performance in complex real-world environments.

Journal ArticleDOI
TL;DR: LASNet as mentioned in this paper proposes a novel feature fusion-based network for RGB-T semantic segmentation, which follows three steps of location, activation, and sharpening, which fully considers the characteristics of cross-modal features at different levels, and accordingly propose three specific modules for better segmentation.
Abstract: Semantic segmentation is important for scene understanding. To address the scenes of adverse illumination conditions of natural images, thermal infrared (TIR) images are introduced. Most existing RGB-T semantic segmentation methods follow three cross-modal fusion paradigms, i.e. encoder fusion, decoder fusion, and feature fusion. Some methods, unfortunately, ignore the properties of RGB and TIR features or the properties of features at different levels. In this paper, we propose a novel feature fusion-based network for RGB-T semantic segmentation, named \emph{LASNet}, which follows three steps of location, activation, and sharpening. The highlight of LASNet is that we fully consider the characteristics of cross-modal features at different levels, and accordingly propose three specific modules for better segmentation. Concretely, we propose a Collaborative Location Module (CLM) for high-level semantic features, aiming to locate all potential objects. We propose a Complementary Activation Module for middle-level features, aiming to activate exact regions of different objects. We propose an Edge Sharpening Module (ESM) for low-level texture features, aiming to sharpen the edges of objects. Furthermore, in the training phase, we attach a location supervision and an edge supervision after CLM and ESM, respectively, and impose two semantic supervisions in the decoder part to facilitate network convergence. Experimental results on two public datasets demonstrate that the superiority of our LASNet over relevant state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/LASNet.

Journal ArticleDOI
TL;DR: MSCAF-Net as mentioned in this paper adopts the improved Pyramid Vision Transformer (PVTv2) model as the backbone to extract global contextual information at multiple scales, and an enhanced receptive field (ERF) module is designed to refine the features at each scale.
Abstract: The aim of camouflaged object detection (COD) is to find objects that are hidden in their surrounding environment. Due to the factors like low illumination, occlusion, small size and high similarity to the background, COD is recognized to be a very challenging task. In this paper, we propose a general COD framework, termed as MSCAF-Net, focusing on learning multi-scale context-aware features. To achieve this target, we first adopt the improved Pyramid Vision Transformer (PVTv2) model as the backbone to extract global contextual information at multiple scales. An enhanced receptive field (ERF) module is then designed to refine the features at each scale. Further, a cross-scale feature fusion (CSFF) module is introduced to achieve sufficient interaction of multi-scale information, aiming to enrich the scale diversity of extracted features. In addition, inspired the mechanism of the human visual system, a dense interactive decoder (DID) module is devised to output a rough localization map, which is used to modulate the fused features obtained in the CSFF module for more accurate detection. The effectiveness of our MSCAF-Net is validated on four benchmark datasets. The results show that the proposed method significantly outperforms state-of-the-art (SOTA) COD models by a large margin. Besides, we also investigate the potential of our MSCAF-Net on some other vision tasks that are highly related to COD, such as polyp segmentation, COVID-19 lung infection segmentation, transparent object detection and defect detection. Experimental results demonstrate the high versatility of the proposed MSCAF-Net. The source code and results of our method are available at https://github.com/yuliu316316/MSCAF-COD. IEEE

Journal ArticleDOI
TL;DR: Li et al. as discussed by the authors proposed a novel RGB-T semantic segmentation network, called MMSMCNet, based on modal memory fusion and morphological multiscale assistance to address the problem of cross-modal feature fusion.
Abstract: Combining color (RGB) images with thermal images can facilitate semantic segmentation of poorly lit urban scenes. However, for RGB-thermal (RGB-T) semantic segmentation, most existing models address cross-modal feature fusion by focusing only on exploring the samples while neglecting the connections between different samples. Additionally, although the importance of boundary, binary, and semantic information is considered in the decoding process, the differences and complementarities between different morphological features are usually neglected. In this paper, we propose a novel RGB-T semantic segmentation network, called MMSMCNet, based on modal memory fusion and morphological multiscale assistance to address the aforementioned problems. For this network, in the encoding part, we used SegFormer for feature extraction of bimodal inputs. Next, our modal memory sharing module implements staged learning and memory sharing of sample information across modal multiscales. Furthermore, we constructed a decoding union unit comprising three decoding units in a layer-by-layer progression that can extract two different morphological features according to the information category and realize the complementary utilization of multiscale cross-modal fusion information. Each unit contains a contour positioning module based on detail information, a skeleton positioning module with deep features as the primary input, and a morphological complementary module for mutual reinforcement of the first two types of information and construction of semantic information. Based on this, we constructed a new supervision strategy, that is, a multi-unit-based complementary supervision strategy. Extensive experiments using two standard datasets showed that MMSMCNet outperformed related state-of-the-art methods. The code is available at: https://github.com/2021nihao/MMSMCNet.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a cross-view recurrence-based self-supervised mapping framework to correlate complementary information among views in the down-scaled input light field (LF) acquisition conditions.
Abstract: Compared with external-supervised learning-based (ESLB) methods, self-supervised learning-based (SSLB) methods can overcome the domain gap problem caused by different light field (LF) acquisition conditions, which results in the performance degradation of light field super-resolution on unseen test datasets. Current SSLB methods exploit the cross-scale recurrence feature in the single view image for super-resolution, ignoring the correlation information among views. Different from previous works, we propose a cross-view recurrence-based self-supervised mapping framework to correlate complementary information among views in the down-scaled input LF. Specifically, the cross-view recurrence information consists of geometry structure features and similar structure features. The former is to provide sub-pixel information according to disparity correlations among adjacent views, and the latter is to acquire similar color and contour information among arbitrary views, which can compensate for error disparity guidance of geometry structure features in sharp variance areas. Moreover, instead of the widely used “All-to-All” strategy, we propose a “Part-to-Part” mapping strategy, which is better competent for SSLB approaches with limited training examples solely extracted from input LF. Finally, considering that self-supervised methods need to retrain from the beginning toward each test image, based on the proposed “part-to-part” strategy, an efficient end-to-end network is designed to extract these cross-view features for superior SASR performance with less training time. Experiment results demonstrate that our method outperforms other state-of-the-art ESLB methods on both large and small domain gap cases. Compared with the only SSLB method (LFZSSR [1]), our approach achieves better performance with 524 times less training time.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed an effective guided feature domain denoising residual network (GDFN) for real-world noise level estimation using iteratively denoised features, initial image features, and noise level maps.
Abstract: Deep learning-based methods have dominated the field of image denoising with their superior performance. Most of them belong to the non-blind denoising approaches assuming that the noise is known at a specific level. However, real-world noise is complex and usually unknown. Since the distribution and level of noise are often unavailable, it will lead to severe performance degradation for non-blind denoising methods. Therefore, introducing noise levels is crucial for the challenging real-world denoising problem. Meanwhile, we observe that noise level mismatch will bring some artifacts to the denoised images. An intuitive solution is using the intermediate denoised images to correct the inaccurate noise level maps. Thus, we introduce an iterative correction scheme, yielding better results than direct noise prediction. We further propose an effective guided feature domain denoising residual network that can promote denoising for various real-world noises using iteratively denoised features, initial image features, and noise level maps. Experimental results on real-world image datasets show that the proposed method can provide excellent visual and objective performance for the real-world denoising task.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an effective no-reference Enhanced Colonoscopy Image Quality (ECIQ) method to automatically evaluate the perceptual quality of ECIs via analysis of brightness, contrast, colorfulness, naturalness, and noise.
Abstract: In colonoscopy, the captured images are usually with low-quality appearance, such as non-uniform illumination, low contrast, etc., due to the specialized imaging environment, which may provide poor visual feedback and bring challenges to subsequent disease analysis. Many low-light image enhancement (LIE) algorithms have recently proposed to improve the perceptual quality. However, how to fairly evaluate the quality of enhanced colonoscopy images (ECIs) generated by different LIE algorithms remains a rarely-mentioned and challenging problem. In this study, we carry out a pioneering investigation on perceptual quality assessment of ECIs. Firstly, considering the lack of specific datasets, we collect 300 low-light images with diverse contents during the real-world colonoscopy and conduct rigorous subjective studies to compare the performance of 8 popular LIE methods, resulting in a benchmark dataset (named ECIQAD) for ECIs. Secondly, in view of the distinctive distortion characteristics of ECIs, we propose an effective no-reference Enhanced Colonoscopy Image Quality (ECIQ) method to automatically evaluate the perceptual quality of ECIs via analysis of brightness, contrast, colorfulness, naturalness, and noise. Extensive experiments on ECIQAD demonstrate the superiority of our proposed ECIQ method over 14 mainstream no-reference image quality assessment methods.

Journal ArticleDOI
TL;DR: In this paper , a 2D-discrete wavelet transform (DWT) is introduced to obtain the low-frequency and high-frequency components of features and further used to generate the two essential features following different routing paths in the encoder process.
Abstract: Anomaly detection plays an important role in manufacturing quality control/assurance. Among approaches adopting computer vision techniques, reconstruction-based methods learn a content-aware mapping function that transfers abnormal regions to normal regions in an unsupervised manner. Such methods usually have difficulty in improving both the reconstruction quality and capacity for abnormal discovery. We observe that high-level semantic contextual features demonstrate a strong ability for abnormal discovery, while variational features help to preserve fine image details. Inspired by the observation, we propose a new abnormal detection model by utilizing features for different purposes depending on their frequency characteristics. The 2D-discrete wavelet transform (DWT) is introduced to obtain the low-frequency and high-frequency components of features and further used to generate the two essential features following different routing paths in our encoder process. To further improve the capacity for abnormal discovery, we propose a novel feature augmentation module that is informed by a customized self-attention mechanism. Extensive experiments are conducted on two popular datasets: MVTec AD and BTAD. The experimental results illustrate that the proposed method outperforms other state-of-the-art approaches in terms of the image-level AUROC score. In particular, our method achieves 100% of the image-level AUROC score on 8 out of 15 classes on the MVTec dataset.

Journal ArticleDOI
TL;DR: In this article , the capacity of sharing multiple secrets in XOR-based VCS (XVCS) was exploited and three efficient base matrix constructions were proposed for realizing the XVCS.
Abstract: Multiple-secret visual cryptography scheme (MVCS) and fully incrementing visual cryptography scheme (FIVCS) have the same functionality that different secrets are gradually revealed by stacking different numbers of shadows. In essence, MVCS and FIVCS are the same. However, both of the two schemes suffer from large pixel expansion and deteriorated reconstructed image quality. In addition, MVCS and FIVCS require intensive computations to create base matrices. In this research, we exploit the capacity of sharing multiple secrets in XOR-based VCS (XVCS). First of all, three efficient base matrix constructions are proposed for realizing the $(k, n)$ non-monotonic XVCS (NXVCS), where the secret image is only revealed by XOR-ing exact $k$ shadows. The $(k, n)$ -NXVCS is adopted to constitute the multiple-secret XVCS (MXVCS). Theoretical analysis on the proposed constructions is provided. Extensive experiments and comparisons are conducted to illustrate that the pixel expansion, the visual quality of recovered image and the efficiency of generating base matrices are significantly improved by the proposed MXVCS, while comparing to MVCS and FIVCS.

Journal ArticleDOI
TL;DR: In this article , a meta-learning based decision tree framework is proposed to achieve the interpretability of few-shot learning (FSL) decision process, which is achieved from two aspects, i.e., a concept aspect and a visual aspect.
Abstract: Few-Shot Learning (FSL) is a challenging task, which aims to recognize novel classes with few examples. Recently, lots of methods have been proposed from the perspective of meta-learning and representation learning. However, few works focus on the interpretability of FSL decision process. In this paper, we take a step towards the interpretable FSL by proposing a novel meta-learning based decision tree framework, namely, MetaDT. In particular, the FSL interpretability is achieved from two aspects, i.e., a concept aspect and a visual aspect. On the concept aspect, we first introduce a tree-like concept hierarchy as FSL prior. Then, resorting to the prior, we split each few-shot task to a set of subtasks with different concept levels and then perform class prediction via a model of decision tree. The advantage of such design is that a sequence of high-level concept decisions that lead up to a final class prediction can be obtained, which clarifies the FSL decision process. On the visual aspect, a set of subtask-specific classifiers with visual attention mechanism is designed to perform decision at each node of the decision tree. As a result, a subtask-specific heatmap visualization can be obtained to achieve the decision interpretability of each tree node. At last, to alleviate the data scarcity issue of FSL, we regard the prior of concept hierarchy as an undirected graph, and then design a graph convolution-based decision tree inference network as our meta-learner to infer parameters of the decision tree. Extensive experiments on performance comparison and interpretability analysis show superiority of our MetaDT.