scispace - formally typeset
Search or ask a question

Showing papers by "Wenhan Yang published in 2023"


Journal Article•DOI•
TL;DR: Wang et al. as discussed by the authors proposed a novel framework to learn a compact representation in the latent space serving as the metadata in an end-to-end manner, which leads to better reconstruction quality, smaller size of metadata, and faster speed.
Abstract: While raw images exhibit advantages over sRGB images (e.g., linearity and fine-grained quantization level), they are not widely used by common users due to the large storage requirements. Very recent works propose to compress raw images by designing the sampling masks in the raw image pixel space, leading to suboptimal image representations and redundant metadata. In this paper, we propose a novel framework to learn a compact representation in the latent space serving as the metadata in an end-to-end manner. Furthermore, we propose a novel sRGB-guided context model with improved entropy estimation strategies, which leads to better reconstruction quality, smaller size of metadata, and faster speed. We illustrate how the proposed raw image compression scheme can adaptively allocate more bits to image regions that are important from a global perspective. The experimental results show that the proposed method can achieve superior raw image reconstruction results using a smaller size of the metadata on both uncompressed sRGB images and JPEG images.

5 citations


Journal Article•DOI•
TL;DR: Wang et al. as mentioned in this paper proposed a frequency-based trigger injection model that adds triggers in the discrete cosine transform (DCT) domain to attack compression quality in terms of bit-rate and reconstruction quality.
Abstract: Recent deep-learning-based compression methods have achieved superior performance compared with traditional approaches. However, deep learning models have proven to be vulnerable to backdoor attacks, where some specific trigger patterns added to the input can lead to malicious behavior of the models. In this paper, we present a novel backdoor attack with multiple triggers against learned image compression models. Motivated by the widely used discrete cosine transform (DCT) in existing compression systems and standards, we propose a frequency-based trigger injection model that adds triggers in the DCT domain. In particular, we design several attack objectives for various attacking scenarios, including: 1) attacking compression quality in terms of bit-rate and reconstruction quality; 2) attacking task-driven measures, such as down-stream face recognition and semantic segmentation. Moreover, a novel simple dynamic loss is designed to balance the influence of different loss terms adaptively, which helps achieve more efficient training. Extensive experiments show that with our trained trigger injection models and simple modification of encoder parameters (of the compression model), the proposed attack can successfully inject several backdoors with corresponding triggers in a single image compression model.

2 citations


Journal Article•DOI•
TL;DR: In this article , a unified framework with two cooperative modules is proposed to remove image artifacts from the scratched lens protector, which is inherently challenging due to the occasional flare artifacts and the co-occurring interference within mixed artifacts.
Abstract: A protector is placed in front of the camera lens for mobile devices to avoid damage, while the protector itself can be easily scratched accidentally, especially for plastic ones. The artifacts appear in a wide variety of patterns, making it difficult to see through them clearly. Removing image artifacts from the scratched lens protector is inherently challenging due to the occasional flare artifacts and the co-occurring interference within mixed artifacts. Though different methods have been proposed for some specific distortions, they seldom consider such inherent challenges. In our work, we consider the inherent challenges in a unified framework with two cooperative modules, which facilitate the performance boost of each other. We also collect a new dataset from the real world to facilitate training and evaluation purposes. The experimental results demonstrate that our method outperforms the baselines qualitatively and quantitatively. The code and datasets will be released after acceptance.

1 citations


Journal Article•DOI•
TL;DR: In contrastive learning, HLCL as discussed by the authors uses a high-pass and a low-pass graph filter to generate different views of the same node, and contrasts the two filtered views to learn the final node representations.
Abstract: Graph Neural Networks are powerful tools for learning node representations when task-specific node labels are available. However, obtaining labels for graphs is expensive in many applications. This is particularly the case for large graphs. To address this, there has been a body of work to learn node representations in a self-supervised manner without labels. Contrastive learning (CL), has been particularly popular to learn representations in a self-supervised manner. In general, CL methods work by maximizing the similarity between representations of augmented views of the same example, and minimizing the similarity between augmented views of different examples. However, existing graph CL methods cannot learn high-quality representations under heterophily, where connected nodes tend to belong to different classes. This is because under heterophily, augmentations of the same example may not be similar to each other. In this work, we address the above problem by proposing the first graph CL method, HLCL, for learning node representations, under heterophily. HLCL uses a high-pass and a low-pass graph filter to generate different views of the same node. Then, it contrasts the two filtered views to learn the final node representations. Effectively, the high-pass filter captures the dissimilarity between nodes in a neighborhood and the low-pass filter captures the similarity between neighboring nodes.Contrasting the two filtered views allows HLCL to learn rich node representations for graphs, under heterophily and homophily.Empirically, HLCL outperforms state-of-the-art graph CL methods on benchmark heterophily datasets and large-scale real-world datasets by up to 10%.

1 citations


15 Jul 2023
TL;DR: Zhang et al. as discussed by the authors integrated a diffusion model with a physics-based exposure model, which can directly start from a noisy image instead of pure noise, and achieved significantly improved performance and reduced inference time compared with vanilla diffusion models.
Abstract: Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. Note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.

Journal Article•DOI•
TL;DR: SpuCo as discussed by the authors is a python package with modular implementations of state-of-the-art solutions enabling easy and reproducible evaluation of current methods, demonstrating the limitations of existing datasets and evaluation schemes in validating the learning of predictive features over spurious ones.
Abstract: Deep neural networks often exploit non-predictive features that are spuriously correlated with class labels, leading to poor performance on groups of examples without such features. Despite the growing body of recent works on remedying spurious correlations, the lack of a standardized benchmark hinders reproducible evaluation and comparison of the proposed solutions. To address this, we present SpuCo, a python package with modular implementations of state-of-the-art solutions enabling easy and reproducible evaluation of current methods. Using SpuCo, we demonstrate the limitations of existing datasets and evaluation schemes in validating the learning of predictive features over spurious ones. To overcome these limitations, we propose two new vision datasets: (1) SpuCoMNIST, a synthetic dataset that enables simulating the effect of real world data properties e.g. difficulty of learning spurious feature, as well as noise in the labels and features; (2) SpuCoAnimals, a large-scale dataset curated from ImageNet that captures spurious correlations in the wild much more closely than existing datasets. These contributions highlight the shortcomings of current methods and provide a direction for future research in tackling spurious correlations. SpuCo, containing the benchmark and datasets, can be found at https://github.com/BigML-CS-UCLA/SpuCo, with detailed documentation available at https://spuco.readthedocs.io/en/latest/.

Journal Article•DOI•
Yi Ma, Hua Yang, Wenhan Yang, Jianlong Fu, Jiaying Liu 
TL;DR: In this paper , a plug-and-play sampling method is proposed to steadily sample high-quality SR images from pretrained diffusion-based SR models by solving diffusion ordinary differential equations (diffusion ODEs) with optimal boundary conditions (BCs) and analyze the characteristics between the choices of BCs and their corresponding SR results.
Abstract: Diffusion models, as a kind of powerful generative model, have given impressive results on image super-resolution (SR) tasks. However, due to the randomness introduced in the reverse process of diffusion models, the performances of diffusion-based SR models are fluctuating at every time of sampling, especially for samplers with few resampled steps. This inherent randomness of diffusion models results in ineffectiveness and instability, making it challenging for users to guarantee the quality of SR results. However, our work takes this randomness as an opportunity: fully analyzing and leveraging it leads to the construction of an effective plug-and-play sampling method that owns the potential to benefit a series of diffusion-based SR methods. More in detail, we propose to steadily sample high-quality SR images from pretrained diffusion-based SR models by solving diffusion ordinary differential equations (diffusion ODEs) with optimal boundary conditions (BCs) and analyze the characteristics between the choices of BCs and their corresponding SR results. Our analysis shows the route to obtain an approximately optimal BC via an efficient exploration in the whole space. The quality of SR results sampled by the proposed method with fewer steps outperforms the quality of results sampled by current methods with randomness from the same pretrained diffusion-based SR model, which means that our sampling method"boosts"current diffusion-based SR models without any additional training.

17 Jul 2023
TL;DR: Zhang et al. as mentioned in this paper proposed a similarity min-max paradigm that considers both image-level and model-level adaptation under a unified framework to solve the problem of zero-shot day-night domain adaptation.
Abstract: Low-light conditions not only hamper human visual experience but also degrade the model's performance on downstream vision tasks. While existing works make remarkable progress on day-night domain adaptation, they rely heavily on domain knowledge derived from the task-specific nighttime dataset. This paper challenges a more complicated scenario with border applicability, i.e., zero-shot day-night domain adaptation, which eliminates reliance on any nighttime data. Unlike prior zero-shot adaptation approaches emphasizing either image-level translation or model-level adaptation, we propose a similarity min-max paradigm that considers them under a unified framework. On the image level, we darken images towards minimum feature similarity to enlarge the domain gap. Then on the model level, we maximize the feature similarity between the darkened images and their normal-light counterparts for better model adaptation. To the best of our knowledge, this work represents the pioneering effort in jointly optimizing both aspects, resulting in a significant improvement of model generalizability. Extensive experiments demonstrate our method's effectiveness and broad applicability on various nighttime vision tasks, including classification, semantic segmentation, visual place recognition, and video action recognition. Code and pre-trained models are available at https://red-fairy.github.io/ZeroShotDayNightDA-Webpage/.



09 Jul 2023
TL;DR: Li et al. as mentioned in this paper proposed a novel approach to increase the visibility of images captured under low-light environments by removing the in-camera infrared (IR) cut-off filter, which allows for the capture of more photons and results in improved signal-to-noise ratio due to the inclusion of information from the IR spectrum.
Abstract: Low-light image enhancement task is essential yet challenging as it is ill-posed intrinsically. Previous arts mainly focus on the low-light images captured in the visible spectrum using pixel-wise loss, which limits the capacity of recovering the brightness, contrast, and texture details due to the small number of income photons. In this work, we propose a novel approach to increase the visibility of images captured under low-light environments by removing the in-camera infrared (IR) cut-off filter, which allows for the capture of more photons and results in improved signal-to-noise ratio due to the inclusion of information from the IR spectrum. To verify the proposed strategy, we collect a paired dataset of low-light images captured without the IR cut-off filter, with corresponding long-exposure reference images with an external filter. The experimental results on the proposed dataset demonstrate the effectiveness of the proposed method, showing better performance quantitatively and qualitatively. The dataset and code are publicly available at https://wyf0912.github.io/ELIEI/

Journal Article•DOI•
TL;DR: Wang et al. as discussed by the authors proposed a novel framework that learns a compact representation in the latent space, serving as metadata, in an end-to-end manner and analyzed the intrinsic difference of the raw image reconstruction task caused by rich information from the sRGB image.
Abstract: While raw images have distinct advantages over sRGB images, e.g., linearity and fine-grained quantization levels, they are not widely adopted by general users due to their substantial storage requirements. Very recent studies propose to compress raw images by designing sampling masks within the pixel space of the raw image. However, these approaches often leave space for pursuing more effective image representations and compact metadata. In this work, we propose a novel framework that learns a compact representation in the latent space, serving as metadata, in an end-to-end manner. Compared with lossy image compression, we analyze the intrinsic difference of the raw image reconstruction task caused by rich information from the sRGB image. Based on the analysis, a novel backbone design with asymmetric and hybrid spatial feature resolutions is proposed, which significantly improves the rate-distortion performance. Besides, we propose a novel design of the context model, which can better predict the order masks of encoding/decoding based on both the sRGB image and the masks of already processed features. Benefited from the better modeling of the correlation between order masks, the already processed information can be better utilized. Moreover, a novel sRGB-guided adaptive quantization precision strategy, which dynamically assigns varying levels of quantization precision to different regions, further enhances the representation ability of the model. Finally, based on the iterative properties of the proposed context model, we propose a novel strategy to achieve variable bit rates using a single model. This strategy allows for the continuous convergence of a wide range of bit rates. Extensive experimental results demonstrate that the proposed method can achieve better reconstruction quality with a smaller metadata size.

Journal Article•DOI•
TL;DR: RoCLIP as discussed by the authors proposes a method for robust pretraining and fine-tuning multimodal vision-language models by considering a pool of random examples, and matching every image with the text that is most similar to its caption in the pool.
Abstract: Contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image-caption pairs crawled from the internet. However, the massive data that powers large multimodal models such as CLIP, makes them extremely vulnerable to various types of adversarial attacks, including targeted and backdoor data poisoning attacks. Despite this vulnerability, robust contrastive vision-language pretraining against adversarial attacks has remained unaddressed. In this work, we propose RoCLIP, the first effective method for robust pretraining {and fine-tuning} multimodal vision-language models. RoCLIP effectively breaks the association between poisoned image-caption pairs by considering a pool of random examples, and (1) matching every image with the text that is most similar to its caption in the pool, and (2) matching every caption with the image that is most similar to its image in the pool. Our extensive experiments show that our method renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training or fine-tuning of CLIP. In particular, RoCLIP decreases the poison and backdoor attack success rates down to 0\% during pre-training and 1\%-4\% during fine-tuning, and effectively improves the model's performance.

Journal Article•DOI•
TL;DR: In this paper , a cross-modal label contrastive learning method is proposed to exploit multidomain information among unlabeled audio and visual streams as self-supervision signals.
Abstract: This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.

Journal Article•DOI•
TL;DR: In this paper , a GAN inversion framework is proposed to bridge the gap between full fidelity (for human vision) and high discrimination (for machine vision) by relying on existing pretrained generative adversarial networks (GAN).
Abstract: Although the recent learning-based image and video coding techniques achieve rapid development, the signal fidelity-driven target in these methods leads to the divergence to a highly effective and efficient coding framework for both human and machine. In this paper, we aim to address the issue by making use of the power of generative models to bridge the gap between full fidelity (for human vision) and high discrimination (for machine vision). Therefore, relying on existing pretrained generative adversarial networks (GAN), we build a GAN inversion framework that projects the image into a low-dimensional natural image manifold. In this manifold, the feature is highly discriminative and also encodes the appearance information of the image, named as latent code . Taking a variational bit-rate constraint with a hyperprior model to model/suppress the entropy of image manifold code, our method is capable of fulfilling the needs of both machine and human visions at very low bit-rates. To improve the visual quality of image reconstruction, we further propose multiple latent codes and scalable inversion . The former gets several latent codes in the inversion, while the latter additionally compresses and transmits a shallow compact feature to support visual reconstruction. Experimental results demonstrate the superiority of our method in both human vision tasks, i.e . image reconstruction, and machine vision tasks, including semantic parsing and attribute prediction.

DOI•
TL;DR: Zhang et al. as mentioned in this paper proposed a novel haze generation model called HazeGEN by coupling the variational autoencoder and the generative adversarial network to automatically generate annotated datasets.
Abstract: Improving the performance of high-level computer vision tasks in adverse weather (e.g., haze) is highly critical for autonomous driving safety. However, collecting and annotating training sets for various high-level tasks in haze weather are expensive and time-consuming. To address this issue, we propose a novel haze generation model called HazeGEN by coupling the variational autoencoder and the generative adversarial network to automatically generate annotated datasets. The proposed HazeGEN leverages a shared latent space assumption based on an optimized encoder–decoder architecture, which guarantees high fidelity in the cross-domain image translations. To ensure that the generated image can truly facilitate high-level vision task performance, a semisupervised learning strategy is developed for HazeGEN to efficiently learn the useful knowledge from both the real-world images (with unsupervised losses) and the synthetic images generated following the atmosphere scattering model (with supervised losses). Extensive experiments and ablation studies demonstrate that training the model with our generated haze dataset greatly improves accuracy in high-level tasks such as semantic segmentation and object detection. Furthermore, one important but under-exploited issue is investigated to find out whether the developed dataset can be a good substitute for the real ones. Results show that the generated dataset has the most similar performance to the real-world collected haze dataset on multiple challenging industrial scenarios compared with prior works.