scispace - formally typeset
Search or ask a question

Showing papers by "Zhibo Chen published in 2022"


Journal ArticleDOI
TL;DR: A learning-based approach to effectively predict the QTMT structure without having to heuristically explore the partitions of each layer is proposed, which concisely requires inference only once to obtain the entire partition information of the current CU and sub-CUs, and the inference is highly parallel.
Abstract: As one of the key technologies of Versatile Video Coding (VVC), a flexible quad-tree with a nested multi-type tree (QTMT) partition structure significantly improves the rate-distortion (RD) performance. However, this structure brings additional complexity due to the recursive search for the best partition type. Traditional fast partition methods in previous encoders, cannot adapt to this new complex structure, because it’s too complicated to predict each block size from one layer to another layer. Some indirect bottom-up designed methods are simple enough, but cannot predict specific split structures, making the acceleration capacity limited. Therefore, in this paper, we propose a learning-based approach to effectively predict the QTMT structure without having to heuristically explore the partitions of each layer. Firstly, we propose a hierarchy grid fully convolutional network (HG-FCN) framework, which concisely requires inference only once to obtain the entire partition information of the current CU and sub-CUs, and the inference is highly parallel. Secondly, we design a representation of complicated QTMT of CU partition in the form of hierarchy grid map (HGM), which can directly and effectively predict the specific hierarchical split structure. Lastly, a dual-threshold decision scheme is adopted to automatically control the trade-off between coding performance and complexity. Extensive experiments demonstrate the effectiveness of HG-FCN, which can reduce 51.15% $\sim ~65.53$ % complexity of VVC intra coding with negligible 1.17% $\sim ~2.19$ % BD-BR increase, superior to other state-of-the-art methods.

16 citations


Proceedings Article
Tao Yu, Zhizheng Zhang, Cui Lan, Zhibo Chen, Yan Lu 
28 Jan 2022
TL;DR: A simple yet effective self-supervised method, Mask-based Latent Reconstruction (MLR), to predict the complete state representations in the latent space from the observations with spatially and temporally masked pixels is proposed and demonstrates the superiority of MLR in improving the sample efficiency of RL algorithms.
Abstract: For deep reinforcement learning (RL) from pixels, learning effective state representations is crucial for achieving high performance. However, in practice, limited experience and high-dimensional inputs prevent effective representation learning. To address this, motivated by the success of mask-based modeling in other research fields, we introduce mask-based reconstruction to promote state representation learning in RL. Specifically, we propose a simple yet effective self-supervised method, Mask-based Latent Reconstruction (MLR), to predict complete state representations in the latent space from the observations with spatially and temporally masked pixels. MLR enables better use of context information when learning state representations to make them more informative, which facilitates the training of RL agents. Extensive experiments show that our MLR significantly improves the sample efficiency in RL and outperforms the state-of-the-art sample-efficient RL methods on multiple continuous and discrete control benchmarks. Our code is available at https://github.com/microsoft/Mask-based-Latent-Reconstruction.

9 citations


Proceedings ArticleDOI
13 Jul 2022
TL;DR: A P rogressive R einforcement learning based D iscarding module (termed as PRID) to progressively remove quality-irrelevant/negative instances for CCTA VIQA based on end-to-end optimization.
Abstract: . Coronary CT Angiography (CCTA) is susceptible to various distortions (e.g., artifacts and noise), which severely compromise the exact diagnosis of cardiovascular diseases. The appropriate CCTA Vessel-level Image Quality Assessment (CCTA VIQA) algorithm can be used to reduce the risk of error diagnosis. The primary challenges of CCTA VIQA are that the local part of coronary that determines final quality is hard to locate. To tackle the challenge, we formulate CCTA VIQA as a multiple-instance learning (MIL) problem, and exploit T ransformer-based MIL backbone (termed as T-MIL) to aggregate the multiple instances along the coronary centerline into the final quality. However, not all instances are informative for final quality. There are some quality-irrelevant/negative instances intervening the exact quality assessment( e . g ., instances covering only background or the coronary in instances is not identifiable). Therefore, we propose a P rogressive R einforcement learning based I nstance D iscarding module (termed as PRID) to progressively remove quality-irrelevant/negative instances for CCTA VIQA. Based on the above two modules, we propose a R einforced T ransformer N etwork (RTN) for automatic CCTA VIQA based on end-to-end optimization. Extensive experimental results demonstrate that our proposed method achieves the state-of-the-art performance on the real-world CCTA dataset, exceeding previous MIL methods by a large margin.

8 citations


Journal ArticleDOI
09 May 2022
TL;DR: Wang et al. as mentioned in this paper designed a full-reference image quality assessment metric SwinIQA to measure the perceptual quality of compressed images in a learned Swin distance space, which can not only help to verify the performance of various compression algorithms but also help to guide the compression optimization in turn.
Abstract: Image compression has raised widespread interest recently due to its significant importance for multimedia storage and transmission. Meanwhile, a reliable image quality assessment (IQA) for compressed images can not only help to verify the performance of various compression algorithms but also help to guide the compression optimization in turn. In this paper, we design a full-reference image quality assessment metric SwinIQA to measure the perceptual quality of compressed images in a learned Swin distance space. It is known that the compression artifacts are usually non-uniformly distributed with diverse distortion types and degrees. To warp the compressed images into the shared representation space while maintaining the complex distortion information, we extract the hierarchical feature representations from each stage of the Swin Transformer. Besides, we utilize cross attention operation to map the extracted feature representations into a learned Swin distance space. Experimental results show that the proposed metric achieves higher consistency with human’s perceptual judgment compared with both traditional methods and learning-based methods on CLIC datasets.

7 citations


Journal ArticleDOI
Bin Li, Xin Li, Sen Liu, Ruoyu Feng, Zhibo Chen 
21 Aug 2022
TL;DR: The Hierarchical Swin Transformer (HST) network is proposed to restore the low-resolution compressed image, which jointly captures the hierarchical feature representations and enhances each-scale representation with Swin transformer, respectively.
Abstract: Compressed Image Super-resolution has achieved great attention in recent years, where images are degraded with compression artifacts and low-resolution artifacts. Since the complex hybrid distortions, it is hard to restore the distorted image with the simple cooperation of super-resolution and compression artifacts removing. In this paper, we take a step forward to propose the Hierarchical Swin Transformer (HST) network to restore the low-resolution compressed image, which jointly captures the hierarchical feature representations and enhances each-scale representation with Swin transformer, respectively. Moreover, we find that the pretraining with Super-resolution (SR) task is vital in compressed image super-resolution. To explore the effects of different SR pretraining, we take the commonly-used SR tasks (e.g., bicubic and different real super-resolution simulations) as our pretraining tasks, and reveal that SR plays an irreplaceable role in the compressed image super-resolution. With the cooperation of HST and pre-training, our HST achieves the fifth place in AIM 2022 challenge on the low-quality compressed image super-resolution track, with the PSNR of 23.51dB. Extensive experiments and ablation studies have validated the effectiveness of our proposed methods. The code and models are available at https://github.com/USTC-IMCL/HST-for-Compressed-Image-SR.

7 citations


Journal ArticleDOI
TL;DR: This paper takes the first step towards the source-free unsupervised domain adaptation (SFUDA) in a simple yet efficient manner for BIQA to tackle the domain shift without access to the source data.
Abstract: Existing learning-based methods for blind image quality assessment (BIQA) are heavily dependent on large amounts of annotated training data, and usually suffer from a severe performance degradation when encountering the domain/distribution shift problem. Thanks to the development of unsupervised domain adaptation (UDA), some works attempt to transfer the knowledge from a label-sufficient source domain to a label-free target domain under domain shift with UDA. However, it requires the coexistence of source and target data, which might be impractical for source data due to the privacy or storage issues. In this paper, we take the first step towards the source-free unsupervised domain adaptation (SFUDA) in a simple yet efficient manner for BIQA to tackle the domain shift without access to the source data. Specifically, we cast the quality assessment task as a rating distribution prediction problem. Based on the intrinsic properties of BIQA, we present a group of well-designed self-supervised objectives to guide the adaptation of the BN affine parameters towards the target domain. Among them, minimizing the prediction entropy and maximizing the batch prediction diversity aim to encourage more confident results while avoiding the trivial solution. Besides, based on the observation that the IQA rating distribution of single image follows the Gaussian distribution, we apply Gaussian regularization to the predicted rating distribution to make it more consistent with the nature of human scoring. Extensive experimental results under cross-domain scenarios demonstrated the effectiveness of our proposed method to mitigate the domain shift.

5 citations


Proceedings ArticleDOI
05 Jul 2022
TL;DR: This paper attempts to develop an ICM framework by learning universal features as omnipotent features while also considering compression, and designs a novel information filtering module between them by co-optimization of instance distinguishment and entropy minimization.
Abstract: Image Coding for Machines (ICM) aims to compress images for AI tasks analysis rather than meeting human perception. Learning a kind of feature that is both general (for AI tasks) and compact (for compression) is pivotal for its success. In this paper, we attempt to develop an ICM framework by learning universal features while also considering compression. We name such features as omnipotent features and the corresponding framework as Omni-ICM. Considering self-supervised learning (SSL) improves feature generalization, we integrate it with the compression task into the Omni-ICM framework to learn omnipotent features. However, it is non-trivial to coordinate semantics modeling in SSL and redundancy removing in compression, so we design a novel information filtering (IF) module between them by co-optimization of instance distinguishment and entropy minimization to adaptively drop information that is weakly related to AI tasks (e.g., some texture redundancy). Different from previous task-specific solutions, Omni-ICM could directly support AI tasks analysis based on the learned omnipotent features without joint training or extra transformation. Albeit simple and intuitive, Omni-ICM significantly outperforms existing traditional and learning-based codecs on multiple fundamental vision tasks.

4 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed an advanced Semantically Structured Video Coding (SSVC), which encoded continuous motion information and reduced cross-frame redundancy via a predictive coding architecture, then the optical flow and residual information are reorganized into SSB, which enables the proposed SSVC could better adaptively support video-based downstream intelligent applications.

3 citations


Journal ArticleDOI
TL;DR: Extensive experiments demonstrate the effectiveness of the proposed Deep Frequency Filtering and show that applying DFF on a plain baseline outperforms the state-of-the-art methods on different domain generalization tasks, including close-set classification and open-set retrieval.
Abstract: Improving the generalization ability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge. Some theoretical studies have uncovered that DNNs have preferences for some frequency components in the learning process and indicated that this may affect the robustness of learned features. In this paper, we propose Deep Frequency Filtering (DFF) for learning domain-generalizable features, which is the first endeavour to explicitly modulate the frequency components of different transfer difficulties across domains in the latent space during training. To achieve this, we perform Fast Fourier Transform (FFT) for the feature maps at different layers, then adopt a light-weight module to learn attention masks from the frequency representations after FFT to enhance transferable components while suppressing the components not conducive to generalization. Further, we empirically compare the effectiveness of adopting different types of attention designs for implementing DFF. Extensive experiments demonstrate the effectiveness of our proposed DFF and show that applying our DFF on a plain baseline outperforms the state-of-the-art methods on different domain generalization tasks, including close-set classification and open-set retrieval.

3 citations


Journal ArticleDOI
TL;DR: This paper presents ActiveMLP, a general MLP-like backbone for computer vision, and proposes an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given one.
Abstract: This paper presents ActiveMLP , a general MLP-like backbone for computer vision. The three existing dominant network families, i . e ., CNNs, Transformers and MLPs, differ from each other mainly in the ways to fuse contextual information into a given token, leaving the design of more effective token-mixing mechanisms at the core of backbone architecture development. In ActiveMLP, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given one. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the original information of the given token at channel levels. In this way, the spatial range of token-mixing is expanded and the way of token-mixing is reformed. With this design, ActiveMLP is endowed with the merits of global receptive fields and more flexible content-adaptive information fusion. Extensive experiments demonstrate that ActiveMLP is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. The code and models will be available at https://github.com/microsoft/ActiveMLP .

3 citations


Journal ArticleDOI
Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, Xinchao Wang 
TL;DR: TaskRes as mentioned in this paper proposes a new efficient tuning approach for VLMs named Task Residual Tuning (TaskRes), which performs directly on the text-based classifier and explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
Abstract: Large-scale vision-language models (VLMs) pre-trained on billion-level data have learned general visual representations and broad visual concepts. In principle, the well-learned knowledge structure of the VLMs should be inherited appropriately when being transferred to downstream tasks with limited data. However, most existing efficient transfer learning (ETL) approaches for VLMs either damage or are excessively biased towards the prior knowledge, e.g., prompt tuning (PT) discards the pre-trained text-based classifier and builds a new one while adapter-style tuning (AT) fully relies on the pre-trained features. To address this, we propose a new efficient tuning approach for VLMs named Task Residual Tuning (TaskRes), which performs directly on the text-based classifier and explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task. Specifically, TaskRes keeps the original classifier weights from the VLMs frozen and obtains a new classifier for the target task by tuning a set of prior-independent parameters as a residual to the original one, which enables reliable prior knowledge preservation and flexible task-specific knowledge exploration. The proposed TaskRes is simple yet effective, which significantly outperforms previous ETL methods (e.g., PT and AT) on 11 benchmark datasets while requiring minimal effort for the implementation. Our code is available at https://github.com/geekyutao/TaskRes.

Journal ArticleDOI
TL;DR: An effective perception-oriented unsupervised domain adaptation method StyleAM is proposed for NR-IQA, which transfers sufficient knowledge from label-rich source domain data to label-free target domain images via Style Alignment and Mixup.
Abstract: Deep neural networks (DNNs) have shown great potential in non-reference image quality assessment (NR-IQA). However, the annotation of NR-IQA is labor-intensive and time-consuming, which severely limits their application especially for authentic images. To relieve the dependence on quality annotation, some works have applied unsupervised domain adaptation (UDA) to NR-IQA. However, the above methods ignore that the alignment space used in classification is sub-optimal, since the space is not elaborately designed for perception. To solve this challenge, we propose an effective perception-oriented unsupervised domain adaptation method StyleAM for NR-IQA, which transfers sufficient knowledge from label-rich source domain data to label-free target domain images via Style Alignment and Mixup. Specifically, we find a more compact and reliable space i.e., feature style space for perception-oriented UDA based on an interesting/amazing observation, that the feature style (i.e., the mean and variance) of the deep layer in DNNs is exactly associated with the quality score in NR-IQA. Therefore, we propose to align the source and target domains in a more perceptual-oriented space i.e., the feature style space, to reduce the intervention from other quality-irrelevant feature factors. Furthermore, to increase the consistency between quality score and its feature style, we also propose a novel feature augmentation strategy Style Mixup, which mixes the feature styles (i.e., the mean and variance) before the last layer of DNNs together with mixing their labels. Extensive experimental results on two typical cross-domain settings (i.e., synthetic to authentic, and multiple distortions to one distortion) have demonstrated the effectiveness of our proposed StyleAM on NR-IQA.

Journal ArticleDOI
TL;DR: MiNL is a novel MI-wise implicit neural representation for light fields that train an MLP + CNN to learn a mapping from 2D MI coordinates to MI colors, which is more compact and efficient that has faster decoding speed as well as better visual quality.
Abstract: Traditional representations for light fields can be separated into two types: explicit representation and implicit representation. Unlike explicit representation that represents light fields as Sub-Aperture Images (SAIs) based arrays or Micro-Images (MIs) based lenslet images, implicit representation treats light fields as neural networks, which is inherently a continuous representation in contrast to discrete explicit representation. However, at present almost all the implicit representations for light fields utilize SAIs to train an MLP to learn a pixel-wise mapping from 4D spatial-angular coordinate to pixel colors, which is neither compact nor of low complexity. Instead, in this paper we propose MiNL, a novel MI-wise implicit neural representation for light fields that train an MLP + CNN to learn a mapping from 2D MI coordinates to MI colors. Given the micro-image’s coordinate, MiNL outputs the corresponding micro-image’s RGB values. Light field encoding in MiNL is just training a neural network to regress the micro-images and the decoding process is a simple feedforward operation. Compared with common pixel-wise implicit representation, MiNL is more compact and efficient that has faster decoding speed ( × 80 ∼ 180 speed-up) as well as better visual quality ( 1 ∼ 4dB PSNR improvement on average). With such a representation, all information of light fields are stored in parameters of neural networks, which can realize several light field-related tasks at the same time. For example, compared with mainstream light field compression methods that have complex processing pipeline, our proposed method transform the light field compression task into model compression task and can achieve comparable performance with state-of-the-art methods through a simple neural network training, with about 1 ∼ 2dB PSNR improvement over HEVC/H.265 at the same bit rate. In addition to

Xin Li, Xin Jin, Jun Fu, Xiaoyuan Yu, Bei Tong, Zhibo Chen 
TL;DR: This paper is the first to investigate the few-shot real image super-resolution and propose a Distortion-Relation guided Transfer Learning (termed as DRTL) framework and instantiate DRTL integrated with pre-training and meta-learning pipelines as an embodiment to realize a distortion-relation aware FS-RSR.
Abstract: Collecting large clean-distorted training image pairs in real world is non-trivial, which seriously limits the practical applications of these supervised learning based image super-resolution (SR) methods. Previous works attempt to address this problem by leveraging unsupervised learning technologies to alleviate the dependency for paired training samples. However, these methods typically suffer from unsatisfactory textures synthesis due to the lack of clean image supervision. Compared with purely unsupervised solution, the under-explored scheme with Few-Shot clean images (FS-RSR) is more feasible to tackle this challenging Real Image Super-Resolution task. In this paper, we are the first to investigate the few-shot real image super-resolution and propose a Distortion-Relation guided Transfer Learning (termed as DRTL) framework. DRTL assigns a knowledge graph to capture the distortion relation between auxiliary tasks (i.e., synthetic distortions) and target tasks (i.e., real distortions with few images), and then adopt a gradient weighting strategy to guide the knowledge transfer from auxiliary task to target task. In this way, DRTL could quickly learn the most relevant knowledge from the prior distortions for target distortion. We instantiate DRTL integrated with pre-training and meta-learning pipelines as an embodiment to realize a distortion-relation aware FS-RSR. Extensive experiments on multiple benchmarks demonstrate the effectiveness of DRTL on few-shot real image super-resolution.

Proceedings ArticleDOI
24 Aug 2022
TL;DR: This work proposes the learned lossless JPEG transcoding framework via Joint Lossy and Residual Compression, and is the first to utilize the learned end-to-end lossy transform coding to reduce the redundancy of DCT coefficients in a compact representational domain.
Abstract: As a commonly-used image compression format, JPEG has been broadly applied in the transmission and storage of images. To further reduce the compression cost while maintaining the quality of JPEG images, lossless transcoding technology has been proposed to recompress the compressed JPEG image in the DCT domain. Previous works, on the other hand, typically reduce the redundancy of DCT coefficients and optimize the probability prediction of entropy coding in a hand-crafted manner that lacks generalization ability and flexibility. To tackle the above challenge, we propose the learned lossless JPEG transcoding framework via Joint Lossy and Residual Compression. Instead of directly optimizing the entropy estimation, we focus on the redundancy that exists in the DCT coefficients. To the best of our knowledge, we are the first to utilize the learned end-to-end lossy transform coding to reduce the redundancy of DCT coefficients in a compact representational domain. We also introduce residual compression for lossless transcoding, which adaptively learns the distribution of residual DCT coefficients before compressing them using context-based entropy coding. Our proposed transcoding architecture shows significant superiority in the compression of JPEG images thanks to the collaboration of learned lossy transform coding and residual entropy coding. Extensive experi-ments on multiple datasets have demonstrated that our proposed framework can achieve about 21.49% bits saving in average based on JPEG compression, which outperforms the typical lossless transcoding framework JPEG-XL by 3.51%.

Proceedings ArticleDOI
07 Dec 2022
TL;DR: Zhang et al. as discussed by the authors proposed a novel light field compression scheme based on implicit neural representation to reduce redundancy between views, which achieves comparable rate-distortion performance as well as superior perceptual quality over traditional methods.
Abstract: Light field, as a new data representation format in multimedia, has the ability to capture both intensity and direction of light rays. However, the additional angular information also brings a large volume of data. Classical coding methods are not effective to describe the relationship between different views, leading to redundancy left. To address this problem, we propose a novel light field compression scheme based on implicit neural representation to reduce redundancies between views. We store the information of a light field image implicitly in an neural network and adopt model compression methods to further compress the implicit representation. Extensive experiments have demonstrated the effectiveness of our proposed method, which achieves comparable rate-distortion performance as well as superior perceptual quality over traditional methods.

Book ChapterDOI
TL;DR: In this paper , the authors proposed a new concept of go-getting domain labels (Go-labels) to replace the original immutable domain labels on the fly. And they demonstrated through theoretical insights, empirical results on real data as well as toy games that their method leads to efficient training without bells and whistles, while being robust to different backbones.
Abstract: AbstractIn this paper, we propose an embarrassingly simple yet highly effective adversarial domain adaptation (ADA) method. We view ADA problem primarily from an optimization perspective and point out a fundamental dilemma, in that the real-world data often exhibits an imbalanced distribution where the large data clusters typically dominate and bias the adaptation process. Unlike prior works that either attempt loss re-weighting or data re-sampling for alleviating this defect, we introduce a new concept of go-getting domain labels (Go-labels) to replace the original immutable domain labels on the fly. The reason why call it as “Go-labels” is because “go-getting” means able to deal with new or difficult situations easily, like here Go-labels adaptively transfer the model attention from over-studied aligned data to those overlooked samples, which allows each sample to be well studied (i.e., alleviating data imbalance influence) and fully unleashes the potential of adaption model. Albeit simple, this dynamic adversarial domain adaptation framework with Go-labels effectively addresses data imbalance issue and promotes adaptation. We demonstrate through theoretical insights, empirical results on real data as well as toy games that our method leads to efficient training without bells and whistles, while being robust to different backbones.

Journal ArticleDOI
TL;DR: ATMNet as mentioned in this paper proposes an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given query token.
Abstract: The three existing dominant network families, i.e., CNNs, Transformers and MLPs, differ from each other mainly in the ways of fusing spatial contextual information, leaving designing more effective token-mixing mechanisms at the core of backbone architecture development. In this work, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given query token. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the query token at channel level. In this way, the spatial range of token-mixing can be expanded to a global scope with limited computational complexity, where the way of token-mixing is reformed. We take ATMs as the primary operators and assemble them into a cascade architecture, dubbed ATMNet. Extensive experiments demonstrate that ATMNet is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. Code is available at https://github.com/microsoft/ActiveMLP.

Proceedings ArticleDOI
24 Aug 2022
TL;DR: The Hierarchical Reinforcement Learning based task- driven Video Semantic Coding, named as HRLVSC is proposed, which can achieve over 39% BD-rate saving for video semantic coding under the Low Delay P configuration.
Abstract: The rapid development of intelligent tasks, e.g., segmentation, detection, and classification, etc, has brought an urgent need for semantic compression, which aims to reduce the compression cost while maintaining the original semantic information. However, it is impractical to directly integrate the semantic metric into the traditional codecs since they cannot be optimized in an end-to-end manner. To solve this problem, some pioneering works have applied reinforcement learning to implement image-wise semantic compression. Nevertheless, the video semantic compression has not been explored since its complex reference architectures and compression modes. In this paper, we take a step forward to video semantic compression and propose the Hierarchical Reinforcement Learning based task-driven Video Semantic Coding, named as HRLVSC. Specifically, to simplify the complex mode decision of video semantic coding, we divided the action space into frame-level and CTU-level spaces in a hierarchical manner, and then explore the best mode selection for them progressively with the cooperation of frame-level and CTU-level agents. Moreover, since the modes of video semantic coding will exponentially increase with the number of frames in a Group of Pictures (GOP), we carefully investigate the effects of different mode selections for video semantic coding, and design a simple but effective mode simplification strategy for it. We have validated our HRLVSC on video segmentation task with HEVC reference software HM16.19. Extensive experimental results demonstrated that our HRLVSC can achieve over 39% BD-rate saving for video semantic coding under the Low Delay P configuration.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a semantic-aware message broadcasting (SAMB) method, which enables more informative and flexible feature alignment for unsupervised domain adaptation (UDA).
Abstract: Vision transformer has demonstrated great potential in abundant vision tasks. However, it also inevitably suffers from poor generalization capability when the distribution shift occurs in testing (i.e., out-of-distribution data). To mitigate this issue, we propose a novel method, Semantic-aware Message Broadcasting (SAMB), which enables more informative and flexible feature alignment for unsupervised domain adaptation (UDA). Particularly, we study the attention module in the vision transformer and notice that the alignment space using one global class token lacks enough flexibility, where it interacts information with all image tokens in the same manner but ignores the rich semantics of different regions. In this paper, we aim to improve the richness of the alignment features by enabling semantic-aware adaptive message broadcasting. Particularly, we introduce a group of learned group tokens as nodes to aggregate the global information from all image tokens, but encourage different group tokens to adaptively focus on the message broadcasting to different semantic regions. In this way, our message broadcasting encourages the group tokens to learn more informative and diverse information for effective domain alignment. Moreover, we systematically study the effects of adversarial-based feature alignment (ADA) and pseudo-label based self-training (PST) on UDA. We find that one simple two-stage training strategy with the cooperation of ADA and PST can further improve the adaptation capability of the vision transformer. Extensive experiments on DomainNet, OfficeHome, and VisDA-2017 demonstrate the effectiveness of our methods for UDA.

Journal ArticleDOI
TL;DR: A two-step fast mode decision method is proposed to use a convolution neural network (CNN) to automatically extract useful features for fine-grained content classification and a content-aware early termination algorithm is further proposed to reduce the encoding complexity.
Abstract: With the rapid development of screen content video applications, screen content coding (SCC) is urgently needed to be used in commercial codecs. However, the extra encoding complexity introduced by the new SCC tools has posed a great challenge for its practical deployment. In this paper, motivated by our observations that there should be a fine-grained mapping between image content and candidate modes, we propose a two-step fast mode decision method to reduce the encoding complexity. First, we propose to use a convolution neural network (CNN) to automatically extract useful features for fine-grained content classification. Second, we build a precise and concise mapping from CUs to candidate modes by simultaneously considering CU content type, CU size, and mode complexity. Note that the spatial correlations between neighboring CUs and current CU are also utilized in candidate modes derivation. In addition to the two-step fast mode decision method, a content-aware early termination algorithm is further proposed to reduce the encoding complexity. Extensive experiments demonstrate that our method achieves better performance compared with state-of-the-art ones, with 50.13% total encoding complexity reduction and only 0.92% BD-rate increase.