scispace - formally typeset
Search or ask a question
Author

Siwei Ma

Bio: Siwei Ma is an academic researcher from Peking University. The author has contributed to research in topics: Coding tree unit & Motion compensation. The author has an hindex of 44, co-authored 464 publications receiving 7878 citations. Previous affiliations of Siwei Ma include University of Southern California & Beihang University.


Papers
More filters
Posted Content
TL;DR: To maximally excavate the capability of transformer, the IPT model is presented to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs and the contrastive learning is introduced for well adapting to different image processing tasks.
Abstract: As the computing power of modern hardware is increasing strongly, pre-trained deep learning models (e.g., BERT, GPT-3) learned on large-scale datasets have shown their effectiveness over conventional methods. The big progress is mainly contributed to the representation ability of transformer and its variant architectures. In this paper, we study the low-level computer vision task (e.g., denoising, super-resolution and deraining) and develop a new pre-trained model, namely, image processing transformer (IPT). To maximally excavate the capability of transformer, we present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs. The IPT model is trained on these images with multi-heads and multi-tails. In addition, the contrastive learning is introduced for well adapting to different image processing tasks. The pre-trained model can therefore efficiently employed on desired task after fine-tuning. With only one pre-trained model, IPT outperforms the current state-of-the-art methods on various low-level benchmarks. Code is available at this https URL and this https URL

631 citations

Proceedings ArticleDOI
01 Jun 2021
TL;DR: Hu et al. as discussed by the authors proposed a pre-trained image processing transformer (IPT) model for denoising, super-resolution and deraining tasks, which is trained on corrupted image pairs with multi-heads and multi-tails.
Abstract: As the computing power of modern hardware is increasing strongly, pre-trained deep learning models (e.g., BERT, GPT-3) learned on large-scale datasets have shown their effectiveness over conventional methods. The big progress is mainly contributed to the representation ability of transformer and its variant architectures. In this paper, we study the low-level computer vision task (e.g., denoising, super-resolution and deraining) and develop a new pre-trained model, namely, image processing transformer (IPT). To maximally excavate the capability of transformer, we present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs. The IPT model is trained on these images with multi-heads and multi-tails. In addition, the contrastive learning is introduced for well adapting to different image processing tasks. The pre-trained model can therefore efficiently employed on desired task after fine-tuning. With only one pre-trained model, IPT outperforms the current state-of-the-art methods on various low-level benchmarks. Code is available at https://github.com/huawei-noah/Pretrained-IPT and https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/cv/IPT

416 citations

Proceedings ArticleDOI
13 Mar 2019
TL;DR: This work proposes a simple yet effective regularization term to address the mode collapse issue for cGANs and explicitly maximizes the ratio of the distance between generated images with respect to the corresponding latent codes, thus encouraging the generators to explore more minor modes during training.
Abstract: Most conditional generation tasks expect diverse outputs given a single conditional context. However, conditional generative adversarial networks (cGANs) often focus on the prior conditional information and ignore the input noise vectors, which contribute to the output variations. Recent attempts to resolve the mode collapse issue for cGANs are usually task-specific and computationally expensive. In this work, we propose a simple yet effective regularization term to address the mode collapse issue for cGANs. The proposed method explicitly maximizes the ratio of the distance between generated images with respect to the corresponding latent codes, thus encouraging the generators to explore more minor modes during training. This mode seeking regularization term is readily applicable to various conditional generation tasks without imposing training overhead or modifying the original network structures. We validate the proposed algorithm on three conditional image synthesis tasks including categorical generation, image-to-image translation, and text-to-image synthesis with different baseline models. Both qualitative and quantitative results demonstrate the effectiveness of the proposed regularization method for improving diversity without loss of quality.

362 citations

Journal ArticleDOI
TL;DR: A new rate-distortion (R-D) model is proposed by utilizing the true quantization stepsize and an improved rate-control scheme for the H.264/AVC encoder based on this new R-D model is developed.
Abstract: In this paper, an efficient rate-control scheme for H.264/AVC video encoding is proposed. The redesign of the quantization scheme in H.264/AVC results in that the relationship between the quantization parameter and the true quantization stepsize is no longer linear. Based on this observation, we propose a new rate-distortion (R-D) model by utilizing the true quantization stepsize and then develop an improved rate-control scheme for the H.264/AVC encoder based on this new R-D model. In general, the current R-D optimization (RDO) mode-selection scheme in H.264/AVC test model is difficult for rate control, because rate control usually requires a predetermined set of motion vectors and coding modes to select the quantization parameter, whereas the RDO does in the different order and requires a predetermined quantization parameter to select motion vectors and coding modes. To tackle this problem, we develop a complexity-adjustable rate-control scheme based on the proposed R-D model. Briefly, the proposed scheme is a one-pass process at frame level and a partial two-pass process at macroblock level. Since the number of macroblocks with the two-pass processing can be controlled by an encoder parameter, the fully one-pass implementation is a subset of the proposed algorithm. An additional topic discussed in this paper is about video buffering. Since a hypothetical reference decoder (HRD) has been defined in H.264/AVC to guarantee that the buffers never overflow or underflow, the more accurate rate-allocation schemes are proposed to satisfy these requirements of HRD.

341 citations

Proceedings ArticleDOI
29 Dec 2011
TL;DR: Experimental results show that the fast intra mode decision scheme provides almost 20% time savings in all intra low complexity cases on average with negligible loss of coding efficiency.
Abstract: As the next generation standard of video coding, the High Efficiency Video Coding (HEVC) is intended to provide significantly better coding efficiency than all existing video coding standards. To improve the coding efficiency of intra frame coding, up to 34 intra prediction modes are defined in HEVC. The best mode among these pre-defined intra prediction modes is selected by rate-distortion optimization (RDO) for each block. If all directions are tested in the RDO process, it will be very time-consuming. To alleviate the encoder computation load, this paper proposes a new method to reduce the candidates in RDO process. In addition, the direction information of the neighboring blocks is made full use of to speed up intra mode decision. Experimental results show that the proposed scheme provides 20% and 28% time savings in intra high efficiency and low complexity cases on average compared to the default encoding scheme in HM 1.0 with almost the same coding efficiency. This algorithm has been proposed to HEVC standard and partially adopted into the HEVC test model.

311 citations


Cited by
More filters
Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Journal ArticleDOI
TL;DR: A novel feature similarity (FSIM) index for full reference IQA is proposed based on the fact that human visual system (HVS) understands an image mainly according to its low-level features.
Abstract: Image quality assessment (IQA) aims to use computational models to measure the image quality consistently with subjective evaluations. The well-known structural similarity index brings IQA from pixel- to structure-based stage. In this paper, a novel feature similarity (FSIM) index for full reference IQA is proposed based on the fact that human visual system (HVS) understands an image mainly according to its low-level features. Specifically, the phase congruency (PC), which is a dimensionless measure of the significance of a local structure, is used as the primary feature in FSIM. Considering that PC is contrast invariant while the contrast information does affect HVS' perception of image quality, the image gradient magnitude (GM) is employed as the secondary feature in FSIM. PC and GM play complementary roles in characterizing the image local quality. After obtaining the local quality map, we use PC again as a weighting function to derive a single quality score. Extensive experiments performed on six benchmark IQA databases demonstrate that FSIM can achieve much higher consistency with the subjective evaluations than state-of-the-art IQA metrics.

4,028 citations

Journal ArticleDOI
TL;DR: Despite its simplicity, it is able to show that BRISQUE is statistically better than the full-reference peak signal-to-noise ratio and the structural similarity index, and is highly competitive with respect to all present-day distortion-generic NR IQA algorithms.
Abstract: We propose a natural scene statistic-based distortion-generic blind/no-reference (NR) image quality assessment (IQA) model that operates in the spatial domain. The new model, dubbed blind/referenceless image spatial quality evaluator (BRISQUE) does not compute distortion-specific features, such as ringing, blur, or blocking, but instead uses scene statistics of locally normalized luminance coefficients to quantify possible losses of “naturalness” in the image due to the presence of distortions, thereby leading to a holistic measure of quality. The underlying features used derive from the empirical distribution of locally normalized luminances and products of locally normalized luminances under a spatial natural scene statistic model. No transformation to another coordinate frame (DCT, wavelet, etc.) is required, distinguishing it from prior NR IQA approaches. Despite its simplicity, we are able to show that BRISQUE is statistically better than the full-reference peak signal-to-noise ratio and the structural similarity index, and is highly competitive with respect to all present-day distortion-generic NR IQA algorithms. BRISQUE has very low computational complexity, making it well suited for real time applications. BRISQUE features may be used for distortion-identification as well. To illustrate a new practical application of BRISQUE, we describe how a nonblind image denoising algorithm can be augmented with BRISQUE in order to perform blind image denoising. Results show that BRISQUE augmentation leads to performance improvements over state-of-the-art methods. A software release of BRISQUE is available online: http://live.ece.utexas.edu/research/quality/BRISQUE_release.zip for public use and evaluation.

3,780 citations