scispace - formally typeset
Search or ask a question

Showing papers by "Alan C. Bovik published in 2020"


Journal ArticleDOI
TL;DR: This work conducts a comprehensive evaluation of leading no-reference/blind VQA (BVQA) features and models on a fixed evaluation architecture, yielding new empirical insights on both subjective video quality studies and objective V QA model design.
Abstract: Recent years have witnessed an explosion of user-generated content (UGC) videos shared and streamed over the Internet, thanks to the evolution of affordable and reliable consumer capture devices, and the tremendous popularity of social media platforms. Accordingly, there is a great need for accurate video quality assessment (VQA) models for UGC/consumer videos to monitor, control, and optimize this vast content. Blind quality prediction of in-the-wild videos is quite challenging, since the quality degradations of UGC content are unpredictable, complicated, and often commingled. Here we contribute to advancing the UGC-VQA problem by conducting a comprehensive evaluation of leading no-reference/blind VQA (BVQA) features and models on a fixed evaluation architecture, yielding new empirical insights on both subjective video quality studies and VQA model design. By employing a feature selection strategy on top of leading VQA model features, we are able to extract 60 of the 763 statistical features used by the leading models to create a new fusion-based BVQA model, which we dub the \textbf{VID}eo quality \textbf{EVAL}uator (VIDEVAL), that effectively balances the trade-off between VQA performance and efficiency. Our experimental results show that VIDEVAL achieves state-of-the-art performance at considerably lower computational cost than other leading models. Our study protocol also defines a reliable benchmark for the UGC-VQA problem, which we believe will facilitate further research on deep learning-based VQA modeling, as well as perceptually-optimized efficient UGC video processing, transcoding, and streaming. To promote reproducible research and public evaluation, an implementation of VIDEVAL has been made available online: \url{this https URL}.

113 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: In this article, the authors introduce the largest subjective picture quality database, containing about 40, 000 real-world distorted pictures and 120, 000 patches, on which they collected about 4M human judgments of picture quality and used these picture and patch quality labels to learn to produce state-of-the-art global picture quality predictions as well as useful local picture quality maps.
Abstract: Blind or no-reference (NR) perceptual picture quality prediction is a difficult, unsolved problem of great consequence to the social and streaming media industries that impacts billions of viewers daily. Unfortunately, popular NR prediction models perform poorly on real-world distorted pictures. To advance progress on this problem, we introduce the largest (by far) subjective picture quality database, containing about 40, 000 real-world distorted pictures and 120, 000 patches, on which we collected about 4M human judgments of picture quality. Using these picture and patch quality labels, we built deep region-based architectures that learn to produce state-of-the-art global picture quality predictions as well as useful local picture quality maps. Our innovations include picture quality prediction architectures that produce global-to-local inferences as well as local-to-global inferences (via feedback). The dataset and source code are available at https: //live.ece.utexas.edu/research.php.

94 citations


Journal ArticleDOI
TL;DR: The new LIVE-SJTU Audio and Video Quality Assessment (A/V-QA) Database includes 336 A/V sequences that were generated from 14 original source contents by applying 24 different A-V distortion combinations on them, and is validated and tested all of the objective A/v quality prediction models.
Abstract: The topics of visual and audio quality assessment (QA) have been widely researched for decades, yet nearly all of this prior work has focused only on single-mode visual or audio signals. However, visual signals rarely are presented without accompanying audio, including heavy-bandwidth video streaming applications. Moreover, the distortions that may separately (or conjointly) afflict the visual and audio signals collectively shape user-perceived quality of experience (QoE). This motivated us to conduct a subjective study of audio and video (A/V) quality, which we then used to compare and develop A/V quality measurement models and algorithms. The new LIVE-SJTU Audio and Video Quality Assessment (A/V-QA) Database includes 336 A/V sequences that were generated from 14 original source contents by applying 24 different A/V distortion combinations on them. We then conducted a subjective A/V quality perception study on the database towards attaining a better understanding of how humans perceive the overall combined quality of A/V signals. We also designed four different families of objective A/V quality prediction models, using a multimodal fusion strategy. The different types of A/V quality models differ in both the unimodal audio and video quality prediction models comprising the direct signal measurements and in the way that the two perceptual signal modes are combined. The objective models are built using both existing state-of-the-art audio and video quality prediction models and some new prediction models, as well as quality-predictive features delivered by a deep neural network. The methods of fusing audio and video quality predictions that are considered include simple product combinations as well as learned mappings. Using the new subjective A/V database as a tool, we validated and tested all of the objective A/V quality prediction models. We will make the database publicly available to facilitate further research.

92 citations


Proceedings ArticleDOI
02 Nov 2020
TL;DR: A large-scale comparative evaluation is conducted to assess the capabilities and limitations of multiple temporal pooling strategies on blind VQA of usergenerated videos and proposes an ensemble pooling model built on top of high-performing temporal Pooling models.
Abstract: Many objective video quality assessment (VQA) algorithms include a key step of temporal pooling of frame-level quality scores. However, less attention has been paid to studying the relative efficiencies of different pooling methods on noreference (blind) VQA. Here we conduct a large-scale comparative evaluation to assess the capabilities and limitations of multiple temporal pooling strategies on blind VQA of usergenerated videos. The study yields insights and general guidance regarding the application and selection of temporal pooling models. In addition, we also propose an ensemble pooling model built on top of high-performing temporal pooling models. Our experimental results demonstrate the relative efficacies of the evaluated temporal pooling models, using several popular VQA algorithms evaluated on two recent largescale natural video quality databases. Conclusively, we also provide an empirical recipe for applying temporal pooling of frame-based quality predictions.

69 citations


Posted Content
TL;DR: The largest (by far) subjective video quality dataset is created, containing 38,811 real-world distorted videos and 116,433 space-time localized video patches (‘v-patches’), and 5.5M human perceptual quality annotations, which create two unique NR-VQA models.
Abstract: No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem to social and streaming media applications. Efficient and accurate video quality predictors are needed to monitor and guide the processing of billions of shared, often imperfect, user-generated content (UGC). Unfortunately, current NR models are limited in their prediction capabilities on real-world, "in-the-wild" UGC video data. To advance progress on this problem, we created the largest (by far) subjective video quality dataset, containing 39, 000 realworld distorted videos and 117, 000 space-time localized video patches ('v-patches'), and 5.5M human perceptual quality annotations. Using this, we created two unique NR-VQA models: (a) a local-to-global region-based NR VQA architecture (called PVQ) that learns to predict global video quality and achieves state-of-the-art performance on 3 UGC datasets, and (b) a first-of-a-kind space-time video quality mapping engine (called PVQ Mapper) that helps localize and visualize perceptual distortions in space and time. We will make the new database and prediction models available immediately following the review process.

60 citations


Journal ArticleDOI
TL;DR: An effective fusion-based technique to enhance both day-time and night-time hazy scenes by estimating the airlight on image patches and not on the entire image using a Laplacian pyramid decomposition.
Abstract: We introduce an effective fusion-based technique to enhance both day-time and night-time hazy scenes. When inverting the Koschmieder light transmission model, and by contrast with the common implementation of the popular dark-channel [1] , we estimate the airlight on image patches and not on the entire image. Local airlight estimation is adopted because, under night-time conditions, the lighting generally arises from multiple localized artificial sources, and is thus intrinsically non-uniform. Selecting the sizes of the patches is, however, non-trivial. Small patches are desirable to achieve fine spatial adaptation to the atmospheric light, but large patches help improve the airlight estimation accuracy by increasing the possibility of capturing pixels with airlight appearance (due to severe haze). For this reason, multiple patch sizes are considered to generate several images, that are then merged together. The discrete Laplacian of the original image is provided as an additional input to the fusion process to reduce the glowing effect and to emphasize the finest image details. Similarly, for day-time scenes we apply the same principle but use a larger patch size. For each input, a set of weight maps are derived so as to assign higher weights to regions of high contrast, high saliency and small saturation. Finally the derived inputs and the normalized weight maps are blended in a multi-scale fashion using a Laplacian pyramid decomposition. Extensive experimental results demonstrate the effectiveness of our approach as compared with recent techniques, both in terms of computational efficiency and the quality of the outputs.

59 citations


Journal ArticleDOI
TL;DR: It is revealed that the use of the different types of aesthetic labels can be developed within the same statistical framework, which is used to create a unified probabilistic formulation of all the three IAA tasks.
Abstract: Image aesthetic assessment (IAA) has been attracting considerable attention in recent years due to the explosive growth of digital photography in Internet and social networks. The IAA problem is inherently challenging, owning to the ineffable nature of the human sense of aesthetics and beauty, and its close relationship to understanding pictorial content. Three different approaches to framing and solving the problem have been posed: binary classification, average score regression and score distribution prediction. Solutions that have been proposed have utilized different types of aesthetic labels and loss functions to train deep IAA models. However, these studies ignore the fact that the three different IAA tasks are inherently related. Here, we reveal that the use of the different types of aesthetic labels can be developed within the same statistical framework, which we use to create a unified probabilistic formulation of all the three IAA tasks. This unified formulation motivates the use of an efficient and effective loss function for training deep IAA models to conduct different tasks. We also discuss the problem of learning from a noisy raw score distribution which hinders network performance. We then show that by fitting the raw score distribution to a more stable and discriminative score distribution, we are able to train a single model which is able to obtain highly competitive performance on all three IAA tasks. Extensive qualitative analysis and experimental results on image aesthetic benchmarks validate the superior performance afforded by the proposed formulation. The source code is available at https://github.com/HuiZeng/Unified_IAA .

51 citations


Journal ArticleDOI
TL;DR: The next-generation surface water mapping model, DeepWaterMapV2, is presented, which uses improved model architecture, data set, and a training setup to create surface water maps at lower cost, with higher precision and recall, and is memory efficient for large inputs.
Abstract: We present our next-generation surface water mapping model, DeepWaterMapV2, which uses improved model architecture, data set, and a training setup to create surface water maps at lower cost, with higher precision and recall. We designed DeepWaterMapV2 to be memory efficient for large inputs. Unlike earlier models, our new model is able to process a full Landsat scene in one-shot and without dividing the input into tiles. DeepWaterMapV2 is robust against a variety of natural and artificial perturbations in the input, such as noise, different sensor characteristics, and small clouds. Our model can even “see” through the clouds without relying on any active sensor data, in cases where the clouds do not fully obstruct the scene. Although we trained the model on Landsat-8 images only, it also supports data from a variety of other Earth observing satellites, including Landsat-5, Landsat-7, and Sentinel-2, without any further training or calibration. Our code and trained model are available at https://github.com/isikdogan/deepwatermap .

40 citations


Journal ArticleDOI
TL;DR: In the study, 450 distorted images obtained from 15 pristine 3D VR images modified by 6 types of distortion of varying severities were evaluated by 42 subjects in a controlled VR setting and made available as part of the new database, in hopes that the relationships between gaze direction and perceived quality might be better understood.
Abstract: Virtual Reality (VR) and its applications have attracted significant and increasing attention. However, the requirements of much larger file sizes, different storage formats, and immersive viewing conditions pose significant challenges to the goals of acquiring, transmitting, compressing and displaying high quality VR content. Towards meeting these challenges, it is important to be able to understand the distortions that arise and that can affect the perceived quality of displayed VR content. It is also important to develop ways to automatically predict VR picture quality. Meeting these challenges requires basic tools in the form of large, representative subjective VR quality databases on which VR quality models can be developed and which can be used to benchmark VR quality prediction algorithms. Towards making progress in this direction, here we present the results of an immersive 3D subjective image quality assessment study. In the study, 450 distorted images obtained from 15 pristine 3D VR images modified by 6 types of distortion of varying severities were evaluated by 42 subjects in a controlled VR setting. Both the subject ratings as well as eye tracking data were recorded and made available as part of the new database, in hopes that the relationships between gaze direction and perceived quality might be better understood. We also evaluated several publicly available IQA models on the new database, and also report a statistical evaluation of the performances of the compared IQA models.

39 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: In this paper, a distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND index), was proposed.
Abstract: Banding artifact, or false contouring, is a common video compression impairment that tends to appear on large flat regions in encoded videos. These staircase-shaped color bands can be very noticeable in high-definition videos. Here we study this artifact, and propose a new distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND index). BBAND is inspired by human visual models. The proposed detector can generate a pixel-wise banding visibility map and output a banding severity score at both the frame and video levels. Experimental results show that our proposed method outperforms state-of-the-art banding detection algorithms and delivers better consistency with subjective evaluations.

34 citations


Journal ArticleDOI
TL;DR: A novel FR-IQA framework that dynamically generates receptive fields responsive to distortion type is proposed that achieves state-of-the-art prediction accuracy on various open IQA databases.
Abstract: Most full-reference image quality assessment (FR-IQA) methods advanced to date have been holistically designed without regard to the type of distortion impairing the image. However, the perception of distortion depends nonlinearly on the distortion type. Here we propose a novel FR-IQA framework that dynamically generates receptive fields responsive to distortion type. Our proposed method- dynamic receptive field generation based image quality assessor (DRF-IQA)-separates the process of FR-IQA into two streams: 1) dynamic error representation and 2) visual sensitivity-based quality pooling. The first stream generates dynamic receptive fields on the input distorted image, implemented by a trained convolutional neural network (CNN), then the generated receptive field profiles are convolved with the distorted and reference images, and differenced to produce spatial error maps. In the second stream, a visual sensitivity map is generated. The visual sensitivity map is used to weight the spatial error map. The experimental results show that the proposed model achieves state-of-the-art prediction accuracy on various open IQA databases.

Journal ArticleDOI
TL;DR: An objective VQA model called Space-Time GeneRalized Entropic Difference (GREED) is devised which analyzes the statistics of spatial and temporal band-pass video coefficients and achieves state-of-the-art performance on the LIVE-YT-HFR Database when compared with existing V QA models.
Abstract: We consider the problem of conducting frame rate dependent video quality assessment (VQA) on videos of diverse frame rates, including high frame rate (HFR) videos. More generally, we study how perceptual quality is affected by frame rate, and how frame rate and compression combine to affect perceived quality. We devise an objective VQA model called Space-Time GeneRalized Entropic Difference (GREED) which analyzes the statistics of spatial and temporal band-pass video coefficients. A generalized Gaussian distribution (GGD) is used to model band-pass responses, while entropy variations between reference and distorted videos under the GGD model are used to capture video quality variations arising from frame rate changes. The entropic differences are calculated across multiple temporal and spatial subbands, and merged using a learned regressor. We show through extensive experiments that GREED achieves state-of-the-art performance on the LIVE-YT-HFR Database when compared with existing VQA models. The features used in GREED are highly generalizable and obtain competitive performance even on standard, non-HFR VQA databases. The implementation of GREED has been made available online: this https URL

Journal ArticleDOI
TL;DR: 1stepVQA overcomes limitations of Full-Reference, Reduced-Reference and No-Reference VQA models by exploiting the statistical regularities of both natural videos and distorted videos, and is able to more accurately predict the quality of compressed videos, given imperfect reference videos.
Abstract: Over the past decade, the online video industry has greatly expanded the volume of visual data that is streamed and shared over the Internet. Moreover, because of the increasing ease of video capture, many millions of consumers create and upload large volumes of User-Generated-Content (UGC) videos. Unlike streaming television or cinematic content produced by professional videographers and cinemagraphers, UGC videos are most commonly captured by naive users having limited skills and imperfect technique, and often are afflicted by highly diverse and mixed in-capture distortions. These UGC videos are then often uploaded for sharing onto cloud servers, where they further compressed for storage and transmission. Our paper tackles the highly practical problem of predicting the quality of compressed videos (perhaps during the process of compression, to help guide it), with only (possibly severely) distorted UGC videos as references. To address this problem, we have developed a novel Video Quality Assessment (VQA) framework that we call 1stepVQA (to distinguish it from two-step methods that we discuss). 1stepVQA overcomes limitations of Full-Reference, Reduced-Reference and No-Reference VQA models by exploiting the statistical regularities of both natural videos and distorted videos. We show that 1stepVQA is able to more accurately predict the quality of compressed videos, given imperfect reference videos. We also describe a new dedicated video database which includes (typically distorted) UGC reference videos, and a large number of compressed versions of them. We show that the 1stepVQA model outperforms other VQA models in this scenario. We are providing the dedicated new database free of charge at this https URL

Journal ArticleDOI
TL;DR: The experimental results show that the trained model has better quality evaluation performance on noisy images than existing blind noise assessment models, while also outperforming general-purpose blind and full-reference image quality assessment methods.
Abstract: Noise that afflicts natural images, regardless of the source, generally disturbs the perception of image quality by introducing a high-frequency random element that, when severe, can mask image content. Except at very low levels, where it may play a purpose, it is annoying. There exist significant statistical differences between distortion-free natural images and noisy images that become evident upon comparing the empirical probability distribution histograms of their discrete wavelet transform (DWT) coefficients. The DWT coefficients of low- or no-noise natural images have leptokurtic, peaky distributions with heavy tails; while noisy images tend to be platykurtic with less peaky distributions and shallower tails. The sample kurtosis is a natural measure of the peakedness and tail weight of the distributions of random variables. Here, we study the efficacy of the sample kurtosis of image wavelet coefficients as a feature driving, an extreme learning machine which learns to map kurtosis values into perceptual quality scores. The model is trained and tested on five types of noisy images, including additive white Gaussian noise, additive Gaussian color noise, impulse noise, masked noise, and high-frequency noise from the LIVE, CSIQ, TID2008, and TID2013 image quality databases. The experimental results show that the trained model has better quality evaluation performance on noisy images than existing blind noise assessment models, while also outperforming general-purpose blind and full-reference image quality assessment methods.

Journal ArticleDOI
TL;DR: This work uses a conditional generative adversarial network (cGAN) which is trained to learn four kinds of realistic distortions and experimentally demonstrates that the learned model can produce the perceptual characteristics of several types of distortion.
Abstract: Modeling image and video distortions is an important, but difficult problem of great consequence to numerous and diverse image processing and computer vision applications. While many statistical models have been proposed to synthesize different types of image noise, real-world distortions are far more difficult to emulate. Toward advancing progress on this interesting problem, we consider distortion generation as an image-to-image transformation problem, and solve it via a data-driven approach. Specifically, we use a conditional generative adversarial network (cGAN) which we train to learn four kinds of realistic distortions. We experimentally demonstrate that the learned model can produce the perceptual characteristics of several types of distortion.

Journal ArticleDOI
TL;DR: A full-reference video quality assessment framework that integrates analysis of space–time slices (STSs) with frame-based image quality measurement (IQA) to form a high-performance video quality predictor is developed.
Abstract: We develop a full-reference (FR) video quality assessment framework that integrates analysis of space–time slices (STSs) with frame-based image quality measurement (IQA) to form a high-performance video quality predictor. The approach first arranges the reference and test video sequences into a space–time slice representation. To more comprehensively characterize space–time distortions, a collection of distortion-aware maps are computed on each reference–test video pair. These reference-distorted maps are then processed using a standard image quality model, such as peak signal-to-noise ratio (PSNR) or Structural Similarity (SSIM). A simple learned pooling strategy is used to combine the multiple IQA outputs to generate a final video quality score. This leads to an algorithm called Space–TimeSlice PSNR (STS-PSNR), which we thoroughly tested on three publicly available video quality assessment databases and found it to deliver significantly elevated performance relative to state-of-the-art video quality models. Source code for STS-PSNR is freely available at: http://live.ece.utexas.edu/research/Quality/STS-PSNR_release.zip .

Posted Content
TL;DR: This work proposes a new prototype model for no-reference video quality assessment (VQA) based on the natural statistics of space-time chips of videos, which achieves high correlation against human judgments of video quality and is competitive with state-of-the-art models.
Abstract: We propose a new prototype model for no-reference video quality assessment (VQA) based on the natural statistics of space-time chips of videos. Space-time chips (ST-chips) are a new, quality-aware feature space which we define as space-time localized cuts of video data in directions that are determined by the local motion flow. We use parametrized distribution fits to the bandpass histograms of space-time chips to characterize quality, and show that the parameters from these models are affected by distortion and can hence be used to objectively predict the quality of videos. Our prototype method, which we call ChipQA-0, is agnostic to the types of distortion affecting the video, and is based on identifying and quantifying deviations from the expected statistics of natural, undistorted ST-chips in order to predict video quality. We train and test our resulting model on several large VQA databases and show that our model achieves high correlation against human judgments of video quality and is competitive with state-of-the-art models.

Journal ArticleDOI
TL;DR: A new dataset of super-resolved images with associated human quality scores is introduced and two no-reference, (NR) opinion-distortion unaware (ODU) IQA models are implemented, achieving better than state-of-the-art performance among the NR-IQA metrics.
Abstract: Methods for image Super Resolution (SR) have started to benefit from the development of perceptual quality predictors that are designed for super resolved images. However, extensive cross dataset validation studies have not yet been performed on Image Quality Assessment (IQA) for super resolved images. Moreover, powerful natural scene statistics-based approaches for IQA have not yet been studied for SR. To address these issues, we introduced a new dataset of super-resolved images with associated human quality scores. The dataset is based on the existing SupER dataset, which contains real low-resolution images. This new dataset also has 7 SR algorithms at three magnification scales. We selected optimal quality aware features to create two no-reference, (NR) opinion-distortion unaware (ODU) IQA models. Using the same set of selected features, we also implemented two NR-IQA opinion/distortion aware (ODA) models. The selection process identified paired-product (PP) features and those derived from discrete cosine transform coefficients (DCT) as the most relevant for the quality prediction of SR images. We conducted cross dataset validation for several state-of-the-art quality algorithms in four datasets, including our new dataset. The conducted experiments indicate that our models achieved better than state-of-the-art performance among the NR-IQA metrics. Our NR-IQA source code and the dataset are available at https://github.com/juanpaberon/IQA_SR.

Journal ArticleDOI
TL;DR: This study examines the spatial variability of change in the GBMD channel network and finds that the anthropogenically modified embanked regions have much higher levels of geomorphic change than the adjacent natural Sundarban forest and that this change is primarily due to channel infilling and increased rates of channel migration.
Abstract: The Ganges Brahmaputra Meghna Delta (GBMD) is a large and complex coastal system whose channel network is vulnerable to morphological changes caused by sea level rise, subsidence, anthropogenic modifications, and changes to water and sediment loads. Locating and characterizing change is particularly challenging because of the wide range of forcings acting on the GBMD and because of the large range of scales over which these forcings act. In this study, we examine the spatial variability of change in the GBMD channel network. We quantify the relative magnitudes and directions of change across multiple scales and relate the spatial distribution of change to the spatial distribution of a variety of known system forcings. We quantify how the channelization varies by computing the Channelized Response Variance (CRV) on 30 years of remotely sensed imagery of the entire delta extent. The CRV analysis reveals hotspots of morphological change across the delta. We find that the magnitude of these hotspots are related to the spatial distribution of the dominant physiographic forcings in the system (tidal and fluvial influence levels, channel connectivity, and anthropogenic interference levels). We find that the anthropogenically modified embanked regions have much higher levels of geomorphic change than the adjacent natural Sundarban forest and that this change is primarily due to channel infilling and increased rates of channel migration. Having a better understanding of how anthropogenic changes affect delta channel networks over human timescales will help to inform policy decisions affecting the human and ecological presences on deltas around the world.

Journal ArticleDOI
TL;DR: In this paper, a hierarchical fully convolutional network (H-FCN) is proposed to predict intra-mode superblock partitions in the form of a four-level partition tree.
Abstract: In VP9 video codec, the sizes of blocks are decided during encoding by recursively partitioning $64\times 64$ superblocks using rate-distortion optimization (RDO). This process is computationally intensive because of the combinatorial search space of possible partitions of a superblock. Here, we propose a deep learning based alternative framework to predict the intra-mode superblock partitions in the form of a four-level partition tree, using a hierarchical fully convolutional network (H-FCN). We created a large database of VP9 superblocks and the corresponding partitions to train an H-FCN model, which was subsequently integrated with the VP9 encoder to reduce the intra-mode encoding time. The experimental results establish that our approach speeds up intra-mode encoding by 69.7% on average, at the expense of a 1.71% increase in the Bjontegaard-Delta bitrate (BD-rate). While VP9 provides several built-in speed levels which are designed to provide faster encoding at the expense of decreased rate-distortion performance, we find that our model is able to outperform the fastest recommended speed level of the reference VP9 encoder for the good quality intra encoding configuration, in terms of both speedup and BD-rate.

Journal ArticleDOI
TL;DR: This work proposes a new “naturalness”-based image quality predictor for generative images that is built using a multi-stage parallel boosting system based on structural similarity features and measurements of statistical similarity.
Abstract: In recent years, deep neural networks have been utilized in a wide variety of applications including image generation. In particular, generative adversarial networks (GANs) are able to produce highly realistic pictures as part of tasks such as image compression. As with standard compression, it is desirable to be able to automatically assess the perceptual quality of generative images to monitor and control the encode process. However, existing image quality algorithms are ineffective on GAN generated content, especially on textured regions and at high compressions. Here we propose a new “naturalness”-based image quality predictor for generative images. Our new GAN picture quality predictor is built using a multi-stage parallel boosting system based on structural similarity features and measurements of statistical similarity. To enable model development and testing, we also constructed a subjective GAN image quality database containing (distorted) GAN images and collected human opinions of them. Our experimental results indicate that our proposed GAN IQA model delivers superior quality predictions on the generative image datasets, as well as on traditional image quality datasets.

Journal ArticleDOI
TL;DR: The proposed debanding filter is able to adaptively smooth banded regions while preserving image edges and details, yielding perceptually enhanced gradient rendering with limited bit-depths.
Abstract: Banding artifacts, which manifest as staircase-like color bands on pictures or video frames, is a common distortion caused by compression of low-textured smooth regions. These false contours can be very noticeable even on high-quality videos, especially when displayed on high-definition screens. Yet, relatively little attention has been applied to this problem. Here we consider banding artifact removal as a visual enhancement problem, and accordingly, we solve it by applying a form of content-adaptive smoothing filtering followed by dithered quantization, as a post-processing module. The proposed debanding filter is able to adaptively smooth banded regions while preserving image edges and details, yielding perceptually enhanced gradient rendering with limited bit-depths. Experimental results show that our proposed debanding filter outperforms state-of-the-art false contour removing algorithms both visually and quantitatively.

Proceedings ArticleDOI
04 May 2020
TL;DR: In this article, a video compression framework using conditional Generative Adversarial Networks (GANs) is proposed, which relies on two encoders: one that deploys a standard video codec and another one which generates low-level soft edge maps.
Abstract: We propose a video compression framework using conditional Generative Adversarial Networks (GANs). We rely on two encoders: one that deploys a standard video codec and another one which generates low-level soft edge maps. For decoding, we use a standard video decoder as well as a decoder that is trained using a conditional GAN. Recent "deep" approaches to video compression require multiple videos to pre-train generative networks that conduct interpolation. By contrast, our scheme trains a generative decoder that requires only a small number of key frames and edge maps taken from a single video, without any interpolation. Experiments on two video datasets demonstrate that the proposed GAN-based compression engine is a promising alternative to traditional video codec approaches that can achieve higher quality reconstructions for very low bitrates.

Journal ArticleDOI
TL;DR: The experimental results show that the new approach to digitally implementing the Retinex using a local deviation based variational model can reconstruct more accurate recovered images than other state-of-the-art methods, while maintaining good contrast.
Abstract: A topic of continued interest in Retinex over the years has been finding ways to implement it with computational models of improved accuracy and efficiency. We have devised a new approach to digitally implementing the Retinex using a local deviation based variational model. The new model leads to improvements in the computed image quality with respect to illumination correction and image enhancement. Several contributions are made: 1) a new prior constraint, which we call local flatness, is proposed, and a new measure of Local Deviation ( LD ) is developed to quantify the degree of local illumination flatness; 2) a variational problem is defined and the solution is found by a logical sequence of steps; 3) discrete implementation of the variational solution is shown to effectively estimate and remove uneven illumination, yielding an accurate recovered image. Unlike other physical prior based variational Retinex models, which use the L2 norm of the illumination gradient to enforce smoothness of illumination, our LD prior selectively imposes local flatness on illumination by calculating the deviation between the estimated illumination surface to a reference plane. In the experiments, pseudo ground truth images are created by superimposing uneven illumination on real scenes, providing an effective way to objectively assess algorithm performance. The experimental results show that our method can reconstruct more accurate recovered images than other state-of-the-art methods, while maintaining good contrast.

Journal ArticleDOI
TL;DR: In this article, a debanding filter is proposed to adaptively smooth banded regions while preserving image edges and details, yielding perceptually enhanced gradient rendering with limited bit-depths.
Abstract: Banding artifacts, which manifest as staircase-like color bands on pictures or video frames, is a common distortion caused by compression of low-textured smooth regions. These false contours can be very noticeable even on high-quality videos, especially when displayed on high-definition screens. Yet, relatively little attention has been applied to this problem. Here we consider banding artifact removal as a visual enhancement problem, and accordingly, we solve it by applying a form of content-adaptive smoothing filtering followed by dithered quantization, as a post-processing module. The proposed debanding filter is able to adaptively smooth banded regions while preserving image edges and details, yielding perceptually enhanced gradient rendering with limited bit-depths. Experimental results show that our proposed debanding filter outperforms state-of-the-art false contour removing algorithms both visually and quantitatively.

Posted Content
TL;DR: This work constructs a proxy network, which mimics the perceptual model while serving as a loss layer of the network, and experimentally demonstrates how this optimization framework can be applied to train an end-to-end optimized image compression network.
Abstract: Mean squared error (MSE) and $\ell_p$ norms have largely dominated the measurement of loss in neural networks due to their simplicity and analytical properties However, when used to assess visual information loss, these simple norms are not highly consistent with human perception Here, we propose a different proxy approach to optimize image analysis networks against quantitative perceptual models Specifically, we construct a proxy network, which mimics the perceptual model while serving as a loss layer of the networkWe experimentally demonstrate how this optimization framework can be applied to train an end-to-end optimized image compression network By building on top of a modern deep image compression models, we are able to demonstrate an averaged bitrate reduction of $287\%$ over MSE optimization, given a specified perceptual quality (VMAF) level

Journal ArticleDOI
TL;DR: An objective video quality model is designed which builds on existing video quality algorithms, by considering the fidelity of chroma channels in a principled way, and implies that there is room for reducing bitrate consumption in modern video codecs by creatively increasing the compression factor onchroma channels.
Abstract: Measuring the quality of digital videos viewed by human observers has become a common practice in numerous multimedia applications, such as adaptive video streaming, quality monitoring, and other digital TV applications. Here we explore a significant, yet relatively unexplored problem: measuring perceptual quality on videos arising from both luma and chroma distortions from compression. Toward investigating this problem, it is important to understand the kinds of chroma distortions that arise, how they relate to luma compression distortions, and how they can affect perceived quality. We designed and carried out a subjective experiment to measure subjective video quality on both luma and chroma distortions, introduced both in isolation as well as together. Specifically, the new subjective dataset comprises a total of $210$ videos afflicted by distortions caused by varying levels of luma quantization commingled with different amounts of chroma quantization. The subjective scores were evaluated by $34$ subjects in a controlled environmental setting. Using the newly collected subjective data, we were able to demonstrate important shortcomings of existing video quality models, especially in regards to chroma distortions. Further, we designed an objective video quality model which builds on existing video quality algorithms, by considering the fidelity of chroma channels in a principled way. We also found that this quality analysis implies that there is room for reducing bitrate consumption in modern video codecs by creatively increasing the compression factor on chroma channels. We believe that this work will both encourage further research in this direction, as well as advance progress on the ultimate goal of jointly optimizing luma and chroma compression in modern video encoders.

Posted Content
TL;DR: A new distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND index), which is inspired by human visual models.
Abstract: Banding artifact, or false contouring, is a common video compression impairment that tends to appear on large flat regions in encoded videos. These staircase-shaped color bands can be very noticeable in high-definition videos. Here we study this artifact, and propose a new distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND index). BBAND is inspired by human visual models. The proposed detector can generate a pixel-wise banding visibility map and output a banding severity score at both the frame and video levels. Experimental results show that our proposed method outperforms state-of-the-art banding detection algorithms and delivers better consistency with subjective evaluations.

Journal ArticleDOI
TL;DR: A database of typical “billboard” and “thumbnail” images viewed on mobile streaming applications is created and the effects of compression, scaling and chroma-subsampling on perceived quality are studied by conducting a subjective study.
Abstract: With the growing use of smart cellular devices for entertainment purposes, audio and video streaming services now offer an increasingly wide variety of popular mobile applications that offer portable and accessible ways to consume content. The user interfaces of these applications have become increasingly visual in nature, and are commonly loaded with dense multimedia content such as thumbnail images, animated GIFs, and short videos. To efficiently render these and to aid rapid download to the client display, it is necessary to compress, scale and color subsample them. These operations introduce distortions, reducing the appeal of the application. It is desirable to be able to automatically monitor and govern the visual qualities of these small images, which are usually small images. However, while there exists a variety of high-performing image quality assessment (IQA) algorithms, none have been designed for this particular use case. This kind of content often has unique characteristics, such as overlaid graphics, intentional brightness, gradients, text, and warping. We describe a study we conducted on the subjective and objective quality of images embedded in the displayed user interfaces of mobile streaming applications. We created a database of typical “billboard” and “thumbnail” images viewed on such services. Using the collected data, we studied the effects of compression, scaling and chroma-subsampling on perceived quality by conducting a subjective study. We also evaluated the performance of leading picture quality prediction models on the new database. We report some surprising results regarding algorithm performance, and find that there remains ample scope for future model development.

Journal ArticleDOI
TL;DR: A novel statistical entropic differencing method based on a Generalized Gaussian Distribution model expressed in the spatial and temporal band-pass domains, which measures the difference in quality between reference and distorted videos.
Abstract: High frame rate videos are increasingly getting popular in recent years, driven by the strong requirements of the entertainment and streaming industries to provide high quality of experiences to consumers. To achieve the best trade-offs between the bandwidth requirements and video quality in terms of frame rate adaptation, it is imperative to understand the effects of frame rate on video quality. In this direction, we devise a novel statistical entropic differencing method based on a Generalized Gaussian Distribution model expressed in the spatial and temporal band-pass domains, which measures the difference in quality between reference and distorted videos. The proposed design is highly generalizable and can be employed when the reference and distorted sequences have different frame rates. Our proposed model correlates very well with subjective scores in the recently proposed LIVE-YT-HFR database and achieves state of the art performance when compared with existing methodologies.