scispace - formally typeset
Search or ask a question

Showing papers by "Alan C. Bovik published in 2021"


Journal ArticleDOI
01 Jan 2021
TL;DR: In this paper, the Rapid and Accurate Video Quality Evaluator (RAPIQUE) model is proposed for video quality prediction, which combines and leverages the advantages of both quality-aware scene statistics features and semantics-aware deep convolutional features.
Abstract: Blind or no-reference video quality assessment of user-generated content (UGC) has become a trending, challenging, heretofore unsolved problem. Accurate and efficient video quality predictors suitable for this content are thus in great demand to achieve more intelligent analysis and processing of UGC videos. Previous studies have shown that natural scene statistics and deep learning features are both sufficient to capture spatial distortions, which contribute to a significant aspect of UGC video quality issues. However, these models are either incapable or inefficient for predicting the quality of complex and diverse UGC videos in practical applications. Here we introduce an effective and efficient video quality model for UGC content, which we dub the Rapid and Accurate Video Quality Evaluator (RAPIQUE), which we show performs comparably to state-of-the-art (SOTA) models but with orders-of-magnitude faster runtime. RAPIQUE combines and leverages the advantages of both quality-aware scene statistics features and semantics-aware deep convolutional features, allowing us to design the first general and efficient spatial and temporal (space-time) bandpass statistics model for video quality modeling. Our experimental results on recent large-scale UGC video quality databases show that RAPIQUE delivers top performances on all the datasets at a considerably lower computational expense. We hope this work promotes and inspires further efforts towards practical modeling of video quality problems for potential real-time and low-latency applications.

100 citations


Journal ArticleDOI
TL;DR: In this article, the VIDeo quality EVALuator (VIDEVAL) is proposed to improve the performance of VQA models for UGC/consumer videos.
Abstract: Recent years have witnessed an explosion of user-generated content (UGC) videos shared and streamed over the Internet, thanks to the evolution of affordable and reliable consumer capture devices, and the tremendous popularity of social media platforms. Accordingly, there is a great need for accurate video quality assessment (VQA) models for UGC/consumer videos to monitor, control, and optimize this vast content. Blind quality prediction of in-the-wild videos is quite challenging, since the quality degradations of UGC videos are unpredictable, complicated, and often commingled. Here we contribute to advancing the UGC-VQA problem by conducting a comprehensive evaluation of leading no-reference/blind VQA (BVQA) features and models on a fixed evaluation architecture, yielding new empirical insights on both subjective video quality studies and objective VQA model design. By employing a feature selection strategy on top of efficient BVQA models, we are able to extract 60 out of 763 statistical features used in existing methods to create a new fusion-based model, which we dub the VIDeo quality EVALuator (VIDEVAL), that effectively balances the trade-off between VQA performance and efficiency. Our experimental results show that VIDEVAL achieves state-of-the-art performance at considerably lower computational cost than other leading models. Our study protocol also defines a reliable benchmark for the UGC-VQA problem, which we believe will facilitate further research on deep learning-based VQA modeling, as well as perceptually-optimized efficient UGC video processing, transcoding, and streaming. To promote reproducible research and public evaluation, an implementation of VIDEVAL has been made available online: https://github.com/vztu/VIDEVAL .

74 citations


Journal ArticleDOI
TL;DR: A proxy network is constructed, broadly termed ProxIQA, which mimics the perceptual model while serving as a loss layer of the network and is able to demonstrate a bitrate reduction of as much as 31% over MSE optimization, given a specified perceptual quality (VMAF) level.
Abstract: The use of $\ell _{p}$ (p = 1,2) norms has largely dominated the measurement of loss in neural networks due to their simplicity and analytical properties. However, when used to assess the loss of visual information, these simple norms are not very consistent with human perception. Here, we describe a different “proximal” approach to optimize image analysis networks against quantitative perceptual models. Specifically, we construct a proxy network, broadly termed ProxIQA, which mimics the perceptual model while serving as a loss layer of the network. We experimentally demonstrate how this optimization framework can be applied to train an end-to-end optimized image compression network. By building on top of an existing deep image compression model, we are able to demonstrate a bitrate reduction of as much as 31% over MSE optimization, given a specified perceptual quality (VMAF) level.

52 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: In this article, a local-to-global region-based no-reference perceptual video quality assessment (VQA) architecture is proposed to predict global video quality and achieves state-of-the-art performance on 3 UGC datasets.
Abstract: No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem for social and streaming media applications. Efficient and accurate video quality predictors are needed to monitor and guide the processing of billions of shared, often imperfect, user-generated content (UGC). Unfortunately, current NR models are limited in their prediction capabilities on real-world, "in-the-wild" UGC video data. To advance progress on this problem, we created the largest (by far) subjective video quality dataset, containing 38,811 real-world distorted videos and 116,433 space-time localized video patches (‘v-patches’), and 5.5M human perceptual quality annotations. Using this, we created two unique NR-VQA models: (a) a local-to-global region-based NR VQA architecture (called PVQ) that learns to predict global video quality and achieves state-of-the-art performance on 3 UGC datasets, and (b) a first-of-a-kind space-time video quality mapping engine (called PVQ Mapper) that helps localize and visualize perceptual distortions in space and time. The entire dataset and prediction models are freely available at https://live.ece.utexas.edu/research.php.

39 citations


Journal ArticleDOI
TL;DR: A new subjective resource, called the LIVE-YouTube-H FR (LIVE-YT-HFR) dataset, which is comprised of 480 videos having 6 different frame rates, obtained from 16 diverse contents, and is made available online for public use and evaluation purposes.
Abstract: High frame rate (HFR) videos are becoming increasingly common with the tremendous popularity of live, high-action streaming content such as sports. Although HFR contents are generally of very high quality, high bandwidth requirements make them challenging to deliver efficiently, while simultaneously maintaining their quality. To optimize trade-offs between bandwidth requirements and video quality, in terms of frame rate adaptation, it is imperative to understand the intricate relationship between frame rate and perceptual video quality. Towards advancing progression in this direction we designed a new subjective resource, called the LIVE-YouTube-HFR (LIVE-YT-HFR) dataset, which is comprised of 480 videos having 6 different frame rates, obtained from 16 diverse contents. In order to understand the combined effects of compression and frame rate adjustment, we also processed videos at 5 compression levels at each frame rate. To obtain subjective labels on the videos, we conducted a human study yielding 19,000 human quality ratings obtained from a pool of 85 human subjects. We also conducted a holistic evaluation of existing state-of-the-art Full and No-Reference video quality algorithms, and statistically benchmarked their performance on the new database. The LIVE-YT-HFR database has been made available online for public use and evaluation purposes, with hopes that it will help advance research in this exciting video technology direction. It may be obtained at https://live.ece.utexas.edu/research/LIVE_YT_HFR/LIVE_YT_HFR/index.html .

32 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a new model called Space-Time Chips (ST Chips), which uses highly-localized space-time slices called ST Chips to implicitly capture motion.
Abstract: We propose a new model for no-reference video quality assessment (VQA). Our approach uses a new idea of highly-localized space-time (ST) slices called Space-Time Chips (ST Chips). ST Chips are localized cuts of video data along directions that implicitly capture motion. We use perceptually-motivated bandpass and normalization models to first process the video data, and then select oriented ST Chips based on how closely they fit parametric models of natural video statistics. We show that the parameters that describe these statistics can be used to reliably predict the quality of videos, without the need for a reference video. The proposed method implicitly models ST video naturalness, and deviations from naturalness. We train and test our model on several large VQA databases, and show that our model achieves state-of-the-art performance at reduced cost, without requiring motion computation.

26 citations


Journal ArticleDOI
TL;DR: The Structural Similarity (SSIM) Index is a very widely used image/video quality model that continues to play an important role in the perceptual evaluation of compression algorithms, encoding recipes and numerous other image processing algorithms as mentioned in this paper.
Abstract: The Structural Similarity (SSIM) Index is a very widely used image/video quality model that continues to play an important role in the perceptual evaluation of compression algorithms, encoding recipes and numerous other image/video processing algorithms. Several public implementations of the SSIM and Multiscale-SSIM (MS-SSIM) algorithms have been developed, which differ in efficiency and performance. This “bendable ruler” makes the process of quality assessment of encoding algorithms unreliable. To address this situation, we studied and compared the functions and performances of popular and widely used implementations of SSIM, and we also considered a variety of design choices. Based on our studies and experiments, we have arrived at a collection of recommendations on how to use SSIM most effectively, including ways to reduce its computational burden.

23 citations


Journal ArticleDOI
TL;DR: Different objective image and video quality assessment algorithms are evaluated, including both FIQA / FVQA algorithms and non-foveated algorithms, on the so called LIVE-Facebook Technologies Foveated-Compressed Virtual Reality (LIVE-FBT-FCVR) databases, and a statistical evaluation of the relative performances of these algorithms is presented.
Abstract: In Virtual Reality (VR), the requirements of much higher resolution and smooth viewing experiences under rapid and often real-time changes in viewing direction, leads to significant challenges in compression and communication. To reduce the stresses of very high bandwidth consumption, the concept of foveated video compression is being accorded renewed interest. By exploiting the space-variant property of retinal visual acuity, foveation has the potential to substantially reduce video resolution in the visual periphery, with hardly noticeable perceptual quality degradations. Accordingly, foveated image / video quality predictors are also becoming increasingly important, as a practical way to monitor and control future foveated compression algorithms. Towards advancing the development of foveated image / video quality assessment (FIQA / FVQA) algorithms, we have constructed 2D and (stereoscopic) 3D VR databases of foveated / compressed videos, and conducted a human study of perceptual quality on each database. Each database includes 10 reference videos and 180 foveated videos, which were processed by 3 levels of foveation on the reference videos. Foveation was applied by increasing compression with increased eccentricity. In the 2D study, each video was of resolution $7680\times 3840$ and was viewed and quality-rated by 36 subjects, while in the 3D study, each video was of resolution $5376\times 5376$ and rated by 34 subjects. Both studies were conducted on top of a foveated video player having low motion-to-photon latency (~50ms). We evaluated different objective image and video quality assessment algorithms, including both FIQA / FVQA algorithms and non-foveated algorithms, on our so called LIVE-Facebook Technologies Foveation-Compressed Virtual Reality (LIVE-FBT-FCVR) databases. We also present a statistical evaluation of the relative performances of these algorithms. The LIVE-FBT-FCVR databases have been made publicly available and can be accessed at https://live.ece.utexas.edu/research/LIVEFBTFCVR/index.html .

20 citations


Journal ArticleDOI
TL;DR: A large, dedicated VR sickness/presence database is constructed, which contains 100 VR videos with associated human subjective ratings, and a statistical model of spatio-temporal and rotational frame difference maps to predict VR sickness and VR presence is developed.
Abstract: Although it is well-known that the negative effects of VR sickness, and the desirable sense of presence are important determinants of a user’s immersive VR experience, there remains a lack of definitive research outcomes to enable the creation of methods to predict and/or optimize the trade-offs between them. Most VR sickness assessment (VRSA) and VR presence assessment (VRPA) studies reported to date have utilized simple image patterns as probes, hence their results are difficult to apply to the highly diverse contents encountered in general, real-world VR environments. To help fill this void, we have constructed a large, dedicated VR sickness/presence (VR-SP) database, which contains 100 VR videos with associated human subjective ratings. Using this new resource, we developed a statistical model of spatio-temporal and rotational frame difference maps to predict VR sickness. We also designed an exceptional motion feature, which is expressed as the correlation between an instantaneous change feature and averaged temporal features. By adding additional features (visual activity, content features) to capture the sense of presence, we use the new data resource to explore the relationship between VRSA and VRPA. We also show the aggregate VR-SP model is able to predict VR sickness with an accuracy of 90% and VR presence with an accuracy of 75% using the new VR-SP dataset.

20 citations


Journal ArticleDOI
TL;DR: A new NR-IQA/BIQA model that operates on natural scene statistics in the contourlet domain is proposed that has high linearity against human subjective perception, and outperforms the state-of-the-art NR- IQA models.
Abstract: No-reference/blind image quality assessment (NR-IQA/BIQA) algorithms play an important role in image evaluation, as they can assess the quality of an image automatically, only using the distorted image whose quality is being assessed. Among the existing NR-IQA/BIQA methods, natural scene statistic (NSS) models which can be expressed in different bandpass domains show good consistency with human subjective judgments of quality. In this paper, we create new ‘quality-aware’ features: the energy differences of the sub-band coefficients across scales via contourlet transform, and propose a new NR-IQA/BIQA model that operates on natural scene statistics in the contourlet domain. Prior to applying the contourlet transform, we apply two preprocessing steps that help to create more information-dense, low-entropy representations. Specifically, we transform the picture into the CIELAB color space and gradient magnitude map. Then, a number of ‘quality-aware’ features are discovered in the contourlet transform domain: the energy of the sub-band coefficients within scales, and the energy differences between scales, as well as measurements of the statistical relationships of pixels across scales. A detailed analysis is conducted to show how different distortions affect the statistical characteristics of these features, and then features are fed to a support vector regression (SVR) model which learns to predict image quality. Experimental results show that the proposed method has high linearity against human subjective perception, and outperforms the state-of-the-art NR-IQA models.

19 citations


Posted Content
TL;DR: Pavancm et al. as discussed by the authors used prediction of distortion type and degree as an auxiliary task to learn features from an unlabeled image dataset containing a mixture of synthetic and realistic distortions.
Abstract: We consider the problem of obtaining image quality representations in a self-supervised manner. We use prediction of distortion type and degree as an auxiliary task to learn features from an unlabeled image dataset containing a mixture of synthetic and realistic distortions. We then train a deep Convolutional Neural Network (CNN) using a contrastive pairwise objective to solve the auxiliary problem. We refer to the proposed training framework and resulting deep IQA model as the CONTRastive Image QUality Evaluator (CONTRIQUE). During evaluation, the CNN weights are frozen and a linear regressor maps the learned representations to quality scores in a No-Reference (NR) setting. We show through extensive experiments that CONTRIQUE achieves competitive performance when compared to state-of-the-art NR image quality models, even without any additional fine-tuning of the CNN backbone. The learned representations are highly robust and generalize well across images afflicted by either synthetic or authentic distortions. Our results suggest that powerful quality representations with perceptual relevance can be obtained without requiring large labeled subjective image quality datasets. The implementations used in this paper are available at \url{https://github.com/pavancm/CONTRIQUE}.

Journal ArticleDOI
TL;DR: In this paper, a subjective experiment was carried out to measure subjective video quality on both luma and chroma distortions, introduced both in isolation and together, and the subjective scores were evaluated by 34 subjects in a controlled environmental setting.
Abstract: Measuring the quality of digital videos viewed by human observers has become a common practice in numerous multimedia applications, such as adaptive video streaming, quality monitoring, and other digital TV applications. Here we explore a significant, yet relatively unexplored problem: measuring perceptual quality on videos arising from both luma and chroma distortions from compression. Toward investigating this problem, it is important to understand the kinds of chroma distortions that arise, how they relate to luma compression distortions, and how they can affect perceived quality. We designed and carried out a subjective experiment to measure subjective video quality on both luma and chroma distortions, introduced both in isolation as well as together. Specifically, the new subjective dataset comprises a total of 210 videos afflicted by distortions caused by varying levels of luma quantization commingled with different amounts of chroma quantization. The subjective scores were evaluated by 34 subjects in a controlled environmental setting. Using the newly collected subjective data, we were able to demonstrate important shortcomings of existing video quality models, especially in regards to chroma distortions. Further, we designed an objective video quality model which builds on existing video quality algorithms, by considering the fidelity of chroma channels in a principled way. We also found that this quality analysis implies that there is room for reducing bitrate consumption in modern video codecs by creatively increasing the compression factor on chroma channels. We believe that this work will both encourage further research in this direction, as well as advance progress on the ultimate goal of jointly optimizing luma and chroma compression in modern video encoders.

Journal ArticleDOI
TL;DR: In this paper, a generalized Gaussian distribution (GGD) is used to model band-pass responses, while entropy variations between reference and distorted videos under the GGD model are used to capture video quality variations arising from frame rate changes.
Abstract: We consider the problem of conducting frame rate dependent video quality assessment (VQA) on videos of diverse frame rates, including high frame rate (HFR) videos. More generally, we study how perceptual quality is affected by frame rate, and how frame rate and compression combine to affect perceived quality. We devise an objective VQA model called Space-Time GeneRalized Entropic Difference (GREED) which analyzes the statistics of spatial and temporal band-pass video coefficients. A generalized Gaussian distribution (GGD) is used to model band-pass responses, while entropy variations between reference and distorted videos under the GGD model are used to capture video quality variations arising from frame rate changes. The entropic differences are calculated across multiple temporal and spatial subbands, and merged using a learned regressor. We show through extensive experiments that GREED achieves state-of-the-art performance on the LIVE-YT-HFR Database when compared with existing VQA models. The features used in GREED are highly generalizable and obtain competitive performance even on standard, non-HFR VQA databases. The implementation of GREED has been made available online: https://github.com/pavancm/GREED .

Journal ArticleDOI
TL;DR: In this article, the LIVE-NFLX-II database contains subjective QoE responses to various design dimensions, such as bitrate adaptation algorithms, network conditions and video content.
Abstract: Measuring Quality of Experience (QoE) and integrating these measurements into video streaming algorithms is a multi-faceted problem that fundamentally requires the design of comprehensive subjective QoE databases and objective QoE prediction models. To achieve this goal, we have recently designed the LIVE-NFLX-II database, a highly-realistic database which contains subjective QoE responses to various design dimensions, such as bitrate adaptation algorithms, network conditions and video content. Our database builds on recent advancements in content-adaptive encoding and incorporates actual network traces to capture realistic network variations on the client device. The new database focuses on low bandwidth conditions which are more challenging for bitrate adaptation algorithms, which often must navigate tradeoffs between rebuffering and video quality. Using our database, we study the effects of multiple streaming dimensions on user experience and evaluate video quality and quality of experience models and analyze their strengths and weaknesses. We believe that the tools introduced here will help inspire further progress on the development of perceptually-optimized client adaptation and video streaming strategies. The database is publicly available at http://live.ece.utexas.edu/research/LIVE_NFLX_II/live_nflx_plus.html .

Posted Content
TL;DR: The ETRI-LIVE Space-Time Sub-sampled Video Quality (ETRI-Live STSVQ) dataset as discussed by the authors contains 437 videos generated by applying various levels of combined space-time subsampling and video compression on 15 diverse video contents.
Abstract: Video dimensions are continuously increasing to provide more realistic and immersive experiences to global streaming and social media viewers. However, increments in video parameters such as spatial resolution and frame rate are inevitably associated with larger data volumes. Transmitting increasingly voluminous videos through limited bandwidth networks in a perceptually optimal way is a current challenge affecting billions of viewers. One recent practice adopted by video service providers is space-time resolution adaptation in conjunction with video compression. Consequently, it is important to understand how different levels of space-time subsampling and compression affect the perceptual quality of videos. Towards making progress in this direction, we constructed a large new resource, called the ETRI-LIVE Space-Time Subsampled Video Quality (ETRI-LIVE STSVQ) database, containing 437 videos generated by applying various levels of combined space-time subsampling and video compression on 15 diverse video contents. We also conducted a large-scale human study on the new dataset, collecting about 15,000 subjective judgments of video quality. We provide a rate-distortion analysis of the collected subjective scores, enabling us to investigate the perceptual impact of space-time subsampling at different bit rates. We also evaluated and compared the performance of leading video quality models on the new database.

Journal ArticleDOI
TL;DR: In this paper, the temporal bandpass (e.g., lag) filtering in lateral geniculate nucleus (LGN) and area V1 was used to model the statistics of the differences between adjacent or neighboring video frames that have been slightly spatially displaced relative to one another.
Abstract: It is well known that natural images possess statistical regularities that can be captured by bandpass decomposition and divisive normalization processes that approximate early neural processing in the human visual system. We expand on these studies and present new findings on the properties of space-time natural statistics that are inherent in motion pictures. Our model relies on the concept of temporal bandpass (e.g., lag) filtering in lateral geniculate nucleus (LGN) and area V1, which is similar to smoothed frame differencing of video frames. Specifically, we model the statistics of the differences between adjacent or neighboring video frames that have been slightly spatially displaced relative to one another. We find that when these space-time differences are further subjected to locally pooled divisive normalization, statistical regularities (or lack thereof) arise that depend on the local motion trajectory. We find that bandpass and divisively normalized frame differences that are displaced along the motion direction exhibit stronger statistical regularities than for other displacements. Conversely, the direction-dependent regularities of displaced frame differences can be used to estimate the image motion (optical flow) by finding the space-time displacement paths that best preserve statistical regularity.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this paper, binary and ordinal classification methods are proposed to evaluate and compare no-reference quality models at coarser levels to make the problem more tractable, and the proposed new tasks convey more practical meaning on perceptually optimized UGC transcoding, or for preprocessing on media processing platforms.
Abstract: Video and image quality assessment has long been projected as a regression problem, which requires predicting a continuous quality score given an input stimulus. However, recent efforts have shown that accurate quality score regression on real-world user-generated content (UGC) is a very challenging task. To make the problem more tractable, we propose two new methods - binary, and ordinal classification - as alternatives to evaluate and compare no-reference quality models at coarser levels. Moreover, the proposed new tasks convey more practical meaning on perceptually optimized UGC transcoding, or for preprocessing on media processing platforms. We conduct a comprehensive benchmark experiment of popular no-reference quality models on recent in-the-wild picture and video quality datasets, providing reliable baselines for both evaluation methods to support further studies. We hope this work promotes coarse-grained perceptual modeling and its applications to efficient UGC processing.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, a video quality database for live streaming has been built, called the LIVE Livestream Database, which includes 315 videos of 45 contents impaired by 6 types of distortions.
Abstract: Video live streaming is gaining prevalence among video streaming services, especially for the delivery of popular sporting events. Many objective Video Quality Assessment (VQA) models have been developed to predict the perceptual quality of videos. Appropriate databases that exemplify the distortions encountered in live streaming videos are important to designing and learning objective VQA models. Towards making progress in this direction, we built a video quality database specifically designed for live streaming VQA research. The new video database is called the Laboratory for Image and Video Engineering (LIVE) Live stream Database. The LIVE Livestream Database includes 315 videos of 45 contents impaired by 6 types of distortions. We also performed a subjective quality study using the new database, whereby more than 12,000 human opinions were gathered from 40 subjects. We demonstrate the usefulness of the new resource by performing a holistic evaluation of the performance of current state-of-the-art (SOTA) VQA models. The LIVE Livestream database is being made publicly available for these purposes at https://live.ece.utexas.edu/research/LIVE_APV_Study/apv_index.html.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this article, the Rapid and Accurate Video Quality Evaluator (RAPIQUE) is proposed to predict the quality of complex, diverse UGC videos in practical applications.
Abstract: Blind video quality assessment of user-generated content (UGC) has become a trending, challenging, unsolved problem. Accurate and efficient video quality predictors suitable for this content are thus in great demand to achieve intelligent analysis and processing of UGC videos. However, previous video quality models are either incapable or inefficient for predicting the quality of complex, diverse UGC videos in practical applications. Here we introduce an effective and efficient video quality model for UGC content, which we dub the Rapid and Accurate Video Quality Evaluator (RAPIQUE), which we show performs comparably to state-of-the-art models but with orders-of-magnitude faster runtime. Our experimental results on recent large-scale UGC video quality databases show that RAPIQUE delivers top performances on all datasets at a considerably lower computational expense. An implementation of RAPIQUE is online: https://github.com/vztu/RAPIQUE.

Journal ArticleDOI
TL;DR: The proposed method of estimating the contrast masking threshold on natural scene patches, using texture cues imparted by steerable filter responses, is able to outperform existing visual masking models in terms of estimation performance while being relatively computationally inexpensive than these models.
Abstract: A fast and accurate assessment of visual masking effects is desirable while encoding in order to utilize such effects to improve the quality of compressed videos through an adaptive quantization (AQ) scheme. Here, we propose a method of estimating the contrast masking threshold on natural scene patches, using texture cues imparted by steerable filter responses. We then employ the estimated thresholds to perform AQ for AV1 encoding. Our experimental results establish that the proposed method is able to outperform existing visual masking models in terms of estimation performance while being relatively computationally inexpensive than these models, and is also able to improve the variance based AQ algorithm that is currently deployed in the SVT-AV1 codec. Using the multi-scale structural similarity index measure (MS-SSIM) as the quality model, our approach achieves an average BD-rate of -1.82% using the uniform quantization scheme as anchor as compared to 5.83% obtained with the variance based method. We note that the proposed approach produces less visible compression artifacts than the variance based AQ approach at lower bitrates, while maintaining similar encoding complexity.

Proceedings ArticleDOI
19 Sep 2021
TL;DR: This work conducts a comprehensive evaluation of leading blind VQA models and creates a new fusion-based BVQA model, which it dubs the VIDeo quality EVALuator (VIDEVAL), that effectively balances the trade-off between performance and efficiency.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, the features from VMAF and GREED are fused in order to exploit the advantages of both models, and the proposed fusion framework results in more efficient features for predicting frame rate dependent video quality.
Abstract: The popularity of streaming videos with live, high-action content has led to an increased interest in High Frame Rate (HFR) videos. In this work we address the problem of frame rate dependent Video Quality Assessment (VQA) when the videos to be compared have different frame rate and compression factor. The current VQA models such as VMAF have superior correlation with perceptual judgments when videos to be compared have same frame rates and contain conventional distortions such as compression, scaling etc. However this framework requires additional pre-processing step when videos with different frame rates need to be compared, which can potentially limit its overall performance. Recently, Generalized Entropic Difference (GREED) VQA model was proposed to account for artifacts that arise due to changes in frame rate, and showed superior performance on the LIVE-YT-HFR database which contains frame rate dependent artifacts such as judder, strobing etc. In this paper we propose a simple extension, where the features from VMAF and GREED are fused in order to exploit the advantages of both models. We show through various experiments that the proposed fusion framework results in more efficient features for predicting frame rate dependent video quality. We also evaluate the fused feature set on standard non-HFR VQA databases and obtain superior performance than both GREED and VMAF, indicating the combined feature set captures complimentary perceptual quality information.

Proceedings ArticleDOI
19 Sep 2021
TL;DR: A No-Reference (NR or blind) method which is called “Space-Variant BRISQUE (SV-BRisQUE),” which is based on a new space-variant natural scene statistics model, which achieves state of the art (SOTA) performance with correlation 0.90 against human subjectivity.

Proceedings ArticleDOI
Yize Jin1, Liang Zhao1, Xin Zhao1, Shan Liu1, Alan C. Bovik2 
06 Jun 2021
TL;DR: In this article, two methods are proposed to further reduce the signaling cost of delta angles: cross-component delta angle coding, and context-adaptive delta angle decoding, whereby the crosscomponent and spatial correlation of the delta angles are explored, respectively.
Abstract: In AOMedia Video 1 (AV1), directional intra prediction modes are applied to model local texture patterns that present certain directionality. Each intra prediction direction is represented with a nominal mode index and a delta angle. The delta angle is entropy coded using shared context between luma and chroma, and the context is derived using the associated nominal mode. In this paper, two methods are proposed to further reduce the signaling cost of delta angles: cross-component delta angle coding, and context-adaptive delta angle coding, whereby the cross-component and spatial correlation of the delta angles are explored, respectively. The proposed methods were implemented on top of a recent version of libaom. Experimental results show that the proposed cross-component delta angle coding achieved average 0.4% BD-rate reduction with 4% encoding time saving over all intra configurations. By combining both methods, an average 1.2% BD-rate reduction is achieved.

Proceedings ArticleDOI
TL;DR: In this paper, the features from VMAF and GREED are fused in order to exploit the advantages of both models, and the proposed fusion framework results in more efficient features for predicting frame rate dependent video quality.
Abstract: The popularity of streaming videos with live, high-action content has led to an increased interest in High Frame Rate (HFR) videos. In this work we address the problem of frame rate dependent Video Quality Assessment (VQA) when the videos to be compared have different frame rate and compression factor. The current VQA models such as VMAF have superior correlation with perceptual judgments when videos to be compared have same frame rates and contain conventional distortions such as compression, scaling etc. However this framework requires additional pre-processing step when videos with different frame rates need to be compared, which can potentially limit its overall performance. Recently, Generalized Entropic Difference (GREED) VQA model was proposed to account for artifacts that arise due to changes in frame rate, and showed superior performance on the LIVE-YT-HFR database which contains frame rate dependent artifacts such as judder, strobing etc. In this paper we propose a simple extension, where the features from VMAF and GREED are fused in order to exploit the advantages of both models. We show through various experiments that the proposed fusion framework results in more efficient features for predicting frame rate dependent video quality. We also evaluate the fused feature set on standard non-HFR VQA databases and obtain superior performance than both GREED and VMAF, indicating the combined feature set captures complimentary perceptual quality information.


Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this article, a downsampling network architecture that progressively reconstructs residuals at different scales is proposed, which combines an upsampling sub-network and a down-sampling subnetwork, both with integer scale factor.
Abstract: In many image and video processing applications, the ability to resize by a fractional factor, such as from 1080p to 720p, is essential. However, conventional CNN layers can only be used to alter the resolution of their inputs with integer scale factors. In this paper, we propose a downsampling network architecture that progressively reconstructs residuals at different scales. In particular, the aforementioned problem is solved by combining an upsampling sub-network and a downsampling subnetwork, both with integer scale factor. As an application, we apply the proposed downsampling network to an adaptive bitrate video streaming scenario. We extensively evaluate with different video codecs and upsampling algorithms to show the generality of our model. Our experimental results show that improvements in coding efficiency over the conventional Lanczos downsampling and state-of-the-art methods are attained, measured in different perceptual video quality models on large-resolution test videos.

Proceedings ArticleDOI
12 Jun 2021
TL;DR: In this article, the authors proposed a full reference (FR) foveated image quality assessment algorithm, which employs the natural scene statistics of bandpass responses by applying differences of local entropies weighted by a foveation-based error sensitivity function.
Abstract: Virtual Reality is regaining attention due to recent advancements in hardware technology. Immersive images / videos are becoming widely adopted to carry omnidirectional visual information. However, due to the requirements for higher spatial and temporal resolution of real video data, immersive videos require significantly larger bandwidth consumption. To reduce stresses on bandwidth, foveated video compression is regaining popularity, whereby the space-variant spatial resolution of the retina is exploited. Towards advancing the progress of foveated video compression, we propose a full reference (FR) foveated image quality assessment algorithm, which we call foveated entropic differencing (FED), which employs the natural scene statistics of bandpass responses by applying differences of local entropies weighted by a foveation-based error sensitivity function. We evaluate the proposed algorithm by measuring the correlations of the predictions that FED makes against human judgements on the newly created 2D and 3D LIVE-FBT-FCVR databases for Virtual Reality (VR). The performance of the proposed algorithm yields state-of-the-art as compared with other existing full reference algorithms. Software for FED has been made available at: http://live.ece.utexas.edu/research/Quality/FED.zip

Posted Content
TL;DR: In this article, a video quality database called LIVE Livestream Database is proposed for live streaming VQA research, which includes 315 videos of 45 contents impaired by 6 types of distortions.
Abstract: Video live streaming is gaining prevalence among video streaming services, especially for the delivery of popular sporting events. Many objective Video Quality Assessment (VQA) models have been developed to predict the perceptual quality of videos. Appropriate databases that exemplify the distortions encountered in live streaming videos are important to designing and learning objective VQA models. Towards making progress in this direction, we built a video quality database specifically designed for live streaming VQA research. The new video database is called the Laboratory for Image and Video Engineering (LIVE) Live stream Database. The LIVE Livestream Database includes 315 videos of 45 contents impaired by 6 types of distortions. We also performed a subjective quality study using the new database, whereby more than 12,000 human opinions were gathered from 40 subjects. We demonstrate the usefulness of the new resource by performing a holistic evaluation of the performance of current state-of-the-art (SOTA) VQA models. The LIVE Livestream database is being made publicly available for these purposes at this https URL.

Posted Content
TL;DR: In this paper, the Rapid and Accurate Video Quality Evaluator (RAPIQUE) combines and leverages the advantages of both quality-aware scene statistics features and semantics-aware deep convolutional features, allowing the first general and efficient spatial and temporal (space-time) bandpass statistics model for video quality modeling.
Abstract: Blind or no-reference video quality assessment of user-generated content (UGC) has become a trending, challenging, unsolved problem. Accurate and efficient video quality predictors suitable for this content are thus in great demand to achieve more intelligent analysis and processing of UGC videos. Previous studies have shown that natural scene statistics and deep learning features are both sufficient to capture spatial distortions, which contribute to a significant aspect of UGC video quality issues. However, these models are either incapable or inefficient for predicting the quality of complex and diverse UGC videos in practical applications. Here we introduce an effective and efficient video quality model for UGC content, which we dub the Rapid and Accurate Video Quality Evaluator (RAPIQUE), which we show performs comparably to state-of-the-art (SOTA) models but with orders-of-magnitude faster runtime. RAPIQUE combines and leverages the advantages of both quality-aware scene statistics features and semantics-aware deep convolutional features, allowing us to design the first general and efficient spatial and temporal (space-time) bandpass statistics model for video quality modeling. Our experimental results on recent large-scale UGC video quality databases show that RAPIQUE delivers top performances on all the datasets at a considerably lower computational expense. We hope this work promotes and inspires further efforts towards practical modeling of video quality problems for potential real-time and low-latency applications. To promote public usage, an implementation of RAPIQUE has been made freely available online: \url{this https URL}.