scispace - formally typeset
Search or ask a question

Showing papers by "Alan C. Bovik published in 2018"


Journal ArticleDOI
TL;DR: A new database, comprising a total of 208 videos, which model six common in-capture distortions of digital videos, and evaluated several top-performing no-reference IQA and VQA algorithms on the new database and studied how real-world in- capture distortions challenge both human viewers as well as automatic perceptual quality prediction models.
Abstract: Digital videos often contain visual distortions that are introduced by the camera’s hardware or processing software during the capture process. These distortions often detract from a viewer’s quality of experience. Understanding how human observers perceive the visual quality of digital videos is of great importance to camera designers. Thus, the development of automatic objective methods that accurately quantify the impact of visual distortions on perception has greatly accelerated. Video quality algorithm design and verification require realistic databases of distorted videos and human judgments of them. However, most current publicly available video quality databases have been created under highly controlled conditions using graded, simulated, and post-capture distortions (such as jitter and compression artifacts) on high-quality videos. The commercial plethora of hand-held mobile video capture devices produces videos often afflicted by a variety of complex distortions generated during the capturing process. These in-capture distortions are not well-modeled by the synthetic, post-capture distortions found in existing VQA databases. Toward overcoming this limitation, we designed and created a new database that we call the LIVE-Qualcomm mobile in-capture video quality database, comprising a total of 208 videos, which model six common in-capture distortions. We also conducted a subjective quality assessment study using this database, in which each video was assessed by 39 unique subjects. Furthermore, we evaluated several top-performing no-reference IQA and VQA algorithms on the new database and studied how real-world in-capture distortions challenge both human viewers as well as automatic perceptual quality prediction models. The new database is freely available at: http://live.ece.utexas.edu/research/incaptureDatabase/index.html .

120 citations


Journal ArticleDOI
TL;DR: This paper has constructed a large-scale video quality assessment database containing 585 videos of unique content, captured by a large number of users, with wide ranges of levels of complex, authentic distortions, and demonstrates the value of the new resource, which is called the live video quality challenge database (LIVE-VQC), by conducting a comparison with leading NR video quality predictors on it.
Abstract: The great variations of videographic skills, camera designs, compression and processing protocols, and displays lead to an enormous variety of video impairments. Current no-reference (NR) video quality models are unable to handle this diversity of distortions. This is true in part because available video quality assessment databases contain very limited content, fixed resolutions, were captured using a small number of camera devices by a few videographers and have been subjected to a modest number of distortions. As such, these databases fail to adequately represent real world videos, which contain very different kinds of content obtained under highly diverse imaging conditions and are subject to authentic, often commingled distortions that are impossible to simulate. As a result, NR video quality predictors tested on real-world video data often perform poorly. Towards advancing NR video quality prediction, we constructed a large-scale video quality assessment database containing 585 videos of unique content, captured by a large number of users, with wide ranges of levels of complex, authentic distortions. We collected a large number of subjective video quality scores via crowdsourcing. A total of 4776 unique participants took part in the study, yielding more than 205000 opinion scores, resulting in an average of 240 recorded human opinions per video. We demonstrate the value of the new resource, which we call the LIVE Video Quality Challenge Database (LIVE-VQC), by conducting a comparison of leading NR video quality predictors on it. This study is the largest video quality assessment study ever conducted along several key dimensions: number of unique contents, capture devices, distortion types and combinations of distortions, study participants, and recorded subjective scores. The database is available for download on this link: this http URL .

97 citations


Journal ArticleDOI
TL;DR: A variety of recurrent dynamic neural networks are proposed that conduct continuous-time subjective QoE prediction on video streams impaired by both compression artifacts and rebuffering events, and ways of aggregating different models into a forecasting ensemble that delivers improved results with reduced forecasting variance are evaluated.
Abstract: Streaming video services represent a very large fraction of global bandwidth consumption. Due to the exploding demands of mobile video streaming services, coupled with limited bandwidth availability, video streams are often transmitted through unreliable, low-bandwidth networks. This unavoidably leads to two types of major streaming-related impairments: compression artifacts and/or rebuffering events. In streaming video applications, the end-user is a human observer; hence being able to predict the subjective Quality of Experience (QoE) associated with streamed videos could lead to the creation of perceptually optimized resource allocation strategies driving higher quality video streaming services. We propose a variety of recurrent dynamic neural networks that conduct continuous-time subjective QoE prediction. By formulating the problem as one of time-series forecasting, we train a variety of recurrent neural networks and non-linear autoregressive models to predict QoE using several recently developed subjective QoE databases. These models combine multiple, diverse neural network inputs, such as predicted video quality scores, rebuffering measurements, and data related to memory and its effects on human behavioral responses, using them to predict QoE on video streams impaired by both compression artifacts and rebuffering events. Instead of finding a single time-series prediction model, we propose and evaluate ways of aggregating different models into a forecasting ensemble that delivers improved results with reduced forecasting variance. We also deploy appropriate new evaluation metrics for comparing time-series predictions in streaming applications. Our experimental results demonstrate improved prediction performance that approaches human performance. An implementation of this work can be found at https://github.com/christosbampis/NARX_QoE_release .

63 citations


Posted Content
TL;DR: The LIVE-NFLX-II database is designed, a highly-realistic database which contains subjective QoE responses to various design dimensions, such as bitrate adaptation algorithms, network conditions and video content, and builds on recent advancements in content-adaptive encoding.
Abstract: Measuring Quality of Experience (QoE) and integrating these measurements into video streaming algorithms is a multi-faceted problem that fundamentally requires the design of comprehensive subjective QoE databases and metrics. To achieve this goal, we have recently designed the LIVE-NFLX-II database, a highly-realistic database which contains subjective QoE responses to various design dimensions, such as bitrate adaptation algorithms, network conditions and video content. Our database builds on recent advancements in content-adaptive encoding and incorporates actual network traces to capture realistic network variations on the client device. Using our database, we study the effects of multiple streaming dimensions on user experience and evaluate video quality and quality of experience models. We believe that the tools introduced here will help inspire further progress on the development of perceptually-optimized client adaptation and video streaming strategies. The database is publicly available at this http URL.

60 citations


Proceedings ArticleDOI
06 Oct 2018
TL;DR: A probabilistic quality representation (PQR) is proposed and employs a more robust loss function to train deep BIQA models and is shown to speed up the convergence of deep model training, but also greatly improve the quality prediction accuracy relative to scalar quality score regression methods under the same setting.
Abstract: Most existing blind image quality assessment (BIQA) methods learn a regression model to predict scalar quality scores. Such a scheme ignores the fact that an image will receive divergent subjective scores from different subjects, which cannot be adequately represented by a single scalar number. This is particularly true on complex, real-world distorted images. However, the more informative score distributions are unavailable in existing image quality assessment (IQA) databases and can be potentially noisy when limited number of opinions are collected on each image. This paper proposes a probabilistic quality representation (PQR) and employs a more robust loss function to train deep BIQA models. Using a very straightforward implementation, the proposed method is shown to not only speed up the convergence of deep model training, but also greatly improve the quality prediction accuracy relative to scalar quality score regression methods under the same setting. The source code is available at https://github.com/HuiZeng/BIQA_Toolbox.

59 citations


Journal ArticleDOI
TL;DR: A QoE evaluator that accounts for interactions between stalling events, analyzes the spatial and temporal content of a video, predicts the perceptual video quality, models the state of the client-side data buffer, and consequently predicts continuous-time quality scores that agree quite well with human opinion scores is created.
Abstract: Over-the-top adaptive video streaming services are frequently impacted by fluctuating network conditions that can lead to rebuffering events (stalling events) and sudden bitrate changes. These events visually impact video consumers' quality of experience (QoE) and can lead to consumer churn. The development of models that can accurately predict viewers' instantaneous subjective QoE under such volatile network conditions could potentially enable the more efficient design of quality-control protocols for media-driven services, such as YouTube, Amazon, Netflix, and so on. However, most existing models only predict a single overall QoE score on a given video and are based on simple global video features, without accounting for relevant aspects of human perception and behavior. We have created a QoE evaluator, called the time-varying QoE Indexer, that accounts for interactions between stalling events, analyzes the spatial and temporal content of a video, predicts the perceptual video quality, models the state of the client-side data buffer, and consequently predicts continuous-time quality scores that agree quite well with human opinion scores. The new QoE predictor also embeds the impact of relevant human cognitive factors, such as memory and recency, and their complex interactions with the video content being viewed. We evaluated the proposed model on three different video databases and attained standout QoE prediction performance.

57 citations


Journal ArticleDOI
TL;DR: A model that expresses the joint impact of spatial resolution and JPEG compression quality factor andtex-math notation="LaTeX" on immersive image quality and high Pearson correlation and Spearman correlation are developed.
Abstract: We develop a model that expresses the joint impact of spatial resolution $s$ and JPEG compression quality factor $q^{f}$ on immersive image quality. The model is expressed as the product of optimized exponential functions of these factors. The model is tested on a subjective database of immersive image contents rendered on a head mounted display. High Pearson correlation and Spearman correlation (>0.95) and small relative root mean squared error (<5.6%) are achieved between the model predictions and the subjective quality judgements. The immersive ground-truth images along with the rest of the database are made available for future research and comparisons.

51 citations


Journal ArticleDOI
TL;DR: A full reference VQA model that accounts for temporal visual masking of local flicker, called Flicker Sensitive-MOtion-based Video Integrity Evaluation (FS-MOVIE), augments the well-known MOVIE Index by combining motion tuned video integrity features with a new perceptual flicker visibility/masking index.
Abstract: An important element of the design of video quality assessment (VQA) models that remains poorly understood is the effect of temporal visual masking on the visibility of temporal distortions. The visibility of temporal distortions like local flicker can be strongly reduced by motion. Based on a recently discovered visual change silencing illusion, we have developed a full reference VQA model that accounts for temporal visual masking of local flicker. The proposed model, called Flicker Sensitive-MOtion-based Video Integrity Evaluation (FS-MOVIE), augments the well-known MOVIE Index by combining motion tuned video integrity features with a new perceptual flicker visibility/masking index. FS-MOVIE captures the separated spectral signatures caused by local flicker distortions, by using a model of the responses of neurons in primary visual cortex to video flicker, an energy model of motion perception, and a divisive normalization stage. FS-MOVIE predicts the perceptual suppression of local flicker by the presence of motion and evaluates local flicker as it affects video quality. Experimental results show that FS-MOVIE significantly improves VQA performance against its predecessor and is highly competitive with top performing VQA algorithms when tested on the LIVE, IVP, EPFL, and VQEGHD5 VQA databases.

34 citations


Posted Content
TL;DR: In this article, the authors proposed two improvements to the VMAF framework: SpatioTemporal VMAFs and Ensemble VMsF. Both algorithms exploit efficient temporal video features which are fed into a single or multiple regression models.
Abstract: Perceptual video quality assessment models are either frame-based or video-based, i.e., they apply spatiotemporal filtering or motion estimation to capture temporal video distortions. Despite their good performance on video quality databases, video-based approaches are time-consuming and harder to efficiently deploy. To balance between high performance and computational efficiency, Netflix developed the Video Multi-method Assessment Fusion (VMAF) framework, which integrates multiple quality-aware features to predict video quality. Nevertheless, this fusion framework does not fully exploit temporal video quality measurements which are relevant to temporal video distortions. To this end, we propose two improvements to the VMAF framework: SpatioTemporal VMAF and Ensemble VMAF. Both algorithms exploit efficient temporal video features which are fed into a single or multiple regression models. To train our models, we designed a large subjective database and evaluated the proposed models against state-of-the-art approaches. The compared algorithms will be made available as part of the open source package in this https URL.

31 citations


Journal ArticleDOI
TL;DR: A feature-based approach that combines a number of QoE-related features, including perceptually-relevant quality features, stalling-aware features and memory-driven features to makeQoE predictions, which provides improved performance over state-of-the-art video quality metrics while generalizing well on a different dataset is developed.
Abstract: Mobile streaming video data accounts for a large and increasing percentage of wireless network traffic. The available bandwidths of modern wireless networks are often unstable, leading to difficulties in delivering smooth, high-quality video. Streaming service providers such as Netflix and YouTube attempt to adapt their systems to adjust in response to these bandwidth limitations by changing the video bitrate or, failing that, allowing playback interruptions (stalling). Being able to predict end users’ quality of experience (QoE) resulting from these adjustments could lead to perceptually-driven network resource allocation strategies that would deliver streaming content of higher quality to clients, while being cost effective for providers. To this end, a number of QoE predictors have been developed, but they do not always capture the interplay between video quality and stalling. Towards more effectively predicting user QoE, we have developed a QoE prediction model called Video Assessment of TemporaL Artifacts and Stalls (Video ATLAS), which is a feature-based approach that combines a number of QoE-related features, including perceptually-relevant quality features, stalling-aware features and memory-driven features to make QoE predictions. We evaluated Video ATLAS on the recently designed LIVE-Netflix Video QoE Database which consists of practical playout patterns, where the videos are afflicted by both quality changes and stalling events, and found that it provides improved performance over state-of-the-art video quality metrics while generalizing well on a different dataset. The proposed algorithm is made publicly available at http://live.ece.utexas.edu/research/VideoATLAS/vatlas_index.html .

28 citations


Journal ArticleDOI
TL;DR: A deep-learning-based river network extraction model that learns the characteristics of rivers from synthetic data and generalizes them to natural data, and produces maps of river centerlines, which have the potential to be quite useful for analyzing the properties of river networks.
Abstract: We have created a deep-learning-based river network extraction model, called DeepRiver, that learns the characteristics of rivers from synthetic data and generalizes them to natural data. To train this model, we created a very large database of exemplary synthetic local channel segments, including channel intersections. Our model uses a special loss function that automatically shifts the focus to the hardest-to-learn parts of an input image. This adaptive loss function makes it possible to learn to detect river centerlines, including the centerlines at junctions and bifurcations. DeepRiver learns to separate between rivers and oceans, and therefore, it is able to reliably extract rivers in coastal regions. The model produces maps of river centerlines, which have the potential to be quite useful for analyzing the properties of river networks.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: This work carefully designed a framework in Amazon Mechanical Turk (AMT) to address the many technical issues that are faced and has verified that the framework provided results that are highly consistent with the ones obtained in a lab environment under controlled conditions.
Abstract: Most of today's video quality assessment (VQA) databases contain very limited content and distortion diversities and fail to adequately represent real world video impairments. This is in part because conducting subjective studies in the lab is slow, inefficient and expensive process. Crowdsourcing quality scores is a more scalable solution. However given that viewers operate under innumerable viewing conditions (in-cluding display resolutions, viewing distances, internet connection speeds) and because they are not closely supervised, multiple technical challenges arise. We carefully designed a framework in Amazon Mechanical Turk (AMT) to address the many technical issues that are faced. We launched the largest available VQA study, collecting more than 205000 opinion scores provided by more than 4700 unique participants. We have verified that our framework provided us with results that are highly consistent with the ones obtained in a lab environment under controlled conditions.

Proceedings ArticleDOI
08 Apr 2018
TL;DR: The scale-invariant properties of divisively normalized bandpass responses of natural images in the DCT-filtered domain are investigated and found that the variance of the normalized DCT filtered responses of a pristine natural image is scale invariant.
Abstract: We investigate the scale-invariant properties of divisively normalized bandpass responses of natural images in the DCT-filtered domain. We found that the variance of the normalized DCT filtered responses of a pristine natural image is scale invariant. This scale invariance property does not hold in the presence of noise and thus it can be used to devise an efficient blind image noise estimator. The proposed noise estimation approach outperforms other statistics-based methods especially for higher noise levels and competes well with patch-based and filter-based approaches. Moreover, the new variance estimation approach is also effective in the case of non-Gaussian noise. The research code of the proposed algorithm can be found at https://github.com/guptapraful/Noise Estimation.

Journal ArticleDOI
TL;DR: It is shown that the GGSM model can lead to improved performance in distortion-related applications, while providing a more principled approach to the statistical processing of distorted image signals.
Abstract: We develop a Generalized Gaussian scale mixture (GGSM) model of the wavelet coefficients of natural and distorted images. The GGSM model, which is more general than and which subsumes the Gaussian scale mixture (GSM) model, is shown to be a better representation of the statistics of the wavelet coefficients of both natural as well as distorted images. We demonstrate the utility of the model by applying it to various image processing applications, including blind distortion identification and no reference image quality assessment (NR-IQA). Similar to the GSM model, the GGSM model is useful for motivating the use of local divisive energy normalization, especially when the wavelet coefficients are computed on distorted pictures. We show that the GGSM model can lead to improved performance in distortion-related applications, while providing a more principled approach to the statistical processing of distorted image signals. The software release of a GGSM-based NR-IQA approach called DIIVINE-GGSM is available online at http://live.ece.utexas.edu/research/quality/diivine-ggsm.zip for further experimentation.

Journal ArticleDOI
TL;DR: The DeepVDP uses a convolutional neural network to learn features that are highly predictive of experienced visual discomfort prediction and achieves the state-of-the-art performance as compared with previous VDP algorithms.
Abstract: Most prior approaches to the problem of stereoscopic 3D (S3D) visual discomfort prediction (VDP) have focused on the extraction of perceptually meaningful handcrafted features based on models of visual perception and of natural depth statistics. Toward advancing performance on this problem, we have developed a deep learning-based VDP model named deep visual discomfort predictor (DeepVDP). The DeepVDP uses a convolutional neural network (CNN) to learn features that are highly predictive of experienced visual discomfort. Since a large amount of reference data is needed to train a CNN, we develop a systematic way of dividing the S3D image into local regions defined as patches and model a patch-based CNN using two sequential training steps. Since it is very difficult to obtain human opinions on each patch, instead a proxy ground-truth label that is generated by an existing S3D visual discomfort prediction algorithm called 3D-VDP is assigned to each patch. These proxy ground-truth labels are used to conduct the first stage of training the CNN. In the second stage, the automatically learned local abstractions are aggregated into global features via a feature aggregation layer. The learned features are iteratively updated via supervised learning on subjective 3D discomfort scores, which serve as ground-truth labels on each S3D image. The patch-based CNN model that has been pretrained on proxy ground-truth labels is subsequently retrained on true global subjective scores. The global S3D visual discomfort scores predicted by the trained DeepVDP model achieve the state-of-the-art performance as compared with previous VDP algorithms.

Journal ArticleDOI
TL;DR: This work develops a closed form bivariate spatial correlation model of bandpass and normalized image samples that completes an existing 2D joint generalized Gaussian distribution model of adjacent bandpass pixels.
Abstract: Previous work on natural scene statistics (NSS)-based image models has focused primarily on characterizing the univariate bandpass statistics of single pixels. These models have proven to be powerful tools driving a variety of computer vision and image/video processing applications, including depth estimation, image quality assessment, and image denoising, among others. Multivariate NSS models descriptive of the joint distributions of spatially separated bandpass image samples have, however, received relatively little attention. Here, we develop a closed form bivariate spatial correlation model of bandpass and normalized image samples that completes an existing 2D joint generalized Gaussian distribution model of adjacent bandpass pixels. Our model is built using a set of diverse, high-quality naturalistic photographs, and as a control, we study the model properties on white noise. We also study the way the model fits are affected when the images are modified by common distortions.

Book ChapterDOI
01 Jan 2018
TL;DR: In this chapter, a systematic framework for optimization with respect to a perceptual quality assessment algorithm is presented and the Structural SIMilarity (SSIM) index is the representative image quality assessment model that is studied.
Abstract: The fact that multimedia services have become the major driver for next generation wireless networks underscores their technological and economic impact. A vast majority of these multimedia services are consumer-centric and therefore must guarantee a certain level of perceptual quality. Given the massive volumes of image and video data in question, it is only natural to adopt automatic quality prediction and optimization tools. The past decade has seen the invention of several excellent automatic quality prediction tools for natural images and videos. While these tools predict perceptual quality scores accurately, they do not necessarily lend themselves to standard optimization techniques. In this chapter, a systematic framework for optimization with respect to a perceptual quality assessment algorithm is presented. The Structural SIMilarity (SSIM) index, which has found vast commercial acceptance owing to its high performance and low complexity, is the representative image quality assessment model that is studied. Specifically, a detailed exposition of the mathematical properties of the SSIM index is presented first, followed by a discussion on the design of linear and non-linear SSIM-optimal image restoration algorithms.

Journal ArticleDOI
TL;DR: This article has studied and analyzed the statistics of both pristine and distorted bandpass X-ray images, and devised an application of NSS models to an image modality classification task, whereby VL, X-rays, infrared, and millimeter-wave images can be effectively and automatically distinguished.
Abstract: In this article, we have studied and analyzed the statistics of both pristine and distorted bandpass X-ray images. In the past, we have shown that the statistics of natural, bandpass-filtered visible light (VL) pictures, commonly expressed by natural scene statistic (NSS) models, can be used to create remarkably powerful, perceptually relevant predictors of perceptual picture quality. We find that similar models can be developed that apply quite well to X-ray image data. We have also studied the potential of applying these statistical X-ray NSS models to the design of algorithms for automatic image quality prediction of X-ray images, such as might occur in security, medicine, and material inspection applications. As a demonstration of the discrimination power of these models, we devised an application of NSS models to an image modality classification task, whereby VL, X-ray, infrared, and millimeter-wave images can be effectively and automatically distinguished. Our study is conducted on a dataset of X-ray images made available by the National Institute of Standards and Technology.

Journal ArticleDOI
TL;DR: It is found that IQA models based on scene statistics models can successfully predict the perceptual quality of synthetic scenes, including those arising from compression and transmission.
Abstract: Measuring visual quality, as perceived by human observers, is becoming increasingly important in a large number of applications where humans are the ultimate consumers of visual information. Many natural image databases have been developed that contain human subjective ratings of the images. Subjective quality evaluation data is less available for synthetic images, such as those commonly encountered in graphics novels, online games or internet ads. A wide variety of powerful full-reference, reduced-reference and no-reference Image Quality Assessment (IQA) algorithms have been proposed for natural images, but their performance has not been evaluated on synthetic images. In this paper we (1) conduct a series of subjective tests on a new publicly available Embedded Signal Processing Laboratory (ESPL) Synthetic Image Database, which contains 500 distorted images (20 distorted images for each of the 25 original images) in 1920 × 1080 resolution, and (2) evaluate the performance of more than 50 publicly available IQA algorithms on the new database. The synthetic images in the database were processed by post acquisition distortions, including those arising from compression and transmission. We collected 26,000 individual ratings from 64 human subjects which can be used to evaluate full-reference, reduced-reference, and no-reference IQA algorithm performance. We find that IQA models based on scene statistics models can successfully predict the perceptual quality of synthetic scenes. The database is available at: http://signal.ece.utexas.edu/%7Ebevans/synthetic/ .

Proceedings ArticleDOI
24 Jun 2018
TL;DR: Ensemble V MAF (E-VMAF): a video quality predictor that combines two models: VMAF and predictions based on entropic differencing features calculated on video frames and frame differences, which is demonstrated to have improved performance on various subjective video databases.
Abstract: When developing data-driven video quality assessment algorithms, the size of the available ground truth subjective data may hamper the generalization capabilities of the trained models. Nevertheless, if the application context is known a priori, leveraging data-driven approaches for video quality prediction can deliver promising results. Towards achieving highperforming video quality prediction for compression and scaling artifacts, Netflix developed the Video Multi-method Assessment Fusion (VMAF) Framework, a full-reference prediction system which uses a regression scheme to integrate multiple perceptionmotivated features to predict video quality. However, the current version of VMAF does not fully capture temporal video features relevant to temporal video distortions. To achieve this goal, we developed Ensemble VMAF (E-VMAF): a video quality predictor that combines two models: VMAF and predictions based on entropic differencing features calculated on video frames and frame differences. We demonstrate the improved performance of E-VMAF on various subjective video databases. The proposed model will become available as part of the open source package in https://github. com/Netflix/vmaf.

Proceedings ArticleDOI
17 Sep 2018
TL;DR: 2stepQA, which integrates no-reference (NR) and reference (R) measurements into the quality prediction process, and is shown to achieve standout performance compared to other IQA models.
Abstract: Full-reference and reduced-reference image quality assessment (IQA) models assume a high quality reference against which to measure perceptual quality. However, this assumption may be violated when the source image is upscaled, poorly exposed, or otherwise distorted before being compressed. Reference IQA models on a compressed but previously distorted “reference” may produce unpredictable results. Hence we propose 2stepQA, which integrates no-reference (NR) and reference (R) measurements into the quality prediction process. The NR module accounts for imperfect quality of the reference image, while the R component measures further quality from compression. A simple, efficient multiplication step fuses these into a single score. We deploy MS-SSIM as the R component and NIQE as the NR component and combine them using multiplication. We chose MS-SSIM, since it is efficient and correlates well with subjective scores. Likewise, NIQE is simple, efficient, and generic, and does not require training on subjective data. The 2stepQA approach can be generalized by combining other R and NR models. We also built a new data resource: LIVE Wild Compressed Picture Database, where authentically distorted reference images were JPEG compressed at four levels. 2stepQA is shown to achieve standout performance compared to other IQA models. The proposed approach is made publicly available at https://github.com/xiangxuyu/2stepQA.

Posted Content
TL;DR: Experiments on two video datasets demonstrate that the proposed GAN-based compression engine is a promising alternative to traditional video codec approaches that can achieve higher quality reconstructions for very low bitrates.
Abstract: We propose a video compression framework using conditional Generative Adversarial Networks (GANs). We rely on two encoders: one that deploys a standard video codec and another which generates low-level maps via a pipeline of down-sampling, a newly devised soft edge detector, and a novel lossless compression scheme. For decoding, we use a standard video decoder as well as a neural network based one, which is trained using a conditional GAN. Recent "deep" approaches to video compression require multiple videos to pre-train generative networks to conduct interpolation. In contrast to this prior work, our scheme trains a generative decoder on pairs of a very limited number of key frames taken from a single video and corresponding low-level maps. The trained decoder produces reconstructed frames relying on a guidance of low-level maps, without any interpolation. Experiments on a diverse set of 131 videos demonstrate that our proposed GAN-based compression engine achieves much higher quality reconstructions at very low bitrates than prevailing standard codecs such as H.264 or HEVC.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: An enhanced model called SpatioTemporal VMAF (ST- VMAf) is proposed that incorporates temporal features that are easy to compute and demonstrated the improved performance of ST-VMAF on many subjective video databases.
Abstract: Most successful perceptual video quality assessment models are either frame-based, or perform spatiotemporal filtering or motion estimation to model the temporal aspects of video distortions While good results are obtained on video quality databases, their increased computational complexity often causes video quality engineers to instead rely on simpler image-based quality algorithms Towards balancing demands between prediction accuracy and compute efficiency, Netflix developed the Video Multi-method Assessment Fusion (VMAF) Framework, an efficient feature-based system that combines multiple perception-based elementary image measurements to produce video quality predictions However, the current version of VMAF only weakly captures temporal video features which are sensitive to perceptual temporal video distortions To this end, we propose an enhanced model we call SpatioTemporal VMAF (ST- VMAF) that incorporates temporal features that are easy to compute We demonstrate the improved performance of ST- VMAF on many subjective video databases The proposed model will be made available as part of the open source package in https://githubcom/Netflix/vmaf

Proceedings ArticleDOI
01 Apr 2018
TL;DR: This work makes the first attempt to use bivariate NSS features to build a model of no-reference image quality prediction, and shows that the bivariate model outperforms existing state of the art image quality predictors.
Abstract: The univariate statistics of bandpass-filtered images provide powerful features that drive many successful image quality assessment (IQA) algorithms. Bivariate Natural Scene Statistics (NSS), which model the joint statistics of multiple bandpass image samples also provide potentially powerful features to assess the perceptual quality of images, by capturing both image and distortion correlations. Here, we make the first attempt to use bivariate NSS features to build a model of no-reference image quality prediction. We show that our bivariate model outperforms existing state of the art image quality predictors.

Book ChapterDOI
03 Oct 2018
TL;DR: This chapter describes state of the art objective quality metrics to assess the quality of digital images and describes the challenges faced in acquiring, processing, storing and transmitting these metrics.
Abstract: This chapter describes state-of-the-art objective quality metrics to assess the quality of digital images. It also describes approaches that have been shown to be competitive with bottom-up Human Visual System (HVS)-based approaches in predicting image quality. These methods additionally demonstrate advantages over bottom-up HVS-based measures in several aspects. Digital images and video are prolific in the world owing to the ease of acquisition, processing, storage and transmission. Many common image processing operations such as compression, dithering and printing affect the quality of the image. The HVS is very good at evaluating the quality of an image blindly, that is, without a reference “perfect” image to compare it against. It is, however, rather difficult to perform this task automatically using a computer. Traditional approaches to image quality assessment use a bottom-up approach, where models of the HVS are used to derive quality metrics. Bottom-up HVS-based approaches are those that combine models for different properties of the HVS in defining a quality metric.

Proceedings ArticleDOI
08 Apr 2018
TL;DR: This paper studies the univariate and bivariate NSS of luminance and other chromatic components and how they relate.
Abstract: The visual brain is optimally designed to process images from the natural environment that we perceive. Describing the natural environment statistically helps in understanding how the brain encodes those images efficiently. The Natural Scene Statistics (NSS) of the luminance component of images is the basis of several univariate and bivariate statistical models. The NSS of other colors or chromatic components have been less well-analyzed. In this paper, we study the univariate and bivariate NSS of luminance and other chromatic components and how they relate.

DOI
05 Nov 2018
TL;DR: The paper discusses the principles of the human visual system that have been used in perceptual image enhancement algorithms and then presents modern image enhancement models and applications on the perceptual aspects of human vision.
Abstract: Enhancing image quality has become an important issue as the volume of digital images increases exponentially and the expectation of high-quality images grows insatiably. Digitized images commonly suffer from poor visual quality due to distortions, low contrast, the deficiency of lighting, defocusing, atmospheric influences such as fog, severe compression, and transmission errors. Hence, image enhancement is indispensable for better perception, interpretation, and subsequent analysis. Since humans are generally regarded as the final arbiter of the visual quality of the enhanced images, perceptual image enhancement has been of great interest. In this chapter, we visit actively evolving perceptual image enhancement research. The paper discusses the principles of the human visual system that have been used in perceptual image enhancement algorithms and then presents modern image enhancement models and applications on the perceptual aspects of human vision.

Journal ArticleDOI
TL;DR: A generalized Gaussian-based local contrast estimator is proposed as a way to implement non-linear local gain control, that facilitates the accurate modeling of both pristine and distorted images.
Abstract: Many existing Natural Scene Statistics-based no reference image quality assessment (NR IQA) algorithms employ univariate parametric distributions to capture the statistical inconsistencies of bandpass distorted image coefficients. Here we propose a multivariate model of natural image coefficients expressed in the bandpass spatial domain that has the potential to capture higher-order correlations that may be induced by the presence of distortions. We analyze how the parameters of the multivariate model are affected by different distortion types, and we show their ability to capture distortion-sensitive image quality information. We also demonstrate the violation of Gaussianity assumptions that occur when locally estimating the energies of distorted image coefficients. Thus we propose a generalized Gaussian-based local contrast estimator as a way to implement non-linear local gain control, that facilitates the accurate modeling of both pristine and distorted images. We integrate the novel approach of generalized contrast normalization with multivariate modeling of bandpass image coefficients into a holistic NR IQA model, which we refer to as multivariate generalized contrast normalization (MVGCN). We demonstrate the improved performance of MVGCN on quality relevant tasks on multiple imaging modalities, including visible light image quality prediction and task success prediction on distorted X-ray images.

Posted Content
TL;DR: A new two-step image quality prediction approach which integrates both no-reference (NR) and full-reference perceptual quality measurements into the quality prediction process and achieves standout performance when the reference images were of low quality.
Abstract: Full-reference (FR) image quality assessment (IQA) models assume a high quality "pristine" image as a reference against which to measure perceptual image quality. In many applications, however, the assumption that the reference image is of high quality may be untrue, leading to incorrect perceptual quality predictions. To address this, we propose a new two-step image quality prediction approach which integrates both no-reference (NR) and full-reference perceptual quality measurements into the quality prediction process. The no-reference module accounts for the possibly imperfect quality of the source (reference) image, while the full-reference component measures the quality differences between the source image and its possibly further distorted version. A simple, yet very efficient, multiplication step fuses the two sources of information into a reliable objective prediction score. We evaluated our two-step approach on a recently designed subjective image database and achieved standout performance compared to full-reference approaches, especially when the reference images were of low quality. The proposed approach is made publicly available at this https URL

Book ChapterDOI
24 Sep 2018
TL;DR: A novel method for computing depth from fisheye stereo that uses an understanding of the underlying lens models and a convolutional network to predict correspondences and is built on a synthetic database for developing and testing fis heye stereo and SV algorithms.
Abstract: An integral part of driver assistance technology is surround-view (SV), a system which uses four fisheye (wide-angle) cameras on the front, right, rear, and left sides of a vehicle to completely capture the surroundings. Inherent in SV are four wide-baseline orthogonally-divergent fisheye stereo systems, from which, depth information may be extracted and used in 3D scene understanding. Traditional stereo approaches typically require fisheye distortion removal and stereo rectification for efficient correspondence matching. However, such approaches suffer from loss of data and cannot account for widely disparate appearances of objects in corresponding views. We introduce a novel method for computing depth from fisheye stereo that uses an understanding of the underlying lens models and a convolutional network to predict correspondences. We also built a synthetic database for developing and testing fisheye stereo and SV algorithms. We demonstrate the performance of our depth estimation method on this database.