scispace - formally typeset
Search or ask a question

Showing papers on "Video quality published in 2020"


Journal ArticleDOI
TL;DR: This work conducts a comprehensive evaluation of leading no-reference/blind VQA (BVQA) features and models on a fixed evaluation architecture, yielding new empirical insights on both subjective video quality studies and objective V QA model design.
Abstract: Recent years have witnessed an explosion of user-generated content (UGC) videos shared and streamed over the Internet, thanks to the evolution of affordable and reliable consumer capture devices, and the tremendous popularity of social media platforms. Accordingly, there is a great need for accurate video quality assessment (VQA) models for UGC/consumer videos to monitor, control, and optimize this vast content. Blind quality prediction of in-the-wild videos is quite challenging, since the quality degradations of UGC content are unpredictable, complicated, and often commingled. Here we contribute to advancing the UGC-VQA problem by conducting a comprehensive evaluation of leading no-reference/blind VQA (BVQA) features and models on a fixed evaluation architecture, yielding new empirical insights on both subjective video quality studies and VQA model design. By employing a feature selection strategy on top of leading VQA model features, we are able to extract 60 of the 763 statistical features used by the leading models to create a new fusion-based BVQA model, which we dub the \textbf{VID}eo quality \textbf{EVAL}uator (VIDEVAL), that effectively balances the trade-off between VQA performance and efficiency. Our experimental results show that VIDEVAL achieves state-of-the-art performance at considerably lower computational cost than other leading models. Our study protocol also defines a reliable benchmark for the UGC-VQA problem, which we believe will facilitate further research on deep learning-based VQA modeling, as well as perceptually-optimized efficient UGC video processing, transcoding, and streaming. To promote reproducible research and public evaluation, an implementation of VIDEVAL has been made available online: \url{this https URL}.

113 citations


Journal ArticleDOI
TL;DR: The new LIVE-SJTU Audio and Video Quality Assessment (A/V-QA) Database includes 336 A/V sequences that were generated from 14 original source contents by applying 24 different A-V distortion combinations on them, and is validated and tested all of the objective A/v quality prediction models.
Abstract: The topics of visual and audio quality assessment (QA) have been widely researched for decades, yet nearly all of this prior work has focused only on single-mode visual or audio signals. However, visual signals rarely are presented without accompanying audio, including heavy-bandwidth video streaming applications. Moreover, the distortions that may separately (or conjointly) afflict the visual and audio signals collectively shape user-perceived quality of experience (QoE). This motivated us to conduct a subjective study of audio and video (A/V) quality, which we then used to compare and develop A/V quality measurement models and algorithms. The new LIVE-SJTU Audio and Video Quality Assessment (A/V-QA) Database includes 336 A/V sequences that were generated from 14 original source contents by applying 24 different A/V distortion combinations on them. We then conducted a subjective A/V quality perception study on the database towards attaining a better understanding of how humans perceive the overall combined quality of A/V signals. We also designed four different families of objective A/V quality prediction models, using a multimodal fusion strategy. The different types of A/V quality models differ in both the unimodal audio and video quality prediction models comprising the direct signal measurements and in the way that the two perceptual signal modes are combined. The objective models are built using both existing state-of-the-art audio and video quality prediction models and some new prediction models, as well as quality-predictive features delivered by a deep neural network. The methods of fusing audio and video quality predictions that are considered include simple product combinations as well as learned mappings. Using the new subjective A/V database as a tool, we validated and tested all of the objective A/V quality prediction models. We will make the database publicly available to facilitate further research.

92 citations


Journal ArticleDOI
03 Apr 2020
TL;DR: This paper proposes a fast yet effective method for compressed video quality enhancement by incorporating a novel Spatio-Temporal Deformable Fusion (STDF) scheme to aggregate temporal information and achieves the state-of-the-art performance of compressed videoquality enhancement in terms of both accuracy and efficiency.
Abstract: Recent years have witnessed remarkable success of deep learning methods in quality enhancement for compressed video. To better explore temporal information, existing methods usually estimate optical flow for temporal motion compensation. However, since compressed video could be seriously distorted by various compression artifacts, the estimated optical flow tends to be inaccurate and unreliable, thereby resulting in ineffective quality enhancement. In addition, optical flow estimation for consecutive frames is generally conducted in a pairwise manner, which is computational expensive and inefficient. In this paper, we propose a fast yet effective method for compressed video quality enhancement by incorporating a novel Spatio-Temporal Deformable Fusion (STDF) scheme to aggregate temporal information. Specifically, the proposed STDF takes a target frame along with its neighboring reference frames as input to jointly predict an offset field to deform the spatio-temporal sampling positions of convolution. As a result, complementary information from both target and reference frames can be fused within a single Spatio-Temporal Deformable Convolution (STDC) operation. Extensive experiments show that our method achieves the state-of-the-art performance of compressed video quality enhancement in terms of both accuracy and efficiency.

90 citations


Journal ArticleDOI
TL;DR: It is proved that BOLA achieves a time-average utility that is within an additive term $O(1/V)$ of the optimal value, for a control parameter V related to the video buffer size, which is significantly higher than current state-of-the-art algorithms.
Abstract: Modern video players employ complex algorithms to adapt the bitrate of the video that is shown to the user. Bitrate adaptation requires a tradeoff between reducing the probability that the video freezes (rebuffers) and enhancing the quality of the video. A bitrate that is too high leads to frequent rebuffering, while a bitrate that is too low leads to poor video quality. Video providers segment videos into short segments and encode each segment at multiple bitrates. The video player adaptively chooses the bitrate of each segment to download, possibly choosing different bitrates for successive segments. We formulate bitrate adaptation as a utility-maximization problem and devise an online control algorithm called BOLA that uses Lyapunov optimization to minimize rebuffering and maximize video quality. We prove that BOLA achieves a time-average utility that is within an additive term $O(1/V)$ of the optimal value, for a control parameter V related to the video buffer size. Further, unlike prior work, BOLA does not require prediction of available network bandwidth. We empirically validate BOLA in a simulated network environment using a collection of network traces. We show that BOLA achieves near-optimal utility and in many cases significantly higher utility than current state-of-the-art algorithms. Our work has immediate impact on real-world video players and for the evolving DASH standard for video transmission. We also implemented an updated version of BOLA that is now part of the standard reference player dash.js and is used in production by several video providers such as Akamai, BBC, CBS, and Orange.

75 citations


Journal ArticleDOI
TL;DR: This paper proposes an asymmetric generalized Gaussian distribution (AGGD) to model the statistics of MSCN coefficients of natural videos and their spatiotemporal Gabor bandpass filtered outputs and demonstrates that the AGGD model parameters serve as good representative features for distortion discrimination.
Abstract: Robust spatiotemporal representations of natural videos have several applications including quality assessment, action recognition, object tracking etc. In this paper, we propose a video representation that is based on a parameterized statistical model for the spatiotemporal statistics of mean subtracted and contrast normalized (MSCN) coefficients of natural videos. Specifically, we propose an asymmetric generalized Gaussian distribution (AGGD) to model the statistics of MSCN coefficients of natural videos and their spatiotemporal Gabor bandpass filtered outputs. We then demonstrate that the AGGD model parameters serve as good representative features for distortion discrimination. Based on this observation, we propose a supervised learning approach using support vector regression (SVR) to address the no-reference video quality assessment (NRVQA) problem. The performance of the proposed algorithm is evaluated on publicly available video quality assessment (VQA) datasets with both traditional and in-capture/authentic distortions. We show that the proposed algorithm delivers competitive performance on traditional (synthetic) distortions and acceptable performance on authentic distortions. The code for our algorithm will be released at https://www.iith.ac.in/lfovia/downloads.html .

75 citations


Proceedings ArticleDOI
06 Jul 2020
TL;DR: PARSEC significantly outperforms the state-of-art 360° video streaming systems while reducing the bandwidth requirement, and combines traditional video encoding with super-resolution techniques to overcome the challenges.
Abstract: 360° videos provide an immersive experience to users, but require considerably more bandwidth to stream compared to regular videos. State-of-the-art 360° video streaming systems use viewport prediction to reduce bandwidth requirement, that involves predicting which part of the video the user will view and only fetching that content. However, viewport prediction is error prone resulting in poor user Quality of Experience (QoE). We design PARSEC, a 360° video streaming system that reduces bandwidth requirement while improving video quality. PARSEC trades off bandwidth for additional client-side computation to achieve its goals. PARSEC uses an approach based on super-resolution, where the video is significantly compressed at the server and the client runs a deep learning model to enhance the video to a much higher quality. PARSEC addresses a set of challenges associated with using super-resolution for 360° video streaming: large deep learning models, slow inference rate, and variance in the quality of the enhanced videos. To this end, PAR-SEC trains small micro-models over shorter video segments, and then combines traditional video encoding with super-resolution techniques to overcome the challenges. We evaluate PARSEC on a real WiFi network, over a broadband network trace released by FCC, and over a 4G/LTE network trace. PARSEC significantly outperforms the state-of-art 360° video streaming systems while reducing the bandwidth requirement.

74 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel mechanism to jointly consider buffer dynamics, video quality adaption, edge caching, video transcoding and transmission in video streaming over software-defined mobile networks (SDMN) combined with MEC.
Abstract: Both mobile edge cloud (MEC) and software-defined networking (SDN) are technologies for next generation mobile networks. In this paper, we propose to simultaneously optimize energy consumption and quality of experience (QoE) metrics in video streaming over software-defined mobile networks (SDMN) combined with MEC. Specifically, we propose a novel mechanism to jointly consider buffer dynamics, video quality adaption, edge caching, video transcoding and transmission. First, we assume that the time-varying channel is a discrete-time Markov chain (DTMC). Then, based on this assumption, we formulate two optimization problems which can be depicted as a constrained Markov decision process (CMDP) and a Markov decision process (MDP). Then, we transform the CMDP problem into regular MDP by deploying Lyapunov technique. We utilize asynchronous advantage actor-critic (A3C) algorithm, one of the model-free deep reinforcement learning (DRL) methods, to solve the corresponding MDP issues. Simulation results are presented to show that the proposed scheme can achieve the goal of energy saving and QoE enhancement with the corresponding constraints satisfied.

72 citations


Journal ArticleDOI
TL;DR: A mixed datasets training strategy for training a single VQA model with multiple datasets is explored and the superior performance of the unified model in comparison with the state-of-the-art models is proved.
Abstract: Video quality assessment (VQA) is an important problem in computer vision. The videos in computer vision applications are usually captured in the wild. We focus on automatically assessing the quality of in-the-wild videos, which is a challenging problem due to the absence of reference videos, the complexity of distortions, and the diversity of video contents. Moreover, the video contents and distortions among existing datasets are quite different, which leads to poor performance of data-driven methods in the cross-dataset evaluation setting. To improve the performance of quality assessment models, we borrow intuitions from human perception, specifically, content dependency and temporal-memory effects of human visual system. To face the cross-dataset evaluation challenge, we explore a mixed datasets training strategy for training a single VQA model with multiple datasets. The proposed unified framework explicitly includes three stages: relative quality assessor, nonlinear mapping, and dataset-specific perceptual scale alignment, to jointly predict relative quality, perceptual quality, and subjective quality. Experiments are conducted on four publicly available datasets for VQA in the wild, i.e., LIVE-VQC, LIVE-Qualcomm, KoNViD-1k, and CVD2014. The experimental results verify the effectiveness of the mixed datasets training strategy and prove the superior performance of the unified model in comparison with the state-of-the-art models. For reproducible research, we make the PyTorch implementation of our method available at this https URL.

70 citations


Journal ArticleDOI
TL;DR: An efficient in-loop filtering algorithm based on the enhanced deep convolutional neural networks (EDCNN) for significantly improving the performance of in- loop filtering in HEVC is proposed.
Abstract: The raw video data can be compressed much by the latest video coding standard, high efficiency video coding (HEVC). However, the block-based hybrid coding used in HEVC will incur lots of artifacts in compressed videos, the video quality will be severely influenced. To settle this problem, the in-loop filtering is used in HEVC to eliminate artifacts. Inspired by the success of deep learning, we propose an efficient in-loop filtering algorithm based on the enhanced deep convolutional neural networks (EDCNN) for significantly improving the performance of in-loop filtering in HEVC. Firstly, the problems of traditional convolutional neural networks models, including the normalization method, network learning ability, and loss function, are analyzed. Then, based on the statistical analyses, the EDCNN is proposed for efficiently eliminating the artifacts, which adopts three solutions, including a weighted normalization method, a feature information fusion block, and a precise loss function. Finally, the PSNR enhancement, PSNR smoothness, RD performance, subjective test, and computational complexity/GPU memory consumption are employed as the evaluation criteria, and experimental results show that when compared with the filter in HM16.9, the proposed in-loop filtering algorithm achieves an average of 6.45% BDBR reduction and 0.238 dB BDPSNR gains.

66 citations


Proceedings ArticleDOI
01 Mar 2020
TL;DR: A novel conditional GAN architecture, namely ImaGINator, which given a single image, a condition (label of a facial expression or action) and noise, decomposes appearance and motion in both latent and high level feature spaces, generating realistic videos.
Abstract: Generating human videos based on single images entails the challenging simultaneous generation of realistic and visual appealing appearance and motion. In this context, we propose a novel conditional GAN architecture, namely ImaGINator, which given a single image, a condition (label of a facial expression or action) and noise, decomposes appearance and motion in both latent and high level feature spaces, generating realistic videos. This is achieved by (i) a novel spatio-temporal fusion scheme, which generates dynamic motion, while retaining appearance throughout the full video sequence by transmitting appearance (originating from the single image) through all layers of the network. In addition, we propose (ii) a novel transposed (1+2)D convolution, factorizing the transposed 3D convolutional filters into separate transposed temporal and spatial components, which yields significantly gains in video quality and speed. We extensively evaluate our approach on the facial expression datasets MUG and UvA-NEMO, as well as on the action datasets NATOPS and Weizmann. We show that our approach achieves significantly better quantitative and qualitative results than the state-of-the-art. The source code and models are available under https://github.com/wyhsirius/ImaGINator.

65 citations


Posted Content
TL;DR: The largest (by far) subjective video quality dataset is created, containing 38,811 real-world distorted videos and 116,433 space-time localized video patches (‘v-patches’), and 5.5M human perceptual quality annotations, which create two unique NR-VQA models.
Abstract: No-reference (NR) perceptual video quality assessment (VQA) is a complex, unsolved, and important problem to social and streaming media applications. Efficient and accurate video quality predictors are needed to monitor and guide the processing of billions of shared, often imperfect, user-generated content (UGC). Unfortunately, current NR models are limited in their prediction capabilities on real-world, "in-the-wild" UGC video data. To advance progress on this problem, we created the largest (by far) subjective video quality dataset, containing 39, 000 realworld distorted videos and 117, 000 space-time localized video patches ('v-patches'), and 5.5M human perceptual quality annotations. Using this, we created two unique NR-VQA models: (a) a local-to-global region-based NR VQA architecture (called PVQ) that learns to predict global video quality and achieves state-of-the-art performance on 3 UGC datasets, and (b) a first-of-a-kind space-time video quality mapping engine (called PVQ Mapper) that helps localize and visualize perceptual distortions in space and time. We will make the new database and prediction models available immediately following the review process.

Proceedings ArticleDOI
12 Oct 2020
TL;DR: Experimental results show that the proposed model can predict subjective video quality more accurately than the publicly available video quality models representing the state-of-the-art.
Abstract: Due to the wide range of different natural temporal and spatial distortions appearing in user generated video content, blind assessment of natural video quality is a challenging research problem. In this study, we combine the hand-crafted statistical temporal features used in a state-of-the-art video quality model and spatial features obtained from convolutional neural network trained for image quality assessment via transfer learning. Experimental results on two recently published natural video quality databases show that the proposed model can predict subjective video quality more accurately than the publicly available video quality models representing the state-of-the-art. The proposed model is also competitive in terms of computational complexity.

Journal ArticleDOI
TL;DR: This work demonstrates high fidelity and temporally stable results in real-time, even in the highly challenging 4 × 4 upsampling scenario, significantly outperforming existing superresolution and temporal antialiasing work.
Abstract: Due to higher resolutions and refresh rates, as well as more photorealistic effects, real-time rendering has become increasingly challenging for video games and emerging virtual reality headsets. To meet this demand, modern graphics hardware and game engines often reduce the computational cost by rendering at a lower resolution and then upsampling to the native resolution. Following the recent advances in image and video superresolution in computer vision, we propose a machine learning approach that is specifically tailored for high-quality upsampling of rendered content in real-time applications. The main insight of our work is that in rendered content, the image pixels are point-sampled, but precise temporal dynamics are available. Our method combines this specific information that is typically available in modern renderers (i.e., depth and dense motion vectors) with a novel temporal network design that takes into account such specifics and is aimed at maximizing video quality while delivering real-time performance. By training on a large synthetic dataset rendered from multiple 3D scenes with recorded camera motion, we demonstrate high fidelity and temporally stable results in real-time, even in the highly challenging 4 × 4 upsampling scenario, significantly outperforming existing superresolution and temporal antialiasing work.

Proceedings ArticleDOI
Hyunho Yeo, Chan Ju Chong, Youngmok Jung1, Juncheol Ye, Dongsu Han1 
21 Sep 2020
TL;DR: NEMO leverages fine-grained dependencies using information from the video codec and strives to provide guarantees in the quality degradation compared to per-frame super-resolution, which leads to a 31.2% improvement in quality of experience for mobile users.
Abstract: The demand for mobile video streaming has experienced tremendous growth over the last decade. However, existing methods of video delivery fall short of delivering high-quality video. Recent advances in neural super-resolution have opened up the possibility of enhancing video quality by leveraging client-side computation. Unfortunately, mobile devices cannot benefit from this because it is too expensive in computation and power-hungry. To overcome the limitation, we present NEMO, a system that enables real-time video super-resolution on mobile devices. NEMO applies neural super-resolution to a few select frames and transfers the outputs to benefit the remaining frames. The frames to which super-resolution is applied are carefully chosen to maximize the overall quality gains. NEMO leverages fine-grained dependencies using information from the video codec and strives to provide guarantees in the quality degradation compared to per-frame super-resolution. Our evaluation using a full system implementation on Android shows NEMO improves the overall processing throughput by x11.5, reduces energy consumption by 88.6%, and maintains device temperatures at acceptable levels compared to per-frame super-resolution, while ensuring high video quality. Overall, this leads to a 31.2% improvement in quality of experience for mobile users.

Proceedings ArticleDOI
12 Oct 2020
TL;DR: This work proposes a novel no-reference VQA framework named Recurrent-In-Recurrent Network (RIRNet), which integrates concepts from motion perception in human visual system (HVS), which is manifested in the designed network structure composed of low- and high- level processing.
Abstract: Video quality assessment (VQA), which is capable of automatically predicting the perceptual quality of source videos especially when reference information is not available, has become a major concern for video service providers due to the growing demand for video quality of experience (QoE) by end users. While significant advances have been achieved from the recent deep learning techniques, they often lead to misleading results in VQA tasks given their limitations on describing 3D spatio-temporal regularities using only fixed temporal frequency. Partially inspired by psychophysical and vision science studies revealing the speed tuning property of neurons in visual cortex when performing motion perception (i.e., sensitive to different temporal frequencies), we propose a novel no-reference (NR) VQA framework named Recurrent-In-Recurrent Network (RIRNet) to incorporate this characteristic to prompt an accurate representation of motion perception in VQA task. By fusing motion information derived from different temporal frequencies in a more efficient way, the resulting temporal modeling scheme is formulated to quantify the temporal motion effect via a hierarchical distortion description. It is found that the proposed framework is in closer agreement with quality perception of the distorted videos since it integrates concepts from motion perception in human visual system (HVS), which is manifested in the designed network structure composed of low- and high- level processing. A holistic validation of our methods on four challenging video quality databases demonstrates the superior performances over the state-of-the-art methods.

Journal ArticleDOI
TL;DR: An end-edge-cloud coordination framework for low-latency and accurate live video analytics and an online video quality and computing resource configuration algorithm to gradually learn the optimal configuration strategy is introduced.
Abstract: To develop smart city and intelligent manufacturing, video cameras are being increasingly deployed. In order to achieve fast and accurate response to live video queries (e.g., license plate recording and object tracking), the real-time high-volume video streams should be delivered and analyzed efficiently. In this article, we introduce an end-edge-cloud coordination framework for low-latency and accurate live video analytics. Considering the locality of video queries, edge platform is designated as the system coordinator. It accepts live video queries and configures the related end cameras to generate video frames that meet quality requirements. By taking into account the latency constraint, edge computing resources are subtly distributed to process the live video frames from different sources such that the analytic accuracy of the accepted video queries can be maximized. Since the amount of required edge computing resource and video quality to accurately address different video queries are unknown in advance, we propose an online video quality and computing resource configuration algorithm to gradually learn the optimal configuration strategy. Extensive simulation results show that as compared to other benchmarks, the proposed configuration algorithm can effectively improve the analytic accuracy, while providing low-latency response.

Proceedings ArticleDOI
21 Sep 2020
TL;DR: OnRL puts many individual RL agents directly into the video telephony system, which make video bitrate decisions in real-time and evolve their models over time, and incorporates novel mechanisms to handle the adverse impacts of inherent video traffic dynamics.
Abstract: Machine learning models, particularly reinforcement learning (RL), have demonstrated great potential in optimizing video streaming applications. However, the state-of-the-art solutions are limited to an "offline learning" paradigm, i.e., the RL models are trained in simulators and then are operated in real networks. As a result, they inevitably suffer from the simulation-to-reality gap, showing far less satisfactory performance under real conditions compared with simulated environment. In this work, we close the gap by proposing OnRL, an online RL framework for real-time mobile video telephony. OnRL puts many individual RL agents directly into the video telephony system, which make video bitrate decisions in real-time and evolve their models over time. OnRL then aggregates these agents to form a high-level RL model that can help each individual to react to unseen network conditions. Moreover, OnRL incorporates novel mechanisms to handle the adverse impacts of inherent video traffic dynamics, and to eliminate risks of quality degradation caused by the RL model's exploration attempts. We implement OnRL on a mainstream operational video telephony system, Alibaba Taobao-live. In a month-long evaluation with 543 hours of video sessions from 151 real-world mobile users, OnRL outperforms the prior algorithms significantly, reducing video stalling rate by 14.22% while maintaining similar video quality.

Journal ArticleDOI
TL;DR: In this article, a recurrent neural network-based QoE prediction model using an LSTM network is proposed, which is a network of cascaded long short-term memory (LSTM) blocks to capture the nonlinearities and the complex temporal dependencies involved in the time-varying QoEs.
Abstract: Due to the rate adaptation in hypertext transfer protocol adaptive streaming, the video quality delivered to the client keeps varying with time depending on the end-to-end network conditions. Moreover, the varying network conditions could also lead to the video client running out of the playback content resulting in rebuffering events. These factors affect the user satisfaction and cause degradation of the user quality of experience (QoE). Hence, it is important to quantify the perceptual QoE of the streaming video users and to monitor the same in a continuous manner so that the QoE degradation can be minimized. However, the continuous evaluation of QoE is challenging as it is determined by complex dynamic interactions among the QoE influencing factors. Toward this end, we present long short-term memory (LSTM)-QoE, a recurrent neural network-based QoE prediction model using an LSTM network. The LSTM-QoE is a network of cascaded LSTM blocks to capture the nonlinearities and the complex temporal dependencies involved in the time-varying QoE. Based on an evaluation over several publicly available continuous QoE datasets, we demonstrate that the LSTM-QoE has the capability to model the QoE dynamics effectively. We compare the proposed model with the state-of-the-art QoE prediction models and show that it provides an excellent performance across these datasets. Furthermore, we discuss the state space perspective for the LSTM-QoE and show the efficacy of the state space modeling approaches for the QoE prediction.

Journal ArticleDOI
TL;DR: A novel and end-to-end framework to predict video Quality of Experience (QoE) that has the flexibility to fit different datasets, to learn QoE representation, and to perform both classification and regression problems.
Abstract: Recently, many models have been developed to predict video Quality of Experience (QoE), yet the applicability of these models still faces significant challenges. Firstly, many models rely on features that are unique to a specific dataset and thus lack the capability to generalize. Due to the intricate interactions among these features, a unified representation that is independent of datasets with different modalities is needed. Secondly, existing models often lack the configurability to perform both classification and regression tasks. Thirdly, the sample size of the available datasets to develop these models is often very small, and the impact of limited data on the performance of QoE models has not been adequately addressed. To address these issues, in this work we develop a novel and end-to-end framework termed as DeepQoE. The proposed framework first uses a combination of deep learning techniques, such as word embedding and 3D convolutional neural network (C3D), to extract generalized features. Next, these features are combined and fed into a neural network for representation learning. A learned representation will then serve as input for classification or regression tasks. We evaluate the performance of DeepQoE with three datasets. The results show that for small datasets (e.g., WHU-MVQoE2016 and Live-Netflix Video Database), the performance of state-of-the-art machine learning algorithms is greatly improved by using the QoE representation from DeepQoE (e.g., 35.71% to 44.82%); while for the large dataset (e.g., VideoSet), our DeepQoE framework achieves significant performance improvement in comparison to the best baseline method (90.94% vs. 82.84%). In addition to the much improved performance, DeepQoE has the flexibility to fit different datasets, to learn QoE representation, and to perform both classification and regression problems. We also develop a DeepQoE based adaptive bitrate streaming (ABR) system to verify that our framework can be easily applied to multimedia communication service. The software package of the DeepQoE framework has been released to facilitate the current research on QoE.

Journal ArticleDOI
Yu Zhang1, Xinbo Gao1, Lihuo He1, Wen Lu1, Ran He1 
TL;DR: A full-reference (FR) VQA metric integrating transfer learning with a convolutional neural network (CNN) to transfer the distorted images as the related domain, which enriches the distorted samples and introduces a preprocessing and a postprocessing to reduce the impact of inaccurate labels predicted by the FR-V QA metric.
Abstract: Nowadays, video quality assessment (VQA) is essential to video compression technology applied to video transmission and storage. However, small-scale video quality databases with imbalanced samples and low-level feature representations for distorted videos impede the development of VQA methods. In this paper, we propose a full-reference (FR) VQA metric integrating transfer learning with a convolutional neural network (CNN). First, we imitate the feature-based transfer learning framework to transfer the distorted images as the related domain, which enriches the distorted samples. Second, to extract high-level spatiotemporal features of the distorted videos, a six-layer CNN with the acknowledged learning ability is pretrained and finetuned by the common features of the distorted image blocks (IBs) and video blocks (VBs), respectively. Notably, the labels of the distorted IBs and VBs are predicted by the classic FR metrics. Finally, based on saliency maps and the entropy function, we conduct a pooling stage to obtain the quality scores of the distorted videos by weighting the block-level scores predicted by the trained CNN. In particular, we introduce a preprocessing and a postprocessing to reduce the impact of inaccurate labels predicted by the FR-VQA metric. Due to feature learning in the proposed framework, two kinds of experimental schemes including train-test iterative procedures on one database and tests on one database with training other databases are carried out. The experimental results demonstrate that the proposed method has high expansibility and is on a par with some state-of-the-art VQA metrics on two widely used VQA databases with various compression distortions.

Journal ArticleDOI
TL;DR: A joint video transcoding and quality adaptation framework for ABR streaming by enabling RAN with computing capability is proposed and an automatic algorithm is developed to perform the computational resource assignment and video quality adaptation without any prior knowledge of channel statistics.
Abstract: Adaptive bitrate (ABR) streaming has been used in wireless networks to deal with the time-varying wireless channels. Traditionally, wireless video is fetched from remote Internet server. However, wireless video streaming from Internet server faces challenges such as congestion and long latency. ABR streaming and video transcoding at the radio access network (RAN) edge have shown the potential to overcome such problem and provided better video streaming service. In the paper, we consider joint computation and communication for ABR streaming based on mobile edge computing (MEC) under time-varying wireless channels. We propose a joint video transcoding and quality adaptation framework for ABR streaming by enabling RAN with computing capability. By modeling the wireless channel as a finite state Markov channel, we formulate the optimization problem as a stochastic optimization problem of joint computational resource assignment and video quality adaptation for maximizing the average reward, which is defined as the tradeoff between user perceived quality of experience and the cost of performing transcoding at edge server. By using deep reinforcement learning (DRL) algorithm, we develop an automatic algorithm to perform the computational resource assignment and video quality adaptation without any prior knowledge of channel statistics. Simulation results using Tensorflow show the effectiveness of the designed MEC-enabled ABR streaming system and DRL algorithm.

Book ChapterDOI
Jianyi Wang1, Xin Deng1, Mai Xu1, Congyong Chen1, Yuhang Song2 
23 Aug 2020
TL;DR: This paper proposes a novel generative adversarial network (GAN) based on multi-level wavelet packet transform (WPT) to enhance the perceptual quality of compressed video, which is called multi- level wavelet-based GAN (MW-GAN).
Abstract: The past few years have witnessed fast development in video quality enhancement via deep learning. Existing methods mainly focus on enhancing the objective quality of compressed video while ignoring its perceptual quality. In this paper, we focus on enhancing the perceptual quality of compressed video. Our main observation is that enhancing the perceptual quality mostly relies on recovering high-frequency sub-bands in wavelet domain. Accordingly, we propose a novel generative adversarial network (GAN) based on multi-level wavelet packet transform (WPT) to enhance the perceptual quality of compressed video, which is called multi-level wavelet-based GAN (MW-GAN). In MW-GAN, we first apply motion compensation with a pyramid architecture to obtain temporal information. Then, we propose a wavelet reconstruction network with wavelet-dense residual blocks (WDRB) to recover the high-frequency details. In addition, the adversarial loss of MW-GAN is added via WPT to further encourage high-frequency details recovery for video frames. Experimental results demonstrate the superiority of our method.

Proceedings ArticleDOI
01 Jun 2020
TL;DR: This paper proposes a fast unsupervised anomaly detection system comprising of three modules: preprocessing module, candidate selection module and backtracking anomaly detection module that achieves an F1-score of 0.5926 along with 8.2386 root mean square error (RMSE) and is ranked second in the competition.
Abstract: Anomaly detection in traffic videos has been recently gaining attention due to its importance in intelligent transportation systems. Due to several factors such as weather, viewpoint, lighting conditions, etc. affecting the video quality of a real time traffic feed, it still remains a challenging problem. Even though the performance of state-of-the-art methods on the available benchmark dataset has been competitive, they demand a massive amount of external training data combined with significant computational resources. In this paper, we propose a fast unsupervised anomaly detection system comprising of three modules: preprocessing module, candidate selection module and backtracking anomaly detection module. The preprocessing module outputs stationary objects detected in a video. Then, the candidate selection module removes the misclassified stationary objects using a nearest neighbor approach and then uses K-means clustering to identify potential anomalous regions. Finally, the backtracking anomaly detection algorithm computes a similarity statistic and decides on the onset time of the anomaly. Experimental results on the Track 4 test set of the NVIDIA AI CITY 2020 challenge show the efficacy of the proposed framework as we achieve an F1-score of 0.5926 along with 8.2386 root mean square error (RMSE) and are ranked second in the competition.

Journal ArticleDOI
TL;DR: This paper proposes Edge ECCA and Combinatorial Clock Auction in Stream, two auction frameworks to improve the QoE of live video streaming services in the Edge-enabled cellular system and shows that the overall system utility can be significantly improved through the proposed system.
Abstract: The live video streaming services have been suffered from the limited backhaul capacity of the cellular core network and occasional congestions due to the cloud-based architecture. Mobile Edge Computing (MEC) brings the services from the centralized cloud to nearby network edge to improve the Quality of Experience (QoE) of cloud services, such as live video streaming services. Nevertheless, the resource at edge devices is still limited and should be allocated economically efficiently. In this paper, we propose Edge Combinatorial Clock Auction (ECCA) and Combinatorial Clock Auction in Stream (CCAS), two auction frameworks to improve the QoE of live video streaming services in the Edge-enabled cellular system. The edge system is the auctioneer who decides the backhaul capacity and caching space allocation and streamers are the bidders who request for the backhaul capacity and caching space to improve the video quality their audiences can watch. There are two key subproblems: the caching space value evaluations and allocations. We show that both problems can be solved by the proposed dynamic programming algorithms. The truth-telling property is guaranteed in both ECCA and CCAS. The simulation results show that the overall system utility can be significantly improved through the proposed system.

Proceedings ArticleDOI
04 May 2020
TL;DR: In this paper, a distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND index), was proposed.
Abstract: Banding artifact, or false contouring, is a common video compression impairment that tends to appear on large flat regions in encoded videos. These staircase-shaped color bands can be very noticeable in high-definition videos. Here we study this artifact, and propose a new distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND index). BBAND is inspired by human visual models. The proposed detector can generate a pixel-wise banding visibility map and output a banding severity score at both the frame and video levels. Experimental results show that our proposed method outperforms state-of-the-art banding detection algorithms and delivers better consistency with subjective evaluations.

Journal ArticleDOI
14 Feb 2020
TL;DR: A robust and efficient system for unconstrained video-based face recognition, which is composed of modules for face/fiducial detection, face association, and face recognition is proposed.
Abstract: Although deep learning approaches have achieved performance surpassing humans for still image-based face recognition, unconstrained video-based face recognition is still a challenging task due to large volume of data to be processed and intra/inter-video variations on pose, illumination, occlusion, scene, blur, video quality, etc. In this work, we consider challenging scenarios for unconstrained video-based face recognition from multiple-shot videos and surveillance videos with low-quality frames. To handle these problems, we propose a robust and efficient system for unconstrained video-based face recognition, which is composed of modules for face/fiducial detection, face association, and face recognition. First, we use multi-scale single-shot face detectors to efficiently localize faces in videos. The detected faces are then grouped through carefully designed face association methods, especially for multi-shot videos. Finally, the faces are recognized by the proposed face matcher based on an unsupervised subspace learning approach and a subspace-to-subspace similarity metric. Extensive experiments on challenging video datasets, such as Multiple Biometric Grand Challenge (MBGC), Face and Ocular Challenge Series (FOCS), IARPA Janus Surveillance Video Benchmark (IJB-S) for low-quality surveillance videos and IARPA JANUS Benchmark B (IJB-B) for multiple-shot videos, demonstrate that the proposed system can accurately detect and associate faces from unconstrained videos and effectively learn robust and discriminative features for recognition.

Journal ArticleDOI
TL;DR: A novel joint video quality selection and resource allocation technique is proposed for increasing the quality-of-experience (QoE) of vehicular devices and results show that the proposed algorithm ensures high video quality experience compared to the baseline.
Abstract: Vehicle-to-everything (V2X) communication is a key enabler that connects vehicles to neighboring vehicles, infrastructure and pedestrians. In the past few years, multimedia services have seen an enormous growth and it is expected to increase as more devices will utilize infotainment services in the future i.e. vehicular devices. Therefore, it is important to focus on user centric measures i.e. quality-of-experience (QoE) such as video quality (resolution) and fluctuations therein. In this paper, a novel joint video quality selection and resource allocation technique is proposed for increasing the QoE of vehicular devices. The proposed approach exploits the queuing dynamics and channel states of vehicular devices, to maximize the QoE while ensuring seamless video playback at the end users with high probability. The network wide QoE maximization problem is decoupled into two subparts. First, a network slicing based clustering algorithm is applied to partition the vehicles into multiple logical networks. Secondly, vehicle scheduling and quality selection is formulated as a stochastic optimization problem which is solved using the Lyapunov drift plus penalty method. Numerical results show that the proposed algorithm ensures high video quality experience compared to the baseline. Simulation results also show that the proposed technique achieves low latency and high-reliability communication.

Journal ArticleDOI
TL;DR: In this article, the authors describe the process carried out to validate the application of one of the most robust and influential video quality metrics, Video Multimethod Assessment Fusion (VMAF), to 360VR contents.
Abstract: This paper describes the process carried out to validate the application of one of the most robust and influential video quality metrics, Video Multimethod Assessment Fusion (VMAF), to 360VR contents. VMAF is a full reference metric initially designed to work with traditional 2D contents. Hence, at first, it cannot be assumed to be compatible with the particularities of the scenario where omnidirectional content is visualized using commercial head-mounted displays (HMDs). In this article, we prove that this metric can be successfully used to measure the quality of 360VR sequences without any specific training or adjustments, which evidences its usefulness and flexibility, and entails significant time and resource savings. Thus, it can be straightforwardly included in consumer appliances, namely content generators, servers and clients, as part of the embedded software or hardware as a reliable means to monitor the quality of the 360VR content consumed by users.

Posted Content
Lele Chen1, Guofeng Cui1, Ziyi Kou1, Haitian Zheng1, Chenliang Xu1 
TL;DR: This work presents a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies, and aims to uncover the merits and drawbacks of current methods and point out promising directions for future work.
Abstract: Over the years, performance evaluation has become essential in computer vision, enabling tangible progress in many sub-fields. While talking-head video generation has become an emerging research topic, existing evaluations on this topic present many limitations. For example, most approaches use human subjects (e.g., via Amazon MTurk) to evaluate their research claims directly. This subjective evaluation is cumbersome, unreproducible, and may impend the evolution of new research. In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies. As for evaluation, we either propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video, namely, identity preserving, lip synchronization, high video quality, and natural-spontaneous motion. By conducting a thoughtful analysis across several state-of-the-art talking-head generation approaches, we aim to uncover the merits and drawbacks of current methods and point out promising directions for future work. All the evaluation code is available at: this https URL.

Journal Article
TL;DR: In this paper, an adaptive multi-mode EC (AMMEC) algorithm at the decoder supported utilizing preprocessing flexible macro-block ordering error resilience (FMO-ER) technique at the encoder; to efficiently conceal the erroneous MBs of intra and inter-coded frames of 3D video.
Abstract: 3D Multi-View Video (MVV) is multiple video streams shot by several cameras around one scene simultaneously. In Multi-view Video Coding (MVC), the spatio-temporal and interview correlations between frames and views are often used for error concealment. 3D video transmission over erroneous networks remains a substantial issue thanks to restricted resources and therefore the presence of severe channel errors. Efficiently compressing 3D video with a low transmission rate, while maintaining a top quality of the received 3D video, is extremely challenging. Since it's not plausible to re-transmit all the corrupted Macro-Blocks (MBs) thanks to real-time applications and limited resources. Thus it's mandatory to retrieve the lost MBs at the decoder side using sufficient post-processing schemes, like Error Concealment (EC). Error Concealment (EC) algorithms have the advantage of enhancing the received 3D video quality with no modifications within the transmission rate or within the encoder hardware or software. During this presentation, I will be able to explore tons of and different Adaptive Multi-Mode EC (AMMEC) algorithms at the decoder supported utilizing various and adaptive pre-processing techniques, i.e. Flexible Macro-block Ordering Error Resilience (FMO-ER) at the encoder; to efficiently conceal and recover the erroneous MBs of intra and inter-coded frames of the transmitted 3D video. Also, I will be able to present extensive experimental simulation results to point out that our proposed novel schemes can significantly improve the target and subjective 3D video quality. In this paper, secure, timely, fast, and reliable transmission of Wireless Capsule Endoscopy (WCE) images having abnormalities to the physicians are considered. The proposed algorithm uses the image preprocessing technique followed by edge detection using the Fisher Transform (FT) and morphological operation so as to extract features. Implementation of a binary classifier called Linear Support Vector Machine (LSVM) is completed so as to classify the WCE images followed by channel condition gain, specific frame are going to be transmitted to the physician. Thus it's mandatory to retrieve the lost MBs at the decoder side using sufficient post-processing schemes, like error concealment (EC). During this paper, we propose an adaptive multi-mode EC (AMMEC) algorithm at the decoder supported utilizing pre-processing flexible macro-block ordering error resilience (FMO-ER) technique at the encoder; to efficiently conceal the erroneous MBs of intra and inter-coded frames of 3D video. Experimental simulation results show that the proposed FMO-ER/AMMEC schemes can significantly improve the target and subjective 3D video quality. Text superimposed on the video frames provides supplemental but important information for video indexing and retrieval. The detection and recognition of text from the video is thus a crucial issue in automated content-based indexing of visual information in video archives. Text of interest isn't limited to static text. They might be scrolling during a linear motion where only a part of the text information is out there during different frames of the video. The matter is further complicated if the video is corrupted with noise. An algorithm is proposed to detect, classify and segment both static and straightforward linear moving text in a complex noisy background. The extracted texts are further processed using averaging to achieve a top-quality suitable for text recognition by commercial optical character recognition (OCR) software. We have developed a system with multiple pan-tilt cameras for capturing high-resolution videos of a moving person. This technique controls the cameras in order that each camera captures the simplest view of the person (i.e. one among body parts like the top, torso, and limbs) supported criteria for camera-work optimization. For achieving this optimization in real-time, time-consuming pre-processes, which give useful clues for the optimization, are performed during a training stage. Specifically, a target performance (e.g. a dance) is captured to accumulate the configuration of the body parts at each frame. During a real capture stage, the system compares an online-reconstructed shape with those within the training data for fast retrieval of the configuration of the body parts. The retrieved configuration is employed by an efficient scheme for optimizing special effects. Experimental results show the special effects optimized in accordance with the given criteria. A high-resolution 3D videos produced by the proposed system also are shown as typical use of high-resolution videos.