scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Toward Better Statistical Validation of Machine Learning-Based Multimedia Quality Estimators

TL;DR: The main goal of this paper is to shed light on limitations of the current ML-based objective quality predictor approach both from practical and theoretical perspectives wherever applicable, and in the process propose an alternate approach to overcome some of them.
Abstract: Objective assessment of multimedia quality using machine learning (ML) has been gaining popularity especially in the context of both traditional (e.g., terrestrial and satellite broadcast) and advance (such as over-the-top media services, IPTV) broadcast services. Being data-driven, these methods obviously rely on training to find the optimal model parameters. Therefore, to statistically compare and validate such ML-based quality predictors, the current approach randomly splits the given data into training and test sets and obtains a performance measure (for instance mean squared error, correlation coefficient etc.). The process is repeated a large number of times and parametric tests (e.g., ${t}$ test) are then employed to statistically compare mean (or median) prediction accuracies. However, the current approach suffers from a few limitations (related to the qualitative aspects of training and testing data, the use of improper sample size for statistical testing, possibly dependent sample observations, and a lack of focus on quantifying the learning ability of the ML-based objective quality predictor) which have not been addressed in literature. Therefore, the main goal of this paper is to shed light on the said limitations both from practical and theoretical perspectives wherever applicable, and in the process propose an alternate approach to overcome some of them. As a major advantage, the proposed guidelines not only help in a theoretically more grounded statistical comparison but also provide useful insights into how well the ML-based objective quality predictors exploit data structure for learning. We demonstrate the added value of the proposed set of guidelines on standard datasets by comparing the performance of few existing ML-based quality estimators. A software implementation of the presented guidelines is also made publicly available to enable researchers and developers to test and compare different models in a repeatable manner.
Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, some insights on why QoE assessment is so difficult are provided by presenting few major issues as well as a general summary of quality/QoE formation and conception including human auditory and vision systems.
Abstract: Quality of experience (QoE) assessment occupies a key role in various multimedia networks and applications. Recently, large efforts have been devoted to devise objective QoE metrics that correlate with perceived subjective measurements. Despite recent progress, limited success has been attained. In this paper, we provide some insights on why QoE assessment is so difficult by presenting few major issues as well as a general summary of quality/QoE formation and conception including human auditory and vision systems. Also, potential future research directions are described to discern the path forward. This is an academic and perspective article, which is hoped to complement existing studies and prompt interdisciplinary research.

33 citations

Journal ArticleDOI
TL;DR: A novel reduce-reference IQA metric called the multi-channel free-energy based reduced-reference quality metric is proposed, which is highly competitive with the representative reduced- reference and classical full-reference models.
Abstract: The visual quality of perceptions is highly correlated with the mechanisms of the human brain and visual system. Recently, the free-energy principle, which has been widely researched in brain theory and neuroscience, is introduced to quantize the perception, action, and learning in human brain. In the field of image quality assessment (IQA), on one hand, the free-energy principle can resort to the internal generative model to simulate the visual stimulus of the human beings. On the other hand, abundant psychological and neurobiological studies reveal that different frequency and orientation components of one visual stimulus arouse different neurons in the striate cortex, and the striate cortex processes visual information in the cerebral cortex. Motivated by these two aspects, a novel reduce-reference IQA metric called the multi-channel free-energy based reduced-reference quality metric is proposed in this paper. First, a two-level discrete Haar wavelet transform is used to decompose the input reference and distorted images. Next, to simulate the generative model in the human brain, the sparse representation is leveraged to extract the free-energy-based features in subband images. Finally, the overall quality metric is obtained through the support vector regressor. Extensive experimental comparisons on four benchmark image quality databases (LIVE, CSIQ, TID2008, and TID2013) demonstrate that the proposed method is highly competitive with the representative reduced-reference and classical full-reference models.

20 citations


Cites methods from "Toward Better Statistical Validatio..."

  • ...In this work, we follow the traditional way to randomly split the distorted images into training and test set according to the 80/20 rule despite of some limitations such as possibly dependent sample observations [43]....

    [...]

Posted Content
TL;DR: An elastic metric and multi-scale trajectory based video quality metric (EM-VQM) is proposed in this paper and outperforms the state-of-the-art metrics designed for free-viewpoint videos significantly and achieves a gain of 12.86% and 16.75% in terms of median Pearson linear correlation coefficient values on the two datasets compared to the best one, respectively.
Abstract: Virtual viewpoints synthesis is an essential process for many immersive applications including Free-viewpoint TV (FTV). A widely used technique for viewpoints synthesis is Depth-Image-Based-Rendering (DIBR) technique. However, such techniques may introduce challenging non-uniform spatial-temporal structure-related distortions. Most of the existing state-of-the-art quality metrics fail to handle these distortions, especially the temporal structure inconsistencies observed during the switch of different viewpoints. To tackle this problem, an elastic metric and multi-scale trajectory based video quality metric (EM-VQM) is proposed in this paper. Dense motion trajectory is first used as a proxy for selecting temporal sensitive regions, where local geometric distortions might significantly diminish the perceived quality. Afterwards, the amount of temporal structure inconsistencies and unsmooth viewpoints transitions are quantified by calculating 1) the amount of motion trajectory deformations with elastic metric and, 2) the spatial-temporal structural dissimilarity. According to the comprehensive experimental results on two FTV video datasets, the proposed metric outperforms the state-of-the-art metrics designed for free-viewpoint videos significantly and achieves a gain of 12.86% and 16.75% in terms of median Pearson linear correlation coefficient values on the two datasets compared to the best one, respectively.

5 citations

Journal ArticleDOI
10 Mar 2021-Sensors
TL;DR: In this article, an approach of mapping quality of service (QoS) to quality of experience (QoE) using QoE metrics to determine user satisfaction limits, and applying QoS tools to provide the minimum QoS expected by users is proposed.
Abstract: Video quality evaluation needs a combined approach that includes subjective and objective metrics, testing, and monitoring of the network. This paper deals with the novel approach of mapping quality of service (QoS) to quality of experience (QoE) using QoE metrics to determine user satisfaction limits, and applying QoS tools to provide the minimum QoE expected by users. Our aim was to connect objective estimations of video quality with the subjective estimations. A comprehensive tool for the estimation of the subjective evaluation is proposed. This new idea is based on the evaluation and marking of video sequences using the sentinel flag derived from spatial information (SI) and temporal information (TI) in individual video frames. The authors of this paper created a video database for quality evaluation, and derived SI and TI from each video sequence for classifying the scenes. Video scenes from the database were evaluated by objective and subjective assessment. Based on the results, a new model for prediction of subjective quality is defined and presented in this paper. This quality is predicted using an artificial neural network based on the objective evaluation and the type of video sequences defined by qualitative parameters such as resolution, compression standard, and bitstream. Furthermore, the authors created an optimum mapping function to define the threshold for the variable bitrate setting based on the flag in the video, determining the type of scene in the proposed model. This function allows one to allocate a bitrate dynamically for a particular segment of the scene and maintains the desired quality. Our proposed model can help video service providers with the increasing the comfort of the end users. The variable bitstream ensures consistent video quality and customer satisfaction, while network resources are used effectively. The proposed model can also predict the appropriate bitrate based on the required quality of video sequences, defined using either objective or subjective assessment.

2 citations

References
More filters
Journal ArticleDOI
TL;DR: A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented and it is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a random chosen non-diseased subject.
Abstract: A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented. It is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject. Moreover, this probability of a correct ranking is the same quantity that is estimated by the already well-studied nonparametric Wilcoxon statistic. These two relationships are exploited to (a) provide rapid closed-form expressions for the approximate magnitude of the sampling variability, i.e., standard error that one uses to accompany the area under a smoothed ROC curve, (b) guide in determining the size of the sample required to provide a sufficiently reliable estimate of this area, and (c) determine how large sample sizes should be to ensure that one can statistically detect difference...

19,398 citations


"Toward Better Statistical Validatio..." refers background in this paper

  • ..., class label 0) scores (recall that in our context, we assign a label 1 if d(i, j) > th and 0 otherwise) [45], [46], while oth-...

    [...]

Journal ArticleDOI
TL;DR: The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.

17,017 citations


"Toward Better Statistical Validatio..." refers background or methods in this paper

  • ...The AUC values from the ROC analysis performed only on statistically different pairs are then used to compare objective quality predictors....

    [...]

  • ...This can be treated as a binary classification problem and analyzed based on ROC analysis [40]–[42]....

    [...]

  • ...The idea was further extended to include Receiver Operating Characteristics (ROC) [39] based comparison [40]–[42], and states that a better performance measure should be able to distinguish (classify) different quality levels by considering the dispersion (uncertainty) in the opinion scores....

    [...]

  • ...The idea was further extended to include Receiver Operating Characteristics (ROC) [39] based compar-...

    [...]

  • ...The Area Under Curve (AUC) is then to evaluate discrimination abilities [39]....

    [...]

Journal Article
TL;DR: A set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers is recommended: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparisons of more classifiers over multiple data sets.
Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.

10,306 citations


"Toward Better Statistical Validatio..." refers background in this paper

  • ...While p-value or significance level adjustments have been proposed in literature to control the family-wise error error [18], these adjustment procedures are debatable due to practical reasons (for instance refer to [30]–[32])....

    [...]

  • ..., refer to [2]–[12] for some existing efforts in ML based quality estimation for video or [13]–[16] for standardized recommendations) reveals that these important issues have not been thoroughly examined (either from theoretical or practical view points) although few works such as [4], [9], and [11] have considered the practical implications of the first issue regarding the qualitative aspects of training and testing data (also refer to some related works on statistical comparison of classifiers [18] or analysis of their learning ability [19])....

    [...]

Book
28 Oct 1997
TL;DR: In this paper, a broad and up-to-date coverage of bootstrap methods, with numerous applied examples, developed in a coherent way with the necessary theoretical basis, is given, along with a disk of purpose-written S-Plus programs for implementing the methods described in the text.
Abstract: This book gives a broad and up-to-date coverage of bootstrap methods, with numerous applied examples, developed in a coherent way with the necessary theoretical basis. Applications include stratified data; finite populations; censored and missing data; linear, nonlinear, and smooth regression models; classification; time series and spatial problems. Special features of the book include: extensive discussion of significance tests and confidence intervals; material on various diagnostic methods; and methods for efficient computation, including improved Monte Carlo simulation. Each chapter includes both practical and theoretical exercises. Included with the book is a disk of purpose-written S-Plus programs for implementing the methods described in the text. Computer algorithms are clearly described, and computer code is included on a 3-inch, 1.4M disk for use with IBM computers and compatible machines. Users must have the S-Plus computer application. Author resource page: http://statwww.epfl.ch/davison/BMA/

6,420 citations

Journal ArticleDOI
18 Apr 1998-BMJ
TL;DR: This paper advances the view, widely held by epidemiologists, that Bonferroni adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference.
Abstract: When more than one statistical test is performed in analysing the data from a clinical study, some statisticians and journal editors demand that a more stringent criterion be used for “statistical significance” than the conventional P<0051 Many well meaning researchers, eager for methodological rigour, comply without fully grasping what is at stake Recently, adjustments for multiple tests (or Bonferroni adjustments) have found their way into introductory texts on medical statistics, which has increased their apparent legitimacy This paper advances the view, widely held by epidemiologists, that Bonferroni adjustments are, at best, unnecessary and, at worst, deleterious to sound statistical inference #### Summary points Adjusting statistical significance for the number of tests that have been performed on study data—the Bonferroni method—creates more problems than it solves The Bonferroni method is concerned with the general null hypothesis (that all null hypotheses are true simultaneously), which is rarely of interest or use to researchers The main weakness is that the interpretation of a finding depends on the number of other tests performed The likelihood of type II errors is also increased, so that truly important differences are deemed non-significant Simply describing what tests of significance have been performed, and why, is generally the best way of dealing with multiple comparisons Bonferroni adjustments are based on the following reasoning1-3 If a null hypothesis is true (for instance, two treatment groups in a randomised trial do not differ in terms of cure rates), a significant difference (P<005) will be observed by chance once in 20 trials This is the type I error, or α When 20 independent tests are performed (for example, study groups are compared with regard to 20 unrelated variables) and the null hypothesis holds for all 20 comparisons, the chance of at least one test being significant is no longer 005, but 064 …

5,471 citations