What are the common metrics used to predict perceived quality of HDR content?

Results show that HDR images are challenging for objective metrics and that the most commonly used metrics, e.g., PSNR, SSIM, and MS-SSIM, predict perceived quality of HDR content unreliably.

How many HDR images were used in the experiments?

Five HDR images1 of different dynamic ranges (computed using Banterle’s HDR toolbox for MATLAB2), representing different typical scenes, were used in the experiments (see Figure 1 and Table 1 for details).

What was the objective metric used to determine the quality of the HDR images?

In this study, all HDR images were converted to the Y ′CbCr color space [17] and these metrics were applied to the components Y ′, Cb, and Cr separately.

(Open Access) HDR image compression: A new challenge for objective quality metrics (2014) | Philippe Hanhart

Q: What was the monitor for the test?

To display the test stimuli, a full HD (1920× 1080p) 42” Dolby Research HDR RGB backlight dual modulation display (aka Pulsar) was used.

Q: How was the image prepared for subjective experiments?

2http://www.github.com/banterle/HDR_ToolboxTo prepare images for subjective experiments, both HDR and LDR versions were first downscaled by a factor of two with bicubic interpolation.

Q: What is the metric for LDR?

In many benchmarking performed on LDR content, VIF(p) is often among the best metrics and shows lower content dependency when compared to other metrics [19].

Q: What was the purpose of the experiment?

Before the experiment, a consent form was handed to subjects for signature and oral instructions were provided to explain their tasks.

HDR IMAGE COMPRESSION: A NEW CHALLENGE FOR OBJECTIVE QUALITY METRICS

Philippe Hanhart

, Marco V. Bernardo

2,3

, Pavel Korshunov

, Manuela Pereira

Ant

onio M. G. Pinheiro

, and Touradj Ebrahimi

Multimedia Signal Processing Group, EPFL, Lausanne, Switzerland

Remote Sensing Unit/Optics Center, UBI, Covilh

a, Portugal

Instituto de Telecomunicac¸

oes, UBI, Covilh

a, Portugal

ABSTRACT

High Dynamic Range (HDR) imaging is able to capture a

wide range of luminance values, closer to what the human vi-

sual system can perceive. It is believed by many that HDR is

a technology that will revolutionize TV and cinema industry

similar to how color television did. However, the complexity

of HDR requires reinvention of the whole chain from capture

to display. In this paper, HDR images compressed with the

upcoming JPEG XT HDR image coding standard are used to

investigate the correlation between thirteen well known full-

reference metrics and perceived quality of HDR content. The

metrics are benchmarked using ground truth subjective scores

collected during quality evaluations performed on a Dolby

Pulsar HDR monitor. Results demonstrate that objective qual-

ity assessment of HDR image compression is challenging.

Most of the tested metrics, with exceptions of HDR-VDP-2

and FSIM computed for luma component, poorly predict hu-

man perception of visual quality.

Index Terms— Image quality assessment, objective met-

rics, High Dynamic Range, JPEG XT

1. INTRODUCTION

High Dynamic Range (HDR) imaging systems pursue the ac-

quisition of images where all the brightness information of

the visible range of a scene is represented. Hence, they can

capture the whole dynamic range and color gamut perceived

by the human visual system (HVS). Thus, many applications

can greatly beneﬁt from the adoption of HDR imaging. For

example, HDR imaging can be exploited to improve quality of

experience in multimedia applications [1] and to enhance in-

This work has been conducted in the framework of the Swiss Na-

tional Foundation for Scientiﬁc Research (FN 200021-143696-1), EC funded

Network of Excellence VideoSense, Portuguese “FCT – Fundac¸

ao para

a Ci

encia e a Tecnologia” (projects PTDC/EIA-EIA/119004/2010, PEst-

OE/EEI/LA0008/2013, and PEst-OE-FIS/UI0524/2014), and COST IC1003

European Network on Quality of Experience in Multimedia Systems and Ser-

vices QUALINET. The authors would like to thank Dolby Laboratories Inc.

staff for providing Dolby Research HDR RGB backlight dual modulation

display (aka Pulsar).

telligibility in security applications where lighting conditions

cannot be controlled [2].

There are different methods to obtain HDR images. Com-

puter rendering and merging multiple low dynamic range

(LDR) images taken at different exposure settings are the two

methods initially used to generate HDR images. Nowadays,

HDR images can also be acquired using speciﬁc image sen-

sors. There are two forms of visualization in HDR images.

The ﬁrst and the best solution is to use a speciﬁc HDR display

that has the ability of representing a wider luminance range

and color gamut. The second solution is to map the HDR

image to a LDR display luminance range and color gamut,

using a tone mapping operator (TMO).

JPEG XT is an upcoming standard for JPEG backward-

compatible compression of HDR images [3]. Using this com-

pression standard, HDR images are coded in two layers. The

base layer, where a tone mapped version of the HDR image

is encoded in the normal JPEG format, and a residual layer,

where the extra HDR information is encoded. The advantage

of this method is that any conventional JPEG decoder can ex-

tract the tone mapped image, keeping backward compatibil-

ity and allowing for display on a conventional LDR monitor.

Furthermore, a speciﬁc JPEG XT decoder can use the residual

layer to reconstruct a lossy version of the HDR image.

In this paper, HDR images encoded with JPEG XT pro-

ﬁle A and corresponding ground truth subjective scores are

used. During the subjective quality assessment, HDR images

compressed at four different bit rates were displayed side-by-

side on a Dolby Research HDR RGB backlight dual modula-

tion display (aka Pulsar). The black level was held constant,

so the luminance dynamic range was solely determined by

the maximum luminance. The paired comparison evaluation

methodology was selected for its high accuracy and reliability

in constructing a scale of perceptual preferences. The subjects

participating in the evaluation experiment were na

ıve viewers.

This paper investigates the performance of state-of-the-

art objective metrics in predicting perceived quality of com-

pressed HDR images. A good objective metric should take

the psychophysical process of the human vision and per-

ception system into account. The main characteristics of

(a) BloomingGorse2 (b) CanadianFalls (c) McKeesPub

(d) MtRushmore2 (e) WillyDesk

Fig. 1: HDR images used in the experiments.

Table 1: HDR images information.

Image

Resolution Dynamic range [dB] Encoding parameters (q, Q)

[pixels] (cropped part) q: base layer, Q: residual layer

BloomingGorse2 4288 × 2848 42 (11, 12), (20, 13), (32, 15), (62, 15)

CanadianFalls 4312 × 2868 41 (16, 29), (30, 30), (65, 30), (80, 33)

McKeesPub 4258 × 2829 60 (5, 64), (15, 91), (48, 88), (83, 91)

MtRushmore2 4312 × 2868 50 (5, 20), (24, 82), (67, 80), (89, 78)

WillyDesk 4288 × 2848 70 (5, 63), (15, 79), (57, 90), (85, 91)

the HVS include contrast and orientation sensitivity, fre-

quency selection, spatial and temporal pattern masking, and

color perception [4]. In total, 36 metrics developed for im-

age quality assessment were benchmarked using subjective

scores as ground truth. Out of all metrics, only HDR-VDP-2

metric was speciﬁcally developed for HDR images. Out of

the 36 metrics, thirteen full-reference metrics were selected

for the detailed evaluation and analysis, including Mean

Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR),

Signal-to-Noise Ratio (SNR), Weighted Signal-to-Noise Ra-

tio (WSNR), Structural Similarity index (SSIM), Multiscale

SSIM index (MS-SSIM), Visual Information Fidelity (VIF),

Visual Information Fidelity pixel-based (VIFp), Universal

Quality Index (UQI), Image Fidelity Criterion (IFC), Fea-

ture Similarity Index (FSIM), High Dynamic Range Visible

Difference Predictor (HDR-VDP-2), and CIEDE2000 color

difference. For each metric, their objective scores were ﬁtted

to subjective scores using logistic ﬁtting. Several perfor-

mance indexes, such as Pearson and Spearman correlation

coefﬁcients and root-mean-square-error, were computed to

compare the metrics estimation of subjective scores. Hence,

with this study we expect to produce a valid contribution for

future objective quality studies on HDR imaging.

The remainder of the paper is organized as follows. The

dataset and corresponding subjective scores used as ground

truth are described in Section 2. The different metrics bench-

marked in this study are deﬁned in Section 3. In Section 4, the

methodology used to evaluate the performance of the metrics

is described. Section 5 provides a detailed analysis of the ob-

jective results and discusses the reliability of objective met-

rics. Finally, Section 6 concludes the paper.

2. DATASET AND SUBJECTIVE EVALUATIONS

2.1. Dataset

Five HDR images

of different dynamic ranges (computed

using Banterle’s HDR toolbox for MATLAB

), representing

different typical scenes, were used in the experiments (see

Figure 1 and Table 1 for details). Originally, these images

were selected by JPEG for the veriﬁcation tests of JPEG XT

standard. JPEG also provided LDR versions of these im-

ages that were manually tone-mapped using Adobe Photo-

shop from the original HDR.

http://www.cis.rit.edu/fairchild/HDR.html

http://www.github.com/banterle/HDR_Toolbox

To prepare images for subjective experiments, both HDR

and LDR versions were ﬁrst downscaled by a factor of two

with bicubic interpolation. The resulted images were ﬁrst

compressed using JPEG XT Proﬁle A to four different bit

rate values, ranging from a minimum of 0.3 bpp to a maxi-

mum of 2.2 bpp for different images. The bit rate values were

selected for each content separately (see Table 1) in such a

way that there is a noticeable visual difference between im-

ages with different bit rates when they are displayed on the

HDR monitor.

Compressed images were then cropped to 950× 1080 pix-

els regions for side by side subjective experiments (see Sec-

tion 2.2 for details). The regions to crop were selected by

expert viewers in such a way that cropped versions are repre-

sentative of the quality and the dynamic range of the original

images. Red rectangles in Figure 1 show the corresponding

cropped regions. Downscaling together with cropping ap-

proach was selected as a compromise, so that a meaningful

part of an image can be shown on the HDR monitor. Objec-

tive quality metrics were computed on the cropped versions

of the images.

2.2. Subjective evaluations

The experiments were conducted at the MMSPG test labora-

tory, which fulﬁlls the recommendations for subjective evalu-

ation of visual data issued by ITU-R [5]. The test room is

equipped with a controlled lighting system with a 6500 K

color temperature, whereas the color of all the background

walls and curtains present in the test area were mid grey. The

laboratory setup is intended to ensure the reproducibility of

the subjective tests results by avoiding unintended inﬂuence

of external factors.

To display the test stimuli, a full HD (1920 × 1080p) 42”

Dolby Research HDR RGB backlight dual modulation dis-

play (aka Pulsar) was used. The monitor has the following

speciﬁcations: full Rec. 709 color gamut, 4000 cd/m

peak

luminance, low black level (0.005 cd/m

), 12 bits/color in-

put with accurate and reliable reproduction of color and lumi-

nance. In the experiments, the luminance of the background

behind the monitor was about 20 cd/m

. The ambient illumi-

nation did not directly reﬂect off of the display.

In every session, three subjects were assessing the dis-

played images simultaneously. They were seated in one row,

aligned with the center of the monitor, at a distance of 3.2

times the picture height, as suggested in [6].

The paired comparison evaluation methodology was se-

lected for its high accuracy and reliability in constructing a

scale of perceptual preferences. The image pairs were pre-

sented in side-by-side fashion to minimize visual working

memory limitations. Since only one full HD 1920 × 1080

HDR monitor was available, each image was cropped to 950×

1080 pixels (for details see Section 2.1) with 20 pixels of

black border separating the two images. Subjects were asked

to judge which image in a pair (‘left’ or ‘right’) has the best

overall quality. The option ‘same’ was also included to avoid

random preference selections. For each of the 5 contents, all

the possible combinations of the 4 bit rates were considered,

i.e., 6 pairs for each content, leading to a total of 5 × 6 = 30

paired comparisons for all contents.

Before the experiment, a consent form was handed to sub-

jects for signature and oral instructions were provided to ex-

plain their tasks. All subjects were screened for correct visual

acuity and color vision using Snellen and Ishihara charts, re-

spectively. A training session was organized using additional

contents to allow subjects to familiarize with the assessment

procedure.

To reduce contextual effects, the stimuli orders of display

were randomized applying different permutations for each

group of subjects and special care was taken for the same

content not to be shown consecutively.

A total of 20 na

ıve subjects (13 females and 7 males) took

part in the evaluation. They were between 20 and 34 years

old with an average of 25.3 years of age.

The Thurstone Case V model [7] was used to convert

the ratings from the ternary scale to continuous-scale qual-

ity score values, which are equivalent to mean opinion scores

(MOS), considering ties as being half way between the two

preference options. For each content, the quality score values

were converted to the range [1, 5] by mapping the lowest and

highest quality score values to 1 and 5, respectively, as the

lower and upper bit rates were selected to be representative

of the lowest and best quality (see Section 2.1), respectively.

The intermediate values were scaled proportionally.

3. OBJECTIVE QUALITY METRICS

In this study, the performance of a set of 13 full-reference ob-

jective metrics in predicting HDR image quality was assessed:

1. MSE: Mean Squared Error,

2. PSNR: Peak Signal-to-Noise Ratio,

3. SNR: Signal-to-Noise Ratio,

4. WSNR: Weighted Signal-to-Noise Ratio [8, 9],

5. SSIM: Structural Similarity index [10],

6. MS-SSIM: Multiscale SSIM index [10],

7. VIF: Visual Information Fidelity [11],

8. VIFp: Visual Information Fidelity pixel-based [11],

9. UQI: Universal Quality Index [12],

10. IFC: Image Fidelity Criterion [13],

11. FSIM: Feature Similarity Index [14],

12. HDR-VDP-2: High Dynamic Range Visible Difference

Predictor [15],

13. CIEDE2000 color difference [16].

Table 2: Accuracy and monotonicity indexes for the different metrics.

Metric

Luma component only All components

PCC SROCC RMSE PCC SROCC RMSE

MSE 0.8794 0.6935 0.7866 0.8778 0.6655 0.7909

PSNR 0.6591 0.5167 1.2369 0.6164 0.5533 1.2950

SNR 0.8794 0.7375 0.7829 0.7355 0.6352 1.1143

WSNR 0.8099 0.7589 0.9647 0.8785 0.7672 0.7858

SSIM 0.7580 0.7375 1.1185 0.8091 0.8352 1.0448

MS-SSIM 0.8651 0.7131 0.8311 0.8157 0.7176 0.9657

VIF 0.6740 0.5588 1.2163 0.4820 0.1346 1.4468

VIFp 0.7533 0.6871 1.0817 0.3504 0.2611 1.5408

UQI 0.8068 0.8077 0.9725 0.7851 0.7864 1.0189

IFC 0.8833 0.8032 0.7709 0.8256 0.8337 0.9281

FSIM 0.9043 0.8245 0.7021 0.7692 0.7818 1.0513

HDR-VDP-2 0.9337 0.8657 0.5912 0.9241 0.7866 0.6284

CIEDE2000 0.5096 0.5191 1.4174

Almost all the objective metric that were analyzed, except for

CIEDE2000, are typically computed on the luma component

only. In this study, all HDR images were converted to the

color space [17] and these metrics were applied to

the components Y

, C

, and C

separately. In this paper, the

results of the metrics were computed in two different ways:

on the luma component only and on all components, consider-

ing the average value computed on Y

, C

, and C

. Regarding

the PSNR metric, the maximum value of the image after con-

version to Y

was considered for the peak value. For the

HDR-VDP-2 metric, the parameters were set according to the

setup of the subjective evaluations (see Section 2.2) and only

the quality value was used. To compute the CIEDE2000 color

difference, all HDR images were converted to the CIELAB

color space using Banterle’s HDR toolbox for MATLAB

4. PERFORMANCE INDEXES

The results of the subjective tests can be used as ground truth

to evaluate how well the objective metrics estimate perceived

quality. The result of execution of a particular objective met-

ric is an image quality rating (IQR), which is expected to be

the estimation of the MOS corresponding to the compressed

HDR image. To be compliant with the standard procedure for

evaluating the performance of objective metrics [18], the fol-

lowing properties of the IQR estimation of MOS should be

considered: accuracy, monotonicity, and consistency. Consis-

tency estimation is based on the conﬁdence intervals, which

are computed assuming a standard distribution of the subjec-

tive scores. In this study, the Thurstone Case V model was

used to convert the paired comparison ratings to equivalent

MOS values (see Section 2.2). Conﬁdence intervals can be

estimated from the paired comparison ratings, but their na-

ture is different from that of conﬁdence intervals computed

directly on a discrete or continuous ratings scale. Therefore,

only accuracy and monotonicity were considered.

First, a regression was ﬁtted to each [IQR, DMOS] data

set using logistic ﬁtting:

MOS

(IQR) = a +

1 + exp [−c (IQR − d)]

where a, b, c, and d are the parameters of the ﬁtting function.

Then, the Pearson linear correlation coefﬁcient (PCC) and

the root-mean-square error (RMSE) were computed between

MOS

and MOS to estimate accuracy of the IQR. To esti-

mate monotonicity, the Spearman rank order correlation coef-

ﬁcient (SROCC) was computed between MOS

and MOS.

The RMSE is deﬁned as follow:

RMSE =

N − 1

i=1

(MOS

− MOS

)

where N is the total number of points.

To determine whether the difference between two perfor-

mance index values corresponding to two different metrics

is statistically signiﬁcant, a statistical test was performed ac-

cording to [19].

5. RESULTS

Table 2 reports the accuracy and monotonicity indexes, as

deﬁned in Section 4, for the different metrics computed on

the luma component only and on all components. The ﬁtting

was applied on all contents at once. Results show that HDR-

VDP-2, FSIM (luma only), IFC (luma only), SNR (luma

only), MSE (luma only), and WSNR (all components) are

among the best metrics, with a PCC above 0.87 and a RMSE

below 0.79. On the other hand, results indicate that VIF,

VIFp, and CIEDE2000 computed on all components perform

the worst, with a PCC and SROCC below 0.51 and RMSE

(a) HDR-VDP-2: luma only (b) HDR-VDP-2: all components (c) FSIM: luma only (d) IFC: luma only

(e) SNR: luma only (f) MSE: luma only (g) WSNR: all components (h) IFC: all components

Fig. 2: Subjective versus objective results.

above 1.4. In many benchmarking performed on LDR con-

tent, VIF(p) is often among the best metrics and shows lower

content dependency when compared to other metrics [19].

However, in this study, VIF(p) showed quite strong content

dependency, which explains the low performance when con-

sidering all contents at once. As it can be observed, PSNR

also shows quite poor performance, with a PCC between 0.6-

0.66 and a MSE around 1.25. The low performance of PSNR

may be due to the maximum possible pixel value, which is not

well deﬁned in the case of HDR content, used for computing

PSNR.

Even though SSIM and MS-SSIM often have a good cor-

relation with perceived quality, they are criticized by many

researchers as it is hard to interpret their output values when

compared to PSNR values. In most cases, the SSIM and

MS-SSIM values only cover a very limited range, typically

[0.8, 1], when compared to the theoretical [0, 1] range. In

this study, the SSIM and MS-SSIM values are in the range

[0.99997, 1] and [0.999997, 1], respectively. Therefore, the

relative change between the worst and best qualities for SSIM

and MS-SSIM is less than 0.003% and 0.0003%, respectively,

which is almost imperceptible, especially for MS-SSIM.

These ﬁndings suggest that SSIM and MS-SSIM should be

adapted to cope with HDR images.

As it can be observed, the performance of VIF and es-

pecially VIFp drop drastically when considering all compo-

nents. To further understand whether there is a statistically

signiﬁcant difference between the performance of each met-

ric when computed on the luma component only and when

computed on all components, a statistical analysis was per-

formed on the different performance indexes. Results show

that there is no signiﬁcant different in the performance of the

metric between the two approaches for any of the metrics.

However, because of the relatively low number of conditions

(20 stimuli), general conclusions should not be drawn from

these results. As HDR is often considered in combination

with wide color gamut, it is expected that the ﬁdelity of color

reproduction will play a more important role in the context of

HDR when compared to LDR.

Figure 2 depicts the scatter plots of subjective versus ob-

jective results for some of the metrics considered in this study.

The metrics that perform the best according to the perfor-

mance indexes exhibit a very abrupt transition from low to

high quality. Such binary behavior is not well suited for ob-

jective quality metrics, which are expected to discriminate

between several granularities of distortions. This ﬁnding im-

plies that these metrics do not correlate well with human per-

ception of visual quality, as the response of the HVS is ex-

pected to be smoother and not abrupt, and that the perfor-

mance indexes are not sufﬁcient to select a good metric. On

the other hand, IFC computed on all components performs

lower but has a smoother transition between low and high

quality. HDR-VDP-2 is the only metric considered in this

study that was originally designed for HDR content. How-

ever, the performance of this metric is not signiﬁcantly better

than that of state-of-the-art metrics designed for LDR content.

Overall, results show that there is a great room for improve-

ment to better predict the perceived quality of HDR content.

HDR image compression: A new challenge for objective quality metrics

Figures

Citations

Hdr-vqm

Benchmarking of objective quality metrics for HDR image quality assessment

Overview and evaluation of the JPEG XT HDR image compression standard

Subjective quality assessment database of HDR images compressed with JPEG XT

Objective and subjective evaluation of High Dynamic Range video compression

References

Image quality assessment: from error visibility to structural similarity

A universal image quality index

A law of comparative judgment

FSIM: A Feature Similarity Index for Image Quality Assessment

Image information and visual quality

Related Papers (5)

HDR-VDP-2: a calibrated visual metric for visibility and quality predictions in all luminance conditions

Extending Quality Metrics to Full Luminance Range Images

Perceptual Signal Coding for More Efficient Usage of Bit Codes

Image quality assessment: from error visibility to structural similarity

Photographic tone reproduction for digital images

Frequently Asked Questions (15)

Q1. What are the contributions in "Hdr image compression: a new challenge for objective quality metrics" ?

Q2. What are the properties of the IQR estimation of MOS?

Q3. What was the monitor for the test?

Q4. What can be used to evaluate the subjective metrics?

Q5. What was the color temperature of the background walls and curtains?

Q6. What was the metric used to compute the CIEDE2000 color difference?

Q7. How was the image prepared for subjective experiments?

Q8. What is the metric for LDR?

Q9. What was the purpose of the experiment?

Q10. Why was the paired comparison evaluation methodology selected?

Q11. What are the common metrics used to predict perceived quality of HDR content?

Q12. How many paired comparisons were made for each of the 5 contents?

Q13. What is the metric used to assess HDR image quality?

Q14. How many HDR images were used in the experiments?

Q15. What was the objective metric used to determine the quality of the HDR images?