What are the contributions mentioned in the paper "Performance evaluation of objective quality metrics for hdr image compression" ?

A simpler approach consists in computing arithmetic or structural fidelity metrics, such as PSNR and SSIM, on perceptually encoded luminance values but the performance of quality prediction in this case has not been clearly studied. In this paper, the authors aim at providing a better comprehension of the limits and the potentialities of this approach, by means of a subjective study. The authors compare the performance of HDR-VDP to that of PSNR and SSIM computed on perceptually encoded luminance values, when considering compressed HDR images.

What are the common terms used for quality assessment of HDR images?

Since PSNR and SSIM are widely used for quality assessment of LDR images, in the following, the authors will refer to them as LDR metrics.

How many pixels per degree were used to measure the quality of the stimulus?

Viewers participated individually to test sessions, sitting at a distance of approximately 1 meter, which corresponds to an angular resolution of about 40 pixels per degree.

What is the definition of spatial information for an LDR image?

For an LDR image, spatial information is defined as the standard deviation of the output of a Sobel operator applied to the image.

What are the two quality factors that control the base and enhancement layer quality?

The base and enhancement layer quality is controlled by two quality factors, which take values on [0, 100] and that the authors varied as follows: QFb ∈ [40, 70, 90, 100] and QFe ∈ [50, 75, 80, 90, 95], respectively.

What has motivated research towards HDR processing algorithms?

This has motivated research towards novel HDR processing algorithms, including acquisition/generation1 and compression2,3 and, consequently, towards methods for assessing the quality of the processed results.

(Open Access) Performance evaluation of objective quality metrics for HDR image compression (2014) | Giuseppe Valenzise

Q: How many image quality factors are used in the coding of HDR?

The authors coded each content with a JPEG quality factor QF ranging from 20 to 100, with a step of 5, producing a total of 17 rate points × 5 contents = 85 images.•

Q: How many images were retained for the test?

As a result of the screening phase, the authors retained a set of 50 images to use for the test (details about the exact coding parameters of the test dataset, as well as coded images, are available as supplementary material on the reference author’s website).

Performance evaluation of objective quality metrics for HDR

image compression

Giuseppe Valenzise, Francesca De Simone, Paul Lauga, Frederic Dufaux

Institut Mines-Telecom, Telecom ParisTech, CNRS LTCI, Paris, France

ABSTRACT

Due to the much larger luminance and contrast characteristics of high dynamic range (HDR) images, well-

known objective quality metrics, widely used for the assessment of low dynamic range (LDR) content, cannot

be directly applied to HDR images in order to predict their perceptual ﬁdelity. To overcome this limitation,

advanced ﬁdelity metrics, such as the HDR-VDP, have been proposed to accurately predict visually signiﬁcant

diﬀerences. However, their complex calibration may make them diﬃcult to use in practice. A simpler approach

consists in computing arithmetic or structural ﬁdelity metrics, such as PSNR and SSIM, on perceptually encoded

luminance values but the performance of quality prediction in this case has not been clearly studied. In this

paper, we aim at providing a better comprehension of the limits and the potentialities of this approach, by

means of a subjective study. We compare the performance of HDR-VDP to that of PSNR and SSIM computed

on perceptually encoded luminance values, when considering compressed HDR images. Our results show that

these simpler metrics can be eﬀectively employed to assess image ﬁdelity for applications such as HDR image

compression.

Keywords: High dynamic range, quality assessment, image coding

1. INTRODUCTION

High dynamic range (HDR) content has been recently gaining momentum thanks to its ability to reproduce

a much wider gamut of luminance and contrast than traditional low dynamic range (LDR) formats. This has

motivated research towards novel HDR processing algorithms, including acquisition/generation

and compres-

sion

2, 3

and, consequently, towards methods for assessing the quality of the processed results. In principle, the

most accurate way to evaluate image quality is to carry out extensive subjective test campaigns. However, this

is often impractical, especially when the number of parameters and testing conditions is large. In addition, the

feasibility of subjective testing in the case of HDR content is further reduced by the limited diﬀusion and the

high cost of HDR displays. This calls for the design of automatic and accurate objective quality metrics for HDR

content.

In this work, we focus on full-reference quality assessment, where the goal is to assess the perceptual ﬁdelity of

a processed image with respect to its original (i.e., reference) version. This is the typical scenario, e.g., in image

compression, where a picture coded at a certain bitrate is compared to the uncompressed original. In the LDR

case, popular metrics, such as the Structural Similarity Index (SSIM),

are known to provide good predictions

of image quality and even the criticized Peak Signal-to-Noise Ratio (PSNR) produces valid quality measures for

a given content and codec type.

A key advantage of these metrics is that they can be easily computed through

simple pixel operations on LDR images. This is partially due to the fact that LDR pixel values are gamma-

corrected in the sRGB color space,

which not only does compensate for the non-linear luminance response of

legacy CRT displays, but also accounts somehow for the lower contrast sensitivity of the human visual system

(HVS) at dark luminance levels. In other words, the non linearity of the sRGB color space provides a pixel

encoding which is approximately linear with respect to perception.

In the case of HDR, this is no longer the case, since pixel values are proportional to the physical luminance of

the scene, while the HVS is sensible to luminance ratios, as expressed by the Weber-Fechner law. In order to take

into account luminance masking and other complex aspects of the HVS, some metrics, such as the HDR-VDP,

7, 8

Corresponding author: Giuseppe Valenzise — E-mail: giuseppe.valenzise@telecom-paristech.fr

Additional material available at http://perso.telecom-paristech.fr/

gvalenzi/download.htm

accurately model various stages of visual perception under a broad range of viewing conditions, in such a way to

predict and quantify precisely signiﬁcant visual diﬀerences between images. These metrics can provide very good

approximations of human perception but require in general a delicate tuning of several parameters in order to be

computed, which limits their use in many practical applications. A simpler and more convenient approach is to

transform HDR values to perceptually uniform quantities and compute arithmetic or structural metrics, such as

the PSNR or the SSIM, on them. Typical encodings from HDR to perceptually linear values include the simple

logarithm, based on the Weber-Fechner law, or more sophisticated transfer functions such as the PU encoding.

These metrics are often used to evaluate HDR image and video compression performance;

3, 10

however, it is not

clear up to which extent they can provide accurate estimates of the actual visual quality, thus, whether they are

a valid alternative to more complex predictors based on HVS modeling.

In this paper, we evaluate the performance of PSNR and SSIM applied to log- or PU-encoded HDR pictures

corrupted by one speciﬁc type of processing, i.e., image compression. Since PSNR and SSIM are widely used for

quality assessment of LDR images, in the following, we will refer to them as LDR metrics. We also analyze the

performance of the HDR-VDP algorithm (referred to as HDR-VDP-2 in the original paper of Mantiuk et al.

In terms of image compression, we consider three schemes, which are representative of the state of the art in

still image HDR content compression, to build a dataset of compressed images with diﬀerent levels of distortion.

We use this dataset to conduct a subjective experiment and collect subjective mean opinion scores (MOS). Our

analysis of the results shows that subjective ratings are well correlated with LDR metrics applied to perceptually

linearized HDR values, and thus, that they can be consistently used to evaluate coding performance.

The rest of the paper is organized as follows. We review objective approaches to quality assessment of HDR

content in Section 2. The subjective test setup, including the generation of the test material, the test environment

and the test methodology, is described in Section 3. We present and discuss the results of our study in Section 4.

Finally, Section 5 concludes the paper.

2. OBJECTIVE METRICS FOR HDR CONTENT

Automatic quality assessment of low dynamic range pictures has been widely investigated in the past decades and

a number of full-reference metrics have been proposed for this purpose, including: metrics that model the HVS

(e.g., Sarnoﬀ JND,

VDP,

Perceptual Distortion Metric

); feature-based algorithms;

application-speciﬁc

models (DCTune

); structural (SSIM

and its multiscale version

) and information-theoretic (e.g., VIF

)

frameworks. For a comprehensive statistical evaluation of these algorithms on LDR content, the interested

reader can refer to, e.g., the work of Sheikh et al.

At a higher level of abstraction, ﬁdelity metrics can

be classiﬁed according to whether they include some modeling of the HVS (such as contrast and luminance

masking, adaptation mechanisms, etc.), or assume perceptually linearized luminance values. The latter is the

case of arithmetic measures such as the mean square error (MSE) and derived metrics, such as PSNR, as well

as of structural metrics, such as SSIM, which are largely used in ﬁelds such as image/video coding as they oﬀer

a good trade-oﬀ between simplicity and accuracy.

Metrics based on HVS models are conceived to work in a limited luminance range, i.e., that of standard

LCD or CRT displays, but need to be somehow extended to work in the full luminance range of HDR content.

In their HDR-VDP

metric Mantiuk et al. extended the Visual Diﬀerence Predictor of Daly,

in order to take

into account a number of phenomena that occur in the early stages of the HVS – from intra-ocular light scatter

to contrast sensitivity across the full range of visible luminance (scotopic and photopic) and intra/inter-channel

contrast masking – which characterize the optical and retinal pathway. The test and references pictures are

processed according to this path and the resulting images are decomposed through a multiband ﬁlter in such

a way to obtain perceptually linearized per-band contrast diﬀerences. These quantities are then either mapped

to per-pixel probabilities maps of visibility, or they are pooled to produce a single image quality correlate Q.

The pooling function has been selected and parametrized among several candidates by maximizing Spearman

rank-order correlation over a large LDR image dataset (details are found in Section 6.1 of the original HDR-VDP

paper

). The motivation of this choice is twofold: on one hand, it assures the backward compatibility of the

metric to LDR content; on the other hand, it is the only feasible way to optimize the pooling function in the lack

of suﬃciently large HDR datasets with subjective annotations. Recently, Narwaria et al.

computed optimized

pooling weights for HDR-VDP over a dataset of HDR compressed images. Their results show that tuning on

HDR data may improve HDR-VDP performance, but the gain is not statistically signiﬁcant. Thus, in this work,

we resort to the default setting in the implementation of Mantiuk et al.

∗

, which we parametrize to account for

the viewing conditions described in Section 3.2.

A main disadvantage of HDR-VDP is that it requires a complex calibration of its optical and retinal pa-

rameters. A known problem is, e.g., the setting of the peak sensitivity of the photoreceptors – higher values

decrease overall sensitivity to contrast. In many practical applications, and especially in the case of coding, it

is customary to compute simple arithmetic or structural metrics on perceptually linearized HDR values. Per-

ceptual linearization consists in a monotonically increasing mapping of HDR luminance to encoded pixel values.

Typical mapping functions include the logarithm, as it expresses Weber-Fechner law on small luminance ranges,

or a gamma correction to account for Steven’s power law.

Aydin et al.

observed that the Weber ratio can

be assumed to be constant only for luminance values approximatively greater than 500 cd/m

, while for lower

luminance levels the detection threshold rises signiﬁcantly. Thus, they computed a perceptually uniform (PU)

encoding under the form of a look-up table, which follows the Weber-Fechner law for luminance larger than

1000 cd/m

, while at the same time it maintains backward compatibility with the sRGB encoding on typical

LDR displays brightness ranges. Notice that this mapping requires a rough characterization of the response

function of the HDR display in order to transform HDR pixel values into photometric quantities.

Quality assessment for high dynamic range is quite a recent topic, hence there is lack of extensive statistical

studies and image datasets to evaluate performance of existing metrics. Perceptual linearization is supported by

psycho-visual arguments, but its eﬀectiveness for quality assessment has only been conjectured or just showcased

through simple proofs of concepts in the case of PU encoding. Additionally, to the authors’ knowledge, the only

study on the performance of HDR-VDP on HDR content is the recent work by Narwaria et al.,

which considers

test material similar to that considered in this paper, i.e., compressed HDR images. The main diﬀerence with

respect to that study is that, there, the authors compared HDR-VDP with LDR metrics computed over HDR

pixel values without any perceptual linearization. Therefore, they arrive to the rather expected result that HDR-

VDP clearly outperforms LDR metrics and that LDR metrics cannot be used to evaluate HDR content. In this

work, we use instead perceptually linearized HDR values, obtained using either logarithm or PU encoding. Under

this setting, our results reverse the conclusions found previously and show that, with an appropriate perceptual

linearization, well-established metrics that work excellently for LDR image coding can be extended with similar

performance to HDR.

3. SUBJECTIVE TEST SETUP

3.1 Test material

3.1.1 Selection of original content

We analyzed several HDR images from the HDR photographic survey dataset,

as potential test material to

be included in our experiment. The resolution of the pictures was downscaled to meet our display’s resolution,

equal to 1920 × 1080 pixels. We focused on high quality images where typical HDR acquisition artifacts such

as ghosting are not present. In order to select material with suﬃciently diverse characteristics, we compute the

following three features for each image:

• The key k ∈ [0, 1] of the picture,

which gives a measure of the overall brightness of the scene and is

deﬁned as:

k =

log L

avg

− log L

min

log L

max

− log L

min

, (1)

where the average luminance is computed as log L

avg



log(L(i, j)+δ)/N ,withN being the number

of pixels in the image, L(i, j) the luminance of pixel (i, j), and δ is a small oﬀset to avoid the singularity

occurring for black pixels. L

min

and L

max

are the minimum and maximum relative luminance values of the

image, computed after excluding 1% of brightest and darkest pixels in order to make the method robust

against outliers.

∗

Available at http://sourceforge.net/projects/hdrvdp/ (version 2.1.3).

(a) “AirBellowsGap” (b) “LasVegasStore” (c) “MasonLake(1)”

(d) “RedwoodSunset” (e) “UpheavalDome”

Figure 1. HDR images used for the test (tone mapped version).

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

LasVegasStore

AirBellowsGap

RedwoodSunset

UpheavalDome

MasonLake(1)

(a) Key

UpheavalDome

MasonLake(1)

RedwoodSunset

AirBellowsGap

LasVegasStore

(b) Dynamic Range

0.016

0.018

0.02

0.022

0.024

0.026

0.028

0.03

AirBellowsGap

RedwoodSunset

UpheavalDome

LasVegasStore

MasonLake(1)

Figure 2. Characteristics of the selected HDR test images (contents are ordered for increasing value of each feature).

• The image dynamic range DR = L

max

min

,withL

min

and L

max

computed as above.

• The spatial perceptual information SI,

which describes image spatial complexity and is related to coding

complexity. For an LDR image, spatial information is deﬁned as the standard deviation of the output of a

Sobel operator applied to the image. The LDR image in our case is obtained using Reinhard’s photographic

tone reproduction operator.

Based on the semantic interest of each content and on the diversity of the considered characteristics, we

selected the ﬁve images shown in Fig. 1. Fig. 2 reports the content characteristics for the selected material. Two

additional images, shown in Fig. 3, were used for training the subjects.

3.1.2 Production of test material

We produced the test material by compressing the selected images using diﬀerent codecs and coding conditions.

Due to the huge bulk of available LDR images, the most promising HDR image coding techniques are those

that oﬀer backward compatibility with legacy LDR pictures. These schemes are based on a scalable approach,

where an LDR base layer is obtained by tone mapping the original HDR and is then coded using available LDR

codecs such as JPEG or JPEG 2000. The tone mapping function is inverted at the decoder to reconstruct an

approximation of the original HDR. Additionally, an enhancement layer that stores the diﬀerences (or ratios)

(a) “DevilsBathtub” (b) “PaulBunyan”

Figure 3. Training images (tone mapped version).

between the original and the inverse tone mapped images can be also transmitted as header information. In

addition to the usual settings to optimize in the LDR case (e.g., quantization parameters, transform size, etc.),

the choice of the tone mapping operator (TMO) is critical and can lead to diﬀerent coding performance.

Instead

of using a tone mapping designed for rendering on a LDR display, we implemented the minimum-MSE TMO

proposed by Mai et al.,

which is the global TMO that minimizes the reconstruction error after tone mapping

and inverse tone mapping.

Thus, we consider the following three coding schemes:

• JPEG with minimum-MSE TMO (applied to each color channel) and no enhancement layer. We coded

each content with a JPEG quality factor QF ranging from 20 to 100, with a step of 5, producing a total

of 17 rate points × 5 contents = 85 images.

• JPEG 2000 with minimum-MSE TMO (applied to each color channel) and no enhancement layer. We

sampled 15 target bitrates in the range 0.06 bpp up to 1.75 bpp, giving a total of 75 images.

• JPEG XT,

which is the new standardization initiative (ISO/IEC 18477) of JPEG for backward compatible

encoding of HDR images. JPEG XT produces a LDR bitstream compatible with the JPEG standard. There

are several proposals so far for coding the enhancement layer. In the reference implementation that we

adopted

†

, the TMO is a content dependent linear map, followed by a gamma adaption with exponent 2.2

to compensate for the sRGB gamma. Encoding of residuals is performed in a lossy manner in the spatial

domain. The base and enhancement layer quality is controlled by two quality factors, which take values

on [0, 100] and that we varied as follows: QF

∈ [40, 70, 90, 100] and QF

∈ [50, 75, 80, 90, 95], respectively.

This yields 100 coded images.

We screened all the 260 images, produced with the coding conditions described above, and we selected a

subset of them in such a way to respect the following requirements: i) all the levels of the MOS scale (described

in Section 3.3) should be equally represented; ii) all codecs and contents should be equally present; and iii) the

length of the actual test should be reasonable, i.e., it should not be longer than 20 minutes without pauses.

Distortions with the JPEG and JPEG 2000 codecs, when seen on the HDR display, are similar to analogous

distortions in LDR pictures. As for the JPEG XT codec, its distortion has characteristics similar to JPEG:

speciﬁcally, the noise has the same typical blocking structure; however, as QF

increases, JPEG XT images

have less ringing artifacts than JPEG ones. Finally, we observed that, for some contents, even with the highest

considered bitrates, none of the used lossy coding schemes was able to produce imperceptible distortions (i.e., the

highest level of the considered MOS scale) on the HDR display. This conﬁrmed the ﬁndings of Aydin et al.

that

distortions are much more perceptible on brighter screens. In those cases, we used the original (uncompressed)

content as test image. These samples were excluded from the performance analysis of the objective metrics

in order to avoid any bias due to the choice of an arbitrary maximum value for the PSNR. As a result of the

screening phase, we retained a set of 50 images to use for the test (details about the exact coding parameters

of the test dataset, as well as coded images, are available as supplementary material on the reference author’s

website).

†

JPEG document wg1n6639 in the JPEG document repository, version 0.8 (February 2014).

Performance evaluation of objective quality metrics for HDR image compression

Figures

Citations

HDR-VDP-2.2: a calibrated method for objective quality prediction of high-dynamic range and standard images

Hdr-vqm

Benchmarking of objective quality metrics for HDR image quality assessment

Overview and evaluation of the JPEG XT HDR image compression standard

Perception-driven Accelerated Rendering

References

Image quality assessment: from error visibility to structural similarity

Multiscale structural similarity for image quality assessment

On the psychophysical law.

Image information and visual quality

A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms

Related Papers (5)

HDR-VDP-2: a calibrated visual metric for visibility and quality predictions in all luminance conditions

Image quality assessment: from error visibility to structural similarity

Photographic tone reproduction for digital images

Optimizing a Tone Curve for Backward-Compatible High Dynamic Range Image and Video Compression

Multiscale structural similarity for image quality assessment

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Performance evaluation of objective quality metrics for hdr image compression" ?

Q2. What are the common terms used for quality assessment of HDR images?

Q3. What is the common encoding of HDR to perceptually linear values?

Q4. How many pixels per degree were used to measure the quality of the stimulus?

Q5. How many image quality factors are used in the coding of HDR?

Q6. Why are there so many LDR image coding techniques?

Q7. What is the definition of spatial information for an LDR image?

Q8. What are the main characteristics of the metrics used in the literature?

Q9. What are the two quality factors that control the base and enhancement layer quality?

Q10. How many images were retained for the test?

Q11. Why is the LDR pixel values gammacorrected?

Q12. What has motivated research towards HDR processing algorithms?