scispace - formally typeset
Open AccessProceedings ArticleDOI

Performance evaluation of objective quality metrics for HDR image compression

TLDR
The performance of HDR-VDP is compared to that of PSNR and SSIM computed on perceptually encoded luminance values, when considering compressed HDR images, to show that these simpler metrics can be effectively employed to assess image fidelity for applications such as HDR image compression.
Abstract
Due to the much larger luminance and contrast characteristics of high dynamic range (HDR) images, well-known objective quality metrics, widely used for the assessment of low dynamic range (LDR) content, cannot be directly applied to HDR images in order to predict their perceptual fidelity. To overcome this limitation, advanced fidelity metrics, such as the HDR-VDP, have been proposed to accurately predict visually significant differences. However, their complex calibration may make them difficult to use in practice. A simpler approach consists in computing arithmetic or structural fidelity metrics, such as PSNR and SSIM, on perceptually encoded luminance values but the performance of quality prediction in this case has not been clearly studied. In this paper, we aim at providing a better comprehension of the limits and the potentialities of this approach, by means of a subjective study. We compare the performance of HDR-VDP to that of PSNR and SSIM computed on perceptually encoded luminance values, when considering compressed HDR images. Our results show that these simpler metrics can be effectively employed to assess image fidelity for applications such as HDR image compression.

read more

Content maybe subject to copyright    Report

Performance evaluation of objective quality metrics for HDR
image compression
Giuseppe Valenzise, Francesca De Simone, Paul Lauga, Frederic Dufaux
Institut Mines-Telecom, Telecom ParisTech, CNRS LTCI, Paris, France
ABSTRACT
Due to the much larger luminance and contrast characteristics of high dynamic range (HDR) images, well-
known objective quality metrics, widely used for the assessment of low dynamic range (LDR) content, cannot
be directly applied to HDR images in order to predict their perceptual fidelity. To overcome this limitation,
advanced fidelity metrics, such as the HDR-VDP, have been proposed to accurately predict visually significant
differences. However, their complex calibration may make them difficult to use in practice. A simpler approach
consists in computing arithmetic or structural fidelity metrics, such as PSNR and SSIM, on perceptually encoded
luminance values but the performance of quality prediction in this case has not been clearly studied. In this
paper, we aim at providing a better comprehension of the limits and the potentialities of this approach, by
means of a subjective study. We compare the performance of HDR-VDP to that of PSNR and SSIM computed
on perceptually encoded luminance values, when considering compressed HDR images. Our results show that
these simpler metrics can be effectively employed to assess image fidelity for applications such as HDR image
compression.
Keywords: High dynamic range, quality assessment, image coding
1. INTRODUCTION
High dynamic range (HDR) content has been recently gaining momentum thanks to its ability to reproduce
a much wider gamut of luminance and contrast than traditional low dynamic range (LDR) formats. This has
motivated research towards novel HDR processing algorithms, including acquisition/generation
1
and compres-
sion
2, 3
and, consequently, towards methods for assessing the quality of the processed results. In principle, the
most accurate way to evaluate image quality is to carry out extensive subjective test campaigns. However, this
is often impractical, especially when the number of parameters and testing conditions is large. In addition, the
feasibility of subjective testing in the case of HDR content is further reduced by the limited diffusion and the
high cost of HDR displays. This calls for the design of automatic and accurate objective quality metrics for HDR
content.
In this work, we focus on full-reference quality assessment, where the goal is to assess the perceptual fidelity of
a processed image with respect to its original (i.e., reference) version. This is the typical scenario, e.g., in image
compression, where a picture coded at a certain bitrate is compared to the uncompressed original. In the LDR
case, popular metrics, such as the Structural Similarity Index (SSIM),
4
are known to provide good predictions
of image quality and even the criticized Peak Signal-to-Noise Ratio (PSNR) produces valid quality measures for
a given content and codec type.
5
A key advantage of these metrics is that they can be easily computed through
simple pixel operations on LDR images. This is partially due to the fact that LDR pixel values are gamma-
corrected in the sRGB color space,
6
which not only does compensate for the non-linear luminance response of
legacy CRT displays, but also accounts somehow for the lower contrast sensitivity of the human visual system
(HVS) at dark luminance levels. In other words, the non linearity of the sRGB color space provides a pixel
encoding which is approximately linear with respect to perception.
In the case of HDR, this is no longer the case, since pixel values are proportional to the physical luminance of
the scene, while the HVS is sensible to luminance ratios, as expressed by the Weber-Fechner law. In order to take
into account luminance masking and other complex aspects of the HVS, some metrics, such as the HDR-VDP,
7, 8
Corresponding author: Giuseppe Valenzise E-mail: giuseppe.valenzise@telecom-paristech.fr
Additional material available at http://perso.telecom-paristech.fr/
~
gvalenzi/download.htm

accurately model various stages of visual perception under a broad range of viewing conditions, in such a way to
predict and quantify precisely significant visual differences between images. These metrics can provide very good
approximations of human perception but require in general a delicate tuning of several parameters in order to be
computed, which limits their use in many practical applications. A simpler and more convenient approach is to
transform HDR values to perceptually uniform quantities and compute arithmetic or structural metrics, such as
the PSNR or the SSIM, on them. Typical encodings from HDR to perceptually linear values include the simple
logarithm, based on the Weber-Fechner law, or more sophisticated transfer functions such as the PU encoding.
9
These metrics are often used to evaluate HDR image and video compression performance;
3, 10
however, it is not
clear up to which extent they can provide accurate estimates of the actual visual quality, thus, whether they are
a valid alternative to more complex predictors based on HVS modeling.
In this paper, we evaluate the performance of PSNR and SSIM applied to log- or PU-encoded HDR pictures
corrupted by one specific type of processing, i.e., image compression. Since PSNR and SSIM are widely used for
quality assessment of LDR images, in the following, we will refer to them as LDR metrics. We also analyze the
performance of the HDR-VDP algorithm (referred to as HDR-VDP-2 in the original paper of Mantiuk et al.
8
).
In terms of image compression, we consider three schemes, which are representative of the state of the art in
still image HDR content compression, to build a dataset of compressed images with different levels of distortion.
We use this dataset to conduct a subjective experiment and collect subjective mean opinion scores (MOS). Our
analysis of the results shows that subjective ratings are well correlated with LDR metrics applied to perceptually
linearized HDR values, and thus, that they can be consistently used to evaluate coding performance.
The rest of the paper is organized as follows. We review objective approaches to quality assessment of HDR
content in Section 2. The subjective test setup, including the generation of the test material, the test environment
and the test methodology, is described in Section 3. We present and discuss the results of our study in Section 4.
Finally, Section 5 concludes the paper.
2. OBJECTIVE METRICS FOR HDR CONTENT
Automatic quality assessment of low dynamic range pictures has been widely investigated in the past decades and
a number of full-reference metrics have been proposed for this purpose, including: metrics that model the HVS
(e.g., Sarnoff JND,
11
VDP,
12
Perceptual Distortion Metric
13
); feature-based algorithms;
14
application-specific
models (DCTune
15
); structural (SSIM
4
and its multiscale version
16
) and information-theoretic (e.g., VIF
17
)
frameworks. For a comprehensive statistical evaluation of these algorithms on LDR content, the interested
reader can refer to, e.g., the work of Sheikh et al.
18
At a higher level of abstraction, fidelity metrics can
be classified according to whether they include some modeling of the HVS (such as contrast and luminance
masking, adaptation mechanisms, etc.), or assume perceptually linearized luminance values. The latter is the
case of arithmetic measures such as the mean square error (MSE) and derived metrics, such as PSNR, as well
as of structural metrics, such as SSIM, which are largely used in fields such as image/video coding as they offer
a good trade-off between simplicity and accuracy.
Metrics based on HVS models are conceived to work in a limited luminance range, i.e., that of standard
LCD or CRT displays, but need to be somehow extended to work in the full luminance range of HDR content.
In their HDR-VDP
8
metric Mantiuk et al. extended the Visual Difference Predictor of Daly,
12
in order to take
into account a number of phenomena that occur in the early stages of the HVS from intra-ocular light scatter
to contrast sensitivity across the full range of visible luminance (scotopic and photopic) and intra/inter-channel
contrast masking which characterize the optical and retinal pathway. The test and references pictures are
processed according to this path and the resulting images are decomposed through a multiband filter in such
a way to obtain perceptually linearized per-band contrast differences. These quantities are then either mapped
to per-pixel probabilities maps of visibility, or they are pooled to produce a single image quality correlate Q.
The pooling function has been selected and parametrized among several candidates by maximizing Spearman
rank-order correlation over a large LDR image dataset (details are found in Section 6.1 of the original HDR-VDP
paper
8
). The motivation of this choice is twofold: on one hand, it assures the backward compatibility of the
metric to LDR content; on the other hand, it is the only feasible way to optimize the pooling function in the lack
of sufficiently large HDR datasets with subjective annotations. Recently, Narwaria et al.
19
computed optimized
pooling weights for HDR-VDP over a dataset of HDR compressed images. Their results show that tuning on

HDR data may improve HDR-VDP performance, but the gain is not statistically significant. Thus, in this work,
we resort to the default setting in the implementation of Mantiuk et al.
, which we parametrize to account for
the viewing conditions described in Section 3.2.
A main disadvantage of HDR-VDP is that it requires a complex calibration of its optical and retinal pa-
rameters. A known problem is, e.g., the setting of the peak sensitivity of the photoreceptors higher values
decrease overall sensitivity to contrast. In many practical applications, and especially in the case of coding, it
is customary to compute simple arithmetic or structural metrics on perceptually linearized HDR values. Per-
ceptual linearization consists in a monotonically increasing mapping of HDR luminance to encoded pixel values.
Typical mapping functions include the logarithm, as it expresses Weber-Fechner law on small luminance ranges,
or a gamma correction to account for Steven’s power law.
20
Aydin et al.
9
observed that the Weber ratio can
be assumed to be constant only for luminance values approximatively greater than 500 cd/m
2
, while for lower
luminance levels the detection threshold rises significantly. Thus, they computed a perceptually uniform (PU)
encoding under the form of a look-up table, which follows the Weber-Fechner law for luminance larger than
1000 cd/m
2
, while at the same time it maintains backward compatibility with the sRGB encoding on typical
LDR displays brightness ranges. Notice that this mapping requires a rough characterization of the response
function of the HDR display in order to transform HDR pixel values into photometric quantities.
Quality assessment for high dynamic range is quite a recent topic, hence there is lack of extensive statistical
studies and image datasets to evaluate performance of existing metrics. Perceptual linearization is supported by
psycho-visual arguments, but its effectiveness for quality assessment has only been conjectured or just showcased
through simple proofs of concepts in the case of PU encoding. Additionally, to the authors’ knowledge, the only
study on the performance of HDR-VDP on HDR content is the recent work by Narwaria et al.,
19
which considers
test material similar to that considered in this paper, i.e., compressed HDR images. The main difference with
respect to that study is that, there, the authors compared HDR-VDP with LDR metrics computed over HDR
pixel values without any perceptual linearization. Therefore, they arrive to the rather expected result that HDR-
VDP clearly outperforms LDR metrics and that LDR metrics cannot be used to evaluate HDR content. In this
work, we use instead perceptually linearized HDR values, obtained using either logarithm or PU encoding. Under
this setting, our results reverse the conclusions found previously and show that, with an appropriate perceptual
linearization, well-established metrics that work excellently for LDR image coding can be extended with similar
performance to HDR.
3. SUBJECTIVE TEST SETUP
3.1 Test material
3.1.1 Selection of original content
We analyzed several HDR images from the HDR photographic survey dataset,
21
as potential test material to
be included in our experiment. The resolution of the pictures was downscaled to meet our display’s resolution,
equal to 1920 × 1080 pixels. We focused on high quality images where typical HDR acquisition artifacts such
as ghosting are not present. In order to select material with sufficiently diverse characteristics, we compute the
following three features for each image:
The key k [0, 1] of the picture,
22
which gives a measure of the overall brightness of the scene and is
defined as:
k =
log L
avg
log L
min
log L
max
log L
min
, (1)
where the average luminance is computed as log L
avg
=
ij
log(L(i, j)+δ)/N ,withN being the number
of pixels in the image, L(i, j) the luminance of pixel (i, j), and δ is a small offset to avoid the singularity
occurring for black pixels. L
min
and L
max
are the minimum and maximum relative luminance values of the
image, computed after excluding 1% of brightest and darkest pixels in order to make the method robust
against outliers.
Available at http://sourceforge.net/projects/hdrvdp/ (version 2.1.3).

(a) “AirBellowsGap” (b) “LasVegasStore” (c) “MasonLake(1)”
(d) “RedwoodSunset” (e) “UpheavalDome”
Figure 1. HDR images used for the test (tone mapped version).
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
LasVegasStore
AirBellowsGap
RedwoodSunset
UpheavalDome
MasonLake(1)
(a) Key
1
2
3
4
5
6
UpheavalDome
MasonLake(1)
RedwoodSunset
AirBellowsGap
LasVegasStore
(b) Dynamic Range
0.016
0.018
0.02
0.022
0.024
0.026
0.028
0.03
AirBellowsGap
RedwoodSunset
UpheavalDome
LasVegasStore
MasonLake(1)
(c) Spatial Information
Figure 2. Characteristics of the selected HDR test images (contents are ordered for increasing value of each feature).
The image dynamic range DR = L
max
/L
min
,withL
min
and L
max
computed as above.
The spatial perceptual information SI,
23
which describes image spatial complexity and is related to coding
complexity. For an LDR image, spatial information is defined as the standard deviation of the output of a
Sobel operator applied to the image. The LDR image in our case is obtained using Reinhard’s photographic
tone reproduction operator.
24
Based on the semantic interest of each content and on the diversity of the considered characteristics, we
selected the five images shown in Fig. 1. Fig. 2 reports the content characteristics for the selected material. Two
additional images, shown in Fig. 3, were used for training the subjects.
3.1.2 Production of test material
We produced the test material by compressing the selected images using different codecs and coding conditions.
Due to the huge bulk of available LDR images, the most promising HDR image coding techniques are those
that offer backward compatibility with legacy LDR pictures. These schemes are based on a scalable approach,
25
where an LDR base layer is obtained by tone mapping the original HDR and is then coded using available LDR
codecs such as JPEG or JPEG 2000. The tone mapping function is inverted at the decoder to reconstruct an
approximation of the original HDR. Additionally, an enhancement layer that stores the differences (or ratios)

(a) “DevilsBathtub” (b) “PaulBunyan”
Figure 3. Training images (tone mapped version).
between the original and the inverse tone mapped images can be also transmitted as header information. In
addition to the usual settings to optimize in the LDR case (e.g., quantization parameters, transform size, etc.),
the choice of the tone mapping operator (TMO) is critical and can lead to different coding performance.
26
Instead
of using a tone mapping designed for rendering on a LDR display, we implemented the minimum-MSE TMO
proposed by Mai et al.,
3
which is the global TMO that minimizes the reconstruction error after tone mapping
and inverse tone mapping.
Thus, we consider the following three coding schemes:
JPEG with minimum-MSE TMO (applied to each color channel) and no enhancement layer. We coded
each content with a JPEG quality factor QF ranging from 20 to 100, with a step of 5, producing a total
of 17 rate points × 5 contents = 85 images.
JPEG 2000 with minimum-MSE TMO (applied to each color channel) and no enhancement layer. We
sampled 15 target bitrates in the range 0.06 bpp up to 1.75 bpp, giving a total of 75 images.
JPEG XT,
2
which is the new standardization initiative (ISO/IEC 18477) of JPEG for backward compatible
encoding of HDR images. JPEG XT produces a LDR bitstream compatible with the JPEG standard. There
are several proposals so far for coding the enhancement layer. In the reference implementation that we
adopted
, the TMO is a content dependent linear map, followed by a gamma adaption with exponent 2.2
to compensate for the sRGB gamma. Encoding of residuals is performed in a lossy manner in the spatial
domain. The base and enhancement layer quality is controlled by two quality factors, which take values
on [0, 100] and that we varied as follows: QF
b
[40, 70, 90, 100] and QF
e
[50, 75, 80, 90, 95], respectively.
This yields 100 coded images.
We screened all the 260 images, produced with the coding conditions described above, and we selected a
subset of them in such a way to respect the following requirements: i) all the levels of the MOS scale (described
in Section 3.3) should be equally represented; ii) all codecs and contents should be equally present; and iii) the
length of the actual test should be reasonable, i.e., it should not be longer than 20 minutes without pauses.
Distortions with the JPEG and JPEG 2000 codecs, when seen on the HDR display, are similar to analogous
distortions in LDR pictures. As for the JPEG XT codec, its distortion has characteristics similar to JPEG:
specifically, the noise has the same typical blocking structure; however, as QF
e
increases, JPEG XT images
have less ringing artifacts than JPEG ones. Finally, we observed that, for some contents, even with the highest
considered bitrates, none of the used lossy coding schemes was able to produce imperceptible distortions (i.e., the
highest level of the considered MOS scale) on the HDR display. This confirmed the findings of Aydin et al.
9
that
distortions are much more perceptible on brighter screens. In those cases, we used the original (uncompressed)
content as test image. These samples were excluded from the performance analysis of the objective metrics
in order to avoid any bias due to the choice of an arbitrary maximum value for the PSNR. As a result of the
screening phase, we retained a set of 50 images to use for the test (details about the exact coding parameters
of the test dataset, as well as coded images, are available as supplementary material on the reference author’s
website).
JPEG document wg1n6639 in the JPEG document repository, version 0.8 (February 2014).

Citations
More filters
Journal ArticleDOI

HDR-VDP-2.2: a calibrated method for objective quality prediction of high-dynamic range and standard images

TL;DR: The main contribution is toward improving the frequency-based pooling in HDR-VDP-2 to enhance its objective quality prediction accuracy by formulating and solving a constrained optimization problem and thereby finding the optimal pooling weights.
Journal ArticleDOI

Hdr-vqm

TL;DR: An objective HDR video quality measure (HDR-VQM) based on signal pre-processing, transformation, and subsequent frequency based decomposition is presented, which is one of the first objective method for high dynamic range video quality estimation.
Journal ArticleDOI

Benchmarking of objective quality metrics for HDR image quality assessment

TL;DR: It is suggested that the performance of most full-reference metrics can be improved by considering non-linearities of the human visual system, while further efforts are necessary to improve performance of no-reference quality metrics for HDR content.
Journal ArticleDOI

Overview and evaluation of the JPEG XT HDR image compression standard

TL;DR: The paper introduces three of currently defined profiles in JPEG XT, each constraining the common decoder architecture to a subset of allowable configurations, and assess the coding efficiency of each profile extensively through subjective assessments, using 24 naïve subjects to evaluate 20 images and objective evaluations.
Journal ArticleDOI

Perception-driven Accelerated Rendering

TL;DR: This report presents the key research and models that exploit the limitations of perception to tackle visual quality and workload alike, and presents the open problems and promising future research targeting the question of how to minimize the effort to compute and display only the necessary pixels while still offering a user full visual experience.
References
More filters
Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Proceedings ArticleDOI

Multiscale structural similarity for image quality assessment

TL;DR: This paper proposes a multiscale structural similarity method, which supplies more flexibility than previous single-scale methods in incorporating the variations of viewing conditions, and develops an image synthesis method to calibrate the parameters that define the relative importance of different scales.
Journal ArticleDOI

On the psychophysical law.

S. S. Stevens
- 01 May 1957 - 
Journal ArticleDOI

Image information and visual quality

TL;DR: An image information measure is proposed that quantifies the information that is present in the reference image and how much of this reference information can be extracted from the distorted image and combined these two quantities form a visual information fidelity measure for image QA.
Journal ArticleDOI

A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms

TL;DR: This paper presents results of an extensive subjective quality assessment study in which a total of 779 distorted images were evaluated by about two dozen human subjects and is the largest subjective image quality study in the literature in terms of number of images, distortion types, and number of human judgments per image.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Performance evaluation of objective quality metrics for hdr image compression" ?

A simpler approach consists in computing arithmetic or structural fidelity metrics, such as PSNR and SSIM, on perceptually encoded luminance values but the performance of quality prediction in this case has not been clearly studied. In this paper, the authors aim at providing a better comprehension of the limits and the potentialities of this approach, by means of a subjective study. The authors compare the performance of HDR-VDP to that of PSNR and SSIM computed on perceptually encoded luminance values, when considering compressed HDR images. 

Since PSNR and SSIM are widely used for quality assessment of LDR images, in the following, the authors will refer to them as LDR metrics. 

Typical encodings from HDR to perceptually linear values include the simple logarithm, based on the Weber-Fechner law, or more sophisticated transfer functions such as the PU encoding. 

Viewers participated individually to test sessions, sitting at a distance of approximately 1 meter, which corresponds to an angular resolution of about 40 pixels per degree. 

The authors coded each content with a JPEG quality factor QF ranging from 20 to 100, with a step of 5, producing a total of 17 rate points × 5 contents = 85 images.• 

Due to the huge bulk of available LDR images, the most promising HDR image coding techniques are those that offer backward compatibility with legacy LDR pictures. 

For an LDR image, spatial information is defined as the standard deviation of the output of a Sobel operator applied to the image. 

These metrics can provide very good approximations of human perception but require in general a delicate tuning of several parameters in order to be computed, which limits their use in many practical applications. 

The base and enhancement layer quality is controlled by two quality factors, which take values on [0, 100] and that the authors varied as follows: QFb ∈ [40, 70, 90, 100] and QFe ∈ [50, 75, 80, 90, 95], respectively. 

As a result of the screening phase, the authors retained a set of 50 images to use for the test (details about the exact coding parameters of the test dataset, as well as coded images, are available as supplementary material on the reference author’s website). 

This is partially due to the fact that LDR pixel values are gammacorrected in the sRGB color space,6 which not only does compensate for the non-linear luminance response of legacy CRT displays, but also accounts somehow for the lower contrast sensitivity of the human visual system (HVS) at dark luminance levels. 

This has motivated research towards novel HDR processing algorithms, including acquisition/generation1 and compression2,3 and, consequently, towards methods for assessing the quality of the processed results.