What is the logistic function used in the VQEG Phase The authorFR-TV test?

To provide quantitative performance evaluation, the authors use the logistic function adopted in the video quality experts group (VQEG) Phase The authorFR-TV test [15] to provide a non-linear mapping between the objective and subjective scores.

What is the SSIM index definition for scale M?

In particular, a single-scale implementation for Scale M applies the iterative filtering and downsampling procedure up to Scale M and only the exponents αM , βM and γM are given nonzero values.

How many scales are used in the experiment?

The authors use 5 scales and 12 distortion levels (range from 23 to 214) in their experiment, resulting in a total of 60 images, as demonstrated in Fig.

what is the general form of the SSIM index between x and y?

The general form of the Structural SIMilarity (SSIM) index between signal x and y is defined as:SSIM(x,y) = [l(x,y)]α · [c(x,y)]β · [s(x,y)]γ , (5) where α, β and γ are parameters to define the relative importance of the three components.

What is the performance of the SSIM model?

From both the scatter plots and the quantitative evaluation results, the authors see that the performance of single-scale SSIM model varies with scales and the best performance is given by the case of M=2.

what is the scalarity of x and y?

µx and σx can be viewed as estimates of the luminance and contrast of x, and σxy measures the the tendency of x and y to vary together, thus an indication of structural similarity.

(Open Access) Multiscale structural similarity for image quality assessment (2003) | Zhou Wang

Q: What are the measures of prediction accuracy?

After the non-linear mapping, the linear correlation coefficient (CC), the mean absolute error (MAE), and the root mean squared error (RMS) between the subjective and objective scores are calculated as measures of prediction accuracy.

Q: What is the main purpose of the paper?

In this paper, the authors used an image synthesis approach to calibrate the parameters that define the relative importance between scales.

Q: What is the purpose of this paper?

In this paper, the authors propose a multi-scale structural similarity method and introduce a novel image synthesis-based approach to calibrate the parameters that weight the relative importance between different scales.

MULTI-SCALE STRUCTURAL SIMILARITY FOR IMAGE QUALITY ASSESSMENT

Zhou Wang

, Eero P. Simoncelli

and Alan C. Bovik

(Invited Paper)

Center for Neural Sci. and Courant Inst. of Math. Sci., New York Univ., New York, NY 10003

Dept. of Electrical and Computer Engineering, Univ. of Texas at Austin, Austin, TX 78712

Email: zhouwang@ieee.org, eero.simoncelli@nyu.edu, bovik@ece.utexas.edu

ABSTRACT

The structural similarity image quality paradigm is based on the

assumption that the human visual system is highly adapted for

extracting structural information from the scene, and therefore a

measure of structural similarity can provide a good approxima-

tion to perceived image quality. This paper proposes a multi-scale

structural similarity method, which supplies more ﬂexibility than

previous single-scale methods in incorporating the variations of

viewing conditions. We develop an image synthesis method to

calibrate the parameters that deﬁne the relative importance of dif-

ferent scales. Experimental comparisons demonstrate the effec-

tiveness of the proposed method.

1. INTRODUCTION

Objective image quality assessment research aims to design qual-

ity measures that can automatically predict perceived image qual-

ity. These quality measures play important roles in a broad range

of applications such as image acquisition, compression, commu-

nication, restoration, enhancement, analysis, display, printing and

watermarking. The most widely used full-reference image quality

and distortion assessment algorithms are peak signal-to-noise ra-

tio (PSNR) and mean squared error (MSE), which do not correlate

well with perceived quality (e.g., [1]–[6]).

Traditional perceptual image quality assessment methods are

based on a bottom-up approach which attempts to simulate the

functionality of the relevant early human visual system (HVS)

components. These methods usually involve 1) a preprocessing

process that may include image alignment, point-wise nonlinear

transform, low-pass ﬁltering that simulates eye optics, and color

space transformation, 2) a channel decomposition process that trans-

forms the image signals into different spatial frequency as well as

orientation selective subbands, 3) an error normalization process

that weights the error signal in each subband by incorporating the

variation of visual sensitivity in different subbands, and the vari-

ation of visual error sensitivity caused by intra- or inter-channel

neighboring transform coefﬁcients, and 4) an error pooling pro-

cess that combines the error signals in different subbands into a

single quality/distortion value. While these bottom-up approaches

can conveniently make use of many known psychophysical fea-

tures of the HVS, it is important to recognize their limitations. In

particular, the HVS is a complex and highly non-linear system and

the complexity of natural images is also very signiﬁcant, but most

models of early vision are based on linear or quasi-linear oper-

ators that have been characterized using restricted and simplistic

stimuli. Thus, these approaches must rely on a number of strong

assumptions and generalizations [4],[5]. Furthermore, as the num-

ber of HVS features has increased, the resulting quality assessment

systems have become too complicated to work with in real-world

applications, especially for algorithm optimization purposes.

Structural similarity provides an alternative and complemen-

tary approach to the problem of image quality assessment [3]–

[6]. It is based on a top-down assumption that the HVS is highly

adapted for extracting structural information from the scene, and

therefore a measure of structural similarity should be a good ap-

proximation of perceived image quality. It has been shown that

a simple implementation of this methodology, namely the struc-

tural similarity (SSIM) index [5], can outperform state-of-the-art

perceptual image quality metrics. However, the SSIM index al-

gorithm introduced in [5] is a single-scale approach. We consider

this a drawback of the method because the right scale depends on

viewing conditions (e.g., display resolution and viewing distance).

In this paper, we propose a multi-scale structural similarity method

and introduce a novel image synthesis-based approach to calibrate

the parameters that weight the relative importance between differ-

ent scales.

2. SINGLE-SCALE STRUCTURAL SIMILARITY

Let x = {x

|i = 1, 2, · · · , N} and y = {y

|i = 1, 2, · · · , N} be

two discrete non-negative signals that have been aligned with each

other (e.g., two image patches extracted from the same spatial lo-

cation from two images being compared, respectively), and let µ

and σ

be the mean of x, the variance of x, and the covariance

of x and y, respectively. Approximately, µ

and σ

can be viewed

as estimates of the luminance and contrast of x, and σ

measures

the the tendency of x and y to vary together, thus an indication of

structural similarity. In [5], the luminance, contrast and structure

comparison measures were given as follows:

l(x, y) =

2 µ

+ C

+ µ

+ C

, (1)

c(x, y) =

2 σ

+ C

+ σ

+ C

, (2)

s(x, y) =

+ C

, (3)

where C

, C

and C

are small constants given by

= (K

, C

= (K

and C

= C

/2, (4)

L 2

c

1

(x, y)

s

1

(x, y)

signal 1

similarity

measure

L 2

c

2

(x, y)

s

2

(x, y)

L

c

M

(x, y)

s

M

(x, y)

...

l

M

(x, y)

2

L 2signal 2 L 2 L

...

2

Fig. 1. Multi-scale structural similarity measurement system. L: low-pass ﬁltering; 2 ↓: downsampling by 2.

respectively. L is the dynamic range of the pixel values (L = 255

for 8 bits/pixel gray scale images), and K

¿ 1 and K

¿ 1 are

two scalar constants. The general form of the Structural SIMilarity

(SSIM) index between signal x and y is deﬁned as:

SSIM(x, y) = [l(x, y)]

· [c(x, y)]

· [s(x, y)]

, (5)

where α, β and γ are parameters to deﬁne the relative importance

of the three components. Speciﬁcally, we set α = β = γ = 1, and

the resulting SSIM index is given by

SSIM(x, y) =

(2 µ

+ C

) (2 σ

+ C

)

(µ

+ µ

+ C

) (σ

+ σ

+ C

)

, (6)

which satisﬁes the following conditions:

1. symmetry: SSIM(x, y) = SSIM(y, x);

2. boundedness: SSIM(x, y) ≤ 1;

3. unique maximum: SSIM(x, y) = 1 if and only if x = y.

The universal image quality index proposed in [3] corresponds

to the case of C

= C

= 0, therefore is a special case of (6). The

drawback of such a parameter setting is that when the denominator

of Eq. (6) is close to 0, the resulting measurement becomes unsta-

ble. This problem has been solved successfully in [5] by adding

the two small constants C

and C

(calculated by setting K

=0.01

and K

=0.03, respectively, in Eq. (4)).

We apply the SSIM indexing algorithm for image quality as-

sessment using a sliding window approach. The window moves

pixel-by-pixel across the whole image space. At each step, the

SSIM index is calculated within the local window. If one of the

image being compared is considered to have perfect quality, then

the resulting SSIM index map can be viewed as the quality map

of the other (distorted) image. Instead of using an 8 × 8 square

window as in [3], a smooth windowing approach is used for local

statistics to avoid “blocking artifacts” in the quality map [5]. Fi-

nally, a mean SSIM index of the quality map is used to evaluate

the overall image quality.

3. MULTI-SCALE STRUCTURAL SIMILARITY

3.1. Multi-scale SSIM index

The perceivability of image details depends the sampling density

of the image signal, the distance from the image plane to the ob-

server, and the perceptual capability of the observer’s visual sys-

tem. In practice, the subjective evaluation of a given image varies

when these factors vary. A single-scale method as described in

the previous section may be appropriate only for speciﬁc settings.

Multi-scale method is a convenient way to incorporate image de-

tails at different resolutions.

We propose a multi-scale SSIM method for image quality as-

sessment whose system diagram is illustrated in Fig. 1. Taking

the reference and distorted image signals as the input, the system

iteratively applies a low-pass ﬁlter and downsamples the ﬁltered

image by a factor of 2. We index the original image as Scale 1,

and the highest scale as Scale M, which is obtained after M − 1

iterations. At the j-th scale, the contrast comparison (2) and the

structure comparison (3) are calculated and denoted as c

(x, y)

and s

(x, y), respectively. The luminance comparison (1) is com-

puted only at Scale M and is denoted as l

(x, y). The overall

SSIM evaluation is obtained by combining the measurement at dif-

ferent scales using

SSIM(x, y) = [l

(x, y)]

j=1

(x, y)]

. (7)

Similar to (5), the exponents α

, β

and γ

are used to ad-

just the relative importance of different components. This multi-

scale SSIM index deﬁnition satisﬁes the three conditions given in

the last section. It also includes the single-scale method as a spe-

cial case. In particular, a single-scale implementation for Scale M

applies the iterative ﬁltering and downsampling procedure up to

Scale M and only the exponents α

, β

and γ

are given non-

zero values. To simplify parameter selection, we let α

=β

=γ

for

all j’s. In addition, we normalize the cross-scale settings such that

j=1

=1. This makes different parameter settings (including

all single-scale and multi-scale settings) comparable. The remain-

ing job is to determine the relative values across different scales.

Conceptually, this should be related to the contrast sensitivity func-

tion (CSF) of the HVS [7], which states that the human visual sen-

sitivity peaks at middle frequencies (around 4 cycles per degree

of visual angle) and decreases along both high- and low-frequency

directions. However, CSF cannot be directly used to derive the

parameters in our system because it is typically measured at the

visibility threshold level using simpliﬁed stimuli (sinusoids), but

our purpose is to compare the quality of complex structured im-

ages at visible distortion levels.

3.2. Cross-scale calibration

We use an image synthesis approach to calibrate the relative impor-

tance of different scales. In previous work, the idea of synthesizing

images for subjective testing has been employed by the “synthesis-

by-analysis” methods of assessing statistical texture models, in

which the model is used to generate a texture with statistics match-

ing an original texture, and a human subject then judges the sim-

ilarity of the two textures [8]–[11]. A similar approach has also

been qualitatively used in demonstrating quality metrics in [5],

[12], though quantitative subjectivetests were not conducted. These

synthesis methods provide a powerful and efﬁcient means of test-

ing a model, and have the added beneﬁt that the resulting images

suggest improvements that might be made to the model [11].

scale(M)

distortion

level

(MSE)

1 2 3 4 5

Fig. 2. Demonstration of image synthesis approach for cross-scale

calibration. Images in the same row have the same MSE. Images in

the same column have distortions only in one speciﬁc scale. Each

subject was asked to select a set of images (one from each scale),

having equal quality. As an example, one subject chose the marked

images.

For a given original 8bits/pixel gray scale test image, we syn-

thesize a table of distorted images (as exempliﬁed by Fig. 2),

where each entry in the table is an image that is associated with

a speciﬁc distortion level (deﬁned by MSE) and a speciﬁc scale.

Each of the distorted image is created using an iterative procedure,

where the initial image is generated by randomly adding white

Gaussian noise to the original image and the iterative process em-

ploys a constrained gradient descent algorithm to search for the

worst images in terms of SSIM measure while constraining MSE

to be ﬁxed and restricting the distortions to occur only in the spec-

iﬁed scale. We use 5 scales and 12 distortion levels (range from

to 2

) in our experiment, resulting in a total of 60 images, as

demonstrated in Fig. 2. Although the images at each row has the

same MSE with respect to the original image, their visual quality

is signiﬁcantly different. Thus the distortions at different scales are

of very different importance in terms of perceived image quality.

We employ 10 original 64×64 images with different types of con-

tent (human faces, natural scenes, plants, man-made objects, etc.)

in our experiment to create 10 sets of distorted images (a total of

600 distorted images).

We gathered data for 8 subjects, including one of the authors.

The other subjects have general knowledge of human vision but

did not know the detailed purpose of the study. Each subject was

shownthe 10 sets of test images, one set at a time. The viewing dis-

tance was ﬁxed to 32 pixels per degree of visual angle. The subject

was asked to compare the quality of the images across scales and

detect one image from each of the ﬁve scales (shown as columns

in Fig. 2) that the subject believes having the same quality. For

example, one subject chose the images marked in Fig. 2 to have

equal quality. The positions of the selected images in each scale

were recorded and averaged over all test images and all subjects.

In general, the subjects agreed with each other on each image more

than they agreed with themselves across different images. These

test results were normalized (sum to one) and used to calculate the

exponents in Eq. (7). The resulting parameters we obtained are β

= γ

= 0.0448, β

= γ

= 0.2856, β

= γ

= 0.3001, β

= γ

0.2363, and α

= β

= γ

= 0.1333, respectively.

4. TEST RESULTS

We test a number of image quality assessment algorithms using

the LIVE database (available at [13]), which includes 344 JPEG

and JPEG2000 compressed images (typically 768×512 or similar

size). The bit rate ranges from 0.028 to 3.150 bits/pixel, which

allows the test images to cover a wide quality range, from in-

distinguishable from the original image to highly distorted. The

mean opinion score (MOS) of each image is obtained by averag-

ing 13∼25 subjective scores given by a group of human observers.

Eight image quality assessment models are being compared, in-

cluding PSNR, the Sarnoff model (JNDmetrix 8.0 [14]), single-

scale SSIM index with M equals 1 to 5, and the proposed multi-

scale SSIM index approach.

The scatter plots of MOS versus model predictions are shown

in Fig. 3, where each point represents one test image, with its

vertical and horizontal axes representing its MOS and the given

objective quality score, respectively. To provide quantitative per-

formance evaluation, we use the logistic function adopted in the

video quality experts group (VQEG) Phase I FR-TV test [15] to

provide a non-linear mapping between the objective and subjective

scores. After the non-linear mapping, the linear correlation coef-

ﬁcient (CC), the mean absolute error (MAE), and the root mean

squared error (RMS) between the subjective and objective scores

are calculated as measures of prediction accuracy. The prediction

consistency is quantiﬁed using the outlier ratio (OR), which is de-

Table 1. Performance comparison of image quality assessment

models on LIVE JPEG/JPEG2000 database [13]. SS-SSIM:

single-scale SSIM; MS-SSIM: multi-scale SSIM; CC: non-linear

regression correlation coefﬁcient; ROCC: Spearman rank-order

correlation coefﬁcient; MAE: mean absolute error; RMS: root

mean squared error; OR: outlier ratio.

Model CC ROCC MAE RMS OR(%)

PSNR 0.905 0.901 6.53 8.45 15.7

Sarnoff 0.956 0.947 4.66 5.81 3.20

SS-SSIM (M=1) 0.949 0.945 4.96 6.25 6.98

SS-SSIM (M=2) 0.963 0.959 4.21 5.38 2.62

SS-SSIM (M=3) 0.958 0.956 4.53 5.67 2.91

SS-SSIM (M=4) 0.948 0.946 4.99 6.31 5.81

SS-SSIM (M=5) 0.938 0.936 5.55 6.88 7.85

MS-SSIM 0.969 0.966 3.86 4.91 1.16

ﬁned as the percentage of the number of predictions outside the

range of ±2 times of the standard deviations. Finally, the predic-

tion monotonicity is measured using the Spearman rank-order cor-

relation coefﬁcient (ROCC). Readers can refer to [15] for a more

detailed descriptions of these measures. The evaluation results for

all the models being compared are given in Table 1.

From both the scatter plots and the quantitative evaluation re-

sults, we see that the performance of single-scale SSIM model

varies with scales and the best performance is given by the case

of M=2. It can also be observed that the single-scale model tends

to supply higher scores with the increase of scales. This is not

surprising because image coding techniques such as JPEG and

JPEG2000 usually compress ﬁne-scale details to a much higher

degree than coarse-scale structures, and thus the distorted image

“looks” more similar to the original image if evaluated at larger

scales. Finally, for every one of the objective evaluation criteria,

multi-scale SSIM model outperforms all the other models, includ-

ing the best single-scale SSIM model, suggesting a meaningful

balance between scales.

5. DISCUSSIONS

We propose a multi-scale structural similarity approach for image

quality assessment, which provides more ﬂexibility than single-

scale approach in incorporating the variations of image resolution

and viewing conditions. Experiments show that with an appropri-

ate parameter settings, the multi-scale method outperforms the best

single-scale SSIM model as well as state-of-the-art image quality

metrics.

In the development of top-down image quality models (such as

structural similarity based algorithms), one of the most challeng-

ing problems is to calibrate the model parameters, which are rather

“abstract” and cannot be directly derived from simple-stimulus

subjective experiments as in the bottom-up models. In this pa-

per, we used an image synthesis approach to calibrate the param-

eters that deﬁne the relative importance between scales. The im-

provement from single-scale to multi-scale methods observed in

our tests suggests the usefulness of this novel approach. However,

this approach is still rather crude. We are working on developing it

into a more systematic approach that can potentially be employed

in a much broader range of applications.

6. REFERENCES

[1] A. M. Eskicioglu and P. S. Fisher, “Image quality mea-

sures and their performance,” IEEE Trans. Communications,

vol. 43, pp. 2959–2965, Dec. 1995.

[2] T. N. Pappas and R. J. Safranek, “Perceptual criteria for im-

age quality evaluation,” in Handbook of Image and Video

Proc. (A. Bovik, ed.), Academic Press, 2000.

[3] Z. Wang and A. C. Bovik, “A universal image quality in-

dex,” IEEE Signal Processing Letters, vol. 9, pp. 81–84, Mar.

2002.

[4] Z. Wang, H. R. Sheikh, and A. C. Bovik, “Objective video

quality assessment,” in The Handbook of Video Databases:

Design and Applications (B. Furht and O. Marques, eds.),

pp. 1041–1078, CRC Press, Sept. 2003.

[5] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-

celli, “Image quality assessment: From error measurement to

structural similarity,” IEEE Trans. Image Processing, vol. 13,

Jan. 2004.

[6] Z. Wang, L. Lu, and A. C. Bovik, “Video quality assessment

based on structural distortion measurement,” Signal Process-

ing: Image Communication, special issue on objective video

quality metrics, vol. 19, Jan. 2004.

[7] B. A. Wandell, Foundations of Vision. Sinauer Associates,

Inc., 1995.

[8] O. D. Faugeras and W. K. Pratt, “Decorrelation methods of

texture feature extraction,” IEEE Pat. Anal. Mach. Intell.,

vol. 2, no. 4, pp. 323–332, 1980.

[9] A. Gagalowicz, “A new method for texture ﬁelds synthesis:

Some applications to the study of human vision,” IEEE Pat.

Anal. Mach. Intell., vol. 3, no. 5, pp. 520–533, 1981.

[10] D. Heeger and J. Bergen, “Pyramid-based texture analy-

sis/synthesis,” in Proc. ACM SIGGRAPH, pp. 229–238, As-

sociation for Computing Machinery, August 1995.

[11] J. Portilla and E. P. Simoncelli, “A parametric texture model

based on joint statistics of complex wavelet coefﬁcients,”

Int’l J Computer Vision, vol. 40, pp. 49–71, Dec 2000.

[12] P. C. Teo and D. J. Heeger, “Perceptual image distortion,” in

Proc. SPIE, vol. 2179, pp. 127–141, 1994.

[13] H. R. Sheikh, Z. Wang, A. C. Bovik, and L. K.

Cormack, “Image and video quality assessment re-

search at LIVE,” http://live.ece.utexas.edu/

research/quality/.

[14] Sarnoff Corporation, “JNDmetrix Technology,” http:

//www.sarnoff.com/products_services/

video_vision/jndmetrix/.

[15] VQEG, “Final report from the video quality experts group

on the validation of objective models of video quality assess-

ment,” Mar. 2000. http://www.vqeg.org/.

15 20 25 30 35 40 45 50

100

PSNR

MOS

JPEG images

JPEG2000 images

Fitting with Logistic Function

0 2 4 6 8 10 12

100

Sarnoff

MOS

JPEG images

JPEG2000 images

Fitting with Logistic Function

(a) (b)

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

100

Single−scale SSIM (M=1)

MOS

JPEG images

JPEG2000 images

Fitting with Logistic Function

0.5 0.6 0.7 0.8 0.9 1

100

Single−scale SSIM (M=2)

MOS

JPEG images

JPEG2000 images

Fitting with Logistic Function

0.7 0.75 0.8 0.85 0.9 0.95 1

100

Single−scale SSIM (M=3)

MOS

JPEG images

JPEG2000 images

Fitting with Logistic Function

0.85 0.9 0.95 1

100

Single−scale SSIM (M=4)

MOS

JPEG images

JPEG2000 images

Fitting with Logistic Function

(e) (f)

0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1

100

Single−scale SSIM (M=5)

MOS

JPEG images

JPEG2000 images

Fitting with Logistic Function

0.7 0.75 0.8 0.85 0.9 0.95 1

100

Multi−scale SSIM

MOS

JPEG images

JPEG2000 images

Fitting with Logistic Function

(g) (h)

Fig. 3. Scatter plots of MOS versus model predictions. Each sample point represents one test image in the LIVE JPEG/JPEG2000 image

database [13]. (a) PSNR; (b) Sarnoff model; (c)-(g) single-scale SSIM method for M = 1, 2, 3, 4 and 5, respectively; (h) multi-scale SSIM

method.

Multiscale structural similarity for image quality assessment

Figures

Citations

Image Super-Resolution Using Deep Convolutional Networks

FSIM: A Feature Similarity Index for Image Quality Assessment

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

No-Reference Image Quality Assessment in the Spatial Domain

Making a “Completely Blind” Image Quality Analyzer

References

Image quality assessment: from error visibility to structural similarity

A universal image quality index

A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients

Image quality measures and their performance

Foundations of vision

Related Papers (5)

Image quality assessment: from error visibility to structural similarity

Image information and visual quality

FSIM: A Feature Similarity Index for Image Quality Assessment

Deep Residual Learning for Image Recognition

No-Reference Image Quality Assessment in the Spatial Domain

Frequently Asked Questions (23)

Q1. What have the authors contributed in "Multi-scale structural similarity for image quality assessment" ?

Q2. What is the logistic function used in the VQEG Phase The authorFR-TV test?

Q3. What are the measures of prediction accuracy?

Q4. What is the drawback of a parameter setting?

Q5. What is the main purpose of the paper?

Q6. What is the way to evaluate the performance of a SSIM model?

Q7. What are the widely used image quality and distortion assessment algorithms?

Q8. What is the purpose of this paper?

Q9. What is the bit rate of the test images?

Q10. What scales are used to compute contrast comparisons?

Q11. What is the SSIM index definition for scale M?

Q12. How many distorted images are used in this experiment?

Q13. What is the definition of objective image quality assessment?

Q14. How many scales are used in the experiment?

Q15. Why is the distorted image more similar to the original?

Q16. How does the system compare the image at different scales?

Q17. What is the drawback of the method?

Q18. what is the general form of the SSIM index between x and y?

Q19. What is the SSIM evaluation of the image?

Q20. What is the performance of the SSIM model?

Q21. what is the scalarity of x and y?

Q22. what is the dynamic range of the pixel values?

Q23. How many images are available in the LIVE database?