What future works have the authors mentioned in the paper "Efficient automatic detection of 3d video artifacts" ?

Future work will focus on improving the performance of the proposed approaches and developing new approaches for further 3D defects.

What is the condition for detecting a left or right SWV?

When a left or right SWV of duration dSWV frames is detected, the condition dSWV > fps 2is checked, where fps is the video frame rate, to determine whether the violation is perceived as annoying or not.

What is the definition of a synchronization mismatch?

Synchronization mismatch is detected based on the motion inconsistencies of feature point pairs between the LR views within a short time frame.

(Open Access) Efficient automatic detection of 3D video artifacts (2014) | Mohan Liu

Q: What was used as a reference in the experiment?

Deviations (CPBDd) of sharpness scores, which were estimated by the sharpness metric CPBD [15], were used as a reference in the experiment.

Q: What is the synchronization probability of a frame?

which is computed considering a confidence level α, is used as the threshold for the estimation of the synchronization probability.

Q: What is the definition of a stereoscopic window effect?

Bent window effect: Sometimes, objects appearing infront of the virtual screen in theatre space extend vertically across the entire frame and hit both the top and bottom frame boundaries.

Liu, M., Ndjiki-Nya, P., Le Quintrec, J-C., Nikolaidis, N., & Pitas, I.

(2015). Efficient automatic detection of 3D video artifacts. In

2014

IEEE 16th International Workshop on Multimedia Signal Processing

(MMSP 2014): Proceedings of a meeting held 22-24 September 2014,

Jakarta, Indonesia

(pp. 1-6). Institute of Electrical and Electronics

Engineers (IEEE). https://doi.org/10.1109/MMSP.2014.6958787

Peer reviewed version

Link to published version (if available):

10.1109/MMSP.2014.6958787

Link to publication record in Explore Bristol Research

PDF-document

This is the author accepted manuscript (AAM). The final published version (version of record) is available online

via IEEE at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6958787. Please refer to any applicable

University of Bristol - Explore Bristol Research

General rights

This document is made available in accordance with publisher policies. Please cite only the

published version using the reference above. Full terms of use are available:

http://www.bristol.ac.uk/red/research-policy/pure/user-guides/ebr-terms/

Efﬁcient Automatic Detection of 3D Video Artifacts

Mohan Liu

, Ioannis Mademlis

†2

, Patrick Ndjiki-Nya

, Jean-Charles Le Quintrec

∗3

Nikos Nikolaidis

†2

, Ioannis Pitas

†2

Interactive Media - Human Factors Department, Fraunhofer Institute for Telecommunication - HHI

Einsteinufer 37, 10629 Berlin, Germany

{mohan.liu, patrick.ndjiki-nya}@hhi.fraunhofer.de

†

Department of Informatics, Aristotle University of Thessaloniki - AUTH

Box 451, 54124 Thessaloniki, Greece

{imademlis, nikolaid, pitas}@aiia.csd.auth.gr

∗

ARTE G.E.I.E., Association Relative

a la T

evision Europ

eenne

4 Quai Du Chanoine Winterer CS 20035, 67080 Strasbourg Cedex, France

jean-charles.lequintrec@arte.tv

Abstract—This paper summarizes some common artifacts in

stereo video content. These artifacts lead to poor even uncom-

fortable 3D viewing experience. Efﬁcient approaches for detect-

ing three typical artifacts, sharpness mismatch, synchronization

mismatch and stereoscopic window violation, are presented

in detail. Sharpness mismatch is estimated by measuring the

width deviations of edge pairs in depth planes. Synchronization

mismatch is detected based on the motion inconsistencies of

feature points between the stereoscopic channels in a short

time frame. Stereoscopic window violation is detected, using

connected component analysis, when objects hit the vertical

frame boundaries while being in front of the virtual screen. For

experiments, test sequences were created in a professional studio

environment and state-of-the-art metrics were used for evaluating

the proposed approaches. The experimental results show that our

algorithms have considerable robustness in detecting 3D defects.

I. INTRODUCTION

Three-dimensional (3D) videos are not only a big success in

cinemas, but also entered into ordinary households. 3D expe-

rience is highly related to the quality of the 3D content. Thus,

quality assessment of 3D videos has become a rising topic.

In comparison to two-dimensional (2D) image quality assess-

ment, 3D quality is also highly related to depth perception and

the visual comfort. Although techniques for automatic quality

assessment of 2D images have been extensively developed for

years, prior research has shown that 2D quality metrics cannot

be directly used to estimate 3D quality features [1].

There are two major ways to create 3D sequences: ﬁlming

with stereo cameras and converting from 2D videos using

depth maps. Quality control is an important task in such

workﬂows. Although 3D quality assessment has been widely

studied for several years, there is still no formal deﬁnition

of 3D defects. In the MSU project [2], eight measures for 3D

artifacts are proposed. Similarly, ﬁfteen 3D quality issues have

been identiﬁed in the Certiﬁ3D project [3].

This paper proposes ofﬂine automatic analysis methods for

the detection and assessment of 3D artifacts. Three impor-

tant artifacts, i.e. Sharpness Mismatch (SM), SYNchroniza-

tion Mismatch (SYNM) and Stereoscopic Window Violation

(SWV), are further detailed. The proposed SM detector frame-

work is based on measuring the width deviations between

edge pairs in valid depth planes. The SYNM is estimated

based on statistics on motion inconsistencies of object fea-

ture points across stereoscopic channels. SWVs happen when

objects appearing in front of the virtual screen hit the vertical

frame boundaries. Connected component analysis is used for

detecting SWV. A robust disparity estimation algorithm [4],

which computes Disparity Maps (DM) in the horizontal and

vertical directions without rectiﬁcations [4], is integrated in

the proposed algorithms. The remaining of this paper are orga-

nized as follows: Section 2 describes the proposed approaches

in detail; Section 3 discusses the experimental results; Section

4 concludes the paper and describes future work.

II. P

ROPOSED APPROACHES

For observers, stereo 3D artifacts are not only undesirable

but sometimes also painful. Common defects in real stereo 3D

videos include (similar deﬁnitions of some defects are also

introduced in [2] and [3]):

• Vertical misalignment: For depth illusion, vertical dispar-

ity is unwanted. Imperfect horizontal alignment of stereo

cameras can cause this defect.

• Sharpness mismatch: Sharpness mismatch can be caused,

among others, by focus/aperture setup errors, inconsistent

light environment, compression, denoising.

• Colorimetric mismatch: Common causes of colorimetric

mismatches are different point-of-views, changing light

conditions, malfunctioning or non-calibrated acquisition

system or even the use of bad color grading in post-

processing.

• Synchronization mismatch: Non-genlock cameras or bad

post-processing can cause asynchronism artifacts between

stereo channels.

• Hyper divergence/convergence: Excessive positive or

negative parallaxes on inappropriate viewing devices can

lead to these artifacts.

• Cross-talk level: Artifact caused by imperfect view sep-

aration, such that one view can be partially seen in the

other view.

• Stereoscopic window violation: Objects appearing in

front of the virtual screen in theatre space and hit the

left or right frame boundary cause retinal rivalry, that is

erroneously interpreted by the viewer as occlusion.

• Bent window effect: Sometimes, objects appearing in

front of the virtual screen in theatre space extend ver-

tically across the entire frame and hit both the top

and bottom frame boundaries. This is interpreted by the

brain as an occlusion cue, causing the perception of the

stereoscopic window as being bent towards the viewer.

• Depth jump cut: During editing, video cuts between two

shots with very different average depth cause a temporary

loss of the viewer’s 3D perception.

There are also other less common defects, e.g. view reversal

and reﬂection, and 2D to 3D conversion defects, e.g. depth

mismatch and visual mismatch. In this paper, techniques

developed for sharpness mismatch, synchronization mismatch

and stereoscopic window violation detection are presented.

There is no signiﬁcant connection among these three artifacts.

A. Disparity map correction

The automatically estimated DMs are usually noisy, which

must be corrected for further use. A valid disparity mapping in

the horizontal direction from the Left view to the Right view

(L2R) views can be deﬁned as

|DM

L2R

(i, j) − DM

R2L

(i, j + DM

L2R

(i, j))| ≤ δ, (1)

where δ denotes the disparity estimation error tolerance and

(i, j) denotes the pixel coordinates in the disparity map. The

validation of disparity maps from the Right view to the Left

view (R2L) is similar to eq. 1.

B. Sharpness mismatch

The proposed approach estimates the SM by analyzing

width deviations of corresponding edge pairs. According to the

Epipolar geometry, the edge widths of edge pairs in a depth

plane are consistent between LR views, when the focuses of

stereo cameras are well calibrated. SM usually leads to cross-

talk/ghost effect and unexpected blurriness, which can impair

the 3D experience for observers.

We use the Sobel ﬁlters to extract edge pixels E. They

are further segmented in depth planes based on the estimated

disparities as

(i, j) = E(i, j) & (DM (i, j) = d), (2)

where d denotes an estimated disparity value.

Only the vertical edge widths are measured in this work

since the major disparities occur in the horizontal direction.

For each edge pixel, a pixel sequence centered at the target

edge pixel is then selected. The width of a pixel sequence is

set to 64 pixels given human visual activity. Human visual

activity is related to the size of the fovea region, which covers

about 2% of the visual angle [5]. The central visual ﬁeld covers

approximately 30

[5]. The width of an activity region on a

high-deﬁnition (HD) image can be approximately calculated

as about 64 pixels at a comfortable viewing distance. The

method proposed in [6], is used to measure the edge width by

locating the pixel positions of the local minimum and the local

maximum of luminance intensities centered on the target edge

pixel within the allocated pixel sequence. The edge width is

then computed as the Euclidean distance between the positions

of the local minimum and the local maximum pixels.

The perceived blurriness is also related to the local con-

trast [7]. The edge width of just noticeable blur w

JN B

[7]

is estimated in a 64 × 64 pixels block. If the deviations of

edge widths are larger than w

JN B

, the SM artifact cannot be

noticed. The cumulative probability of noticeable SM can be

calculated as

i=1

dw>w

J N B

, (3)

where N denotes the number of edge pixels and dw denotes

the width deviation of an edge pair between stereoscopic

views. I

dw>w

J N B

is an indicator function if the condition

dw > w

JN B

is met. However, the cumulative probability

must be corrected considering the lack of edge pixels caused

by the disparity estimation errors. Then, the Probability of

Sharpness Mismatches P SM is estimated and smoothed with

the correction coefﬁcients k , which are calculated considering

the number of valid disparity mappings, as

P SM = 1 − exp



−

+ k

2 · ˜σ



, (4)

where ˜σ denotes the standard deviation between P

and k.

C. Synchronization mismatch

The annoyance of the synchronization distortion depends

on the 3D scene. SYNM can be very annoying in a shot with

strong object motions. If the scene is still, SYNM is almost

imperceptible. Thus, the ﬁrst step of our approach is to analyze

the perceptibility of SYNM. Furthermore, shot detection is

required for the framework.

Single-valued spatial and temporal perceptual information,

deﬁned by ITU [8], is computed to estimate the perceptibility

of SYNM. The Spatial Information (SI) describes the level of

spatial details in textured images and is calculated as the max-

imum standard deviation of each Sobel-ﬁltered frame within

a time duration. The Temporal Information (T I) describes

the strength of the motion in a sequence and is computed

as the maximum standard deviation of the pixel luminance

differences at the same location between two neighboring

frames. The SI and TI values for a shot are respectively

calculated for the LR channels. The reversed perceptibility

score (RP S) is deﬁned as

RP S = max



T I



, c = {L, R}. (5)

If RP S is smaller than a threshold, it is necessary to detect

the synchronization mismatch.

SYNM is detected at frame and shot levels. If a frame is syn-

chronous at time point t, the object motions in a depth plane

between the LR views are consistent. Conversely, motion in-

consistency can be observed between two asynchronous views.

The computation of motion consistencies is based on relative

displacements of feature point pairs, which are extracted and

matched using SIFT [9] as well as RANSAC [10]. The

matched feature points are segmented according to validated

depth planes. The relative displacement

d of a matched feature

point fp(i) in a depth plane D

, j ∈ [0, 255], is determined

d(fp(i)) = P

− DM

L2R

) − P

, fp(i) ∈ D

, (6)

where P

and P

denote the coordinates of fp(i) in the

L and R views respectively. The variances of the relative

displacements of all feature points are calculated to describe

the motion consistencies. Please note that if there is only slight

motion in depth but no noticeable motion in the horizontal

(H) and vertical (V ) directions between neighboring frames,

the corresponding SI is signiﬁcant larger than T I.

To measure the synchronization of frame f(t) at time point

t, two neighboring frames f(t − 1) and f(t + 1) are required.

The synchronism probability is measured from the left view

(t)) to the right views (f

(t−1, t, t+1)) and from the right

view (f

(t)) to the left views (f

(t − 1, t, t + 1)) respectively.

The motion related displacements of the feature points are

decomposed in HV directions. Then, HV synchronization

probabilities are respectively estimated based on the variances

of the motion displacements of the matched feature points.

Hence, six histograms of the displacement variances in all

valid depth planes are constructed to rank the synchronization

probabilities between the frame f(t) of one view and the

frames f (t − 1), f(t), f(t + 1) of the other view. Z

which is computed considering a conﬁdence level α, is used

as the threshold for the estimation of the synchronization

probability. The synchronization probability in one direction

is calculated as,

i=1

, h

≤ Z

, r = {H, V } (7)

where n is the total number of the variance histograms

h. In consideration of the statistical accuracy, the outliers

of the variance histogram are ﬁrst detected and removed.

The overall synchronization probability p is computed as the

geometric mean of p

, since the geometric mean is more robust

than the arithmetic mean, if outliers can be observed in test

samples [11]. A frame is judged as synchronous when the

maximum p between views occur at time point t in both

estimation direction:

(

max{p(f(t)

, f(t

′

)

)} = p(f (t)

, f(t)

max{p(f(t)

, f(t

′

)

)} = p(f (t)

, f(t)

(8)

where t

′

= {t − 1, t, t + 1}. If most of the frames within

a shot are estimated as asynchronous, the shot is judged as

asynchronous.

D. Stereoscopic window violation

In 3D cinematography, we observe the 3D world through

the so-called Stereoscopic Window (SW) [12], namely the

TV or cinema screen. In other words, the viewer watches

objects ﬂoating in a space deﬁned by the screen edges. If

the left disparity of a 3D point is positive/zero/negative, the

eyes converge to a point either behind the screen, on screen or

within the theatre space (in front of the screen), respectively.

Retinal rivalry occurs on the left or right frame edges, when

object regions positioned close to the left image’s left or right

border do not have correspondence (are not displayed) in the

right frame and vice versa. For objects with zero disparity, no

retinal rivalry is observed. When an object region is cut off by

the edge of the display, it results in the so-called Stereoscopic

Window Violation (SWV) and is interpreted as occlusion by

the viewer.

A SWV does not create any problems when it occurs behind

the screen (i.e., for objects with positive left disparity), because

both disparity and occlusion cues dictate that the object lies

behind the screen. However, when SWV involves objects

perceived as appearing in front of the screen (i.e. they have

negative left disparity) the occlusion cue conﬂicts the disparity

one. Generally, as occlusion supersedes the disparity cue, the

object is ﬁnally perceived as lying behind the screen plane

[12]. The above are true for a mild SWV, where only a small

region of the object that interferes with the left or right frame

edge is missing from the other image. In a severe SWV, the

missing object region is so extended, that the human brain

cannot fuse the images and eventually see 3D.

SWV in negative disparities is not only undesirable, but

may also prove painful. The rule regarding SWV states that

a cinematographer has to avoid breaking the stereoscopic

window, while an object is being ﬁlmed with negative left

disparity. There is one notable exception, related to object

speed [13]. Objects entering or exiting the frame in no more

than half a second cause no problem, since, by the time the

brain localizes the object in front of the screen, the entire

object is either fully visible in the frame or has disappeared,

respectively.

A simple, yet effective, algorithm that detects the Stereo-

scopic Window Violation using disparity maps has been devel-

oped in this work. We assume the existence of left and right

dense disparity maps for each stereoscopic video frame, i.e.,

L2R

(u, v) and DM

R2L

(u, v), u = 0, ..W −1, v = 0, ..H−

1, where W, H are the width and height of the video frame

(in pixels). At the ﬁrst step of the algorithm, pixels u, v are

selected, having left disparity DM

L2R

(u, v) < −T

and right

disparity DM

R2L

(u, v) > T

. In order to exclude objects that

do not appear in front of the screen, we set the threshold T

to a suitable value and perform connected component analysis

with an 8-point neighbourhood to extract objects (connected

components) that are displayed signiﬁcantly in front of the

screen. A value of T

= 0.0025W worked well in our

experiments. To reduce noise, objects with small width (less

than T

) or height (less than T

) are rejected. Threshold values

of T

= 0. 02W and T

= 0. 04H have been found to work

well. The detected objects are then enclosed into rectangular

bounding boxes (Regions of Interest, ROIs). Thus, two sets of

ROIs R

= {R

, R

, ..., R

} and R

= {R

, R

, ..., R

} are

created for the left and right channel, respectively. These ROIs

are represented by their upper left and lower right coordinates

i,min

, Y

i,min

]

and [X

i,max

, Y

i,max

]

, where j = {r, l}

and i is the ROI index.

Two types of disturbing SWVs can be deﬁned. In the

ﬁrst type, namely left SWV, the violation occurs on the left

frame border, since there is a region in the left image which

is missing from the right one. Its detection is performed

as follows. If one or more object ROIs R

, with disparity

characteristics such as those previously described, lie on the

left border of the right image, that is, if X

i,min

= 0, a SWV is

present. This is because X

i,min

= X

i,min

+DM

R2L

(i, j) > 0

and, thus, the region [0, DM

R2L

(i, j)] in the left image is not

present in the right one. In order to reduce false alarms, arising

from inaccuracies in the disparity maps, another condition is

introduced. The number of pixels that belong to the object in

the two leftmost ROI columns must be greater than a threshold

, expressed as a percentage of the ROI height, to decide

that this object signals a SWV. In our experiments T

is set

to 0.3h

ROI

, where h

ROI

is the ROI height.

A similar procedure is followed for the detection of a

right SWV. In this case, a region appearing in the rightmost

border of the right image is absent from the left image. Thus,

if one or more object ROIs detected in the left disparity

map R

lie on the right border of the left image, i.e., if

i,max

= W − 1, a SWV is present. This is because

i,max

= X

i,max

+ DM

L2R

(i, j) < W − 1. Therefore,

the region [W + DM

L2R

(i, j), W − 1] in the right image

is not present in the left one. The false alarm reduction

approach regarding small regions (noise) is applied to right

SWV detection, as well.

When a left or right SWV of duration d

SW V

frames is

detected, the condition d

SW V

fps

is checked, where fps

is the video frame rate, to determine whether the violation is

perceived as annoying or not. The satisfaction of this condition

implies that the duration of the violation is more than half a

second.

III. E

XPERIMENTAL RESULTS

In order to evaluate the performance of the proposed ap-

proaches, several HD 3D sequences (cf. TABLE I) were used

for the experiments. The disparity maps of all test sequences

were automatically estimated by the algorithm proposed in [4].

δ in Eq. 1 was set to 1 in the experiments considering the

clipping problem between the integer and ﬂoating values.

S1 - S3 were realized in a professional studio using two

Sony HDC 1400 cameras and broadcasted by ARTE (Associ-

ation Relative

a la T

evision Europ

eenne). The cameras were

set in convergence mode. Focuses were manually calibrated

with a sharpness chart. Luminance and color parameters were

corrected with Remote Control Panel (RCP ) and the use of

a waveform monitor and a video monitor. Adobe Premi

ere

was used to remove the useless parts at the beginning and the

end of each sequence. The focus of the left camera for S2

and S3 were manually modiﬁed to generate global sharpness

mismatches. There is no modiﬁcation of the right camera

setups among S1, S2 and S3. The focus of the left camera was

manually set to +2m for S2 and to +3m for S3 in comparison

to the right camera. There is no camera motion in S1, S2 and

S3. S4 - S6 contain densely meshed textures and local motion

types. S7 is a sequence containing SWVs.

TABLE I

ORIGINAL TEST SEQUENCES USED IN THE EXPERIMENTS

ID. Seq. name ID. Seq. name

S1 ARTE studio setup S2 ARTE left camera +2m

S3 ARTE left camera +3m S4 Badminton

S5 BeergardenNoFlag S6 BeergardenFlag

S7 The Magician

A. Results of sharpness mismatch estimation

The proposed SM framework was evaluated with two kinds

of experiments. One experiment used S1 - S3 to evaluate the

performance on global SMs. Fig. 1 (a) - (c) show example

views with focus mismatches. S3 contains some local object

motions. The aim of the second experiment is to measure

the performance on depth-of-ﬁeld (DOF ) mismatches. Exper-

imentally, DOF mismatches were generated in the right chan-

nels of S4 - S6 by Gaussian low-pass ﬁlters from a speciﬁc

depth plane on (cf. Fig. 1 (d) - (f)) since defocus-based effects

of lens aberrations can be modeled as a Gaussian blur [14].

The standard deviations σ of the Gaussian low-pass ﬁlter was

varied from 0.0 to 6.0 with a step width of 0.4. The images are

totally distorted if σ > 6.0. The radius of Gaussian ﬁlters was

set to 3σ. Deviations (CP BD

) of sharpness scores, which

were estimated by the sharpness metric CPBD [15], were used

as a reference in the experiment. CPBD is a well performing

2D objective no-reference sharpness metric for Gaussian blur

distortions.

Mean SM scores of S1 - S3 are shown in TABLE II. It

can be observed that both the proposed approach (P SM) and

CP BD

can detect slight SMs between stereo cameras, as

the SM scores of both metrics increase with the enlargement

of focus deviations. Although S3 contains object motion, the

variances of SM scores of both metrics are very small. Thus,

local motion does not affect SM predictions estimated by the

P SM and CP BD

, when small variances are considered.

However, P SM is more sensitive to SMs than CPBD

can be seen in TABLE II. The range of the both metrics is [0,

1].

The results of the second experiment are shown in Fig. 2.

The SM scores of P SM, monotonically increase with the

increase of the standard deviations of Gaussian low-pass ﬁl-

ters. Signiﬁcant changes can be observed when σ = [1.2, 4.4].

However, P SM is not very sensitive for slight blur distortions

(e.g σ < 1.2). Signiﬁcant modiﬁcations of SM scores of

S6 can be observed even from σ = 1.6 since S6 contains

homogenous textures and strong local motion blur. CPBD

Efficient automatic detection of 3D video artifacts

Figures

Citations

Trends in S3D-Movie Quality Evaluated on 105 Films Using 10 Metrics

Efficient no-reference metric for sharpness mismatch artifact between stereoscopic views

Sharpness Mismatch and 6 Other Stereoscopic Artifacts Measured on 10 Chinese S3D Movies

References

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Object recognition from local scale-invariant features

An Introduction To Probability Theory And Its Applications

A No-Reference Objective Image Sharpness Metric Based on the Notion of Just Noticeable Blur (JNB)

Understanding Robust and Exploratory Data Analysis

Related Papers (5)

High dynamic range stereo video using SIFT and simultaneous multi-exposure

Stereo imaging using a camera with stereoscopic adapter

Efficient uncalibrated rectification method for stereo vision systems

A stereo camera distortion detecting method for 3DTV video quality assessment

An Object Detection and Extraction Method Using Stereo Camera

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Efficient automatic detection of 3d video artifacts" ?

Q2. What future works have the authors mentioned in the paper "Efficient automatic detection of 3d video artifacts" ?

Q3. What was used as a reference in the experiment?

Q4. How do the authors exclude objects that do not appear in front of the screen?

Q5. How is the stereoscopic window violation detected?

Q6. What is the synchronization probability of a frame?

Q7. What is the condition for detecting a left or right SWV?

Q8. What is the definition of a synchronization mismatch?

Q9. What is the definition of a retinal rivalry?

Q10. What is the definition of a stereoscopic window effect?

Q11. What are the common causes of 3D artifacts?

Q12. How many pixels can be measured at a comfortable viewing distance?

Q13. How is the probability of a sharpness mismatch calculated?

Q14. What was the procedure used for the experiments?