What contributions have the authors mentioned in the paper "Occlusion-aware depth estimation using light-field cameras" ?

In this paper, the authors develop a depth estimation algorithm that treats occlusion explicitly ; the method also enables identification of occlusion edges, which may be useful in other applications. The authors show that, although pixels at occlusions do not preserve photo-consistency in general, they are still consistent in approximately half the viewpoints.

What are the main problems of the two methods?

both methods are vulnerable to heavy occlusion: the tensor field becomes too random to estimate, and 3D lines are partitioned into small, incoherent segments.

What is the main contribution of this paper?

In this paper, the authors explicitly model occlusions, by developing a modified version of the photo-consistency condition on angular pixels.

What is the angular coordinates of the pixel?

For each pixel, the authors refocus to various depths using a 4D shearing of the light-field data [17],Lα(x, y, u, v) = L(x+u(1− 1 α ), y+v(1− 1 α ), u, v), (6)where L is the input light-field image, α is the ratio of the refocused depth to the currently focused depth, Lα is the refocused light-field image, (x, y) are the spatial coordinates, and (u, v) are the angular coordinates.

(Open Access) Occlusion-Aware Depth Estimation Using Light-Field Cameras (2015) | Ting-Chun Wang

Q: How do the authors increase the depth of the occlusion cues?

The authors divide the gradient by dini to increase robustness since for the same normal, the depth change across pixels becomes larger as the depth gets larger.

Q: What is the effect of a refocus on the angular patch?

The authors show that if the authors refocus to the occluded plane, the angular patch will still have photo-consistency in a subset of the pixels (unoccluded).

Q: What is the effect of the bilateral metric on angular patch?

The authors show that at occlusions, some of the angular patch remains photoconsistent, while the other part comes from occluders and exhibits no photo consistency.

Occlusion-aware Depth Estimation Using Light-ﬁeld Cameras

Ting-Chun Wang

UC Berkeley

tcwang0509@berkeley.edu

Alexei A. Efros

UC Berkeley

efros@eecs.berkeley.edu

Ravi Ramamoorthi

UC San Diego

ravir@cs.ucsd.edu

Abstract

Consumer-level and high-end light-ﬁeld cameras are

now widely available. Recent work has demonstrated prac-

tical methods for passive depth estimation from light-ﬁeld

images. However, most previous approaches do not explic-

itly model occlusions, and therefore cannot capture sharp

transitions around object boundaries. A common assump-

tion is that a pixel exhibits photo-consistency when focused

to its correct depth, i.e., all viewpoints converge to a sin-

gle (Lambertian) point in the scene. This assumption does

not hold in the presence of occlusions, making most cur-

rent approaches unreliable precisely where accurate depth

information is most important – at depth discontinuities.

In this paper, we develop a depth estimation algorithm

that treats occlusion explicitly; the method also enables

identiﬁcation of occlusion edges, which may be useful in

other applications. We show that, although pixels at occlu-

sions do not preserve photo-consistency in general, they are

still consistent in approximately half the viewpoints. More-

over, the line separating the two view regions (correct depth

vs. occluder) has the same orientation as the occlusion edge

has in the spatial domain. By treating these two regions

separately, depth estimation can be improved. Occlusion

predictions can also be computed and used for regulariza-

tion. Experimental results show that our method outper-

forms current state-of-the-art light-ﬁeld depth estimation

algorithms, especially near occlusion boundaries.

1. Introduction

Light-ﬁeld cameras from Lytro [3] and Raytrix [18]

are now available for consumer and industrial use respec-

tively, bringing to fruition early work on light-ﬁeld render-

ing [10, 15]. An important beneﬁt of light-ﬁeld cameras for

computer vision is that multiple viewpoints or sub-apertures

are available in a single light-ﬁeld image, enabling passive

depth estimation [4]. Indeed, Lytro Illum and Raytrix soft-

ware produces depth maps used for tasks like refocusing af-

ter capture, and recent work [20] shows how multiple cues

like defocus and correspondence can be combined.

However, very little work has explicitly considered oc-

Wanner et al. (CVPR12)

Tao et al. (ICCV13)

Yu et al. (ICCV13)

Chen et al. (CVPR14)Our Result

Input light-field image

Figure 1: Comparison of depth estimation results of differ-

ent algorithms from a light-ﬁeld input image. Darker rep-

resents closer and lighter represents farther. It can be seen

that only our occlusion-aware algorithm successfully cap-

tures most of the holes in the basket, while other methods

either smooth over them, or have artifacts as a result.

angular

patch

(a) Non-occluded pixels

angular

patch

(b) Occluded pixels

Figure 2: Non-occluded vs. occluded pixels. (a) At non-

occluded pixels, all view rays converge to the same point

in the scene if refocused to the correct depth. (b) However,

photo-consistency fails to hold at occluded pixels, where

some view rays will hit the occluder.

clusion. A common assumption is that, when refocused

to the correct depth (the depth of the center view), an-

gular pixels corresponding to a single spatial pixel repre-

sent viewpoints that converge to one point in the scene. If

we collect these pixels into an angular patch (Eq. 6), they

exhibit photo-consistency for Lambertian surfaces, which

means they all share the same color (Fig. 2a). However, this

assumption is not true when occlusions occur at a pixel;

photo-consistency no longer holds (Fig. 2b). Enforcing

photo-consistency on these pixels will often lead to incor-

rect depth results, causing smooth transitions around sharp

occlusion boundaries.

In this paper, we explicitly model occlusions, by devel-

oping a modiﬁed version of the photo-consistency condition

on angular pixels. Our main contributions are:

1. An occlusion prediction framework on light-ﬁeld im-

ages that uses a modiﬁed angular photo-consistency.

2. A robust depth estimation algorithm which explicitly

takes occlusions into account.

We show (Sec. 3) that around occlusion edges, the angu-

lar patch can be divided into two regions, where only one

of them obeys photo-consistency. A key insight (Fig. 3) is

that the line separating the two regions in the angular do-

main (correct depth vs. occluder) has the same orientation

as the occlusion edge does in the spatial domain. This ob-

servation is speciﬁc to light-ﬁelds, which have a dense set

of views from a planar camera array or set of sub-apertures.

Standard stereo image pairs (nor general multi-view stereo

conﬁgurations) do not directly satisfy the model.

We use the modiﬁed photo-consistency condition, and

the means/variances in the two regions, to estimate initial

occlusion-aware depth (Sec. 4). We also compute a predic-

tor for the occlusion boundaries, that can be used as an input

to determine the ﬁnal regularized depth (Sec. 5). These oc-

clusion boundaries could also be used for other applications

like segmentation or recognition. As seen in Fig. 1, our

depth estimates are more accurate in scenes with complex

occlusions (previous results smooth object boundaries like

the holes in the basket). In Sec. 6, we present extensive re-

sults on both synthetic data (Figs. 9, 10), and on real scenes

captured with the consumer Lytro Illum camera (Fig. 11),

demonstrating higher-quality depth recovery than previous

work [8, 20, 22, 26].

2. Related Work

(Multi-View) Stereo with Occlusions: Multi-view stereo

matching has a long history, with some efforts to handle oc-

clusions. For example, the graph-cut framework [12] used

an occlusion term to ensure visibility constraints while as-

signing depth labels. Woodford et al. [25] imposed an ad-

ditional second order smoothness term in the optimization,

and solved it using Quadratic Pseudo-Boolean Optimiza-

tion [19]. Based on this, Bleyer et al. [5] assumed a scene

is composed of a number of smooth surfaces and proposed

a soft segmentation method to apply the asymmetric occlu-

sion model [24]. However, signiﬁcant occlusions still re-

main difﬁcult to address even with a large number of views.

Depth from Light-Field Cameras: Perwass and Wiet-

zke [18] proposed using correspondence techniques to esti-

mate depth from light-ﬁeld cameras. Tao et al. [20] com-

bined correspondence and defocus cues in the 4D Epipo-

lar Image (EPI) to complement the disadvantages of each

other. Neither method explicitly models occlusions. Mc-

Closkey [16] proposed a method to remove partial occlusion

in color images, which does not estimate depth. Wanner and

Goldluecke [22] proposed a globally consistent framework

by applying structure tensors to estimate the directions of

feature pixels in the 2D EPI. Yu et al. [26] explored geo-

metric structures of 3D lines in ray space and encoded the

line constraints to further improve the reconstruction qual-

ity. However, both methods are vulnerable to heavy occlu-

sion: the tensor ﬁeld becomes too random to estimate, and

3D lines are partitioned into small, incoherent segments.

Kim et al. [11] adopted a ﬁne-to-coarse framework to en-

sure smooth reconstructions in homogeneous areas using

dense light-ﬁelds. We build on the method by Tao et al. [20],

which works with consumer light-ﬁeld cameras, to improve

depth estimation by taking occlusions into account.

Chen et al. [8] proposed a new bilateral metric on angu-

lar pixel patches to measure the probability of occlusions

by their similarity to the central pixel. However, as noted in

their discussion, their method is biased towards the central

view as it uses the color of the central pixel as the mean of

the bilateral ﬁlter. Therefore, the bilateral metric becomes

unreliable once the input images get noisy. In contrast, our

method uses the mean of about half the pixels as the ref-

erence, and is thus more robust when the input images are

noisy, as shown in our result section.

3. Light-Field Occlusion Theory

We ﬁrst develop our new light-ﬁeld occlusion model,

based on the physical image formation. We show that

at occlusions, some of the angular patch remains photo-

consistent, while the other part comes from occluders and

exhibits no photo consistency. By treating these two regions

separately, occlusions can be better handled.

For each pixel on an occlusion edge, we assume it is oc-

cluded by only one occluder among all views. We also as-

sume that we are looking at a spatial patch small enough,

so that the occlusion edge around that pixel can be approxi-

mated by a line. We show that if we refocus to the occluded

plane, the angular patch will still have photo-consistency

in a subset of the pixels (unoccluded). Moreover, the edge

separating the unoccluded and occluded pixels in the angu-

lar patch has the same orientation as the occlusion edge in

the spatial domain (Fig. 3). In Secs. 4 and 5, we use this idea

to develop a depth estimation and regularization algorithm.

Consider a pixel at (x

, y

, f) on the imaging focal plane

(the plane in focus), as shown in Fig. 3a. An edge in the cen-

tral pinhole image with 2D slope γ corresponds to a plane

P in 3D space (the green plane in Fig. 3a). The normal n to

this plane can be obtained by taking the cross-product,

n = (x

, y

, f)× (x

+1, y

+γ, f) = (−γf, f, γx

− y

). (1)

Note that we do not need to normalize the vector. The plane

equation is P (x, y, z) ≡ n · (x

− x, y

− y, f − z) = 0,

P (x, y, z) ≡ γf (x − x

) − f (y − y

) + (y

− γx

)(z − f) = 0.

(2)

Camera plane

Occluder

Focal plane

= (x

,f)

= (u,v,0)

(a) Pinhole model

Camera plane

Occluder

Occluded plane

(b) “Reversed” pinhole model

Figure 3: Light-ﬁeld occlusion model. (a) Pinhole model for

central camera image formation. An occlusion edge on the

imaging plane corresponds to an occluding plane in the 3D

space. (b) The “reversed” pinhole model for light-ﬁeld for-

mation. It can be seen that when we refocus to the occluded

plane, we get a projection of the occluder on the camera

plane, forming a reversed pinhole camera model.

In our case, one can verify that n · (x

, y

, f) = 0 so a

further simpliﬁcation to n · (x, y, z) = 0 is possible,

P (x, y, z) ≡ γfx − fy + (y

− γx

)z = 0. (3)

Now consider the occluder (yellow triangle in Fig. 3a).

The occluder intersects P (x, y, z) with z ∈ (0, f) and lies

on one side of that plane. Without loss of generality, we

can assume it lies in the half-space P (x, y, z) ≥ 0. Now

consider a point (u, v, 0) on the camera plane (the plane

where the camera array lies on). To avoid being shadowed

by the occluder, the line segment connecting this point and

the pixel (x

, y

, f) on the image must not hit the occluder,

P (s

+ (s

− s

)t) ≤ 0 ∀t ∈ [0, 1], (4)

where s

= (u, v, 0) and s

= (x

, y

, f). When t = 1,

P (s

) = 0. When t = 0,

P (s

) ≡ γf u − f v ≤ 0. (5)

The last inequality is satisﬁed if v ≥ γu, i.e., the critical

slope on the angular patch v/u = γ is the same as the edge

orientation in the spatial domain. If the inequality above is

(a) Occlusion in central view (b) Occlusion in other views

Figure 4: Occlusions in different views. The insets are the

angular patches of the red pixels when refocused to the cor-

rect depth. At the occlusion edge in the central view, the

angular patch can be divided evenly into two regions, one

with photo-consistency and one without. However, for pix-

els around the occlusion edge, although the central view

is not occluded, some other views will still get occluded.

Hence, the angular patch will not be photo-consistent, and

will be unevenly divided into occluded and visible regions.

satisﬁed, both endpoints of the line segment lie on the other

side of the plane, and hence the entire segment lies on that

side as well. Thus, the light ray will not be occluded.

We also give an intuitive explanation of the above proof.

Consider a plane being occluded by an occluder, as shown

in Fig. 3b. Consider a simple 3 × 3 camera array. When we

refocus to the occluded plane, we can see that some views

are occluded by the occluder. Moreover, the occluded cam-

eras on the camera plane are the projection of the occluder

on the camera plane. Thus, we obtain a “reversed” pinhole

camera model, where the original imaging plane is replaced

by the camera plane, and the original pinhole becomes the

pixel we are looking at. When we collect pixels from differ-

ent cameras to form an angular patch, the edge separating

the two regions will correspond to the same edge the oc-

cluder has in the spatial domain.

Therefore, we can predict the edge orientation in the an-

gular domain using the edge in the spatial image. Once we

divide the patch into two regions, we know photo consis-

tency holds in one of them since they all come from the

same (assumed to be Lambertian) spatial pixel.

4. Occlusion-Aware Initial Depth Estimation

In this section, we show how to modify the initial depth

estimation from Tao et al. [20], based on the theory above.

First, we apply edge detection on the central view image.

Then for each edge pixel, we compute initial depths using

a modiﬁed photo-consistency constraint. The next section

will discuss computation of reﬁned occlusion predictors and

regularization to generate the ﬁnal depth map.

Edge detection: We ﬁrst apply Canny edge detection on

the central view (pinhole) image. Then an edge orientation

predictor is applied on the obtained edges to get the orien-

tation angles at each edge pixel. These pixels are candidate

occlusion pixels in the central view. However, some pix-

els are not occluded in the central view, but are occluded in

other views, as shown in Fig. 4, and we want to mark these

as candidate occlusions as well. We identify these pixels by

dilating the edges detected in the center view.

Depth Estimation: For each pixel, we refocus to various

depths using a 4D shearing of the light-ﬁeld data [17],

(x, y, u, v) = L(x+u(1−

), y +v(1−

), u, v), (6)

where L is the input light-ﬁeld image, α is the ratio of the

refocused depth to the currently focused depth, L

is the re-

focused light-ﬁeld image, (x, y) are the spatial coordinates,

and (u, v) are the angular coordinates. The central view-

point is located at (u, v) = (0, 0). This gives us an angular

patch for each depth, which can be averaged to give a refo-

cused pixel.

When an occlusion is not present at the pixel, the ob-

tained angular patch will have photo-consistency, and hence

exhibits small variance and high similarity to the central

view. For pixels that are not occlusion candidates, we can

simply compute the variance and the mean of this patch to

obtain the correspondence and defocus cues, similar to the

method by Tao et al. [20].

However, if an occlusion occurs, photo-consistency will

no longer hold. Instead of dealing with the entire angular

patch, we divide the patch into two regions. The angular

edge orientation separating the two regions is the same as

in the spatial domain, as proven in Sec. 3. Since at least

half the angular pixels come from the occluded plane (oth-

erwise it will not be seen in the central view), we place the

edge passing through the central pixel, dividing the patch

evenly. Note that only one region, corresponding to the par-

tially occluded plane focused to the correct depth, exhibits

photo-consistency. The other region contains angular pix-

els that come from the occluder, which is not focused at

the proper depth, and might also contain some pixels from

the occluded plane. We therefore replace the original patch

with the region that has the minimum variance to compute

the correspondence and defocus cues.

To be speciﬁc, let (u

, v

) and (u

, v

) be the angular co-

ordinates in the two regions, respectively. We ﬁrst compute

the means and the variances of the two regions,

α,j

(x, y) =

(x, y, u

, v

), j = 1, 2 (7)

α,j

(x, y) =

− 1



(x, y, u

, v

)−

α,j

(x, y)



(8)

where N

is the number of pixels in region j. Let

i = arg min

j=1,2



α,j

(x, y)



(9)

(a) Spatial image (b) Angular patch

(correct depth)

(incorrect depth)

spatial patch

angular patch

(d) Color consistency (e) Focusing to

correct depth

angular

patch

(f) Focusing to

incorrect depth

Figure 5: Color consistency constraint. (b)(e) We can see

that when we refocus to the correct depth, we get low vari-

ance in half the angular patch. However, in (c)(f) although

we refocused to an incorrect depth, it still gives low vari-

ance response since the occluded plane is very textureless,

so we get a “reversed” angular patch. To address this, we

add another constraint that p

and p

should be similar to

the averages of R

and R

in (d), respectively.

be the index of the region that exhibits smaller variance.

Then the correspondence response is given by

(x, y) = V

α,i

(x, y) (10)

Similarly, the defocus response is given by

(x, y) =



α,i

(x, y) − L(x, y, 0, 0)



(11)

Finally, the optimal depth is determined as

∗

(x, y) = arg min



(x, y) + D

(x, y)



(12)

Color Consistency Constraint: When we divide the an-

gular patch into two regions, it is sometimes possible to

obtain a “reversed” patch when we refocus to an incorrect

depth, as shown in Fig. 5. If the occluded plane is very

textureless, this depth might also give a very low variance

response, even though it is obviously incorrect. To address

this, we add a color consistency constraint that the averages

of the two regions should have a similar relationship with

respect to the current pixel as they have in the spatial do-

main. Mathematically,

α,1

− p

| + |

α,2

− p

| < |

α,2

− p

| + |

α,1

− p

| + δ,

(13)

where p

and p

are the values of the pixels shown in

Fig. 5d, and δ is a small value (threshold) to increase robust-

ness. If refocusing to a depth violates this constraint, this

depth is considered invalid, and is automatically excluded

in the depth estimation process.

(a) Central input image (b) Depth cue (F=0.58)

(e) Combined cue (F=0.65) (f) Occlusion ground truth

Figure 6: Occlusion Predictor (Synthetic Scene). The inten-

sities are adjusted for better contrast. F-measure is the har-

monic mean of precision and recall compared to the ground

truth. By combining three cues from depth, correspondence

and refocus, we can obtain a better prediction of occlusions.

5. Occlusion-Aware Depth Regularization

After the initial local depth estimation phase, we reﬁne

the results with global regularization using a smoothness

term. We improve on previous methods by reducing the ef-

fect of the smoothness/regularization term in occlusion re-

gions. Our occlusion predictor, discussed below, may also

be useful independently for other vision applications.

Occlusion Predictor Computation: We compute a pre-

dictor P

occ

for whether a particular pixel is occluded, by

combining cues from depth, correspondence and refocus.

1. Depth Cues: First, by taking the gradient of the

initial depth, we can obtain an initial occlusion boundary,

occ

= f



∇d

ini



(14)

where d

ini

is the initial depth, and f(·) is a robust clipping

function that saturates the response above some threshold.

We divide the gradient by d

ini

to increase robustness since

for the same normal, the depth change across pixels be-

comes larger as the depth gets larger.

2. Correspondence Cues: In occlusion regions, we

have already seen that photo-consistency will only be valid

in approximately half the angular patch, with a small vari-

ance in that region. On the other hand, the pixels in the other

region come from different points on the occluding object,

and thus exhibit much higher variance. By computing the

ratio between the two variances, we can obtain an estimate

of how likely the current pixel is to be at an occlusion,

var

occ

= f



max



∗



. (15)

where α

∗

is the initial depth we get.

3. Refocus Cues: Finally, note that the variances in

both the regions will be small if the occluder is textureless.

To address this issue, we also compute the means of both

regions. Since the two regions come from different objects,

their colors should be different, so a large difference be-

tween the two means also indicates a possible occlusion oc-

currence. In other words,

avg

occ

= f(|

∗

−

∗

|) (16)

Finally, we compute the combined occlusion response or

prediction by the product of these three cues,

occ

= N (P

occ

) · N (P

var

occ

) · N (P

avg

occ

) (17)

where N (·) is a normalization function that subtracts the

mean and divides by the standard deviation.

Depth Regularization: Finally, given initial depth and

occlusion cues, we regularize with a Markov Random Field

(MRF) for a ﬁnal depth map. We minimize the energy:

E =

unary

(p, d(p)) +

p,q

binary

(p, q, d(p), d(q)).

(18)

where d is the ﬁnal depth and p, q are neighboring pixels.

We adopt the unary term similar to Tao et al. [20]. The

binary energy term is deﬁned as

binary

(p, q, d(p), d(q)) =

exp



− (d(p) − d(q))

/(2σ

)



(|∇I(p) − ∇I(q)| + k|P

occ

(p) − P

occ

(q)|)

(19)

where ∇I is the gradient of the central pinhole image, and

k is a weighting factor. The numerator encodes the smooth-

ness constraint, while the denominator reduces the strength

of the constraint if two pixels are very different or an oc-

clusion is likely to be between them. The minimization

is solved using a standard graph cut algorithm [6, 7, 13].

We can then apply the occlusion prediction procedure again

on this regularized depth map. A sample result is shown

in Fig. 6. In this example, the F-measure (harmonic mean

of precision and recall compared to ground truth) increased

from 0.58 (depth cue), 0.53 (correspondence cue), and 0.57

(refocus cue), to 0.65 (combined cue).

Occlusion-Aware Depth Estimation Using Light-Field Cameras

Figures

Citations

Learning-based view synthesis for light field cameras

Learning-Based View Synthesis for Light Field Cameras

A Dataset and Evaluation Methodology for Depth Estimation on 4D Light Fields

Light Field Image Processing: An Overview

Soft 3D reconstruction for view synthesis

References

Fast approximate energy minimization via graph cuts

An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision

Light field rendering

A volumetric method for building complex models from range images

Fast approximate energy minimization via graph cuts

Related Papers (5)

Accurate depth map estimation from a lenslet light field camera

Depth from Combining Defocus and Correspondence Using Light-Field Cameras

Light field rendering

Variational Light Field Analysis for Disparity Estimation and Super-Resolution

Light field photography with a hand-held plenoptic camera

Frequently Asked Questions (8)

Q1. What contributions have the authors mentioned in the paper "Occlusion-aware depth estimation using light-field cameras" ?

Q2. How do the authors increase the depth of the occlusion cues?

Q3. What are the main problems of the two methods?

Q4. What is the effect of a refocus on the angular patch?

Q5. What is the effect of the bilateral metric on angular patch?

Q6. What is the main contribution of this paper?

Q7. How did Wanner and Goldluecke propose a global consistent framework?

Q8. What is the angular coordinates of the pixel?