Why did the proposed method have significant outliers?

Because the proposed depth estimation method collects matching costs using robust clipping functions, it can tolerate significant outliers.

How does the proposed method improve the matching quality?

In addition, the calculation of the exact sub-pixel shift using the phase shift theorem improves the matching quality as demonstrated in the synthetic experiments.

What was the effect of the proposed method on the depth map?

The significant challenges of estimating the disparity using very narrow baselines was discussed, and the proposed method was found to be effective in terms of utilizing the sub-pixel shift in the frequency domain.

What is the error value of AWS and Robust PCA?

The error values of AWS and Robust PCA are from [8].every x, Ĉ(x, lr(x)) is set to 0, followed by weighted median filtering [15] of the cost slices.

What is the way to compute the optimal disparity map?

The optimal disparity map is obtained through minimizinglr = argmin l ∑ x C ′ ( x, l(x) ) + λ1 ∑ x∈I ‖l(x)−la(x)‖+λ2 ∑ x∈M ‖l(x)−lc(x)‖+ λ3 ∑ x′∈Nx ‖l(x)−l(x′)‖, (10)where The authorcontains inlier pixels that are determined in the previous step in Sec. 4.2, and M denotes the pixels that have confident matching correspondences.

What is the method for detecting sub-pixel shift?

Among the sub-pixel shift methods, the proposed phase-shift based approach exhibited the best results, which supports the importance of accurate sub-pixel shifting.

(Open Access) Accurate depth map estimation from a lenslet light field camera (2015) | Hae-Gon Jeon

Q: What have the authors contributed in "Accurate depth map estimation from a lenslet light field camera" ?

This paper introduces an algorithm that accurately estimates depth maps using a lenslet light field camera.

Q: What is the effect of d on the angular difference between the pixels?

Shortening d enlarges the angular difference between the corresponding rays of adjacent pixels and might cause radial distortion of the micro-lenses.

Q: How can a continuous band-limited signal be reconstructed?

According to the NyquistShannon sampling theorem [21], a continuous band-limited signal can be perfectly reconstructed through convolving it with a sinc function.

Accurate Depth Map Estimation from a Lenslet Light Field Camera

Hae-Gon Jeon Jaesik Park Gyeongmin Choe Jinsun Park

Yunsu Bok Yu-Wing Tai In So Kweon

Korea Advanced Institute of Science and Technology (KAIST), Republic of Korea

[hgjeon,jspark,gmchoe,ysbok]@rcv.kaist.ac.kr

[zzangjinsun,yuwing]@gmail.com, iskweon@kaist.ac.kr

Abstract

This paper introduces an algorithm that accurately esti-

mates depth maps using a lenslet light ﬁeld camera. The

proposed algorithm estimates the multi-view stereo cor-

respondences with sub-pixel accuracy using the cost vol-

ume. The foundation for constructing accurate costs is

threefold. First, the sub-aperture images are displaced us-

ing the phase shift theorem. Second, the gradient costs

are adaptively aggregated using the angular coordinates of

the light ﬁeld. Third, the feature correspondences between

the sub-aperture images are used as additional constraints.

With the cost volume, the multi-label optimization propa-

gates and corrects the depth map in the weak texture re-

gions. Finally, the local depth map is iteratively reﬁned

through ﬁtting the local quadratic function to estimate a

non-discrete depth map. Because micro-lens images con-

tain unexpected distortions, a method is also proposed that

corrects this error. The effectiveness of the proposed algo-

rithm is demonstrated through challenging real world ex-

amples and including comparisons with the performance of

advanced depth estimation algorithms.

1. Introduction

The problem of estimating an accurate depth map

from a lenslet light ﬁeld camera, e.g. Lytro

[1] and

Raytrix

[19], is investigated. Different to conventional

cameras, a light ﬁeld camera captures not only a 2D image,

but also the directions of the incoming light rays. The addi-

tional light directions allow the image to be re-focused and

the depth map of a scene to be estimated as demonstrated

in [12, 17, 19, 23, 26, 29].

Because the baseline between sub-aperture images from

a lenslet light ﬁeld camera is very narrow, directly applying

the existing stereo matching algorithms such as [20] can-

not produce satisfying results, even if the applied algorithm

is a top ranked method in the Middlebury stereo matching

benchmark. As reported in Yu et al. [29], the disparity range

Lytro software [1] Ours

Figure 1. Synthesized views of the two depth maps acquired from

Lytro software [1] and our approach.

of adjacent sub-aperture images in Lytro is between −1 to

1 pixels. Consequently, it is very challenging to estimate an

accurate depth map because the one pixel disparity error is

already a signiﬁcant error in this problem.

In this paper, an algorithm for stereo matching between

sub-aperture images with an extremely narrow baseline is

presented. Central to the proposed algorithm is the use of

the phase shift theorem in the Fourier domain to estimate

the sub-pixel shifts of sub-aperture images. This enables the

estimation of the stereo correspondences at sub-pixel accu-

racy, even with a very narrow baseline. The cost volume

is computed to evaluate the matching cost of different dis-

parity labels, which is deﬁned using the similarity measure-

ment between the sub-aperture images and the center view

sub-aperture image shifted at different sub-pixel locations.

Here, the gradient matching costs are adaptively aggregated

based on the angular coordinates of the light ﬁeld camera.

In order to reduce the effects of image noise, the

weighted median ﬁlter was adopted to remove the noise

in the cost volume, followed by using the multi-label op-

timization to propagate reliable disparity labels to the weak

texture regions. In the multi-label optimization, conﬁdent

matching correspondences between the center view and

other views are used as additional constraints, which as-

sist in preventing oversmoothing at the edges and texture

regions. Finally, the estimated depth map is iteratively re-

ﬁned using quadratic polynomial interpolation to enhance

the estimated depth map with sub-label precision.

In the experiments, it was found that a micro-lens im-

age of lenslet light ﬁeld cameras contains depth distor-

tions. Therefore, a method of correcting this error is also

presented. The effectiveness of the proposed algorithm is

demonstrated using challenging real world examples that

were captured by a Lytro camera, a Raytrix camera, and a

lab-made lenslet light ﬁeld camera. A performance compar-

ison with advanced methods is also presented. An example

of the results of the proposed method are presented in Fig. 1.

2. Related Work

Previous work related to depth map (or disparity map

)

estimation from a light ﬁeld image is reviewed. Compared

with conventional approaches in stereo matching, lenslet

light ﬁeld images have very narrow baselines. Conse-

quently, approaches based on correspondence matching do

not typically work well because the sub-pixel shift in the

spatial domain usually involves interpolation with blurri-

ness, and the matching costs of stereo correspondence are

highly ambiguous. Therefore, instead of using correspon-

dence matching, other cues and constraints were used to es-

timate the depth maps from a lenslet light ﬁeld image.

Georgiev and Lumsdaine [7] computed a normalized

cross correlation between microlens images in order to es-

timate the disparity map. Bishop and Favaro [4] introduced

an iterative method for a multi-view stereo image for a light

ﬁeld. Wanner and Goldluecke [26] used a structure tensor

to compute the vertical and horizontal slopes in the epipolar

plane of a light ﬁeld image, and they formulated the depth

map estimation problem as a global optimization approach

that was subject to the epipolar constraint. Yu et al. [29]

analyzed the 3D geometry of lines in a light ﬁeld image

and computed the disparity maps through line matching be-

tween the sub-aperture images. Tao et al. [23] introduced

a fusion method that uses the correspondences and defocus

cues of a light ﬁeld image to estimate the disparity maps.

After the initial estimation, a multi-label optimization is ap-

plied in order to reﬁne the estimated disparity map. Heber

and Pock [8] estimated disparity maps using the low-rank

structure regularization to align the sub-aperture images.

In addition to the aforementioned approaches, there have

been recent studies that have estimated depth maps from

light ﬁeld images. For example, Kim et al. [10] estimated

depth maps from a DSLR camera with movement, which

simulated the multiple viewpoints of a light ﬁeld image.

Chen et al. [6] introduced a bilateral consistency metric on

the surface camera in order to estimate the stereo correspon-

dence in a light ﬁeld image in the presence of occlusion.

However, it should be noted that the baseline of the light

ﬁeld images presented in Kim et al. [10] and Chen et al. [6]

are signiﬁcantly larger than the baseline of the light ﬁeld

We sometimes use disparity map to represent depth map.

images captured using a lenslet light ﬁeld camera.

Compared with previous studies, the proposed algorithm

computes the cost volume that is based on sub-pixel multi-

view stereo matching. Unique in the proposed algorithm

is the usage of the phase shift theorem when performing

the sub-pixel shifts of sub-aperture image. The phase shift

theorem allows the reconstruction of the sub-pixel shifted

sub-aperture images without introducing blurriness in con-

trast to spatial domain interpolation. As is demonstrated in

the experiments, the proposed algorithm is highly effective

and outperforms the advanced algorithms in depth map es-

timation using a lenslet light ﬁeld image.

3. Sub-aperture Image Analysis

First, the characteristics of sub-aperture images obtained

from a lenslet-based light ﬁeld camera are analyzed, and

then the proposed distortion correction method is described.

3.1. Narrow Baseline Sub-aperture Image

Narrow baseline. According to the lenslet light ﬁeld cam-

era projection model proposed by Bok et al. [5], the view-

point (S, T ) of a sub-aperture image with an angular direc-

tion s = (s, t)

is as follows:





(D + d)



s/f

t/f



, (1)

where D is the distance between the lenslet array and the

center of main lens, d is the distance between the lenslet

array and imaging sensor, and f is the focal length of the

main lens. With the assumption of a uniform focal length

(i.e. f

= f

= f ), the baseline between two adjacent sub-

aperture images is deﬁned as baseline :=

(D+d)D

Based on this, we need to shorten f, shorten d, or

lengthen D for a wider baseline. However, f cannot be too

short because it is proportional to the angular resolution of

the micro-lenses in a lenslet array. Therefore, the maximum

baseline that is multiplication of the baseline and angular

resolution of sub-aperture images remains unchanged even

if the value of f varies. If the physical size of the micro-

lenses is too large, the spatial resolution of the sub-aperture

images is reduced. Shortening d enlarges the angular dif-

ference between the corresponding rays of adjacent pixels

and might cause radial distortion of the micro-lenses. Fi-

nally, lengthening D increases the baseline, but the ﬁeld

of view is reduced. Due to these challenges, the disparity

range of sub-aperture images is quite narrow. For example,

the disparity range between adjacent sub-aperture views of

the Lytro camera is smaller than ±1 pixel [29].

The 4D parameterization [7, 17, 26] is followed where the pixel co-

ordinates of a light ﬁeld image I are deﬁned using the 4D parameters of

(s, t, x, y). Here, s = (s, t) denotes the discrete index of the angular di-

rections and x = (x, y) denotes the Cartesian image coordinates of each

sub-aperture image.

(a) Before compensation

(b) After compensation

Pivot

Rotate

Figure 2. (a) and (b) EPI before and after distortion correction.

difference between two EPIs.

without correction with correction

Figure 3. Disparity map before and after distortion correction

(Sec. 3.2). Real-world planar scene is captured and the depth map

is computed using our approach (Sec. 4).

Sub-aperture image distortion. From the analyses con-

ducted in this study, it is observed that the lenslet light ﬁeld

images contain optical distortions that are caused by both

the main lens (thin lens model) and micro-lenses (pinhole

model). Although the radial distortion of the main lens can

be calibrated using conventional methods, it is imperfect,

particularly for light rays that have large angular differences

from the optical axis. The distortion caused by these rays

is called astigmatism [22]. Moreover, because the conven-

tional distortion model is based on a pinhole camera model,

the rays that do not pass through the center of the main lens

cannot ﬁt well to the model. The distortion caused by those

rays is called ﬁeld curvature [22]. Because they are the pri-

mary causes of the depth distortion, the two distortions are

compensated in the following subsection.

3.2. Distortion Estimation and Correction

During the capture of a light ﬁeld image of a planar ob-

ject, spatially variant epipolar plane image (EPI) slopes (i.e.

non-uniform depths) are observed that result from the dis-

tortions mentioned in Sec. 3.1 (see Fig. 3). In addition, the

degree of distortion also varies for each sub-aperture image.

To solve this problem, an energy minimization problem

is formulated under a constant depth assumption as follows:

G = argmin

|θ(I(x)) − θ

− G(x)| (2)

where | · | denotes the absolute operator. θ

, θ(·), and G(·)

denote the slope without distortion, the slope of EPI, and

the amount of distortion at point x, respectively.

The amount of ﬁeld curvature distortion is estimated for

Bilinear Bicubic Phase Original

Figure 4. An original sub-aperture image is shifted with bilinear,

bicubic and phase shift theorem.

each pixel. An image of a planar checkerboard is captured

and compared with the observed EPI slopes with θ

. Points

with strong gradients in the EPI are selected and the differ-

ence (θ(·) − θ

) is calculated in Eq. (2). Then, the entire

ﬁeld curvature G is ﬁtted to a second order polynomial sur-

face model.

After solving Eq. (2), each point’s EPI slope is rotated

using

G. The pixel of reference view (i.e. center view) is

set as the pivot of the rotation (see Fig. 2 (c)). However,

due to the astigmatism, the ﬁeld curvature varies accord-

ing to the slice direction. In order to consider this problem,

Eq. (2) is solved twice: once each for the horizontal and

vertical directions. The correction order does not affect the

compensation result. In order to avoid chromatic aberra-

tions, the distortion parameters are estimated for each color

channel. Figure 2 and Fig. 3 present the EPI image and es-

timated depth map before and after the proposed distortion

correction, respectively

The proposed method is classiﬁed as a low order ap-

proach that targets the astigmatism and ﬁeld curvature. A

more generalized technique for correcting the aberration has

been proposed by Ng and Hanrahan [16], and it is currently

used for real products [2].

4. Depth Map Estimation

Given the distortion-corrected sub-aperture images, the

goal is to estimate accurate dense depth maps. The pro-

posed depth map estimation algorithm is developed using a

cost-volume-based stereo [20]. In order to manage the nar-

row baseline between the sub-aperture images, the pipeline

is tailored with three signiﬁcant differences. First, instead

of traversing the local patches to compute the cost vol-

ume, the sub-aperture images were directly shifted using a

phase shift theorem and the per-pixel cost volume was com-

puted. Second, in order to effectively aggregate the gradient

costs computed from dozens of sub-aperture image pairs, a

A tilt error might exist if the sensor and calibration plane are not par-

allel. In order to avoid this, an optical table is used.

It is observed that altering the focal length and zooming parameters

affect the correction. This is a limitation of the proposed method. However,

it is also observed that the distortion parameter is not scene dependent.

weight term that considers the horizontal/vertical deviation

in the st coordinates between the sub-aperture image pairs

is deﬁned. Third, because small viewpoint changes of sub-

aperture images allow feature matching to be more reliable,

a guidance of conﬁdent matching correspondences is also

included in the discrete label optimization [11]. The details

are described in following sub-sections.

4.1. Phase Shift based Sub-pixel Displacement

A key contribution of the proposed depth estimation al-

gorithm is matching the narrow baseline sub-aperture im-

ages using sub-pixel displacements. According to the phase

shift theorem, if an image I is shifted by ∆x ∈ R

, the

corresponding phase shift in the 2D Fourier transform is:

F{I(x + ∆x)} = F{I(x)}exp

2πi∆x

, (3)

where F{·} denotes the discrete 2D Fourier transform.

In Eq. (3), multiplying the exponential term in the frequency

domain is the same as convolving a Dirichlet kernel (or peri-

odic sinc) in the spatial domain. According to the Nyquist-

Shannon sampling theorem [21], a continuous band-limited

signal can be perfectly reconstructed through convolving it

with a sinc function. If the centroid of the sinc function

is deviated from the origin, precisely shifted signals can be

obtained. In the same manner, Eq. (3) generates a precisely

shifted image in the spatial domain if the sub-aperture im-

age is band-limited. Therefore, the sub-pixel shifted image

(x) is obtained using:

(x) = I(x + ∆x) = F

−1

{F{I(x)}exp

2πi∆x

}. (4)

In practice, the light ﬁeld image is not always a band-

limited signal. This results from the weak pre-ﬁltering

that ﬁts the light ﬁeld into the sub-aperture image resolu-

tion [13, 24]. However, the artifact is not obvious for re-

gions where the texture is obtained from the source surface

in the scene. For example, a sub-aperture image of a reso-

lution chart captured by Lytro camera is presented in Fig. 4.

This image is shifted by ∆x = [2.2345, −1.5938] pixels.

Compared with the displacement that results from the bilin-

ear and bicubic interpolations, the sub-pixel shifted image

using the phase shift theorem is sharper and does not con-

tain blurriness. Note that having an accurate reconstruction

of sub-pixel shifted images is signiﬁcant for accurate depth

map estimations, particularly when the baseline is narrow.

The effect of the interpolation method and depth accuracy

is analyzed in Sec. 5.

In this implementation, the fast Fourier transform with

a circular boundary condition is used to manage the non-

inﬁnite signals. Because the proposed algorithm shifts the

entire sub-aperture image instead of local patches, the ar-

tifacts that result from periodicity problems only appear at

the boundary of the image within a width of a few pixels

(less than two pixels), which is negligible.

4.2. Building the Cost Volume

In order to match sub-aperture images, two complemen-

tary costs were used: the sum of absolute differences (SAD)

and the sum of gradient differences (GRAD). The cost vol-

ume C is deﬁned as a function of x and cost label l:

C(x, l) = αC

(x, l) + (1−α)C

(x, l), (5)

where α ∈ [0, 1] adjusts the relative importance between

the SAD cost C

and GRAD cost C

. C

is deﬁned as

(x, l)=

s∈V

x∈R

min(|I(s

, x)−I(s, x+∆x(s, l))|, τ

(6)

where R

is a small rectangular region centered at x; τ

a truncation value of a robust function; and V contains the

st coordinate pixels s, except for the center view s

. Equa-

tion (3) is used for precise sub-pixel shifting of the images.

Equation (6) builds a matching cost through comparing

the center sub-aperture image I(s

, x) with the other sub-

aperture images I(s, x) to generate a disparity map from a

canonical viewpoint. The 2D shift vector ∆x in Eq. (6) is

deﬁned as follows:

∆x(s, l) = lk(s − s

), (7)

where k is the unit of the label in pixels. ∆x linearly in-

creases as the angular deviations from the center viewpoint

increase. Another cost volume C

is deﬁned as follows:

(x, l)=

s∈V

x∈R

β(s) min



Diﬀ

, s, x, l), τ



(8)



1 − β(s)



min



Diﬀ

, s, x, l), τ



where Diﬀ

, s, x, l) = |I

, x) − I

(s, x + ∆x(s, l))|

denotes the differences between the x-directional gradient

of the sub-aperture images. Diﬀ

is deﬁned similarly on

the y-directional gradients. τ

is a truncation constant that

suppresses outliers. β(s) in Eq. (8) controls the relative im-

portance of the two directional gradient differences based

on the relative st coordinates. β(s) is deﬁned as follows:

β(s) =

|s − s

| + |t − t

. (9)

According to Eq. (9), β increases if the target view s is

located at the horizontal extent of the center view s

. In

this case, only the gradient costs in the x direction are ag-

gregated to C

. Note that β is independent of the scene

because it is determined purely using the relative position

between s and s

As a sequential step, every cost slice is reﬁned using an

edge-preserving ﬁlter [15] to alleviate the coarsely scattered

unreliable matches. Here, the central sub-aperture image is

used to determine the weights used for the ﬁlter. They are

(a) (b) (c) (d) (e)

Figure 5. Estimated disparity maps at different step of our algorithm. (a) The center view sub-aperture image. (b)-(e) Disparity maps (b)

based on the initial cost volume (winner-takes-all strategy), (c) after weighted median ﬁlter reﬁnement (The red pixels indicates detected

outlier pixels), (d) after the multi-label optimization, and (e) after the iterative reﬁnement. The processes in (b) and (c) are described in

Sec. 4.2, and the processes in (d) and (e) are described in Sec. 4.3 respectively.

Central view Graph cuts Refined

Synthesized view

using graph cut depth

Synthesized view

using refined depth

Figure 6. The effectiveness of the iterative reﬁnement step de-

scribed in Sec. 4.3.

determined using the Euclidean distances between the RGB

values of two pixels in the ﬁlter, which preserves the discon-

tinuity in the cost slices. From the reﬁned cost volume C

a disparity map l

is determined using the winner-takes-all

strategy. As depicted in Figs. 5 (b) and (c), the noisy back-

ground disparities are substituted with the majority value

(almost zero in this example) of the background disparity.

In each pixel, if the variance over the cost slices is smaller

than a threshold τ

reject

, this pixel is regarded as an outlier

because it does not have distinctive minimum values. The

red pixels in Fig. 5 (c) indicate these outlier pixels.

4.3. Disparity Optimization and Enhancement

The disparity map from the previous step is enhanced

through discrete optimization and iterative quadratic ﬁtting.

Conﬁdent matching correspondences. Besides the cost

volume, the correspondences are also matched at salient

feature points as strong guides for multi-label optimization.

In particular, local feature matching is conducted between

the center sub-aperture image and other sub-aperture im-

ages. Here, the SIFT algorithm [14] is used for the feature

extraction and matching. From a pair of matched feature po-

sitions, the positional deviation ∆f ∈ R

in the xy coordi-

nates is computed. If the amount of deviation k∆f k exceeds

the maximum disparity range of the light ﬁeld camera, they

are rejected as outliers. For each pair of matched positions,

given s, s

, ∆f , and k, an over-determined linear equation

∆f = lk(s − s

) is solved for l. This is based on the linear

relationship depicted in Eq. (7). Because the feature point

in the center view is matched with that of multiple images,

it has several candidates for disparities. Therefore, their me-

dian value is obtained and used to compute the sparse and

conﬁdent disparities l

Multi-label optimization. Multi-label optimization is per-

formed using graph cuts [11] to propagate and correct the

disparities using neighboring estimation. The optimal dis-

parity map is obtained through minimizing

= argmin



x, l(x)



+ λ

x∈I

kl(x)−l

(x)k

+λ

x∈M

kl(x)−l

(x)k+ λ

∈N

kl(x)−l(x

)k, (10)

where I contains inlier pixels that are determined in the

previous step in Sec. 4.2, and M denotes the pixels that

have conﬁdent matching correspondences. Equation (10)

has four terms: matching cost reliability (C



x, l(x)



), data

ﬁdelity (kl(x)− l

(x)k), conﬁdent matching cost (kl(x)−

(x)k), and local smoothness (kl(x)−l(x

)k). Figure 5 (d)

presents a corrected depth map after the discrete optimiza-

tion. Note that even without the conﬁdent matching cost,

the proposed approach estimates a reliable disparity map.

The conﬁdent matching cost further enhances the estimated

disparity at regions with salient matching.

Iterative reﬁnement. The last step reﬁnes the discrete dis-

parity map after the multi-label optimization into a contin-

uous disparity with sharp gradients at depth discontinuities.

The method presented by Yang et al. [28] is adopted. A new

cost volume

C that is ﬁlled with one is computed. Then, for

Accurate depth map estimation from a lenslet light field camera

Figures

Citations

Learning-based view synthesis for light field cameras

Learning-Based View Synthesis for Light Field Cameras

A Dataset and Evaluation Methodology for Depth Estimation on 4D Light Fields

Light Field Image Processing: An Overview

Soft 3D reconstruction for view synthesis

References

Distinctive Image Features from Scale-Invariant Keypoints

Communication in the presence of noise

Light field photography with a hand-held plenoptic camera

Fast cost-volume filtering for visual correspondence and beyond

Multi-camera Scene Reconstruction via Graph Cuts

Related Papers (5)

Depth from Combining Defocus and Correspondence Using Light-Field Cameras

Light field photography with a hand-held plenoptic camera

Light field rendering

Variational Light Field Analysis for Disparity Estimation and Super-Resolution

Decoding, Calibration and Rectification for Lenselet-Based Plenoptic Cameras

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Accurate depth map estimation from a lenslet light field camera" ?

Q2. What is the key to the proposed algorithm?

Q3. Why did the proposed method have significant outliers?

Q4. What is the effect of d on the angular difference between the pixels?

Q5. How does the proposed method improve the matching quality?

Q6. What was the effect of the proposed method on the depth map?

Q7. What is the weight term for the sub-aperture image?

Q8. What is the error value of AWS and Robust PCA?

Q9. What is the way to compute the optimal disparity map?

Q10. How can a continuous band-limited signal be reconstructed?

Q11. Why is the baseline between sub-aperture images from a lenslet light field camera?

Q12. What is the method for detecting sub-pixel shift?