What is the main topic of the paper?

M. Jaimez, C. Kerl, J. Gonzalez-Jimenez, and D. Cremers, “Fast odometry and scene flow from rgb-d cameras based on geometric clustering,” in IEEE Int. Conf. on Robotics and Automation (ICRA),.

What is the purpose of this experiment?

For that purpose, the authors have used the production files of the open source movie Sintel [28] and have modified them in the Blender 3D software [29] in order to render an array of 3x3 views.

(Open Access) Sparse to Dense Scene Flow Estimation From Light Fields (2019) | Pierre David

Q: What is the method for estimating optical flow?

The authors consider two methods proposed for optical flow estimation: the sparse to dense estimation method EpicFlow [16] and a state-of-the-art technique based on a deep learning architecture called PWC-Net [20]

HAL Id: hal-02123544

https://hal.archives-ouvertes.fr/hal-02123544

Submitted on 8 May 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Sparse to Dense Scene Flow Estimation from Light

Fields

Pierre David, Mikaël Le Pendu, Christine Guillemot

To cite this version:

Pierre David, Mikaël Le Pendu, Christine Guillemot. Sparse to Dense Scene Flow Estimation from

Light Fields. ICIP 2019 - IEEE International Conference on Image Processing, Sep 2019, Taïpei,

Taiwan. pp.1-5, �10.1109/ICIP.2019.8803520�. �hal-02123544�

SPARSE TO DENSE SCENE FLOW ESTIMATION FROM LIGHT FIELDS

Pierre David

Mika

el Le Pendu

†

Christine Guillemot

Inria, Campus Universitaire de Beaulieu, 35042 Rennes, France

†

Trinity College Dublin, Ireland

ABSTRACT

The paper addresses the problem of scene ﬂow estimation from

sparsely sampled video light ﬁelds. The scene ﬂow estimation

method is based on an afﬁne model in the 4D ray space that allows

us to estimate a dense ﬂow from sparse estimates in 4D clusters.

A dataset of synthetic video light ﬁelds created for assessing scene

ﬂow estimation techniques is also described. Experiments show

that the proposed method gives error rates on the optical ﬂow com-

ponents that are comparable to those obtained with state of the art

optical ﬂow estimation methods, while computing a more accurate

disparity variation when compared with prior scene ﬂow estimation

techniques.

Index Terms— Scene ﬂow, optical ﬂow, light ﬁeld, sparse to

dense

1. INTRODUCTION

Light ﬁelds have proven to be very useful for scene analysis, and in

particular for scene depth estimation [1, 2, 3, 4, 5]. However, while

a lot of work has been dedicated to scene analysis from static light

ﬁelds, very little effort has been invested on the problem of dynamic

scene analysis from video light ﬁelds. The concept of scene ﬂow has

ﬁrst been deﬁned in [6] as describing the 3D geometry as well as

the 3D motion of each scene point. Considering a multi-view set-up,

the scene ﬂow is estimated using an optical ﬂow estimator for each

view. The 3D scene ﬂow is then computed by ﬁtting its projection

on each view to the estimated optical ﬂows. However, in the recent

literature (e.g. [7, 8, 9, 10]), the scene ﬂow is instead deﬁned as a

direct extension of the optical ﬂow, where the depth (or disparity)

variation ∆d of objects along time is represented in addition to the

apparent 2D motion (∆x, ∆y).

Considering this deﬁnition, the scene ﬂow analysis problem has

ﬁrst been addressed for stereo video sequences: in [7, 8, 9] a scene

ﬂow is estimated assuming that the scene can be decomposed into

rigidly moving objects. Several methods based on RGB-D videos

have also been developed [10], [11], [12]. The ﬁrst methods for

scene ﬂow analysis from dense light ﬁelds have been proposed in

[13] and [14]. Both are based on variational models. The authors

in [15] proposed oriented light ﬁeld windows to estimate the scene

ﬂow. All these methods rely on epipolar images which only provide

sufﬁcient information for scene ﬂow estimation of densely sampled

light ﬁelds (as those captured with plenoptic cameras). However,

they do not address the case of sparse light ﬁelds (i.e. with large

baselines).

In this paper, we focus on the problem of scene ﬂow analysis

from sparse video light ﬁelds. This problem is made difﬁcult due to

This work was supported by the EU H2020 Research and Innovation

Program under grant agreement N

◦

694122 (ERC advanced grant CLIM).

the large temporal and angular occlusions. To cope with this difﬁ-

culty, we propose a method for interpolating, in the 4D ray space,

a sparse scene ﬂow into a dense one. The sparse-to-dense approach

naturally handles occlusions: while the sparse estimation can only

be obtained on non-occluded parts of the image, the dense interpola-

tion extends these estimations to every pixel (occluded or not). The

proposed method is based on an afﬁne model in the 4D ray space that

allows us to estimate the dense ﬂow from sparse estimates in nearby

clusters. Note that an advantage of our sparse-to-dense approach is

the possibility to use it as a post-processing step in other scene ﬂow

estimation methods, in the same way as EpicFlow [16] has been used

in other methods (e.g. [17]) to interpolate the optical ﬂow of pixels

detected as outliers.

In order to validate the proposed algorithm, we have generated

a synthetic light ﬁeld video dataset based on the Sintel movie (used

in the optical ﬂow benchmark [18, 19]). This dataset is composed

of two synthetic light ﬁelds (‘Bamboo2’ and ‘Temple1’) of 3 × 3

views of 1024 × 536 pixels and 50 frames. The light ﬁeld views are

provided with the corresponding ground truth scene ﬂow (optical

ﬂow and disparity variation).

The proposed method has been assessed using this synthetic

dataset in comparison with the oriented window method in [15], and

with two methods proposed for optical ﬂow estimation: the sparse to

dense estimation method EpicFlow [16] and a state-of-the-art tech-

nique based on a deep learning architecture called PWC-Net [20].

Note that, since the depth estimation component of the method in

[15] has been designed for dense light ﬁelds and is hardly applicable

when the baseline is large, when using this method, we have coupled

the optical ﬂow estimation performed by the method with the ground

truth disparity. The two optical ﬂow estimation techniques [20] and

[16] are used for separately estimating the optical ﬂow in each view

as well as the disparity between views. Experimental results show

that the proposed method gives optical ﬂow endpoint error rates that

are comparable to those obtained with state of the art optical ﬂow

estimation methods, while computing at the same time the disparity

variation with lower mean absolute errors. Although the accuracy of

our optical ﬂow is slightly lower than that of PWC-Net, the depth

variation maps estimated by our method have considerably lower er-

rors.

2. NOTATIONS AND METHOD OVERVIEW

Let us consider the 4D representation of Light Fields proposed in

[21] and [22] to describe the radiance along the different light rays.

This 4D function, at each time instant, is denoted LF

(x, y, u, v),

where the pairs x = (x, y) and u = (u, v) respectively denote the

spatial and angular coordinates of light rays. A view u of a light ﬁeld

at t is written L

. We denote by C : (I, I

) → (S, S

, f) the func-

tion that determines two sets S and S

of matching points in I and

respectively. The associated bijection f is such that f(S) = S

t+1

scene

ﬂow

Sparse

estimation

4D clustering

Sparse

to dense

interpolation

sparse

scene ﬂow

clusters

Fig. 1. Overview of the sparse to dense scene ﬂow estimation

t+1

Fig. 2. Matching between a view u and the central view c at t and

t + 1, shown in epipolar images at t and t+1.

The inverse bijection is noted f

−1

The proposed method takes as inputs two consecutive frames

of a video light ﬁeld LF

(x, y, u, v), LF

t+1

(x, y, u, v), a match-

ing function C and estimates a consistent dense scene ﬂow on each

view. It produces disparity maps at t and t+1 and a disparity variation

map between t and t+1 (in addition to the optical ﬂow components).

The algorithm proceeds as follows (see Fig.1). The ﬁrst step of the

method is the robust sparse scene ﬂow estimation. All light ﬁeld

rays are then grouped in 4D clusters to guide the sparse to dense

interpolation of the scene ﬂow. An afﬁne model is computed from

the sparse ﬂow estimates in neighboring clusters, and then used to

compute a ﬂow value in each light ﬁeld pixel.

3. SPARSE SCENE FLOW ESTIMATION

The method pivots on the central view L

. We ﬁrst compute

, S

, f

) = C(L

, L

t+1

). Then, for every non central view

u, we perform three kinds of matching (see Fig. 2), angular match-

ing to estimate the sets of points (S

, S

, f

) = C(L

t+1

, L

t+1

)

and (S

, S

, f

) = C(L

, L

), temporal matching to estimate

the sets of points (S

, S

, f

) = C(L

t+1

, L

), and temporo-

angular matching to estimate (S

, S

, f

) = C(L

, L

t+1

) and

, S

, f

) = C(L

t+1

, L

). In the experiments, we used [23] as

function C.

The next step is to link the different sets together based on the

distance between the points computed as in [24]. We deﬁne a dis-

tance D

based on color, angular and spatial proximity in the 7D

space [labuvxy] as :

, P

) =

(1)

with :

S =

h × w/|S

| A =

√

× N

(2)

where d

, d

and d

are the color, spatial and angular distances

deﬁned as euclidean distances in the CIELAB colorspace, the [xy]

space and the [uv] space respectively. The variables w and h are the

width and height of a view, N

and N

are the angular size of the

light ﬁeld and m

is a parameter controlling the balance between

color and spatial distances that we ﬁx to 10 in the experiments. For

this step, we have d

= 0 because we are aiming at linking points

in the same view, but this distance will be used later in the cluster-

ing step. Based on the above distances, for each point P

∈ S

, we

build a chain of points (P

, P

) by searching

for the nearest neighbors successively in S

, S

and S

(see Eq. (3)

and Fig. 3a). We call this step a forward nearest neighbor search and

denote NN the simple nearest neighbor search function for a point

∈ S

in a set S

, according to the distance D.

−→ P

(3)

To ensure the robustness of the chain of points, we then per-

form a backward nearest neighbor check as in Eq.(4). Given a point

∈ S

whose nearest neighbor in S

is P

, we check if P

reciprocally the nearest neighbor of P

in S

−1

−→ P

NN?

−→ P

−1

−→ P

NN?

−→ P

−1

−→ P

NN?

−→ P

−1

−→ P

(4)

Let A

be the set of points P

∈ S

that passed the backward near-

est neighbor test. We have A

⊂ S

. We then perform a similar

forward nearest neighbor search and backward nearest neighbor test

for each point P

∈ S

but this time with the two temporo-angular

matching sets S

and S

, building the chain of points (P

, P

)

and (P

, P

) (see Fig. 3b).

Let A

be the set of points P

∈ S

that passed the backward

nearest neighbor test in S

and S

. We have A

⊂ S

. For each

point P

∈ A = A

∩ A

, we build a complete chain of points

g = (P

, P

Given a chain of points, we check the consistency of each match-

ing point P

= f

) with the point P

in the same view and in-

stant. More precisely, we discard the chain if either D(P

, P

) > τ,

D(P

, P

)) > τ, or D(P

, P

) > τ, where τ is a distance threshold

that was ﬁxed to 10 in our experiments.

Any remaining chain g is used to estimate a sparse scene ﬂow.

From g, we can compute the scene ﬂow on two points of the light

ﬁeld LF

: on the central view L

at P

and on the view L

at P

− P



∆x

∆y



and P

− P

= −



∆x

∆y



(5)

Let d

and d

t+1

be the disparities at t and t + 1. We have:

− P

= (u − c) · d

t+1

and P

− P

= −(u − c) · d

(6)

With the temporo-angular matching, we also have:

− P

= −(P

− P

) − (P

− P

)

= (P

− P

) + (P

− P

) (7)

− P

= −(P

− P

) − (P

− P

)

= (P

− P

) + (P

− P

)

We solve this linear overdetermined system with a least square

method in order to determine the values ∆x

, ∆y

, ∆x

, ∆y

, d

and d

t+1

for each chain g.

t+1

•

NN?

(a) Chain of angular and temporal correspondences

t+1

•

NN?

(b) Chain of temporo-angular correspondences

Fig. 3. Forward nearest neighbor search and backward nearest neighbor check. The green dots represent the points that passed the backward

nearest neighbor check.

4. CLUSTERING THE LIGHT FIELD

In order to estimate a dense scene ﬂow from sparse estimates, we

need to make an assumption about the scene. We assume that rays

which have similar colors and are located near each other in the light

ﬁeld should have similar motions. As a consequence, similarly to

[25] but extended to 4D light ﬁelds, we propose to decompose the

light ﬁeld into clusters based on color, spatial and angular proximity.

Then, the clustering will be used to estimate one scene ﬂow model

per cluster. We generalize in 4D the SLIC superpixels developed in

[24] to cluster the light ﬁeld LF

. We use as distance D

, already

deﬁned in Eq.(1). We only modify the deﬁnition of S :

S =

h × w/K (8)

K denotes the number of desired clusters. We use the parameters

and m

to control the balance between spatial, angular and color

proximity. S (in Eq. (8)) and A (in Eq. (2)) are the maximum spatial

and angular distances expected in a cluster.

The spatial positions of the centroids C

of the clusters are ini-

tialized on a regular grid of step S, while their initial angular posi-

tions are randomly sampled. Then, each pixel i is assigned to the

nearest cluster center C

. We ﬁx the search region to N

× N

2S ×2S because the expected approximate spatial size of each clus-

ter is S × S and the approximate angular size is N

× N

. Indeed,

each cluster is excpected to contain pixels in all the N

×N

views.

Once the assignment step is over, we update the cluster centers

by computing the new centroid of each cluster. We only need to

iterate this second step N

iteration

= 10 in order to have stabilized

cluster centroids.

5. SPARSE TO DENSE INTERPOLATION

Once we have divided the light ﬁeld into clusters, we estimate a

model for each one of them. For each cluster, we look for the points

estimated in Section 3 that are inside the cluster, we compute the

mean value of the sparse estimates and associate it with the centroids

of the cluster. For clusters that have no seeds inside them, we do not

do anything. We end up with initialized clusters and unitialized ones.

As in [16], we build a weighted graph where each node is a

cluster (initialized or not) and where an edge between two nodes

means that the two clusters are adjacent. The associated weight is

deﬁned as the distance D

(as in Eq.(1)) between the centroids of the

two clusters. We then look for the N nearest initialized neighbors

using Dijkstra’s algorithm on the graph [26], discarding every empty

cluster.

For each cluster, we estimate the parameters of the afﬁne model

in Eq.(9) in the 4D space by ﬁtting the model on the initialized cen-

troids using the RANSAC algorithm [27].

u + b

v + c

x + d

y + e

= x + ∆x

u + b

v + c

x + d

y + e

= y + ∆y (9)

u + b

v + c

x + d

y + e

= ∆d

We then apply the model to every pixel belonging to the cluster

to compute its (∆x, ∆y, ∆d) scene ﬂow value.

As a ﬁnal post-processing step, we perform an energy minimiza-

tion like in [16] for the optical ﬂow of each subaperture image inde-

pendently. To regularize the disparity variation map, we then per-

form joint bilateral ﬁltering using the optical ﬂow as a guide.

6. EVALUATION

6.1. Scene Flow Dataset

For our experiments, we have prepared a synthetic video Light Field

dataset

. For that purpose, we have used the production ﬁles of the

open source movie Sintel [28] and have modiﬁed them in the Blender

3D software [29] in order to render an array of 3x3 views. Simi-

larly to the MPI Sintel ﬂow dataset [18, 19], we have modiﬁed the

scenes to generate not only the ‘ﬁnal’ render, but also a ‘clean’ ren-

der without lighting effects, motion blur, or semi-transparent objects.

Ground truth optical ﬂow and disparity maps were also generated

for each view. Since disparity variation maps could not be rendered

within Blender, we have computed them using the disparity and opti-

cal ﬂow. However, this process requires projecting the disparity map

of a frame to the next frame using the optical ﬂow, which results in

unavailable disparity variation information in areas of temporal oc-

clusion. We have processed two scenes of 3×3 views of 1024×536

pixels and 50 frames corresponding to the scenes ‘Bamboo2’ and

http://clim.inria.fr/Datasets/SyntheticVideoLF/index.html

Bamboo2 Temple1

clean ﬁnal clean ﬁnal

EPE OF MAE ∆d EPE OF MAE ∆d EPE OF MAE ∆d EPE OF MAE ∆d

Central

view

only

Ours 1.007 0.136 1.102 0.140 1.042 0.109 1.383 0.128

OW [15] 1.421 0.356 1.462 0.345 2.061 0.152 2.374 0.162

Epic [16] 1.078 0.685 1.280 0.663 5.405 0.664 2.472 1.009

PWC [20] 0.945 0.576 1.017 0.596 1.042 0.291 1.322 0.325

Global

Ours 1.090 0.140 1.169 0.142 1.109 0.111 1.453 0.131

OW [15] / / / / / / / /

Epic [16] 1.071 1.020 1.264 1.015 5.676 0.730 2.609 1.208

PWC [20] 0.947 0.873 1.017 0.884 1.038 0.385 1.323 0.429

Table 1. Results on our light ﬁeld scene ﬂow dataset for the optical ﬂow endpoint error EPE OF and for the disparity variation mean absolute

error MAE ∆d. This latter is only computed on pixels that remain visible between two adjacent frames because of the lack of ground truth in

occluded and disoccluded areas. The lowest errors are in red, the second lowest errors in orange.

t + 1

(∆x, ∆y)

∆d

Central view Ground Truth Ours

OW[15]

Epic[16]

PWC[20]

Fig. 4. Visual comparison of our method with [15, 16, 20]. The optical ﬂows are visualized with the Middlebury color code, and the disparity

variations are visualized using a gray-scale representation. The red pixels are the occlusion mask where there is no ground truth disparity

variation available.

‘Temple1’ in [18]. The disparities (in pixels) between neighboring

views are in the range [−8, +52] for ‘Bamboo2’ and [−22, +9] for

‘Temple1’.

6.2. Results

The proposed method is ﬁrst assessed in comparison with the method

in [15], referred to here as OW (Oriented Window). This latter

was designed for dense light ﬁelds captured with plenoptic cam-

eras. However, the optical ﬂow searched via the oriented window

can be combined with disparity maps estimated by methods suit-

able for sparse light ﬁelds. In the test reported here, we used for

this method ground truth disparity maps, thus showing the best re-

sults it can give for the estimated scene ﬂow. We also compare the

method with a naive approach that would consist in separately esti-

mating the disparity maps at t and t+1, the optical ﬂows between t

and t+1 and ﬁnally computing the disparity variation as the differ-

ence of disparities along the optical ﬂows. We consider two methods

proposed for optical ﬂow estimation: the sparse to dense estima-

tion method EpicFlow [16] and a state-of-the-art technique based

on a deep learning architecture called PWC-Net [20] Note that this

separate disparity and optical ﬂow estimation does not handle occlu-

sions. So, the disparity variation in occluded or disoccluded areas

will never be consistent.

The results are summarized in Table 1. For each successive light

ﬁeld frame of the four sequences (Bamboo2 and Temple1, both ren-

dered as clean and ﬁnal), we compute the endpoint error for the op-

tical ﬂow (EPE OF) and separately the mean absolute error for the

disparity variation (MAE ∆d). The latter is only computed for dis-

occluded pixels because there is no ground truth on occluded pixels.

We compute these two errors on every ray of the light ﬁeld (Global)

and because the method proposed in [15] only gives the scene ﬂow

for the central view, we also compute the errors on the central view

only. We can observe that although [20] yields the most accurate

optical ﬂows, our method has a lower optical ﬂow end point error

than the two other methods and that it gets close to [20] for some

sequences like Temple1 clean and Bamboo2 clean. Regarding the

disparity variation, our method provides much lower errors than any

other tested method. In the end, we propose a method that is slightly

less precise than state of the art optical ﬂow methods but more accu-

rate in terms of disparity variation.

7. CONCLUSION

In this paper, we have presented a new method to estimate scene

ﬂows from sparsely sampled video light ﬁelds. The method is based

on three steps: ﬁrst a sparse scene ﬂow estimation, then a 4D clus-

tering of the light ﬁeld, and ﬁnally a sparse to dense scene ﬂow inter-

polation for each cluster. For the performance evaluation, we have

generated a synthetic dataset from the open source movie Sintel in

order to extend the popular MPI Sintel dataset to light ﬁelds and

scene ﬂow. Our method gives comparable performances with state-

of-the art approaches only considering the horizontal and vertical

displacements (i.e. the optical ﬂow). However, signiﬁcant improve-

ments were obtained in the estimation of the disparity variation.

Sparse to Dense Scene Flow Estimation From Light Fields

Figures

Citations

Scene Flow Estimation From Sparse Light Fields Using a Local 4D Affine Model

Angularly Consistent Light Field Video Interpolation

A Spatial Calibrated and Colour Corrected Light Field Outdoor Video Dataset from a $5 \times 5$ Dense Camera Array

References

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

A note on two problems in connexion with graphs

SLIC Superpixels Compared to State-of-the-Art Superpixel Methods

Light field rendering

The lumigraph

Related Papers (5)

Robust local optical flow: Dense motion vector field interpolation

Local All-Pass filters for optical flow estimation

Separating background and foreground optical flow fields by low-rank and sparse regularization

Optical flow estimation in omnidirectional images using wavelet approach

Fast super-resolution from video data using optical flow estimation

Frequently Asked Questions (3)

Q1. What is the method for estimating optical flow?

Q2. What is the main topic of the paper?

Q3. What is the purpose of this experiment?