scispace - formally typeset
Open AccessProceedings ArticleDOI

Sparse to Dense Scene Flow Estimation From Light Fields

Reads0
Chats0
TLDR
Experiments show that the proposed method gives error rates on the optical flow components that are comparable to those obtained with state of the art optical flow estimation methods, while computing a more accurate disparity variation when compared with prior scene flow estimation techniques.
Abstract
The paper addresses the problem of scene flow estimation from sparsely sampled video light fields. The scene flow estimation method is based on an affine model in the 4D ray space that allows us to estimate a dense flow from sparse estimates in 4D clusters. A dataset of synthetic video light fields created for assessing scene flow estimation techniques is also described. Experiments show that the proposed method gives error rates on the optical flow components that are comparable to those obtained with state of the art optical flow estimation methods, while computing a more accurate disparity variation when compared with prior scene flow estimation techniques.

read more

Content maybe subject to copyright    Report

HAL Id: hal-02123544
https://hal.archives-ouvertes.fr/hal-02123544
Submitted on 8 May 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Sparse to Dense Scene Flow Estimation from Light
Fields
Pierre David, Mikaël Le Pendu, Christine Guillemot
To cite this version:
Pierre David, Mikaël Le Pendu, Christine Guillemot. Sparse to Dense Scene Flow Estimation from
Light Fields. ICIP 2019 - IEEE International Conference on Image Processing, Sep 2019, Taïpei,
Taiwan. pp.1-5, �10.1109/ICIP.2019.8803520�. �hal-02123544�

SPARSE TO DENSE SCENE FLOW ESTIMATION FROM LIGHT FIELDS
Pierre David
?
Mika
¨
el Le Pendu
Christine Guillemot
?
?
Inria, Campus Universitaire de Beaulieu, 35042 Rennes, France
Trinity College Dublin, Ireland
ABSTRACT
The paper addresses the problem of scene flow estimation from
sparsely sampled video light fields. The scene flow estimation
method is based on an affine model in the 4D ray space that allows
us to estimate a dense flow from sparse estimates in 4D clusters.
A dataset of synthetic video light fields created for assessing scene
flow estimation techniques is also described. Experiments show
that the proposed method gives error rates on the optical flow com-
ponents that are comparable to those obtained with state of the art
optical flow estimation methods, while computing a more accurate
disparity variation when compared with prior scene flow estimation
techniques.
Index Terms Scene flow, optical flow, light field, sparse to
dense
1. INTRODUCTION
Light fields have proven to be very useful for scene analysis, and in
particular for scene depth estimation [1, 2, 3, 4, 5]. However, while
a lot of work has been dedicated to scene analysis from static light
fields, very little effort has been invested on the problem of dynamic
scene analysis from video light fields. The concept of scene flow has
first been defined in [6] as describing the 3D geometry as well as
the 3D motion of each scene point. Considering a multi-view set-up,
the scene flow is estimated using an optical flow estimator for each
view. The 3D scene flow is then computed by fitting its projection
on each view to the estimated optical flows. However, in the recent
literature (e.g. [7, 8, 9, 10]), the scene flow is instead defined as a
direct extension of the optical flow, where the depth (or disparity)
variation d of objects along time is represented in addition to the
apparent 2D motion (∆x, y).
Considering this definition, the scene flow analysis problem has
first been addressed for stereo video sequences: in [7, 8, 9] a scene
flow is estimated assuming that the scene can be decomposed into
rigidly moving objects. Several methods based on RGB-D videos
have also been developed [10], [11], [12]. The first methods for
scene flow analysis from dense light fields have been proposed in
[13] and [14]. Both are based on variational models. The authors
in [15] proposed oriented light field windows to estimate the scene
flow. All these methods rely on epipolar images which only provide
sufficient information for scene flow estimation of densely sampled
light fields (as those captured with plenoptic cameras). However,
they do not address the case of sparse light fields (i.e. with large
baselines).
In this paper, we focus on the problem of scene flow analysis
from sparse video light fields. This problem is made difficult due to
This work was supported by the EU H2020 Research and Innovation
Program under grant agreement N
694122 (ERC advanced grant CLIM).
the large temporal and angular occlusions. To cope with this diffi-
culty, we propose a method for interpolating, in the 4D ray space,
a sparse scene flow into a dense one. The sparse-to-dense approach
naturally handles occlusions: while the sparse estimation can only
be obtained on non-occluded parts of the image, the dense interpola-
tion extends these estimations to every pixel (occluded or not). The
proposed method is based on an affine model in the 4D ray space that
allows us to estimate the dense flow from sparse estimates in nearby
clusters. Note that an advantage of our sparse-to-dense approach is
the possibility to use it as a post-processing step in other scene flow
estimation methods, in the same way as EpicFlow [16] has been used
in other methods (e.g. [17]) to interpolate the optical flow of pixels
detected as outliers.
In order to validate the proposed algorithm, we have generated
a synthetic light field video dataset based on the Sintel movie (used
in the optical flow benchmark [18, 19]). This dataset is composed
of two synthetic light fields (‘Bamboo2’ and ‘Temple1’) of 3 × 3
views of 1024 × 536 pixels and 50 frames. The light field views are
provided with the corresponding ground truth scene flow (optical
flow and disparity variation).
The proposed method has been assessed using this synthetic
dataset in comparison with the oriented window method in [15], and
with two methods proposed for optical flow estimation: the sparse to
dense estimation method EpicFlow [16] and a state-of-the-art tech-
nique based on a deep learning architecture called PWC-Net [20].
Note that, since the depth estimation component of the method in
[15] has been designed for dense light fields and is hardly applicable
when the baseline is large, when using this method, we have coupled
the optical flow estimation performed by the method with the ground
truth disparity. The two optical flow estimation techniques [20] and
[16] are used for separately estimating the optical flow in each view
as well as the disparity between views. Experimental results show
that the proposed method gives optical flow endpoint error rates that
are comparable to those obtained with state of the art optical flow
estimation methods, while computing at the same time the disparity
variation with lower mean absolute errors. Although the accuracy of
our optical flow is slightly lower than that of PWC-Net, the depth
variation maps estimated by our method have considerably lower er-
rors.
2. NOTATIONS AND METHOD OVERVIEW
Let us consider the 4D representation of Light Fields proposed in
[21] and [22] to describe the radiance along the different light rays.
This 4D function, at each time instant, is denoted LF
t
(x, y, u, v),
where the pairs x = (x, y) and u = (u, v) respectively denote the
spatial and angular coordinates of light rays. A view u of a light field
at t is written L
t
u
. We denote by C : (I, I
0
) (S, S
0
, f) the func-
tion that determines two sets S and S
0
of matching points in I and
I
0
respectively. The associated bijection f is such that f(S) = S
0
.

LF
t
LF
t+1
C
scene
flow
Sparse
estimation
4D clustering
Sparse
to dense
interpolation
sparse
scene flow
clusters
Fig. 1. Overview of the sparse to dense scene flow estimation
f
2
f
0
f
4
f
5
f
3
f
1
L
t
u
L
t
c
L
t+1
u
L
t+1
c
u
c
x
u
x
LF
t
LF
t+1
Fig. 2. Matching between a view u and the central view c at t and
t + 1, shown in epipolar images at t and t+1.
The inverse bijection is noted f
1
.
The proposed method takes as inputs two consecutive frames
of a video light field LF
t
(x, y, u, v), LF
t+1
(x, y, u, v), a match-
ing function C and estimates a consistent dense scene flow on each
view. It produces disparity maps at t and t+1 and a disparity variation
map between t and t+1 (in addition to the optical flow components).
The algorithm proceeds as follows (see Fig.1). The first step of the
method is the robust sparse scene flow estimation. All light field
rays are then grouped in 4D clusters to guide the sparse to dense
interpolation of the scene flow. An affine model is computed from
the sparse flow estimates in neighboring clusters, and then used to
compute a flow value in each light field pixel.
3. SPARSE SCENE FLOW ESTIMATION
The method pivots on the central view L
t
c
. We first compute
(S
0
, S
0
0
, f
0
) = C(L
t
c
, L
t+1
c
). Then, for every non central view
u, we perform three kinds of matching (see Fig. 2), angular match-
ing to estimate the sets of points (S
1
, S
0
1
, f
1
) = C(L
t+1
c
, L
t+1
u
)
and (S
3
, S
0
3
, f
3
) = C(L
t
u
, L
t
c
), temporal matching to estimate
the sets of points (S
2
, S
0
2
, f
2
) = C(L
t+1
u
, L
t
u
), and temporo-
angular matching to estimate (S
4
, S
0
4
, f
4
) = C(L
t
c
, L
t+1
u
) and
(S
5
, S
0
5
, f
5
) = C(L
t+1
c
, L
t
u
). In the experiments, we used [23] as
function C.
The next step is to link the different sets together based on the
distance between the points computed as in [24]. We define a dis-
tance D
0
based on color, angular and spatial proximity in the 7D
space [labuvxy] as :
D
0
(P
i
, P
j
) =
r
d
c
2
+
m
s
2
S
2
d
s
2
+
m
a
2
A
2
d
a
2
(1)
with :
S =
p
h × w/|S
0
| A =
N
u
× N
v
(2)
where d
c
, d
s
and d
a
are the color, spatial and angular distances
defined as euclidean distances in the CIELAB colorspace, the [xy]
space and the [uv] space respectively. The variables w and h are the
width and height of a view, N
u
and N
v
are the angular size of the
light field and m
s
is a parameter controlling the balance between
color and spatial distances that we fix to 10 in the experiments. For
this step, we have d
a
= 0 because we are aiming at linking points
in the same view, but this distance will be used later in the cluster-
ing step. Based on the above distances, for each point P
0
S
0
, we
build a chain of points (P
0
, P
0
0
, P
1
, P
0
1
, P
2
, P
0
2
, P
3
, P
0
3
) by searching
for the nearest neighbors successively in S
1
, S
2
and S
3
(see Eq. (3)
and Fig. 3a). We call this step a forward nearest neighbor search and
denote NN the simple nearest neighbor search function for a point
P
i
S
i
in a set S
j
, according to the distance D.
P
0
f
0
P
0
0
NN
P
1
f
1
P
0
1
NN
P
2
f
2
P
0
2
NN
P
3
f
3
P
0
3
(3)
To ensure the robustness of the chain of points, we then per-
form a backward nearest neighbor check as in Eq.(4). Given a point
P
0
i
S
0
i
whose nearest neighbor in S
j
is P
j
, we check if P
0
i
is
reciprocally the nearest neighbor of P
j
in S
0
i
.
P
0
3
f
1
3
P
3
NN?
P
0
2
f
1
2
P
2
NN?
P
0
1
f
1
1
P
1
NN?
P
0
0
f
1
0
P
0
(4)
Let A
1
be the set of points P
0
S
0
that passed the backward near-
est neighbor test. We have A
1
S
0
. We then perform a similar
forward nearest neighbor search and backward nearest neighbor test
for each point P
0
S
0
but this time with the two temporo-angular
matching sets S
4
and S
5
, building the chain of points (P
0
, P
0
0
, P
4
)
and (P
0
, P
0
0
, P
5
) (see Fig. 3b).
Let A
2
be the set of points P
0
S
0
that passed the backward
nearest neighbor test in S
4
and S
5
. We have A
2
S
0
. For each
point P
0
A = A
1
A
2
, we build a complete chain of points
g = (P
0
, P
1
, P
2
, P
3
, P
4
, P
5
).
Given a chain of points, we check the consistency of each match-
ing point P
0
i
= f
i
(P
i
) with the point P
j
in the same view and in-
stant. More precisely, we discard the chain if either D(P
0
, P
0
3
) > τ,
D(P
2
, P
0
4
)) > τ, or D(P
3
, P
0
5
) > τ, where τ is a distance threshold
that was fixed to 10 in our experiments.
Any remaining chain g is used to estimate a sparse scene flow.
From g, we can compute the scene flow on two points of the light
field LF
t
: on the central view L
t
c
at P
0
and on the view L
t
u
at P
0
2
:
P
0
0
P
0
=
x
c
y
c
and P
0
2
P
2
=
x
u
y
u
(5)
Let d
t
and d
t+1
be the disparities at t and t + 1. We have:
P
0
1
P
1
= (u c) · d
t+1
and P
0
3
P
3
= (u c) · d
t
(6)
With the temporo-angular matching, we also have:
P
0
4
P
4
= (P
0
2
P
2
) (P
0
3
P
3
)
= (P
0
0
P
0
) + (P
0
1
P
1
) (7)
P
0
5
P
5
= (P
0
0
P
0
) (P
0
3
P
3
)
= (P
0
1
P
1
) + (P
0
2
P
2
)
We solve this linear overdetermined system with a least square
method in order to determine the values x
c
, y
c
, x
u
, y
u
, d
t
and d
t+1
for each chain g.

L
t
c
L
t+1
c
L
t
u
L
t+1
u
f
0
S
0
S
0
0
f
1
S
1
S
0
1
f
2
S
2
S
0
2
f
3
S
3
S
0
3
NN
NN?
(a) Chain of angular and temporal correspondences
L
t
c
L
t+1
c
L
t
u
L
t+1
u
f
0
S
0
S
0
0
f
5
S
5
S
0
5
f
4
S
4
S
0
4
NN
NN?
(b) Chain of temporo-angular correspondences
Fig. 3. Forward nearest neighbor search and backward nearest neighbor check. The green dots represent the points that passed the backward
nearest neighbor check.
4. CLUSTERING THE LIGHT FIELD
In order to estimate a dense scene flow from sparse estimates, we
need to make an assumption about the scene. We assume that rays
which have similar colors and are located near each other in the light
field should have similar motions. As a consequence, similarly to
[25] but extended to 4D light fields, we propose to decompose the
light field into clusters based on color, spatial and angular proximity.
Then, the clustering will be used to estimate one scene flow model
per cluster. We generalize in 4D the SLIC superpixels developed in
[24] to cluster the light field LF
t
. We use as distance D
0
, already
defined in Eq.(1). We only modify the definition of S :
S =
p
h × w/K (8)
K denotes the number of desired clusters. We use the parameters
m
s
and m
a
to control the balance between spatial, angular and color
proximity. S (in Eq. (8)) and A (in Eq. (2)) are the maximum spatial
and angular distances expected in a cluster.
The spatial positions of the centroids C
k
of the clusters are ini-
tialized on a regular grid of step S, while their initial angular posi-
tions are randomly sampled. Then, each pixel i is assigned to the
nearest cluster center C
k
. We fix the search region to N
u
× N
v
×
2S ×2S because the expected approximate spatial size of each clus-
ter is S × S and the approximate angular size is N
u
× N
v
. Indeed,
each cluster is excpected to contain pixels in all the N
u
×N
v
views.
Once the assignment step is over, we update the cluster centers
by computing the new centroid of each cluster. We only need to
iterate this second step N
iteration
= 10 in order to have stabilized
cluster centroids.
5. SPARSE TO DENSE INTERPOLATION
Once we have divided the light field into clusters, we estimate a
model for each one of them. For each cluster, we look for the points
estimated in Section 3 that are inside the cluster, we compute the
mean value of the sparse estimates and associate it with the centroids
of the cluster. For clusters that have no seeds inside them, we do not
do anything. We end up with initialized clusters and unitialized ones.
As in [16], we build a weighted graph where each node is a
cluster (initialized or not) and where an edge between two nodes
means that the two clusters are adjacent. The associated weight is
defined as the distance D
0
(as in Eq.(1)) between the centroids of the
two clusters. We then look for the N nearest initialized neighbors
using Dijkstra’s algorithm on the graph [26], discarding every empty
cluster.
For each cluster, we estimate the parameters of the affine model
in Eq.(9) in the 4D space by fitting the model on the initialized cen-
troids using the RANSAC algorithm [27].
a
1
u + b
1
v + c
1
x + d
1
y + e
1
= x + x
a
2
u + b
2
v + c
2
x + d
2
y + e
2
= y + y (9)
a
3
u + b
3
v + c
3
x + d
3
y + e
3
= d
We then apply the model to every pixel belonging to the cluster
to compute its (∆x, y, d) scene flow value.
As a final post-processing step, we perform an energy minimiza-
tion like in [16] for the optical flow of each subaperture image inde-
pendently. To regularize the disparity variation map, we then per-
form joint bilateral filtering using the optical flow as a guide.
6. EVALUATION
6.1. Scene Flow Dataset
For our experiments, we have prepared a synthetic video Light Field
dataset
1
. For that purpose, we have used the production files of the
open source movie Sintel [28] and have modified them in the Blender
3D software [29] in order to render an array of 3x3 views. Simi-
larly to the MPI Sintel flow dataset [18, 19], we have modified the
scenes to generate not only the ‘final’ render, but also a ‘clean’ ren-
der without lighting effects, motion blur, or semi-transparent objects.
Ground truth optical flow and disparity maps were also generated
for each view. Since disparity variation maps could not be rendered
within Blender, we have computed them using the disparity and opti-
cal flow. However, this process requires projecting the disparity map
of a frame to the next frame using the optical flow, which results in
unavailable disparity variation information in areas of temporal oc-
clusion. We have processed two scenes of 3×3 views of 1024×536
pixels and 50 frames corresponding to the scenes ‘Bamboo2’ and
1
http://clim.inria.fr/Datasets/SyntheticVideoLF/index.html

Bamboo2 Temple1
clean final clean final
EPE OF MAE d EPE OF MAE d EPE OF MAE d EPE OF MAE d
Central
view
only
Ours 1.007 0.136 1.102 0.140 1.042 0.109 1.383 0.128
OW [15] 1.421 0.356 1.462 0.345 2.061 0.152 2.374 0.162
Epic [16] 1.078 0.685 1.280 0.663 5.405 0.664 2.472 1.009
PWC [20] 0.945 0.576 1.017 0.596 1.042 0.291 1.322 0.325
Global
Ours 1.090 0.140 1.169 0.142 1.109 0.111 1.453 0.131
OW [15] / / / / / / / /
Epic [16] 1.071 1.020 1.264 1.015 5.676 0.730 2.609 1.208
PWC [20] 0.947 0.873 1.017 0.884 1.038 0.385 1.323 0.429
Table 1. Results on our light field scene flow dataset for the optical flow endpoint error EPE OF and for the disparity variation mean absolute
error MAE d. This latter is only computed on pixels that remain visible between two adjacent frames because of the lack of ground truth in
occluded and disoccluded areas. The lowest errors are in red, the second lowest errors in orange.
t
t + 1
(∆x, y)
d
Central view Ground Truth Ours
OW[15]
Epic[16]
PWC[20]
Fig. 4. Visual comparison of our method with [15, 16, 20]. The optical flows are visualized with the Middlebury color code, and the disparity
variations are visualized using a gray-scale representation. The red pixels are the occlusion mask where there is no ground truth disparity
variation available.
‘Temple1’ in [18]. The disparities (in pixels) between neighboring
views are in the range [8, +52] for ‘Bamboo2’ and [22, +9] for
‘Temple1’.
6.2. Results
The proposed method is first assessed in comparison with the method
in [15], referred to here as OW (Oriented Window). This latter
was designed for dense light fields captured with plenoptic cam-
eras. However, the optical flow searched via the oriented window
can be combined with disparity maps estimated by methods suit-
able for sparse light fields. In the test reported here, we used for
this method ground truth disparity maps, thus showing the best re-
sults it can give for the estimated scene flow. We also compare the
method with a naive approach that would consist in separately esti-
mating the disparity maps at t and t+1, the optical flows between t
and t+1 and finally computing the disparity variation as the differ-
ence of disparities along the optical flows. We consider two methods
proposed for optical flow estimation: the sparse to dense estima-
tion method EpicFlow [16] and a state-of-the-art technique based
on a deep learning architecture called PWC-Net [20] Note that this
separate disparity and optical flow estimation does not handle occlu-
sions. So, the disparity variation in occluded or disoccluded areas
will never be consistent.
The results are summarized in Table 1. For each successive light
field frame of the four sequences (Bamboo2 and Temple1, both ren-
dered as clean and final), we compute the endpoint error for the op-
tical flow (EPE OF) and separately the mean absolute error for the
disparity variation (MAE d). The latter is only computed for dis-
occluded pixels because there is no ground truth on occluded pixels.
We compute these two errors on every ray of the light field (Global)
and because the method proposed in [15] only gives the scene flow
for the central view, we also compute the errors on the central view
only. We can observe that although [20] yields the most accurate
optical flows, our method has a lower optical flow end point error
than the two other methods and that it gets close to [20] for some
sequences like Temple1 clean and Bamboo2 clean. Regarding the
disparity variation, our method provides much lower errors than any
other tested method. In the end, we propose a method that is slightly
less precise than state of the art optical flow methods but more accu-
rate in terms of disparity variation.
7. CONCLUSION
In this paper, we have presented a new method to estimate scene
flows from sparsely sampled video light fields. The method is based
on three steps: first a sparse scene flow estimation, then a 4D clus-
tering of the light field, and finally a sparse to dense scene flow inter-
polation for each cluster. For the performance evaluation, we have
generated a synthetic dataset from the open source movie Sintel in
order to extend the popular MPI Sintel dataset to light fields and
scene flow. Our method gives comparable performances with state-
of-the art approaches only considering the horizontal and vertical
displacements (i.e. the optical flow). However, significant improve-
ments were obtained in the estimation of the disparity variation.

Citations
More filters
Journal ArticleDOI

Scene Flow Estimation From Sparse Light Fields Using a Local 4D Affine Model

TL;DR: A local 4D affine model is proposed to represent scene flows, taking into account light field epipolar geometry, and it is demonstrated that the model is very effective for estimating scene flows from 2D optical flows.
Proceedings ArticleDOI

Angularly Consistent Light Field Video Interpolation

TL;DR: Experimental results show that the proposed method produces light fields that are angularly coherent while keeping similar temporal and spatial consistency as state-of-the-art video frame interpolation methods.

A Spatial Calibrated and Colour Corrected Light Field Outdoor Video Dataset from a $5 \times 5$ Dense Camera Array

TL;DR: In this article , a unified calibration method involving both spatial calibration and colour correction is employed to correct inconsistencies and achieve a better image quality with reduced image distortion, which is suitable for further research and investigation of a variety of light field applications.
References
More filters
Journal ArticleDOI

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Journal ArticleDOI

A note on two problems in connexion with graphs

TL;DR: A tree is a graph with one and only one path between every two nodes, where at least one path exists between any two nodes and the length of each branch is given.
Journal ArticleDOI

SLIC Superpixels Compared to State-of-the-Art Superpixel Methods

TL;DR: A new superpixel algorithm is introduced, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels and is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.
Proceedings ArticleDOI

Light field rendering

TL;DR: This paper describes a sampled representation for light fields that allows for both efficient creation and display of inward and outward looking views, and describes a compression system that is able to compress the light fields generated by more than a factor of 100:1 with very little loss of fidelity.
Proceedings ArticleDOI

The lumigraph

TL;DR: A new method for capturing the complete appearance of both synthetic and real world objects and scenes, representing this information, and then using this representation to render images of the object from new camera positions.
Frequently Asked Questions (3)
Q1. What is the method for estimating optical flow?

The authors consider two methods proposed for optical flow estimation: the sparse to dense estimation method EpicFlow [16] and a state-of-the-art technique based on a deep learning architecture called PWC-Net [20] 

M. Jaimez, C. Kerl, J. Gonzalez-Jimenez, and D. Cremers, “Fast odometry and scene flow from rgb-d cameras based on geometric clustering,” in IEEE Int. Conf. on Robotics and Automation (ICRA),. 

For that purpose, the authors have used the production files of the open source movie Sintel [28] and have modified them in the Blender 3D software [29] in order to render an array of 3x3 views.