Book Chapter•DOI•

2.1 depth estimation of frames in image sequences using motion occlusions

Guillem Palou¹, Philippe Salembier¹•Institutions (1)

07 Oct 2012-pp 516-525

TL;DR: This paper proposes a system to depth order regions of a frame belonging to a monocular image sequence, where regions are ordered according to their relative depth using the previous and following frames.

read less

Abstract: This paper proposes a system to depth order regions of a frame belonging to a monocular image sequence. For a given frame, regions are ordered according to their relative depth using the previous and following frames. The algorithm estimates occluded and disoccluded pixels belonging to the central frame. Afterwards, a Binary Partition Tree (BPT) is constructed to obtain a hierarchical, region based representation of the image. The final depth partition is obtained by means of energy minimization on the BPT. To achieve a global depth ordering from local occlusion cues, a depth order graph is constructed and used to eliminate contradictory local cues. Results of the system are evaluated and compared with state of the art figure/ground labeling systems on several datasets, showing promising results.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 Optical Flow and Image Representation] – [3 Motion Occlusions from Optical Flow] – [4 Depth Order Retrieval] – [4.1 General Energy Minimization on BPTs] – [4.2 Depth ordering] – [5 Results] and [6 Conclusions]

1 Introduction

Depth perception in human vision emerges from several depth cues.
Most of the published approaches make use of two (or more) points of view to compute the disparity as it offers a reliable cue for depth estimation [2].
Whereas, references [6, 7] attempt to retrieve a full depth map from a monocular image sequence, under some assumptions/restrictions about the scene structure which may not be fulfilled in typical sequences.
The work [8] assigns figure/ground (f/g) labels to detected occlusion boundaries.
First, the optical flow is used in Section 2 to introduce motion information for the BPT [9] construction and in Section 3 to estimate (dis)occluded points.

2 Optical Flow and Image Representation

It, the previous It−1 and following It+1 frames are used.
For two given temporal indices a, b, the optical flow vector wa,b maps each pixel of Ia to one pixel in Ib.
Iteratively, the two most similar neighboring regions according to a predefined distance are merged and the process is repeated until only one region is left.
The BPT describes a set of regions organized in a tree structure and this hierarchical structure represents the inclusion relationship between regions.

3 Motion Occlusions from Optical Flow

When only one point of view is available, humans take profit of monocular depth cues to retrieve the scene structure: motion parallax and motion occlusions.
Motion parallax assumes still scenes, and it is able to retrieve the absolute depth.
Since motion occlusions appear in more situations and do not make any assumptions, they are selected here.
The pixel with maximum D(px, pm) value is decided to be the occluded pixel.
Occluded and disoccluded pixels may be useful to some extent (e.g. to improve optical flow estimation, [12]).

4 Depth Order Retrieval

Once the optical flow is estimated and the BPT is constructed, the last step of the system is to retrieve a suitable partition to depth order its regions.
There are many ways to obtain a partition from a hierarchical representation [14, 15, 9].
Since raw optical flows are not reliable at (dis)occluded points, a first step allows us to find a partition.
When the occlusion relations are estimated, the second step finds a second partition Pd attempting to maintain occluded-occluding pairs in different regions.
Obtaining Pf and Pd is performed using the same energy minimization algorithm.

4.1 General Energy Minimization on BPTs

If that is the case, Algorithm 1 uses dynamic programming (Viterbi like) to find the optimal x∗.
Small BPT with green nodes marked forming the pruning x3, also known as Center.
Keyframe with occluded (red) and occluding pixels overlaid.
The modeled flow w̃t,qRi is estimated by robust regression [17] for each region Ri. Occlusion relations estimation.
Pf and a flow model available for each region, occlusion relations can be reliably estimated.

4.2 Depth ordering

The vertices V represent the regions of D and the edges E represent occlusion relations between regions.
The weight pi = Nab/No where Nab is the number of occlusion relations between both regions.
It iteratively finds low confident occlusion relations and breaks cycles.
Once all cycles have been removed in G, a topological partial sort [18] is applied and each region is assigned a depth order.
Regions which have no depth relation, are assigned the depth of their most similar adjacent region according to the distance in the BPT construction.

5 Results

The evaluation of the system is performed at keyframes of several sequences, comparing the assigned f/g contours against the ground-truth assignments.
When two depth planes meet, the part of the contour belonging to the closest region is assigned figure, or ground otherwise, see Figure 6.
The datasets are the Carneige Mellon Dataset (CMU) [19] and the Berkeley Dataset (BDS) [8].
It can be seen in Table 1 that the proposed system outperforms the one presented in [8], showing that motion occlusions are a reliable cue for depth ordering.
In spite of the simplicity of the optical flow estimation algorithm, occlusion points were reliably estimated.

6 Conclusions

A system inferring the relative depth order of the different regions of a frame relying only on motion occlusion has been described.
Combining a variational approach for optical flow estimation and a region based representation of the image the authors have developed a reliable system to detect occlusion relations and to create depth ordered partitions using only these depth cues.
Comparison with the state of the art shows that motion occlusions are very reliable cues.
There are many possible extensions to the proposed system.
The authors believe also that occlusions caused by motions can be propagated throughout the sequence to infer a consistent depth ordering across multiple frames.

Did you find this useful? Give us your feedback

Figures (6)

Fig. 1. Scheme of the proposed system. From three consecutive frames of a sequence (green blocks), a 2.1D map is estimated (red block)

Fig. 5. Depth ordering example. From left to right, top to bottom. Final depth partition with region number. Estimated occluded points in red and occluding points in green. Initial graph. Final graph where cycles have been removed. Depth order image (the brighter the region, the closer).

Table 1. Our method vs. [8] on the percentage of correct f/g assignments.

Fig. 2. Left: color code used to represent optical flow values. Three consecutive frames are presented in the top row, It−1 ,It in red and It+1. In the bottom row, from left to right, the wt−1,t,wt,t−1,wt,t+1,wt+1,t flows are shown.

Fig. 6. Results on the CMU dataset. From left to right, for the two columns. 1) Keyframe image and 2) image with occlusion relations (green occluding, red occluded). 3) estimated depth partition, with white regions meaning closer and black meaning further. 4) Figure/ground assignment on contours with green and red overlaid marking figure and ground regions, respectively.

Fig. 7. Results on some of the sequences of the BDS dataset. For each column, the right image corresponds to the keyframe with figure/ground assignments on contours overlaid. The left image correspond to the final depth ordered partition.

Content maybe subject to copyright Report

2.1 Depth Estimation of Frames in Image

Sequences Using Motion Occlusions

Guillem Palou, Philippe Salembier

Technical University of Catalonia (UPC), Dept. of Signal Theory and

Communications, Barcelona, SPAIN

{guillem.palou,philippe.salembier}@upc.edu

Abstract. This paper proposes a system to depth order regions of a

frame belonging to a monocular image sequence. For a given frame, re-

gions are ordered according to their relative depth using the previous

and following frames. The algorithm estimates occluded and disoccluded

pixels belonging to the central frame. Afterwards, a Binary Partition

Tree (BPT) is constructed to obtain a hierarchical, region based repre-

sentation of the image. The ﬁnal depth partition is obtained by means

of energy minimization on the BPT. To achieve a global depth ordering

from local occlusion cues, a depth order graph is constructed and used to

eliminate contradictory local cues. Results of the system are evaluated

and compared with state of the art ﬁgure/ground labeling systems on

several datasets, showing promising results.

1 Introduction

Depth perception in human vision emerges from several depth cues. Normally,

humans estimate depth accurately making use of both eyes, inferring (subcon-

sciously) disparity between two views. However, when only one point of view is

available, it is also possible to estimate the scene structure to some extent. This

is done by the so called monocular depth cues. In static images, T-junctions or

convexity cues may be detected in speciﬁc image areas and provide depth order

information. If a temporal dimension is introduced, motion information can also

be used to get depth information. Occlusion of moving objects, size changes or

motion parallax are used in the human brain to structure the scene [1].

Nowadays, a strong research activity is focusing on depth maps generation,

mainly motivated by the ﬁlm industry. However, most of the published ap-

proaches make use of two (or more) points of view to compute the disparity

as it oﬀers a reliable cue for depth estimation [2]. Disparity needs at least two

images captured at the same time instant but, sometimes, this requirement can-

not be fulﬁlled. For example, current handheld cameras have only one objective.

Moreover, a large amount of material has already been acquired as monocular

sequences and needs to be converted. In such cases, depth perception should be

inferred only through monocular cues. Although monocular cues are less reliable

than stereo cues, humans can do this task with ease.

2 Depth Estimation of Frames Using Motion Occlusions

t−1

t+1

Optical ﬂow &

(dis)occluded points

estimation

BPT

construction

Occlusion

relations

estimation

Pruning

Depth ordering

2.1D Map

Fig. 1. Scheme of the proposed system. From three consecutive frames of a

sequence (green blocks), a 2.1D map is estimated (red block)

The 2.1D model is an intermediate state between 2D images and full/absolute

3D maps, representing the image as a partition with its regions ordered by its

relative depth. State of the art depth ordering systems on monocular sequences

focus on the extraction of foreground regions from the background. Although this

may be appropriate for some applications, more information can be extracted

from an image sequence. The approach in [3] provides a pseudo-depth estima-

tion to detect occlusion boundaries from optical ﬂow. References [4, 5] estimate

a layered image representation of the scene. Whereas, references [6, 7] attempt to

retrieve a full depth map from a monocular image sequence, under some assump-

tions/restrictions about the scene structure which may not be fulﬁlled in typical

sequences. The work [8] assigns ﬁgure/ground (f/g) labels to detected occlusion

boundaries. f/g labeling provides a quantitative measure of depth ordering, as it

assigns a local depth gradient at each occlusion boundary. Although f/g labeling

is an interesting ﬁeld of study, it does not oﬀer a dense depth representation.

A good monocular cue to determine a 2.1D map of the scene is motion occlu-

sion. When objects move, background regions (dis)appear, creating occlusions.

Humans use these occlusions to detect the relative depth between scene regions.

The proposed work assesses the performance of these cues in a fully automated

system. To this end, the process is divided as shown in Figure 1 and presented

as follows. First, the optical ﬂow is used in Section 2 to introduce motion infor-

mation for the BPT [9] construction and in Section 3 to estimate (dis)occluded

points. Next, to ﬁnd both occlusion relations and a partition of the current frame,

the energy minimization technique described in Section 4 is used. Lastly, the re-

gions of this partition are ordered, generating a 2.1D map. Results compared

with [8] are exposed in Section 5.

2 Optical Flow and Image Representation

As shown in Figure 2, to determine the depth order of frame I

, the previous

t−1

and following I

t+1

frames are used. Forward w

t−1,t

, w

t,t+1

and backward

Depth Estimation of Frames Using Motion Occlusions 3

t−1,t

t,t−1

t,t+1

t+1,t

Fig. 2. Left: color code used to represent optical ﬂow values. Three consecutive

frames are presented in the top row, I

t−1

in red and I

t+1

. In the bottom row,

from left to right, the w

t−1,t

t,t−1

t,t+1

t+1,t

ﬂows are shown.

ﬂows w

t,t−1

, w

t+1,t

can be estimated using [10]. For two given temporal indices

a, b, the optical ﬂow vector w

a,b

maps each pixel of I

to one pixel in I

Once the optical ﬂows are computed, a BPT is built [11]. The BPT begins

with an initial partition (here a partition where each pixel forms a region).

Iteratively, the two most similar neighboring regions according to a predeﬁned

distance are merged and the process is repeated until only one region is left. The

BPT describes a set of regions organized in a tree structure and this hierarchical

structure represents the inclusion relationship between regions. Although the

construction process is an active ﬁeld of study, it is not the main purpose of this

paper and we chose the distance deﬁned in [11] to build the BPT: the region

distance is deﬁned using color, area, shape and motion information.

3 Motion Occlusions from Optical Flow

When only one point of view is available, humans take proﬁt of monocular depth

cues to retrieve the scene structure: motion parallax and motion occlusions.

Motion parallax assumes still scenes, and it is able to retrieve the absolute depth.

Occlusions may work in dynamic scenes but only oﬀer insights about relative

depth. Since motion occlusions appear in more situations and do not make any

assumptions, they are selected here. Motion occlusions can be detected with

several approaches [12, 13]. In this work, however, a diﬀerent approach is followed

as it gave better results in practice.

Using three frames I

t−1

, I

t+1

, it is possible to detect pixels becoming oc-

cluded from I

to I

t+1

and pixels becoming visible (disoccluded) from I

t−1

to I

To detect motion occlusions, the optical ﬂow between an image pair (I

, I

) is

used with q = t ± 1. To obtain occluded pixels q = t + 1, while disoccluded are

obtained when q = t − 1.

Flow estimation attempts to ﬁnd a matching for each pixel between two

frames. If a pixel is visible in both frames, the ﬂow estimation is likely to ﬁnd

the true matching. If, however, the pixel becomes (dis)occluded, the matching

4 Depth Estimation of Frames Using Motion Occlusions

will not be against its true peer. In the case of occlusion, two pixels p

and p

in I

will be matched with the same pixel p

in frame I

+ w

t,q

) = p

+ w

t,q

) = p

(1)

Equation (1) implicitly tells that either p

or p

is occluded. It is likely that the

non occluded pixel neighborhood is highly correlated in both frames. Therefore,

to decide which one is the occluded pixel, a patch distance is computed:

D(p

, p

) =

d∈Γ

+ d) − I

+ d))

(2)

with p

= p

or p

. The pixel with maximum D(p

, p

) value is decided to be

the occluded pixel. The neighborhood Γ is a 5 × 5 square window centered at

but results are similar with windows of size 3 × 3 or 7 × 7.

Occluded and disoccluded pixels may be useful to some extent (e.g. to im-

prove optical ﬂow estimation, [12]). To retrieve a 2.1D map, an (dis)occluded-

(dis)occluding relation is needed to create a depth order. (Dis)occluding pixels

are pixels in I

that will be in front of their (dis)occluded peer in I

. There-

fore, using these relations it is possible to order diﬀerent regions in the frame

according to depth. In the proposed system, occlusion relations estimation is

postponed until the BPT representation is available, see Section 4.1. The rea-

son to do so is because raw estimated optical ﬂows are not reliable in occluded

points. Nevertheless, with the knowledge of region information it is possible to

ﬁt optical ﬂow models to regions and provide conﬁdent optical ﬂow values even

for (dis)occluded points.

4 Depth Order Retrieval

Once the optical ﬂow is estimated and the BPT is constructed, the last step of

the system is to retrieve a suitable partition to depth order its regions. There are

many ways to obtain a partition from a hierarchical representation [14, 15, 9]. In

this work an energy minimization strategy is proposed. The complete process

comprises two energy minimization steps to ﬁnd the ﬁnal partition. Since raw

optical ﬂows are not reliable at (dis)occluded points, a ﬁrst step allows us to

ﬁnd a partition P

where an optical ﬂow model is ﬁtted in each region. When

the occlusion relations are estimated, the second step ﬁnds a second partition P

attempting to maintain occluded-occluding pairs in diﬀerent regions. The ﬁnal

stage of the system relates regions in P

according to their relative depth.

Obtaining P

and P

is performed using the same energy minimization al-

gorithm. For this reason, the general algorithm is presented ﬁrst in Section 4.1

and then it is particularized for each step in the following subsections.

4.1 General Energy Minimization on BPTs

A partition P, can be represented by a vector x of binary variables x

= {0, 1}

with i = 1..N , one for each region R

forming the BPT. If x

= 1, R

is in the

Depth Estimation of Frames Using Motion Occlusions 5

Algorithm 1 Optimal Partition Selection

function OptimalSubTree(Region R

)

, R

← (LeftChild(R

),RightChild(R

))

, o

) ← (E

), R

)

, c

) ← OptimalSubTree(R

)

, c

) ← OptimalSubTree(R

)

if c

< c

+ c

then

OptimalSubTree(R

) ← (o

, c

)

else

OptimalSubTree(R

) ← (o

, c

+ c

)

end if

end function

partition, otherwise x

= 0. Although there are a total of 2

possible vectors,

only a reduced subset may represent a partition, as shown in Figure 3. A given

vector x is a valid vector if one, and only one, region in every BPT branch has

= 1. A branch is the sequence of regions from a leaf to the root of the tree.

Intuitively speaking, if a region R

is forming the partition P (x

= 1), no other

region R

enclosed or enclosing R

may have x

= 1. This can be expressed as a

linear constraint A on the vector x. A is provided for the case in Figure 3:

Ax = 1







1 0 0 0 1 0 1

0 1 0 0 1 0 1

0 0 1 0 0 1 1

0 0 0 1 0 1 1







x = 1 (3)

Where 1 is a vector containing all ones. The proposed optimization scheme

ﬁnds a partition that minimizes energy functions of the type:

∗

= arg min

E(x) = arg min

∈BP T

(4)

s.t. Ax = 1 x

= {0, 1} (5)

where E

) is a function that depends only of the internal characteristics of

the region (mean color or shape, for example). If that is the case, Algorithm 1

uses dynamic programming (Viterbi like) to ﬁnd the optimal x

∗

Fitting the ﬂows and ﬁnding occlusion relations As stated in Section 3,

the algorithm [10] does not provide reliable ﬂow values at (dis)occluded points.

Therefore, to be able to determine consistent occlusion relations, the ﬂow in

non-occluded areas is extrapolated to these points by ﬁnding a partition P

and

estimating a parametric projective model [16] in each region. The set of regions

that best ﬁts to these models is computed using Algorithm 1 with E

) =

q=t±1

x,y∈R



t,q

(x, y) −

t,q

(x, y)



+ λ

(6)

HTML Viewer

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "2.1 depth estimation of frames in image sequences using motion occlusions" ?

This paper proposes a system to depth order regions of a frame belonging to a monocular image sequence. For a given frame, regions are ordered according to their relative depth using the previous and following frames. Results of the system are evaluated and compared with state of the art figure/ground labeling systems on several datasets, showing promising results.

2.1 depth estimation of frames in image sequences using motion occlusions

Summary (2 min read)

1 Introduction

2 Optical Flow and Image Representation

3 Motion Occlusions from Optical Flow

4 Depth Order Retrieval

4.1 General Energy Minimization on BPTs

4.2 Depth ordering

5 Results

6 Conclusions

Figures (6)

Citations

Cites background or methods from "2.1 depth estimation of frames in i..."

References

"2.1 depth estimation of frames in i..." refers background in this paper

"2.1 depth estimation of frames in i..." refers background in this paper

"2.1 depth estimation of frames in i..." refers methods in this paper

"2.1 depth estimation of frames in i..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "2.1 depth estimation of frames in image sequences using motion occlusions" ?