Pixelwise View Selection for Unstructured Multi-View Stereo

doi:10.1007/978-3-319-46487-9_31

Pixelwise View Selection for Unstructured

Multi-View Stereo

Johannes L. Sch¨onberger

1(

B

)

, Enliang Zheng

2

, Jan-Michael Frahm

2

,

and Marc Pollefeys

1,3

1

ETH Z ¨urich, Z¨urich, Switzerland

{jsch,pomarc}@inf.ethz.ch

2

UNC Chap el Hill, Chapel Hill, USA

{ezheng,jmf}@cs.unc.edu

3

Microsoft, Redmond, USA

Abstract. This work presents a Multi-View Stereo system for robust

and eﬃcient dense modeling from unstructured image collections. Our

core contributions are the joint estimation of depth and normal infor-

mation, pixelwise view selection using photometric and geometric pri-

ors, and a multi-view geometric consistency term for the simultaneous

reﬁnement and image-based depth and normal fusion. Experiments on

b enchmarks and large-scale Internet photo collections demonstrate state-

of-the-art performance in terms of accuracy, completeness, and eﬃciency.

1 Introduction

Large-scale 3D reconstruction from Internet photos has seen a tremendous evo-

lution in sparse modeling using Structure-from-Motion (SfM) [

1–8] and in dense

modeling using Multi-View Stereo (MVS) [9–15]. Many applications beneﬁt from

a dense scene representation, e.g.,, classiﬁcation [

16], image-based rendering [17],

localization [

18], etc. Despite the widespread use of MVS, the eﬃcient and robust

estimation of accurate, complete, and aesthetically pleasing dense models in

uncontrolled environments remains a challenging task. Dense pixelwise corre-

spondence search is the core problem of stereo methods. Recovering correct cor-

respondence is challenging even in controlled environments with known viewing

geometry and illumination. In uncontrolled settings, e.g.,, where the input con-

sists of crowd-sourced images, it is crucial to account for various factors, such as

heterogeneous resolution and illumination, scene variability, unstructured view-

ing geometry, and mis-registered views.

Our proposed approach improves the state of the art in dense reconstruc-

tion for unstructured images. This work leverages the optimization framework

by Zheng et al.[

14] to propose the following core contributions: (1) Pixelwise

Electronic supplementary material The online version of this chapter (doi:10.

1007/978-3-319-46487-9

31) contains supplementary material, which is available to

authorized users.

c

 Springer International Publishing AG 2016

B. Leibe et al. (Eds.): ECCV 2016, Part III, LNCS 9907, pp. 501–518, 2016.

DOI: 10.1007/978-3-319-46487-9

31

502 J.L. Sch¨onberger et al.

Fig. 1. Reconstructions for Louvre, Todai-ji, Paris Opera, and A stronomical Clock.

normal estimation embedded into an improved PatchMatch sampling scheme.

(2) Pixelwise view selection using triangulation angle, incident angle, and image

resolution-based geometric priors. (3) Integration of a “temporal” view selection

smoothness term. (4) Adaptive window support through bilateral photometric

consistency for improved occlusion boundary behavior. (5) Introduction of a

multi-view geometric consistency term for simultaneous depth/normal estima-

tion and image-based fusion. (6) Reliable depth/normal ﬁltering and fusion.

Outlier-free and accurate depth/normal estimates further allow for direct mesh-

ing of the resulting point cloud. We achieve state-of-the-art results on bench-

marks (Middlebury [

19], Strecha [20]). To demonstrate the advantages of our

method in a more challenging setting, we process SfM models of a world-scale

Internet dataset [

5]. The entire algorithm is released to the public as an open-

source implementation as part of [8]athttps://github.com/colmap/colmap.

2 Related Work

Stereo methods have advanced in terms of accuracy, completeness, scala-

bility, and benchmarking – from the minimal stereo setup with two views

[

21–24] to multi-view methods [9,10,14,15,25–28]. Furthermore, the joint esti-

mation of semantics [29], dynamic scene reconstruction [30–34], and benchmark-

ing [

12,19,20,23]. Our method performs M VS with pixelwise view selection for

depth/normal estimation and fusion. Here, we only review th e most related

approaches, within the large body of research in multi-view and two-view stereo.

MVS leverages multiple views to overcome the inherent occlusion prob-

lems of two-view approaches [

35–37]. Accordingly, view selection plays a crucial

role in the eﬀectiveness of MVS. Kang et al.[

38] heuristically select the best

views with minimal cost (usually 50 %) for computing the depth of each pixel.

Strecha et al.[

39,40] probabilistically model scene visibility combined with a

local depth smoothness assumption [39] in a Markov Random Field for pixel-

wise view selection. Diﬀerent from our approach, their method is prohibitive in

memory usage and does neither include normal estimation nor photometric and

geometric priors for view selection. Gallup et al.[

41] select diﬀerent views and

resolutions on a per-pixel basis to achieve a constant depth error. In contrast,

our method simultaneously considers a variety of photometric and geometric pri-

ors improving upon the robustness and accuracy of the recently proposed depth

estimation framework by Zheng et al.[

14]. Their method is most closely related

to our approach and is reviewed in more detail in Sect. 3.

Pixelwise V iew Selection for Unstructured Multi-View Stereo 503

MVS methods commonly use a fronto-parallel scene structure assumption.

Gallup et al.[

42] observed the distortion of the cost function caused by structure

that deviates from this prior and combats it by using multiple sweeping directions

deduced from the sparse reconstruction. Earlier approaches [

43–45] similarly

account for the surface normal in stereo matching. Recently, Bleyer et al.[

46]use

PatchMatch to estimate per-pixel normals to compensate for the distortion of the

cost function. In contrast to these approaches, we pr opose to estimate normals

not in isolation but also considering the photometric and geometric constraints

guiding the matchabilty of surface texture and its accuracy. By probabilistically

modeling the contribution of individual viewing rays towards reliable surface

recovery, we achieve signiﬁcantly improved depth and normal estimates.

Depth map fusion integrates multiple depth maps into a uniﬁed and aug-

mented scene representation while mitigating any inconsistencies among indi-

vidual estimates. Jancoseck and Pajdla [28] fuses multiple depth estimates into

a surface and, by evaluating visibility in 3D space, they also attempt to recon-

struct parts that are not directly supported by depth measurements. In con-

trast, our method aims at directly maximizing the estimated surface support

in the depth maps and achieves higher completeness and accuracy (see Sect.

5).

Goesele et al.[

47] propose a method that explicitly targets at the reconstruction

from crowd-sourced images. They ﬁrst select camera clusters for each surface and

adjust their resolution to the smallest common resolution. For depth estimation,

they then use the four most suitable images for each pixel. As already noted in

Zheng et al.[

14], this early pre-selection of reduced camera clusters may lead to

less complete results and is sensitive to noise. Our method avoids this restrictive

selection scheme by allowing dataset-wide, pixelwise sampling for view selection.

Zach [

48] proposed a variational depth map formulation th at enabled parallelized

computation on the GPU. However, their volumetric approach i mposes substan-

tial memory requirements and is prohibi tive for the large-scale scenes targeted

by our method. Beyond these methods, there are several large-scale dense recon-

struction and fusion methods for crowd-sourced images, e.g.,, Furukawa et al.

[

10] and Gallup et al.[49,50], who all perform heuristic pre-selection of views,

which leads to reduced completeness and accuracy as compared to our method.

3 Review of Joint View Selection and Depth Estimation

This section reviews the framework by Zheng et al.[

14] to introduce notation

and context for our contributions. Since their method processes each row/column

independently, we limit the description to a single image row with l as the column

index. Their method estimates the depth θ

l

for a pixel in the reference image

X

ref

from a set of unstructured source images X

src

= {X

m

| m =1...M}.The

estimate θ

l

maximizes the color similarity between a patch X

ref

l

in the reference

image and homography-warped patches X

m

l

in non-occluded source images. The

binary indicator variable Z

m

l

∈{0, 1} deﬁnes t he set of non-occluded source

images as

¯

X

src

l

= {X

m

| Z

m

l

=1}. To sample

¯

X

src

l

, they infer the probability

that the reference patch X

ref

l

at depth θ

l

is visible at the source patch X

m

l

using

504 J.L. Sch¨onberger et al.

P (X

m

l

|Z

m

l

,θ

l

)=



1

NA

exp



−

(1−ρ

m

l

(θ

l

))

2

2σ

2

ρ



if Z

m

l

=1

1

N

U if Z

m

l

=0,

(1)

where A =



1

−1

exp{−

(1−ρ)

2

2σ

2

ρ

}dρ and N is a constant canceling out in the infer-

ence. In th e case of occlusion, the color distributions of the two patches are unre-

lated and follow the uniform d istrib ut ion U in the range [−1, 1] with probability

density 0.5. Otherwise, ρ

m

l

describes the color similarity between the reference

and source patch based on n ormalized cross-correlation (NCC) using fronto-

parallel homography warping. The vari able σ

ρ

determines a soft threshold for

ρ

m

l

on the reference patch being visible in the source image. The state-transition

matrix f rom the preceding pixel l − 1 to the current pixel l is P (Z

m

l

|Z

m

l−1

)=



γ 1−γ

1−γγ



and encourages spatially smooth occlusion indicators, where a larger

γ enforces neighboring pixels to have more similar indicators. Given reference

and source images X = {X

ref

, X

src

}, the inference problem then boils down to

recover, for all L pixels in the reference image, the depths θ = {θ

l

| l =1...L}

and the occlusion indicators Z = {Z

m

l

| l =1...L, m =1...M} from the

posterior distribution P (Z, θ|X) with a uni for m prior P (θ). To solve the com-

putationally infeasible Bayesian approach of ﬁrst computing the joint probability

P (X, Z, θ)=

L



l=1

M



m=1

[P (Z

m

l

|Z

m

l−1

)P (X

m

l

|Z

m

l

,θ

l

)] (2)

and then normalizing over P (X), Zheng et al. use variational inference the-

ory to develop a framework that is a variant of the generalized expectation-

maximization (GEM) algor ith m [

51]. For the inference of Z in the hidden

Markov-Chain, the forward-backward algorithm is used in the E step of GEM.

PatchMatch-inspired [

46] sampling serves as an eﬃcient scheme for the inference

of θ in the M step of GEM. Their method iteratively solves for Z with ﬁxed θ and

vice versa using interleaved row-/columnwise propagation. Full depth inference

θ

opt

l

= argmin

θ

∗

l



M

m=1

P

l

(m)(1 − ρ

m

l

(θ

∗

l

)) (3)

has high computational cost if M is large as PatchMatch requires the NCC to be

computed many times. The value P

l

(m)=

q (Z

m

l

=1)



M

m=1

q (Z

m

l

=1)

denotes the probability

of the patch in source image m being similar to the reference patch, while q(Z)

is an approximation of the real posterior P (Z). Source images with small P

l

(m)

are non-informative for the depth inference, hence Zheng et al. propose a Monte

Carlo based approximation of θ

opt

l

for view selection

ˆ

θ

opt

l

= argmin

θ

∗

l

1

|S|



m∈S

(1 − ρ

m

l

(θ

∗

l

)) (4)

by sampling a subset of images S ⊂{1 ...M} from the distribution P

l

(m)and

hence only computing the NCC for the most similar source images.

Pixelwise V iew Selection for Unstructured Multi-View Stereo 505

4 Algorithm

In this section, we describe ou r novel algorithm that leverages the optimization

framework reviewed in the previous section. We ﬁrst present the indi vid ual terms

of the proposed likelihood function, while Sect.

4.6 explains their integration into

the overall optimization framework.

4.1 Normal Estimation

Zheng et al.[

14] map between the reference and source images using fronto-

parallel homographies leading to artifacts f or oblique structures [

42]. In contrast,

we estimate per-pixel depth θ

l

and normals n

l

∈ R

3

, n

l

 = 1. A patch at

x

l

∈ P

2

in the reference image warps to a source patch at x

m

l

∈ P

2

using

x

m

l

= H

l

x

l

with H

l

= K

m

(R

m

− d

−1

l

t

m

n

T

l

)K

−1

. Here, R

m

∈ SO(3) and

t

m

∈ R

3

deﬁne the relative transformation from the reference to the source

camera frame. K and K

m

denote the calibration of the reference and sour ce

images, respectively, and d

l

= n

T

l

p

l

is the orthogonal distance from the reference

image to the p l ane at the point p

l

= θ

l

K

−1

x

l

.

Given no knowledge of the scene, we assume a uniform prior P (N )inthe

inference of the normals N = {n

l

| l =1...L}. Estimating N requires to change

the terms P (X

m

l

|Z

m

l

,θ

l

)andP

l

(m)fromEqs.(

1) and (4) to also depend on N ,

as the color similarity ρ

m

l

is now based on slanted rather than fronto-parallel

homographies. Consequently, the optimal depth and normal are chosen as

(

ˆ

θ

opt

l

,

ˆ

n

opt

l

) = argmin

θ

∗

l

,n

∗

l

1

|S|



m∈S

(1 − ρ

m

l

(θ

∗

l

, n

∗

l

)). (5)

To sample unbiased random normals in PatchMatch, we follow the approach by

Galliani et al.[

15]. With the additional two unknown normal parameters, the num-

ber of unknowns per pixel in the M step of GEM increases from one to three. While

this in theory requires PatchMatch to generate many more samples, we propose

an eﬃcient propagation scheme that maintains the convergence rate of depth-only

inference. Since depth θ

l

and normal n

l

deﬁne a local planar surface in 3D, we prop-

agate the depth θ

prp

l−1

of the intersection of the ray of the current pixel x

l

with the

local surface of the previous pixel (θ

l−1

, n

l−1

). This exploits ﬁrst-order smooth-

ness of the surface (cf.[

52]) and thereby drastically speeds up the optimization

since correct depths propagate more quickly along the surface. Moreover, diﬀerent

from the typical iterative reﬁnement of normals using bisection as an intermediate

step between full sweeps of propagations (cf.[

15,46]), we generate a small set of

additional plane hypotheses at each propagation step. We observe that the current

best depth and normal parameters can h ave the following states: neither of them,

one of them, or both of them have the optimal solution or are close to it. By com-

bining random and perturbed depths with current best normals and vice versa, we

increase the chance of sampling the correct solution. More formally, at each step in

PatchMatch, we choose the current best estimate for pixel l according to Eq. (

4)

from the set of hypotheses

Pixelwise View Selection for Unstructured Multi-View Stereo

Citations

MVSNet: Depth inference for unstructured multi-view stereo

MegaDepth: Learning Single-View Depth Prediction from Internet Photos

Deferred neural rendering: image synthesis using neural textures

Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision

SuperGlue: Learning Feature Matching With Graph Neural Networks

References

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

Photo tourism: exploring photo collections in 3D

Structure-from-Motion Revisited

Related Papers (5)

Structure-from-Motion Revisited

Accurate, Dense, and Robust Multiview Stereopsis

Multiple view geometry in computer vision

Adam: A Method for Stochastic Optimization

A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms