scispace - formally typeset
Book ChapterDOI

Pixelwise View Selection for Unstructured Multi-View Stereo

TLDR
The core contributions are the joint estimation of depth andnormal information, pixelwise view selection using photometric and geometric priors, and a multi-view geometric consistency term for the simultaneous refinement and image-based depth and normal fusion.
Abstract
This work presents a Multi-View Stereo system for robust and efficient dense modeling from unstructured image collections. Our core contributions are the joint estimation of depth and normal information, pixelwise view selection using photometric and geometric priors, and a multi-view geometric consistency term for the simultaneous refinement and image-based depth and normal fusion. Experiments on benchmarks and large-scale Internet photo collections demonstrate state-of-the-art performance in terms of accuracy, completeness, and efficiency.

read more

Content maybe subject to copyright    Report

Pixelwise View Selection for Unstructured
Multi-View Stereo
Johannes L. Sconberger
1(
B
)
, Enliang Zheng
2
, Jan-Michael Frahm
2
,
and Marc Pollefeys
1,3
1
ETH Z ¨urich, Z¨urich, Switzerland
{jsch,pomarc}@inf.ethz.ch
2
UNC Chap el Hill, Chapel Hill, USA
{ezheng,jmf}@cs.unc.edu
3
Microsoft, Redmond, USA
Abstract. This work presents a Multi-View Stereo system for robust
and efficient dense modeling from unstructured image collections. Our
core contributions are the joint estimation of depth and normal infor-
mation, pixelwise view selection using photometric and geometric pri-
ors, and a multi-view geometric consistency term for the simultaneous
refinement and image-based depth and normal fusion. Experiments on
b enchmarks and large-scale Internet photo collections demonstrate state-
of-the-art performance in terms of accuracy, completeness, and efficiency.
1 Introduction
Large-scale 3D reconstruction from Internet photos has seen a tremendous evo-
lution in sparse modeling using Structure-from-Motion (SfM) [
18] and in dense
modeling using Multi-View Stereo (MVS) [915]. Many applications benefit from
a dense scene representation, e.g.,, classification [
16], image-based rendering [17],
localization [
18], etc. Despite the widespread use of MVS, the efficient and robust
estimation of accurate, complete, and aesthetically pleasing dense models in
uncontrolled environments remains a challenging task. Dense pixelwise corre-
spondence search is the core problem of stereo methods. Recovering correct cor-
respondence is challenging even in controlled environments with known viewing
geometry and illumination. In uncontrolled settings, e.g.,, where the input con-
sists of crowd-sourced images, it is crucial to account for various factors, such as
heterogeneous resolution and illumination, scene variability, unstructured view-
ing geometry, and mis-registered views.
Our proposed approach improves the state of the art in dense reconstruc-
tion for unstructured images. This work leverages the optimization framework
by Zheng et al.[
14] to propose the following core contributions: (1) Pixelwise
Electronic supplementary material The online version of this chapter (doi:10.
1007/978-3-319-46487-9
31) contains supplementary material, which is available to
authorized users.
c
Springer International Publishing AG 2016
B. Leibe et al. (Eds.): ECCV 2016, Part III, LNCS 9907, pp. 501–518, 2016.
DOI: 10.1007/978-3-319-46487-9
31

502 J.L. Sch¨onberger et al.
Fig. 1. Reconstructions for Louvre, Todai-ji, Paris Opera, and A stronomical Clock.
normal estimation embedded into an improved PatchMatch sampling scheme.
(2) Pixelwise view selection using triangulation angle, incident angle, and image
resolution-based geometric priors. (3) Integration of a “temporal” view selection
smoothness term. (4) Adaptive window support through bilateral photometric
consistency for improved occlusion boundary behavior. (5) Introduction of a
multi-view geometric consistency term for simultaneous depth/normal estima-
tion and image-based fusion. (6) Reliable depth/normal filtering and fusion.
Outlier-free and accurate depth/normal estimates further allow for direct mesh-
ing of the resulting point cloud. We achieve state-of-the-art results on bench-
marks (Middlebury [
19], Strecha [20]). To demonstrate the advantages of our
method in a more challenging setting, we process SfM models of a world-scale
Internet dataset [
5]. The entire algorithm is released to the public as an open-
source implementation as part of [8]athttps://github.com/colmap/colmap.
2 Related Work
Stereo methods have advanced in terms of accuracy, completeness, scala-
bility, and benchmarking from the minimal stereo setup with two views
[
2124] to multi-view methods [9,10,14,15,2528]. Furthermore, the joint esti-
mation of semantics [29], dynamic scene reconstruction [3034], and benchmark-
ing [
12,19,20,23]. Our method performs M VS with pixelwise view selection for
depth/normal estimation and fusion. Here, we only review th e most related
approaches, within the large body of research in multi-view and two-view stereo.
MVS leverages multiple views to overcome the inherent occlusion prob-
lems of two-view approaches [
3537]. Accordingly, view selection plays a crucial
role in the effectiveness of MVS. Kang et al.[
38] heuristically select the best
views with minimal cost (usually 50 %) for computing the depth of each pixel.
Strecha et al.[
39,40] probabilistically model scene visibility combined with a
local depth smoothness assumption [39] in a Markov Random Field for pixel-
wise view selection. Different from our approach, their method is prohibitive in
memory usage and does neither include normal estimation nor photometric and
geometric priors for view selection. Gallup et al.[
41] select different views and
resolutions on a per-pixel basis to achieve a constant depth error. In contrast,
our method simultaneously considers a variety of photometric and geometric pri-
ors improving upon the robustness and accuracy of the recently proposed depth
estimation framework by Zheng et al.[
14]. Their method is most closely related
to our approach and is reviewed in more detail in Sect. 3.

Pixelwise V iew Selection for Unstructured Multi-View Stereo 503
MVS methods commonly use a fronto-parallel scene structure assumption.
Gallup et al.[
42] observed the distortion of the cost function caused by structure
that deviates from this prior and combats it by using multiple sweeping directions
deduced from the sparse reconstruction. Earlier approaches [
4345] similarly
account for the surface normal in stereo matching. Recently, Bleyer et al.[
46]use
PatchMatch to estimate per-pixel normals to compensate for the distortion of the
cost function. In contrast to these approaches, we pr opose to estimate normals
not in isolation but also considering the photometric and geometric constraints
guiding the matchabilty of surface texture and its accuracy. By probabilistically
modeling the contribution of individual viewing rays towards reliable surface
recovery, we achieve significantly improved depth and normal estimates.
Depth map fusion integrates multiple depth maps into a unified and aug-
mented scene representation while mitigating any inconsistencies among indi-
vidual estimates. Jancoseck and Pajdla [28] fuses multiple depth estimates into
a surface and, by evaluating visibility in 3D space, they also attempt to recon-
struct parts that are not directly supported by depth measurements. In con-
trast, our method aims at directly maximizing the estimated surface support
in the depth maps and achieves higher completeness and accuracy (see Sect.
5).
Goesele et al.[
47] propose a method that explicitly targets at the reconstruction
from crowd-sourced images. They first select camera clusters for each surface and
adjust their resolution to the smallest common resolution. For depth estimation,
they then use the four most suitable images for each pixel. As already noted in
Zheng et al.[
14], this early pre-selection of reduced camera clusters may lead to
less complete results and is sensitive to noise. Our method avoids this restrictive
selection scheme by allowing dataset-wide, pixelwise sampling for view selection.
Zach [
48] proposed a variational depth map formulation th at enabled parallelized
computation on the GPU. However, their volumetric approach i mposes substan-
tial memory requirements and is prohibi tive for the large-scale scenes targeted
by our method. Beyond these methods, there are several large-scale dense recon-
struction and fusion methods for crowd-sourced images, e.g.,, Furukawa et al.
[
10] and Gallup et al.[49,50], who all perform heuristic pre-selection of views,
which leads to reduced completeness and accuracy as compared to our method.
3 Review of Joint View Selection and Depth Estimation
This section reviews the framework by Zheng et al.[
14] to introduce notation
and context for our contributions. Since their method processes each row/column
independently, we limit the description to a single image row with l as the column
index. Their method estimates the depth θ
l
for a pixel in the reference image
X
ref
from a set of unstructured source images X
src
= {X
m
| m =1...M}.The
estimate θ
l
maximizes the color similarity between a patch X
ref
l
in the reference
image and homography-warped patches X
m
l
in non-occluded source images. The
binary indicator variable Z
m
l
∈{0, 1} defines t he set of non-occluded source
images as
¯
X
src
l
= {X
m
| Z
m
l
=1}. To sample
¯
X
src
l
, they infer the probability
that the reference patch X
ref
l
at depth θ
l
is visible at the source patch X
m
l
using

504 J.L. Sch¨onberger et al.
P (X
m
l
|Z
m
l
l
)=
1
NA
exp
(1ρ
m
l
(θ
l
))
2
2σ
2
ρ
if Z
m
l
=1
1
N
U if Z
m
l
=0,
(1)
where A =
1
1
exp{−
(1ρ)
2
2σ
2
ρ
} and N is a constant canceling out in the infer-
ence. In th e case of occlusion, the color distributions of the two patches are unre-
lated and follow the uniform d istrib ut ion U in the range [1, 1] with probability
density 0.5. Otherwise, ρ
m
l
describes the color similarity between the reference
and source patch based on n ormalized cross-correlation (NCC) using fronto-
parallel homography warping. The vari able σ
ρ
determines a soft threshold for
ρ
m
l
on the reference patch being visible in the source image. The state-transition
matrix f rom the preceding pixel l 1 to the current pixel l is P (Z
m
l
|Z
m
l1
)=
γ 1γ
1γγ
and encourages spatially smooth occlusion indicators, where a larger
γ enforces neighboring pixels to have more similar indicators. Given reference
and source images X = {X
ref
, X
src
}, the inference problem then boils down to
recover, for all L pixels in the reference image, the depths θ = {θ
l
| l =1...L}
and the occlusion indicators Z = {Z
m
l
| l =1...L, m =1...M} from the
posterior distribution P (Z, θ|X) with a uni for m prior P (θ). To solve the com-
putationally infeasible Bayesian approach of first computing the joint probability
P (X, Z, θ)=
L
l=1
M
m=1
[P (Z
m
l
|Z
m
l1
)P (X
m
l
|Z
m
l
l
)] (2)
and then normalizing over P (X), Zheng et al. use variational inference the-
ory to develop a framework that is a variant of the generalized expectation-
maximization (GEM) algor ith m [
51]. For the inference of Z in the hidden
Markov-Chain, the forward-backward algorithm is used in the E step of GEM.
PatchMatch-inspired [
46] sampling serves as an efficient scheme for the inference
of θ in the M step of GEM. Their method iteratively solves for Z with fixed θ and
vice versa using interleaved row-/columnwise propagation. Full depth inference
θ
opt
l
= argmin
θ
l
M
m=1
P
l
(m)(1 ρ
m
l
(θ
l
)) (3)
has high computational cost if M is large as PatchMatch requires the NCC to be
computed many times. The value P
l
(m)=
q (Z
m
l
=1)
M
m=1
q (Z
m
l
=1)
denotes the probability
of the patch in source image m being similar to the reference patch, while q(Z)
is an approximation of the real posterior P (Z). Source images with small P
l
(m)
are non-informative for the depth inference, hence Zheng et al. propose a Monte
Carlo based approximation of θ
opt
l
for view selection
ˆ
θ
opt
l
= argmin
θ
l
1
|S|
mS
(1 ρ
m
l
(θ
l
)) (4)
by sampling a subset of images S ⊂{1 ...M} from the distribution P
l
(m)and
hence only computing the NCC for the most similar source images.

Pixelwise V iew Selection for Unstructured Multi-View Stereo 505
4 Algorithm
In this section, we describe ou r novel algorithm that leverages the optimization
framework reviewed in the previous section. We first present the indi vid ual terms
of the proposed likelihood function, while Sect.
4.6 explains their integration into
the overall optimization framework.
4.1 Normal Estimation
Zheng et al.[
14] map between the reference and source images using fronto-
parallel homographies leading to artifacts f or oblique structures [
42]. In contrast,
we estimate per-pixel depth θ
l
and normals n
l
R
3
, n
l
= 1. A patch at
x
l
P
2
in the reference image warps to a source patch at x
m
l
P
2
using
x
m
l
= H
l
x
l
with H
l
= K
m
(R
m
d
1
l
t
m
n
T
l
)K
1
. Here, R
m
SO(3) and
t
m
R
3
define the relative transformation from the reference to the source
camera frame. K and K
m
denote the calibration of the reference and sour ce
images, respectively, and d
l
= n
T
l
p
l
is the orthogonal distance from the reference
image to the p l ane at the point p
l
= θ
l
K
1
x
l
.
Given no knowledge of the scene, we assume a uniform prior P (N )inthe
inference of the normals N = {n
l
| l =1...L}. Estimating N requires to change
the terms P (X
m
l
|Z
m
l
l
)andP
l
(m)fromEqs.(
1) and (4) to also depend on N ,
as the color similarity ρ
m
l
is now based on slanted rather than fronto-parallel
homographies. Consequently, the optimal depth and normal are chosen as
(
ˆ
θ
opt
l
,
ˆ
n
opt
l
) = argmin
θ
l
,n
l
1
|S|
mS
(1 ρ
m
l
(θ
l
, n
l
)). (5)
To sample unbiased random normals in PatchMatch, we follow the approach by
Galliani et al.[
15]. With the additional two unknown normal parameters, the num-
ber of unknowns per pixel in the M step of GEM increases from one to three. While
this in theory requires PatchMatch to generate many more samples, we propose
an efficient propagation scheme that maintains the convergence rate of depth-only
inference. Since depth θ
l
and normal n
l
define a local planar surface in 3D, we prop-
agate the depth θ
prp
l1
of the intersection of the ray of the current pixel x
l
with the
local surface of the previous pixel (θ
l1
, n
l1
). This exploits first-order smooth-
ness of the surface (cf.[
52]) and thereby drastically speeds up the optimization
since correct depths propagate more quickly along the surface. Moreover, different
from the typical iterative refinement of normals using bisection as an intermediate
step between full sweeps of propagations (cf.[
15,46]), we generate a small set of
additional plane hypotheses at each propagation step. We observe that the current
best depth and normal parameters can h ave the following states: neither of them,
one of them, or both of them have the optimal solution or are close to it. By com-
bining random and perturbed depths with current best normals and vice versa, we
increase the chance of sampling the correct solution. More formally, at each step in
PatchMatch, we choose the current best estimate for pixel l according to Eq. (
4)
from the set of hypotheses

Citations
More filters
Book ChapterDOI

MVSNet: Depth inference for unstructured multi-view stereo

TL;DR: This work presents an end-to-end deep learning architecture for depth map inference from multi-view images that flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature.
Proceedings ArticleDOI

MegaDepth: Learning Single-View Depth Prediction from Internet Photos

TL;DR: In this paper, the authors proposed to use multi-view Internet photo collections, a virtually unlimited data source, to generate training data via modern structure-from-motion and multiview stereo (MVS) methods, and present a large depth dataset called MegaDepth based on this idea.
Journal ArticleDOI

Deferred neural rendering: image synthesis using neural textures

TL;DR: This work proposes Neural Textures, which are learned feature maps that are trained as part of the scene capture process that can be utilized to coherently re-render or manipulate existing video content in both static and dynamic environments at real-time rates.
Proceedings ArticleDOI

Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision

TL;DR: This work proposes a differentiable rendering formulation for implicit shape and texture representations, showing that depth gradients can be derived analytically using the concept of implicit differentiation, and finds that this method can be used for multi-view 3D reconstruction, directly resulting in watertight meshes.
Proceedings ArticleDOI

SuperGlue: Learning Feature Matching With Graph Neural Networks

TL;DR: SuperGlue is introduced, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points and introduces a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly.
References
More filters
Book

Pattern Recognition and Machine Learning

TL;DR: Probability Distributions, linear models for Regression, Linear Models for Classification, Neural Networks, Graphical Models, Mixture Models and EM, Sampling Methods, Continuous Latent Variables, Sequential Data are studied.
Journal ArticleDOI

Pattern Recognition and Machine Learning

Radford M. Neal
- 01 Aug 2007 - 
TL;DR: This book covers a broad range of topics for regular factorial designs and presents all of the material in very mathematical fashion and will surely become an invaluable resource for researchers and graduate students doing research in the design of factorial experiments.
Journal ArticleDOI

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

TL;DR: This paper has designed a stand-alone, flexible C++ implementation that enables the evaluation of individual components and that can easily be extended to include new algorithms.
Journal ArticleDOI

Photo tourism: exploring photo collections in 3D

TL;DR: This work presents a system for interactively browsing and exploring large unstructured collections of photographs of a scene using a novel 3D interface that consists of an image-based modeling front end that automatically computes the viewpoint of each photograph and a sparse 3D model of the scene and image to model correspondences.
Proceedings ArticleDOI

Structure-from-Motion Revisited

TL;DR: This work proposes a new SfM technique that improves upon the state of the art to make a further step towards building a truly general-purpose pipeline.
Related Papers (5)