scispace - formally typeset
Open AccessProceedings ArticleDOI

On benchmarking camera calibration and multi-view stereo for high resolution imagery

Reads0
Chats0
TLDR
The discussion on whether image based 3D modelling techniques can possibly be used to replace LIDAR systems for outdoor 3D data acquisition and two main issues have to be addressed: camera calibration and dense multi-view stereo.
Abstract
In this paper we want to start the discussion on whether image based 3D modelling techniques can possibly be used to replace LIDAR systems for outdoor 3D data acquisition. Two main issues have to be addressed in this context: (i) camera calibration (internal and external) and (ii) dense multi-view stereo. To investigate both, we have acquired test data from outdoor scenes both with LIDAR and cameras. Using the LIDAR data as reference we estimated the ground-truth for several scenes. Evaluation sets are prepared to evaluate different aspects of 3D model building. These are: (i) pose estimation and multi-view stereo with known internal camera parameters; (ii) camera calibration and multi-view stereo with the raw images as the only input and (iii) multi-view stereo.

read more

Content maybe subject to copyright    Report

On Benchmarking Camera Calibration and Multi-View Stereo for High
Resolution Imagery
C. Strecha
CVLab EPFL
Lausanne (CH)
W. von Hansen
FGAN-FOM
Ettlingen (D)
L. Van Gool
CVLab ETHZ
Z
¨
urich (CH)
P. Fua
CVLab EPFL
Lausanne (CH)
U. Thoennessen
FGAN-FOM
Ettlingen (D)
Abstract
In this paper we want to start the discussion on whether
image based 3-D modelling techniques can possibly be used
to replace
LIDAR systems for outdoor 3D data acquisi-
tion. Two main issues have to be addressed in this context:
(i) camera calibration (internal and external) and (ii) dense
multi-view stereo. To investigate both, we have acquired
test data from outdoor scenes both with
LIDAR and cam-
eras. Using the
LIDAR data as reference we estimated the
ground-truth for several scenes. Evaluation sets are pre-
pared to evaluate different aspects of 3D model building.
These are: (i) pose estimation and multi-view stereo with
known internal camera parameters; (ii) camera calibration
and multi-view stereo with the raw images as the only input
and (iii) multi-view stereo.
1. Introduction
Several techniques to measure the shape of objects in
3-D are available. The most common systems are based
on active stereo, passive stereo, time of flight laser mea-
surements (
LIDAR) or NMR imaging. For measurements in
laboratories, active stereo systems can determine 3-D coor-
dinates accurately and in real-time. However, active stereo
is only available for controlled indoor environments.
A second technique which is also applicable to measure
outdoor environments is
LIDAR. In contrast to image based
techniques,
LIDAR systems are able to directly produce a
3-D point cloud based on distance measurements with an
accuracy of less than 1 cm. The downside are high costs for
the system and a time consuming data acquisition.
Automatic reconstruction from multiple view imagery
already is a low-cost alternative to laser systems, but could
even become a replacement once the geometrical accuracy
of the results can be proven. The aim of this paper is to
investigate whether image based 3-D modelling techniques
could possibly replace
LIDAR systems. For this purpose we
have acquired
LIDAR data and images from outdoor scenes.
Figure 1. Diffuse rendering of the integrated LIDAR 3-D triangle
mesh for the Herz-Jesu-P8 data-set.
The LIDAR data will serve as geometrical ground truth to
evaluate the quality of the image based results.
Our evaluation sets include camera calibration as well
as the evaluation of dense multi-view stereo. Benchmark
data-set for both, camera calibration (internal and exter-
nal) [9] and for stereo and multi-view stereo [16, 15] are
available. To generate ground truth usually a measurement
techniques has to be used which is superior to the evalua-
tion techniques. Seitz et al. [16] and Scharstein et al. [15]
use a laser scanner and an active stereo system, respectively,
to get the advantage w.r.t. multi-view stereo. In the ISPRS
calibration benchmark [9] the ground truth is estimated on
large resolution images, i.e. for a more accurate feature lo-
calisation, and only small resolution images are provided as
benchmark data. However, in these data-sets ground truth
is measured and is assumed to be known exactly. Our ap-
proach is different from that. Similar to Seitz et al. [16]
we use laser scans to obtain ground truth but we also esti-
mate the variance of these measurements. Image based ac-
quisition techniques are evaluated relative to this variance.
This allows to compare different algorithms w.r.t. to each
other. Moreover we can specify, based on this variance,
at which point image based techniques become similar to

Figure 2. Diffuse rendering of the integrated LIDAR 3-D triangle
mesh for the fountain-P11 data-set.
LIDAR techniques, i.e. benchmark results can be classified
into correct if its relative error approaches the uncertainty
range of the ground truth.
Our benchmark data contains realistic scenes that could
also be of practical interest, i.e. outdoor scenes for which
active stereo is not applicable. This is a major difference
to existing data-sets. We use furthermore high resolution
images to be competitive with
LIDAR.
The paper is organised as follows: Sec. 2 deals with the
LIDAR system used for our experiments. The preparation of
the raw point cloud and the integration into a combined tri-
angle mesh is discussed. Sec. 2.2 describes the generation
of ground truth for the images from the
LIDAR data. This
includes the camera calibration and the generation of a per
pixel depth and variance for each image. Sec. 3 evaluates
different aspects of image based 3-D acquisition. In par-
ticular these are camera calibration and multi-view stereo
reconstruction.
2. Ground truth estimation from LIDAR data
2.1. LIDAR acquisition
The datasource for ground truth in our project is laser
scanning (
LIDAR). A laser beam is scanned across the ob-
ject surface, measuring the distance to the object for each
position. We had a Zoller+Fr
¨
ohlich
IMAGER 5003 laser
scanner at our disposition. Multiple scan positions are re-
quired for complex object surfaces to handle missing parts
due to occlusions. Even though fully automatic methods
exist for registration, we have chosen a semi-automatic way
utilising software provided by the manufacturer in order to
get a clean reference.
2-D targets are put into the scene and marked interac-
tively in the datasets. Then the centre coordinates are auto-
matically computed and a registration module computes a
least squares estimation of the parameters for a rigid trans-
form between the datasets. Poorly defined targets can be
detected and manually removed. The resulting standard de-
viation for a single target is 1.1 mm for the Herz-Jesu and
1.5 mm for the Ettlingen-castle data-set. The targets are vis-
ible in the camera images as well and are used to link
LIDAR
and camera coordinate systems.
First, some filters from the software are applied to mask
out bad points resulting from measurements into the sky,
mixed pixels and other error sources. Then, the
LIDAR
data is transformed into a set of oriented 3-D points. Fur-
thermore, we integrated all data-set into a single high-
resolution triangle mesh by using a Poisson based recon-
struction scheme. See Kazhdan [7] for more details. A ren-
dering of the resulting mesh is shown in figs. 1 and 2.
We are now provided with a huge triangle mesh in a lo-
cal coordinate system defined by one of the
LIDAR scan po-
sitions. The next section deals with the calibration of the
digital cameras in this coordinate system.
2.2. Image acquisition
Together with the LIDAR data the scenes have been cap-
tured with a Canon D60 digital camera with a resolution of
3072 × 2028 square pixels. In this section we describe the
camera calibration and the ground truth 3-D model prepa-
ration by using the
LIDAR data. Our focus is thereby not
only on the ground truth estimation itself but also on the
accuracy of our ground truth data. The
LIDAR 3-D esti-
mates are themselves the result of a measurement process
and therefore given by 3-D points and their covariance ma-
trix. Our aim is to propagate the variance into our image
based ground truth estimation. This is an important point
for the preparation of ground truth data in general.
Errors for the multi-view stereo evaluation are intro-
duced by: (i) the 3-D accuracy of the
LIDAR data itself and
(ii) by the calibration errors of the input cameras. The lat-
ter does influence the quality of multi-view stereo recon-
structions strongly. Evaluation taking these calibration er-
rors into account should therefore be based on per image
reference depth maps (more details are given in sec. 3.2) as
opposed to Seitz et al. [16], who evaluate the stereo recon-
structions by the Euclidean 3-D distance between estimated
and ground truth triangle mesh.
2.3. Ground truth camera calibration
LIDAR data and camera images are linked via targets that
are visible in both datasets. Thus the laser scanner pro-
vides 3-D reference coordinates that can be used to compute
the calibration parameters for each camera. For the camera
calibration we assume a perspective camera model with ra-
dial distortion [6]. The images are taken without changing
the focal length, such that the internal camera parameters
θ
int
= {f,s, x
0
,a,y
0
,k
1
,k
2
} (K-matrix and radial dis-
tortion parameters k
1,2
) are assumed to be constant for all

Figure 3. Example of target measurements for the Herz-Jesu data.
Figure 4. Example of feature tracks and their covariance. A small
patch around the feature position is shown for all images. Under-
neath the covariance is shown as a gray-level image.
images. The external camera parameters are the position
and orientation of the camera described by 6 parameters
θ
ext
= {α, β, γ, t
x
,t
y
,t
z
}. The total number of parame-
ters θ for N images is thus 7+6N . To calibrate the cameras
we used M targets, which have been placed in the scene
(shown in fig. 3). The 3-D position Y
j
; j =1...M and the
covariance Σ
Y
for these are provided by the laser scan soft-
ware. In addition we used matched feature points across all
images. From the around 20000 feature tracks we kept 200
as tie points. These have been selected as to have long track
size and large spatial spreading in the images. We checked
these remaining tracks visually for their correctness.
In each input image i we estimated the 2-D positions y
ij
and the covariance matrices Σ
ij
of the targets and the fea-
ture points. Examples are given in fig. 4.
Let y denote all measurements, i.e. the collection of 3-D
points Y
j
and the 2-D image measurements y
ij
.Theex-
pected value of all internal and external camera parameters
θ = {θ
int
, θ
ext
1
,...,θ
ext
N
} can be written as:
E[θ]=
p(y
)p(θ
|y
) θ
dy
dθ
. (1)
Here p(y
) is the likelihood of data, i.e. among all 3-D
points Y
i
and image measurements y
ij
only those will have
a large likelihood that are close to the estimated values y:
p(y
ij
) exp
0.5(y
ij
y
ij
)
T
Σ
1
ij
(y
ij
y
ij
)
p(Y
j
) exp
0.5(Y
j
Y
j
)
T
Σ
1
Y
(Y
j
Y
j
)
(2)
The second term, p(θ
| y
), is the likelihood of the calibra-
tion. This is a Gaussian distribution and reflects the accu-
racy of the calibration, given the data points y
. This accu-
racy is given by the reprojection error:
e(θ)=
N
i
M
j
(P
i
(θ)Y
j
y
ij
)
T
Σ
1
ij
(P
i
(θ)Y
j
y
ij
) ,
where P
i
(θ) projects a 3-D point Y
j
to the image point y
ij
and the calibration likelihood becomes:
p(θ |y) exp (0.5e(θ)) . (3)
The covariance Σ of the camera parameters is similarly
given by:
Σ=
p(y
)p(θ
|y
)(E[θ
]θ
)(E[θ
]θ
)
T
dy
dθ
.
(4)
To compute the solution of eqs. (1) and (4), we apply a sam-
pling strategy. The measurement distribution p(y) is sam-
pled and given a specific sample y
the parameters θ
are
computed as the ML estimate of eq. (3):
θ
= arg max
θ
{log p(θ |y
)} . (5)
Using eq. (5) and eq. (2) we can approximate the expected
values and the covariance in eqn. (1) and (4) by a weighted
sum over the sample estimates. As a result we obtain all
camera parameters θ by E[θ] and their covariance Σ.This
is a standard procedure to estimate parameter distributions,
i.e. their mean and covariance.
2.4. Ground truth 3-D model
Given the mean and variance of the camera calibration
we are now in the position to estimate the expected value of
the per pixel depth and variance. Again we sample the cam-
era parameter distribution given by E[θ] and Σ in eq. (1)
and eq. (4):
p(θ
)=
exp
1
2
(E[θ] θ
)
T
Σ
1
(E[θ] θ
)
2π
7+6N
2
| Σ |
1
2
, (6)

Figure 5. Four images (out of 25) of the fountain-R25 data-set.
and collect sufficient statistics for the per pixel depth val-
ues. This goes in two stages. First, we find the first in-
tersection of the laser scan triangle mesh with the sampled
camera rays. Secondly, we sample the error distribution of
the laser scan data around this first triangle intersection. The
result is the mean D
ij
l
and variance D
ij
σ
of the depth value
for all pixels i. Note, that this procedure allows to eval-
uate multi-view stereo reconstructions independent on the
accuracy of the camera calibration. If the performance of
the stereo algorithm is evaluated in 3-D (e.g.bytheEu-
clidean distance to the ground truth triangle mesh [16]) the
accuracy of the camera calibration and the accuracy of the
stereo algorithm is mixed. Here, the evaluation is relative
to calibration accuracy, i.e. pixels with a large depth vari-
ance, given the uncertainty of the calibration, will influence
the evaluation criterion accordingly. Large depth variance
pixels appear near depth boundaries and for surface parts
with a large slant. Obviously, these depth values vary most
with a varying camera position. The reference depth maps
and their variance will only be used for evaluation of multi-
view stereo in sec. 3.2. When the goal is to evaluate a trian-
gle mesh without a camera calibration, the evaluation will
be done in 3-D equivalent to [16]. This applies to the first
two categories of data-sets described in sec. 3.
3. Evaluation of image based 3-D techniques
The 3-D modelling from high resolution images as the
only input has made a huge step forward in being accurate
and applicable to real scenes. Various authors propose a so
called structure and motion pipeline [2, 12, 13, 14, 17, 21,
22]. This pipeline consists of mainly three steps. In the first
step, the raw images undergo a sparse-feature based match-
ing procedure. Matching is often based on invariant fea-
ture detectors [11] and descriptors [10] which are applied to
pairs of input images. Secondly, the position and orienta-
tion as well as the internal camera parameters are obtained
by camera calibration techniques [6]. The third step takes
the input images, which have often been corrected for radial
distortion, and the camera parameters and establishes dense
correspondences or the complete 3-D model (see [16] for an
overview).
We divided our data in to three categories which are used
to evaluate several aspects of the 3-D acquisition pipeline:
3-D acquisition from uncalibrated raw images: The
data-sets in this category are useful to evaluate tech-
niques that take their images from the internet, e.g.
flickr (see for instance Goesele et al. [5]), or for which
the internal calibration of the cameras is not available.
It is useful to evaluate here the camera parameter esti-
mation as well as the accuracy of the 3-D triangle mesh
that has been computed from those cameras. Unfor-
tunately, algorithms that produce a triangle mesh from
uncalibrated images are still rare. Fully automatic soft-
ware is to our knowledge not available such that we
restrict the evaluation to the camera calibration, as will
follow in the next section.
3-D acquisition with known internal cameras: Of-
ten, it is possible to calibrate the internal camera pa-
rameters by using a calibration grid. These data-set are
the ideal candidate to study the possibility to replace
LIDAR scanning by image based acquisition. Results
that integrate the camera pose estimation and the 3-D
mesh generation could not be obtained.
Multi-View Stereo given all camera parameters:
These data-set are prepared to evaluate classical Multi-
View Stereo algorithms similar to the Multi-View
Stereo evaluation by Seitz et al. [16].
The data-set [1] are named by the convention:
sceneName-XN, where X=R,K,P correspond to the three
cathegories above ((R)aw images given, (K) matrix given,
(P)rojection matrix given); N is the number of images in
the data-set. In practice image based reconstructions should
consist of and combine calibration and multi-view stereo.
This is reflected by the first two categories. However, often
one can find that both problems are handled separately. We
therefore included one category for multi-view stereo.
3.1. Camera calibrat ion
To compare results of self-calibration techniques based
on our ground truth data we first have to align our
ground truth camera track with the evaluation track by
a rigid 3-D transformation (scale, rotation and transla-
tion). This procedure transforms the coordinate system
of the evaluation into our coordinate system. We used a
non-linear optimisation of the 7 parameters an minimise:

Figure 6. Camera calibration for the fountain-R25 data-set in fig. 5. ARC 3D [3] (left) and Martinec et al. [8] (right).
-0.1
-0.05
0
0.05
0.1
0 5 10 15 20 25
x-distance [m]
camera index
Martinec
ARC-3D
-0.1
-0.05
0
0.05
0.1
0 5 10 15 20 25
y-distance [m]
camera index
Martinec
ARC-3D
-0.1
-0.05
0
0.05
0.1
0 5 10 15 20 25
z-distance [m]
camera index
Martinec
ARC-3D
-0.1
-0.05
0
0.05
0.1
0 5 10 15 20 25
x-distance [m]
camera index
Martinec
-0.1
-0.05
0
0.05
0.1
0 5 10 15 20 25
y-distance [m]
camera index
Martinec
-0.1
-0.05
0
0.05
0.1
0 5 10 15 20 25
z-distance [m]
camera index
Martinec
Figure 7. Position error [m] of the camera calibration (Martinec et al. [8] - blue, ARC-3D [3]-red) for the fountain-R25 data (top) and the
Herz-Jesu-R23 data (bottom). The green error bars indicate the 3σ value of the ground truth camera positions.
=(E[θ] θ
eval
)
T
Σ
1
(E[θ] θ
eval
), where θ and Σ in-
cludes now the subset of all camera position and orientation
parameters.
For the evaluation we used ARC-3D [3] and obtained re-
sults by Martinec et al. [8]. ARC-3D is a fully automatic
web application. Martinec et al. [8] scored second in the
ICCV 2005 challenge “Where am I”. Both methods suc-
cessfully calibrated all cameras for the fountain-R25 data-
set. For the Herz-Jesu-R23 data only Martinec et al.[8]
was able to reconstruct all cameras. ARC-3D succeeded to
calibrate four of the 21 cameras. The result of this auto-
matic camera calibration is shown in fig. 6. In this figure
we show the position and orientation of the cameras (both
ground truth and estimated cameras). Fig. 7 shows the dif-

Citations
More filters
Journal ArticleDOI

Accurate, Dense, and Robust Multiview Stereopsis

TL;DR: A novel algorithm for multiview stereopsis that outputs a dense set of small rectangular patches covering the surfaces visible in the images, which outperforms all others submitted so far for four out of the six data sets.
Journal ArticleDOI

DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo

TL;DR: An EM-based algorithm to compute dense depth and occlusion maps from wide-baseline image pairs using a local image descriptor, DAISY, which is very efficient to compute densely and robust against many photometric and geometric transformations.
Book ChapterDOI

Pixelwise View Selection for Unstructured Multi-View Stereo

TL;DR: The core contributions are the joint estimation of depth andnormal information, pixelwise view selection using photometric and geometric priors, and a multi-view geometric consistency term for the simultaneous refinement and image-based depth and normal fusion.
Proceedings ArticleDOI

Learning to compare image patches via convolutional neural networks

TL;DR: This paper shows how to learn directly from image data a general similarity function for comparing image patches, which is a task of fundamental importance for many computer vision problems, and opts for a CNN-based model that is trained to account for a wide variety of changes in image appearance.
Book ChapterDOI

High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth

TL;DR: A structured lighting system for creating high-resolution stereo datasets of static indoor scenes with highly accurate ground-truth disparities using novel techniques for efficient 2D subpixel correspondence search and self-calibration of cameras and projectors with modeling of lens distortion is presented.
References
More filters
Book

Multiple view geometry in computer vision

TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.

Multiple View Geometry in Computer Vision.

TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Journal ArticleDOI

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

TL;DR: This paper has designed a stand-alone, flexible C++ implementation that enables the evaluation of individual components and that can easily be extended to include new algorithms.
Journal ArticleDOI

A performance evaluation of local descriptors

TL;DR: It is observed that the ranking of the descriptors is mostly independent of the interest region detector and that the SIFT-based descriptors perform best and Moments and steerable filters show the best performance among the low dimensional descriptors.
Journal ArticleDOI

Photo tourism: exploring photo collections in 3D

TL;DR: This work presents a system for interactively browsing and exploring large unstructured collections of photographs of a scene using a novel 3D interface that consists of an image-based modeling front end that automatically computes the viewpoint of each photograph and a sparse 3D model of the scene and image to model correspondences.
Related Papers (5)