On benchmarking camera calibration and multi-view stereo for high resolution imagery

doi:10.1109/CVPR.2008.4587706

On Benchmarking Camera Calibration and Multi-View Stereo for High

Resolution Imagery

C. Strecha

CVLab EPFL

Lausanne (CH)

W. von Hansen

FGAN-FOM

Ettlingen (D)

L. Van Gool

CVLab ETHZ

Z

¨

urich (CH)

P. Fua

CVLab EPFL

Lausanne (CH)

U. Thoennessen

FGAN-FOM

Ettlingen (D)

Abstract

In this paper we want to start the discussion on whether

image based 3-D modelling techniques can possibly be used

to replace

LIDAR systems for outdoor 3D data acquisi-

tion. Two main issues have to be addressed in this context:

(i) camera calibration (internal and external) and (ii) dense

multi-view stereo. To investigate both, we have acquired

test data from outdoor scenes both with

LIDAR and cam-

eras. Using the

LIDAR data as reference we estimated the

ground-truth for several scenes. Evaluation sets are pre-

pared to evaluate different aspects of 3D model building.

These are: (i) pose estimation and multi-view stereo with

known internal camera parameters; (ii) camera calibration

and multi-view stereo with the raw images as the only input

and (iii) multi-view stereo.

1. Introduction

Several techniques to measure the shape of objects in

3-D are available. The most common systems are based

on active stereo, passive stereo, time of ﬂight laser mea-

surements (

LIDAR) or NMR imaging. For measurements in

laboratories, active stereo systems can determine 3-D coor-

dinates accurately and in real-time. However, active stereo

is only available for controlled indoor environments.

A second technique which is also applicable to measure

outdoor environments is

LIDAR. In contrast to image based

techniques,

LIDAR systems are able to directly produce a

3-D point cloud based on distance measurements with an

accuracy of less than 1 cm. The downside are high costs for

the system and a time consuming data acquisition.

Automatic reconstruction from multiple view imagery

already is a low-cost alternative to laser systems, but could

even become a replacement once the geometrical accuracy

of the results can be proven. The aim of this paper is to

investigate whether image based 3-D modelling techniques

could possibly replace

LIDAR systems. For this purpose we

have acquired

LIDAR data and images from outdoor scenes.

Figure 1. Diffuse rendering of the integrated LIDAR 3-D triangle

mesh for the Herz-Jesu-P8 data-set.

The LIDAR data will serve as geometrical ground truth to

evaluate the quality of the image based results.

Our evaluation sets include camera calibration as well

as the evaluation of dense multi-view stereo. Benchmark

data-set for both, camera calibration (internal and exter-

nal) [9] and for stereo and multi-view stereo [16, 15] are

available. To generate ground truth usually a measurement

techniques has to be used which is superior to the evalua-

tion techniques. Seitz et al. [16] and Scharstein et al. [15]

use a laser scanner and an active stereo system, respectively,

to get the advantage w.r.t. multi-view stereo. In the ISPRS

calibration benchmark [9] the ground truth is estimated on

large resolution images, i.e. for a more accurate feature lo-

calisation, and only small resolution images are provided as

benchmark data. However, in these data-sets ground truth

is measured and is assumed to be known exactly. Our ap-

proach is different from that. Similar to Seitz et al. [16]

we use laser scans to obtain ground truth but we also esti-

mate the variance of these measurements. Image based ac-

quisition techniques are evaluated relative to this variance.

This allows to compare different algorithms w.r.t. to each

other. Moreover we can specify, based on this variance,

at which point image based techniques become similar to

Figure 2. Diffuse rendering of the integrated LIDAR 3-D triangle

mesh for the fountain-P11 data-set.

LIDAR techniques, i.e. benchmark results can be classiﬁed

into correct if its relative error approaches the uncertainty

range of the ground truth.

Our benchmark data contains realistic scenes that could

also be of practical interest, i.e. outdoor scenes for which

active stereo is not applicable. This is a major difference

to existing data-sets. We use furthermore high resolution

images to be competitive with

LIDAR.

The paper is organised as follows: Sec. 2 deals with the

LIDAR system used for our experiments. The preparation of

the raw point cloud and the integration into a combined tri-

angle mesh is discussed. Sec. 2.2 describes the generation

of ground truth for the images from the

LIDAR data. This

includes the camera calibration and the generation of a per

pixel depth and variance for each image. Sec. 3 evaluates

different aspects of image based 3-D acquisition. In par-

ticular these are camera calibration and multi-view stereo

reconstruction.

2. Ground truth estimation from LIDAR data

2.1. LIDAR acquisition

The datasource for ground truth in our project is laser

scanning (

LIDAR). A laser beam is scanned across the ob-

ject surface, measuring the distance to the object for each

position. We had a Zoller+Fr

¨

ohlich

IMAGER 5003 laser

scanner at our disposition. Multiple scan positions are re-

quired for complex object surfaces to handle missing parts

due to occlusions. Even though fully automatic methods

exist for registration, we have chosen a semi-automatic way

utilising software provided by the manufacturer in order to

get a clean reference.

2-D targets are put into the scene and marked interac-

tively in the datasets. Then the centre coordinates are auto-

matically computed and a registration module computes a

least squares estimation of the parameters for a rigid trans-

form between the datasets. Poorly deﬁned targets can be

detected and manually removed. The resulting standard de-

viation for a single target is 1.1 mm for the Herz-Jesu and

1.5 mm for the Ettlingen-castle data-set. The targets are vis-

ible in the camera images as well and are used to link

LIDAR

and camera coordinate systems.

First, some ﬁlters from the software are applied to mask

out bad points resulting from measurements into the sky,

mixed pixels and other error sources. Then, the

LIDAR

data is transformed into a set of oriented 3-D points. Fur-

thermore, we integrated all data-set into a single high-

resolution triangle mesh by using a Poisson based recon-

struction scheme. See Kazhdan [7] for more details. A ren-

dering of the resulting mesh is shown in ﬁgs. 1 and 2.

We are now provided with a huge triangle mesh in a lo-

cal coordinate system deﬁned by one of the

LIDAR scan po-

sitions. The next section deals with the calibration of the

digital cameras in this coordinate system.

2.2. Image acquisition

Together with the LIDAR data the scenes have been cap-

tured with a Canon D60 digital camera with a resolution of

3072 × 2028 square pixels. In this section we describe the

camera calibration and the ground truth 3-D model prepa-

ration by using the

LIDAR data. Our focus is thereby not

only on the ground truth estimation itself but also on the

accuracy of our ground truth data. The

LIDAR 3-D esti-

mates are themselves the result of a measurement process

and therefore given by 3-D points and their covariance ma-

trix. Our aim is to propagate the variance into our image

based ground truth estimation. This is an important point

for the preparation of ground truth data in general.

Errors for the multi-view stereo evaluation are intro-

duced by: (i) the 3-D accuracy of the

LIDAR data itself and

(ii) by the calibration errors of the input cameras. The lat-

ter does inﬂuence the quality of multi-view stereo recon-

structions strongly. Evaluation taking these calibration er-

rors into account should therefore be based on per image

reference depth maps (more details are given in sec. 3.2) as

opposed to Seitz et al. [16], who evaluate the stereo recon-

structions by the Euclidean 3-D distance between estimated

and ground truth triangle mesh.

2.3. Ground truth camera calibration

LIDAR data and camera images are linked via targets that

are visible in both datasets. Thus the laser scanner pro-

vides 3-D reference coordinates that can be used to compute

the calibration parameters for each camera. For the camera

calibration we assume a perspective camera model with ra-

dial distortion [6]. The images are taken without changing

the focal length, such that the internal camera parameters

θ

int

= {f,s, x

0

,a,y

0

,k

1

,k

2

} (K-matrix and radial dis-

tortion parameters k

1,2

) are assumed to be constant for all

Figure 3. Example of target measurements for the Herz-Jesu data.

Figure 4. Example of feature tracks and their covariance. A small

patch around the feature position is shown for all images. Under-

neath the covariance is shown as a gray-level image.

images. The external camera parameters are the position

and orientation of the camera described by 6 parameters

θ

ext

= {α, β, γ, t

x

,t

y

,t

z

}. The total number of parame-

ters θ for N images is thus 7+6N . To calibrate the cameras

we used M targets, which have been placed in the scene

(shown in ﬁg. 3). The 3-D position Y

j

; j =1...M and the

covariance Σ

Y

for these are provided by the laser scan soft-

ware. In addition we used matched feature points across all

images. From the around 20000 feature tracks we kept 200

as tie points. These have been selected as to have long track

size and large spatial spreading in the images. We checked

these remaining tracks visually for their correctness.

In each input image i we estimated the 2-D positions y

ij

and the covariance matrices Σ

ij

of the targets and the fea-

ture points. Examples are given in ﬁg. 4.

Let y denote all measurements, i.e. the collection of 3-D

points Y

j

and the 2-D image measurements y

ij

.Theex-

pected value of all internal and external camera parameters

θ = {θ

int

, θ

ext

1

,...,θ

ext

N

} can be written as:

E[θ]=



p(y



)p(θ



|y



) θ



dy



dθ



. (1)

Here p(y



) is the likelihood of data, i.e. among all 3-D

points Y



i

and image measurements y



ij

only those will have

a large likelihood that are close to the estimated values y:

p(y



ij

) ∝ exp



−0.5(y

ij

− y



ij

)

T

Σ

−1

ij

(y

ij

− y



ij

)



p(Y



j

) ∝ exp



−0.5(Y

j

− Y



j

)

T

Σ

−1

Y

(Y

j

− Y



j

)



(2)

The second term, p(θ



| y



), is the likelihood of the calibra-

tion. This is a Gaussian distribution and reﬂects the accu-

racy of the calibration, given the data points y



. This accu-

racy is given by the reprojection error:

e(θ)=

N



i

M



j

(P

i

(θ)Y

j

− y

ij

)

T

Σ

−1

ij

(P

i

(θ)Y

j

− y

ij

) ,

where P

i

(θ) projects a 3-D point Y

j

to the image point y



ij

and the calibration likelihood becomes:

p(θ |y) ∝ exp (−0.5e(θ)) . (3)

The covariance Σ of the camera parameters is similarly

given by:

Σ=



p(y



)p(θ



|y



)(E[θ



]−θ



)(E[θ



]−θ



)

T

dy



dθ



.

(4)

To compute the solution of eqs. (1) and (4), we apply a sam-

pling strategy. The measurement distribution p(y) is sam-

pled and given a speciﬁc sample y



the parameters θ



are

computed as the ML estimate of eq. (3):

θ



= arg max

θ

{log p(θ |y



)} . (5)

Using eq. (5) and eq. (2) we can approximate the expected

values and the covariance in eqn. (1) and (4) by a weighted

sum over the sample estimates. As a result we obtain all

camera parameters θ by E[θ] and their covariance Σ.This

is a standard procedure to estimate parameter distributions,

i.e. their mean and covariance.

2.4. Ground truth 3-D model

Given the mean and variance of the camera calibration

we are now in the position to estimate the expected value of

the per pixel depth and variance. Again we sample the cam-

era parameter distribution given by E[θ] and Σ in eq. (1)

and eq. (4):

p(θ



)=

exp



−

1

2

(E[θ] − θ



)

T

Σ

−1

(E[θ] − θ



)



2π

7+6N

2

| Σ |

1

2

, (6)

Figure 5. Four images (out of 25) of the fountain-R25 data-set.

and collect sufﬁcient statistics for the per pixel depth val-

ues. This goes in two stages. First, we ﬁnd the ﬁrst in-

tersection of the laser scan triangle mesh with the sampled

camera rays. Secondly, we sample the error distribution of

the laser scan data around this ﬁrst triangle intersection. The

result is the mean D

ij

l

and variance D

ij

σ

of the depth value

for all pixels i. Note, that this procedure allows to eval-

uate multi-view stereo reconstructions independent on the

accuracy of the camera calibration. If the performance of

the stereo algorithm is evaluated in 3-D (e.g.bytheEu-

clidean distance to the ground truth triangle mesh [16]) the

accuracy of the camera calibration and the accuracy of the

stereo algorithm is mixed. Here, the evaluation is relative

to calibration accuracy, i.e. pixels with a large depth vari-

ance, given the uncertainty of the calibration, will inﬂuence

the evaluation criterion accordingly. Large depth variance

pixels appear near depth boundaries and for surface parts

with a large slant. Obviously, these depth values vary most

with a varying camera position. The reference depth maps

and their variance will only be used for evaluation of multi-

view stereo in sec. 3.2. When the goal is to evaluate a trian-

gle mesh without a camera calibration, the evaluation will

be done in 3-D equivalent to [16]. This applies to the ﬁrst

two categories of data-sets described in sec. 3.

3. Evaluation of image based 3-D techniques

The 3-D modelling from high resolution images as the

only input has made a huge step forward in being accurate

and applicable to real scenes. Various authors propose a so

called structure and motion pipeline [2, 12, 13, 14, 17, 21,

22]. This pipeline consists of mainly three steps. In the ﬁrst

step, the raw images undergo a sparse-feature based match-

ing procedure. Matching is often based on invariant fea-

ture detectors [11] and descriptors [10] which are applied to

pairs of input images. Secondly, the position and orienta-

tion as well as the internal camera parameters are obtained

by camera calibration techniques [6]. The third step takes

the input images, which have often been corrected for radial

distortion, and the camera parameters and establishes dense

correspondences or the complete 3-D model (see [16] for an

overview).

We divided our data in to three categories which are used

to evaluate several aspects of the 3-D acquisition pipeline:

• 3-D acquisition from uncalibrated raw images: The

data-sets in this category are useful to evaluate tech-

niques that take their images from the internet, e.g.

ﬂickr (see for instance Goesele et al. [5]), or for which

the internal calibration of the cameras is not available.

It is useful to evaluate here the camera parameter esti-

mation as well as the accuracy of the 3-D triangle mesh

that has been computed from those cameras. Unfor-

tunately, algorithms that produce a triangle mesh from

uncalibrated images are still rare. Fully automatic soft-

ware is to our knowledge not available such that we

restrict the evaluation to the camera calibration, as will

follow in the next section.

• 3-D acquisition with known internal cameras: Of-

ten, it is possible to calibrate the internal camera pa-

rameters by using a calibration grid. These data-set are

the ideal candidate to study the possibility to replace

LIDAR scanning by image based acquisition. Results

that integrate the camera pose estimation and the 3-D

mesh generation could not be obtained.

• Multi-View Stereo given all camera parameters:

These data-set are prepared to evaluate classical Multi-

View Stereo algorithms similar to the Multi-View

Stereo evaluation by Seitz et al. [16].

The data-set [1] are named by the convention:

sceneName-XN, where X=R,K,P correspond to the three

cathegories above ((R)aw images given, (K) matrix given,

(P)rojection matrix given); N is the number of images in

the data-set. In practice image based reconstructions should

consist of and combine calibration and multi-view stereo.

This is reﬂected by the ﬁrst two categories. However, often

one can ﬁnd that both problems are handled separately. We

therefore included one category for multi-view stereo.

3.1. Camera calibrat ion

To compare results of self-calibration techniques based

on our ground truth data we ﬁrst have to align our

ground truth camera track with the evaluation track by

a rigid 3-D transformation (scale, rotation and transla-

tion). This procedure transforms the coordinate system

of the evaluation into our coordinate system. We used a

non-linear optimisation of the 7 parameters an minimise:

Figure 6. Camera calibration for the fountain-R25 data-set in ﬁg. 5. ARC 3D [3] (left) and Martinec et al. [8] (right).

-0.1

-0.05

0

0.05

0.1

0 5 10 15 20 25

x-distance [m]

camera index

Martinec

ARC-3D

-0.1

-0.05

0

0.05

0.1

0 5 10 15 20 25

y-distance [m]

camera index

Martinec

ARC-3D

-0.1

-0.05

0

0.05

0.1

0 5 10 15 20 25

z-distance [m]

camera index

Martinec

ARC-3D

-0.1

-0.05

0

0.05

0.1

0 5 10 15 20 25

x-distance [m]

camera index

Martinec

-0.1

-0.05

0

0.05

0.1

0 5 10 15 20 25

y-distance [m]

camera index

Martinec

-0.1

-0.05

0

0.05

0.1

0 5 10 15 20 25

z-distance [m]

camera index

Martinec

Figure 7. Position error [m] of the camera calibration (Martinec et al. [8] - blue, ARC-3D [3]-red) for the fountain-R25 data (top) and the

Herz-Jesu-R23 data (bottom). The green error bars indicate the 3σ value of the ground truth camera positions.

=(E[θ] − θ

eval

)

T

Σ

−1

(E[θ] − θ

eval

), where θ and Σ in-

cludes now the subset of all camera position and orientation

parameters.

For the evaluation we used ARC-3D [3] and obtained re-

sults by Martinec et al. [8]. ARC-3D is a fully automatic

web application. Martinec et al. [8] scored second in the

ICCV 2005 challenge “Where am I”. Both methods suc-

cessfully calibrated all cameras for the fountain-R25 data-

set. For the Herz-Jesu-R23 data only Martinec et al.[8]

was able to reconstruct all cameras. ARC-3D succeeded to

calibrate four of the 21 cameras. The result of this auto-

matic camera calibration is shown in ﬁg. 6. In this ﬁgure

we show the position and orientation of the cameras (both

ground truth and estimated cameras). Fig. 7 shows the dif-

On benchmarking camera calibration and multi-view stereo for high resolution imagery

Figures

Citations

Accurate, Dense, and Robust Multiview Stereopsis

DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo

Pixelwise View Selection for Unstructured Multi-View Stereo

Learning to compare image patches via convolutional neural networks

High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth

References

Multiple view geometry in computer vision

Multiple View Geometry in Computer Vision.

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

A performance evaluation of local descriptors

Photo tourism: exploring photo collections in 3D

Related Papers (5)

Distinctive Image Features from Scale-Invariant Keypoints

Accurate, Dense, and Robust Multiview Stereopsis

Multiple view geometry in computer vision

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Photo tourism: exploring photo collections in 3D