What future works have the authors mentioned in the paper "Calibrating and optimizing poses of visual sensors in distributed platforms" ?

As the change in viewpoint between the different cameras is restricted, future work is needed to improve the automatic extraction of point correspondences between images. Future work on this topic will include the investigation of how to handle large numbers of grid points.

How is the registration of triplets and subgroups achieved?

Registration of triplets and sub-groups is achieved by computing a homography of 3-space between the different metric structures.

What is the effect of the use of feature matching in combination with a flat screen?

The use of SIFT-feature matching in combination with a flat screen displaying a known pattern enables us to easily and automaticaly detect the subset of image points.

What is the way to solve the camera positioning problem?

Considering N cameras that are calibrated, i.e. their fields-of-view as well as positions in the space are known, the authors formulate their camera positioning problem in terms of maximizing the coverage.

What is the main reason for the failure of the optimization problem?

As the optimization problem of the final bundle adjustment is of very high dimension, a poor initial guess commonly results in the non-linear optimization to fail completely, i.e. to converge to a suboptimal solution or to not converge at all.

How many parameters are used for the camera matrices?

The dimension of the minimization problem adds then up to a total number of 6(N −1) parameters for the camera matrices, plus a set of 3L parameters for the coordinates of the L reconstructed 3D points.

What is the basic optimization problem of the feature tracker?

The basic optimization problem solved by the feature tracker is:min d,Dωx Xx=−ωxωy Xy=−ωy(I(x+u)− J((D+ I2×2)x+d+u)) 2 (2)where I(u), J(u) represent the grey-scale values of the two images at location u, the vector d = [dx dy ]T is the optical flow at location u, and the matrix D denotes an affine deformation matrix characterized by the four coefficients dxx, dxy, dyx, dyy:D =„dxx dxy dyx dyy«(3)The objective of affine tracking is then to choose d and D in a way that minimizes the dissimilarity between feature windows of size 2ωx + 1 in x and size 2ωy + 1 in y direction around the point u and v in The authorand J respectively.

What is the way to extract point correspondences between images?

As the change in viewpoint between the different cameras is restricted, future work is needed to improve the automatic extraction of point correspondences between images.

(Open Access) Calibrating and optimizing poses of visual sensors in distributed platforms (2006) | Eva Hörster

Q: What contributions have the authors mentioned in the paper "Calibrating and optimizing poses of visual sensors in distributed platforms" ?

The authors present a novel approach for position and pose calibration of visual sensors, i. e. cameras, in a distributed network of general purpose computing devices ( GPCs ).

Q: What is the way to determine the optimal poses of the multiple cameras?

Given the fixed positions, the authors develop a linear programming model that determines the optimal poses (pan and tilt angles) with respect to coverage while maintaining the required resolution (i.e. minimal ’sampling frequency’).

Universit

at Augsburg

KABCROMUNGSHO0

Calibrating and Optimizing Poses of

Visual Sensors in Distributed Platforms

E. H¨orster, R. Lienhart

Report 2006-19 Juli 2006

Institut f

ur Informatik

D-86135 Augsburg

 E. H¨orster, R. Lienhart

Institut f¨ur Informatik

Universit¨at Augsburg

D–86135 Augsburg, Germany

http://www.Informatik.Uni-Augsburg.DE

Calibrating and Optimizing Poses of Visual Sensors in

Distributed Platforms

Eva H¨orster, Rainer Lienhart

Multimedia Computing Lab

University of Augsburg

Augsburg, Germany

{hoerster,lienhart}@informatik.uni-augsburg.de

ABSTRACT

Many novel multimedia, home entertainment, visual surveil-

lance and health applications use multiple audio-visual sen-

sors. We present a novel approach for position and pose

calibration of visual sensors, i.e. cameras, in a distributed

network of general purpose computing devices (GPCs). It

complements our work on position calibration of audio sen-

sors and actuators in a distributed computing platform [22].

The approach is suitable for a wide range of possible - even

mobile - setups since (a) synchronization is not required,

(b) it works automatically, (c) only weak restrictions are im-

posed on the positions of the cameras, and (d) no up per limit

on the number of cameras and displays un der calibration is

imposed. Corresponding points across diﬀerent camera im-

ages are established automatically. Cameras do not have

to share one common view. Only a reasonable overlap be-

tween camera subgroups is n ecessary. The method has been

sucessfully tested in numerous multi-camera environments

with a varying number of cameras and has proven itself to

work ext remely accurate. Once all distributed visual sensors

are calibrated, we focus on post-optimizing their poses to in-

crease coverage of the space observed. A linear programming

approach is derived that determines jointly for each camera

the pan and tilt angle that maximizes the coverage of the

space at a given sampling frequ ency. Experimental results

clearly demonstrate the gain in visual coverage.

1. INTRODUCTION

Today we can ﬁnd microphones, cameras, loudsp eakers

and displays nearly everywhere - in public, at home and at

work. These audio/video sensors and actuators are often a

component of computing and communication devices such

as laptops, PDAs and tablets, which we refer to as General

Purpose Computers (GPCs). Often GPCs are networked us-

ing high-speed wired or wireless connections. The resulting

array of audio/video sensors and actuators along with array

processing algorithms oﬀers a set of new features for mul-

timedia app lications such as video conferencing, smart con-

ference rooms, video surveillance, games, e-learning, home

entertainment and image based rendering.

Many of the above mentioned audio-visual array process-

ing algorithms require precise knowledge about the positions

and poses of the sensors and actuators as well as the cover-

age that is achieved by those sensors. This demands a sim-

ple and convenient calibration approach to put all sensors

and actuators into a common time and space. [14] proposes

a means to provide a common time reference to multiple

distributed GPCs. In [22] a method for automatically cal-

ibrating audio sensors and actuators is presented. In this

paper we focus on visual sensors where a room or area is

instrumented with N ≥ 3 static cameras connected to net-

worked GPCs. No precise synchronization of the diﬀerent

devices is required.

In the ﬁrst part of this paper we focus on providing a com-

mon space for multiple cameras by actively estimating their

3D positions and poses. We also address the problem of

eﬀortlessly calibrating the intrinsic parameters of multiple

cameras.

In the second part of the paper another important issue in

designing visual sensor arrays is considered: orienting the

visual sensors such that they achieve optimal coverage of a

given space at a predeﬁned ’sampling rate’ (see Section 3 for

a precise deﬁnition). We assume that the positions and inital

poses are given. This is reasonable because either cameras

have been already installed (e.g. at an airport), or they are

put up arbitrarily. Currently there exists only few theoret-

ical research on planning visual sensor positions and poses.

Positions and inital poses of the multiple cameras can be

determined automatically by our calibration approach (see

Section 2). Given the ﬁxed positions, we develop a linear

programming model that determines the optimal poses (pan

and tilt angles) with respect to coverage while maintaining

the required resolution (i.e. minimal ’sampling frequency’).

Fig. 1 shows one ineﬀective setup that we desire to optimize.

Related Work: Camera calibration is a well researched

topic in computer vision. Fund amentally t here are two dif-

ferent methods of camera calibration: photogrammetric cal-

ibration and self-calibration [31]. The ﬁrst method uses a

3D, a 2D (planar), or a virtual calibration object of pre-

cisely known geometry. Important approaches are described

in [31] [11] [28] [4] [27]. Planar methods are very popular

because it is easy to obtain a calibration t arget by just print-

Figure 1: Example of an ineﬃcient setup we desire

to optimize

ing the pattern and ﬁxing the paper on a ﬂat surface. Al-

though providing good results, the major drawback of these

calibration methods is that they require special equipment

or precise manual measurements. Virtual calibration ob-

jects are constructed over time by tracking an easily iden-

tiﬁable object through a 3D scene. The cameras u su ally

have to be synchronized and thus the setup requires spe-

cial equipment. Self calibration techniques ([9] [26] [20]) d o

not require any special calibration target. They simultane-

ously process several images from diﬀerent perspectives of

a scene and are b ased on point correspondences across the

images. The accuracy of these methods dep ends on how ac-

curately those point correspondences can be extracted be-

tween images. Point correspondences are extracted auto-

matically from the images by identifying 2D features and

tracking those between the diﬀerent perspective views. Dif-

ferent feature extraction algorithms exist (see [8] [24] [15]).

There ex ist also self-calibration approaches using silouettes

or t rajectories of moving objects [21] [25]. Multiple cam-

era calibration can be solved globally in one step, or multi-

ple subsets of cameras and displays are calibrated ﬁrst and

then merged into a global coordinate system. Since the ﬁrst

method is only suitable if all cameras share a common view,

we follow th e second more general approach.

Although a signiﬁcant amount of research exists in designing

and calibrating video sensor arrays, automated v isual sensor

placement and alignment in general has n ot been addressed

often. There is some work in the area of grid coverage prob-

lems with sensors sensing events that occur within a distance

r (the sensing range of the sensor) [23] [13] [29] [32]. Our

work is based on those approaches, but diﬀers in the sensor

model (since cameras do not posses circular sensing ranges)

as well as the cost function and some constraints. In [5] a

camera placement algorithm b ased on a binary optimization

technique is proposed. The algorithm aims to ﬁnd the place-

ment with minimum cost of a camera set such that a given

space is viewed with some minimal spatial resolution. Space

is represented as a occupancy grid and the authors focused

on planar regions. A similar task is considered in [12] and

also solved by linear programming techniques. In [19] the

authors analyze the visibility from static sensors probabilis-

tically and p resent a solution for maximizing visibility in a

given region of interest. They solve the problem by simu-

lated annealing.

Contributions: The main contributions of the paper are:

• A procedure to automatically calibrate the positions

and poses of sensors without using calibration objects.

Thus no special equipment is required. In addition

the setup does not h ave to be syn chronized. It only

requires to ﬁlter out temporally unstable salient points

and keep only stationary features. Our method is sim-

ple and convenient to use and oﬀers mobility of the en-

tire setup. The camera views are assumed to overlap

only partly, i.e. only some cameras share a common

view.

• The usage of an active display as our calibration tar-

get for intrinsic calibration giving us control over the

calibration pattern to be displayed. As a result th e

extraction of feature points is easier and more reliable.

The calibration pattern can be made adaptive to the

distance between the camera and the pattern’s image

on the LCD screen.

• The automatic extraction of control points and point

correspondences across images.

• A procedure to determine th e optimal poses of the

cameras such that coverage is maximized while main-

taining a minimal resolution.

The rest of the paper is organized as follows. In Section

2 we formulate the calibration problem and present our so-

lution. We describes how point features are extracted and

tracked between images and outline the calibration of the

intrinsic parameters of each camera. The algorithm used to

determine the extrinsic parameters, i.e. the positions and

poses of all cameras in a common coordinate system is pre-

sented. In S ection 3 we formulate the optimization problem

of maximizing coverage with multiple cameras by pose vari-

ation. Ou r solution is p resented and results are reported.

The paper concludes with a summary and an outlook in

Section 4.

2. MULITPLE CAMERA CALIBRATION

2.1 Problem Formulation

Given M cameras, the goal is to determine the cameras’

internal parameters and the 3D positions and poses of the

cameras automatically. Therefore we only make the assump-

tion, that we know the number of visual sensors in the net-

work.

In this work we use an enhanced perspective model to de-

scribe our cameras. The mapping performed by a perspec-

tive camera between a 3D point X and its 2D image point

x, b oth represented by their homogeneous coordinates, is

usually represented by a 3 × 4 matrix, the camera projec-

tive matrix P: x ≃ PX. The matrix P can be written as

P = K[R|t] where K is a 3 × 3 upper triangular matrix

containing the camera intrinsic parameters:

K =

s p

0 f

0 0 1

(1)

The parameters f

and f

denote the focal length, p

and

denote the coordinates of th e principal point, each in

terms of pixel dimensions. s denotes the skew. For most

commercial cameras, and hence below, the skew is consid-

ered to be zero. The 3 × 3 rotation matrix R and the 3 × 1

translation vector t describe the 3D position and pose of

the camera. As some desktop cameras exhibit signiﬁcant

distortions, this model has to be enriched by some distor-

tion components. The distortion model introduced in [11]

Figure 2: General calibration problem

accounts for tangential and radial distortions using two co-

eﬃcients. It describes distortions occuring in practice suﬃ-

ciently precise. In the following discussion we assume that

the distortion parameters of each camera are known and the

eﬀects of those have been removed from all images.

Diﬀerent views of t he same scene are related t o each other.

These relations can be used for our multiple camera calibra-

tion task. Therefore we need to determine a set of corre-

sponding points across the diﬀerent images. Points are said

to correspond if they represent the same scene point in dif-

ferent views. This general calibration problem is illustrated

in Fig. 2.

A set of 3D points X

is viewed by a set of cameras with

matrices P

. Let x

denote the coordinates of th e i-th

point as detected in the j-th camera image. A 3D point

may not be visible in all cameras, thus its corresp onding

projected point will not be available in all images. The cal-

ibration problem is then t o ﬁnd the set of camera matrices

and points X

such that for all image points x

≃ P

holds. However, unless additional constraints are given, it is

in principle only possible to determine the camera matrices

up to a projective ambiguity. Additional constraints arising

from knowledge about the cameras’ parameters and/or the

scene can be used to restrict this ambiguity up to an aﬃne,

metric or Euclidean transformation.

Solution: We solve the camera calibration problem in

two stages. I n a ﬁrst stage we determine the cameras’ intrin-

sic parameters. Intrinsic calibration is done independently

for each camera by u sing a ﬂat-panel display as the pla-

nar calibration object. In a second stage camera positions

and poses are computed in a common coordinate system

(extrinsic calibration). Their positions and poses can be de-

termined relative to each other up to a global coordinate. In

a typical distributed camera environment each camera can

only see a small volume of the total viewing space and diﬀer-

ent intersecting subsets of cameras share diﬀerent intersect-

ing views. Hence multiple camera calibrations are performed

by calibrating subsets of cameras and then building a global

coordinate system from individual overlapping views.

2.2 Point Correspondences

2D point correspondences between projections of the same

3D point onto diﬀerent camera planes can be generally used

to recover the calibration matrices of the cameras. There-

Figure 3: Matched points are visualized by a con-

necting line between images

fore establishing such correspondences is the ﬁrst step in

determining the cameras parameters. To establish point cor-

respond ences each image is at ﬁrst represented by a set of

features. Each feature describes a speciﬁc image point, and

its neighborhood. Subsequently these features are input to

a matching procedure, which identiﬁes features in diﬀerent

images that correspond to the same point in the observed

scene. There are various approaches for extracting a set of

interest points and features from an image. Our approach

uses the so called SIFT-features proposed in [15]. SIFT-

based feature descriptors were identiﬁed in [18] to deliver

the most suitable features in the context of matching points

of a scene under diﬀerent viewing conditions such as diﬀer-

ent lighting and changes in 3D viewpoint.

SIFT-Features Extraction: The SIFT-feature extrac-

tion method combines a scale invariant region detector and a

descriptor based on the gradient distribution in the detected

regions. In order to compute a set of caracteristic image fea-

tures, ﬁrst a set of interest points - also called keypoints -

is found by detecting scale-space extremas. Only keypoints

that are stable under a certain amount of additive noise are

preserved. An image location, scale and orientation is as-

signed to each keypoint. This enables the construction of

a repeatable local 2D coordinate system, in which the local

image (pixel and its surrounding region) is described invari-

antly from t hese parameters. Finally a descriptor for each

keypoint is calculated based upon image gradients in the lo-

cal image. However this approach has its limitations. To

ensure a suﬃcient number of reliable matching points, the

displacement between the cameras should not exceed 15

◦

The resulting correspondences are within pixel accuracy.

SIFT-Feature Matching: The matching technique used

for the SIFT-features h as been proposed in [15]. Point cor-

respond ences between two images are established by com-

paring their respective keypoint descriptors. Matching is

performed by ﬁrst individually measuring the Euclidean dis-

tance of each feature vector (representing a certain keypoint)

of one image to each feature vector of the other image. The

best matching candidate for a sp eciﬁc keypoint is identiﬁed

by the keypoint belonging to the feature vector with the min -

imum distance. A match is found in the second image, if the

distance ratio between the nearest and the second nearest

neighbor (closest/second closest) is below a threshold. An

example of matched points between two images is shown in

Fig. 3.

Subpixel Accuracy: The result of SIFT-feature match-

ing is only at pixel accuracy. For position estimation of

multiple cameras experiments have shown that it is essen-

Calibrating and optimizing poses of visual sensors in distributed platforms

Figures

Citations

Computer vision : a modern approach = 计算机视觉 : 一种现代的方法

A convenient multicamera self-calibration for virtual environments

On the optimal placement of multiple visual sensors

The Coverage Problem in Video-Based Wireless Sensor Networks: A Survey

Optimal sensor placement for surveillance of large spaces

References

Distinctive Image Features from Scale-Invariant Keypoints

Multiple view geometry in computer vision

Multiple View Geometry in Computer Vision.

A Combined Corner and Edge Detector

A flexible new technique for camera calibration

Related Papers (5)

Visibility Analysis and Sensor Planning in Dynamic Environments

Focus distance-aware lifetime maximization of video camera-based wireless sensor networks

Automatic Online Calibration of Cameras and Lasers

Chameleon v2:Improved Imaging-inertial Indoor Navigation

System and method for analyzing video from non-static camera

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "Calibrating and optimizing poses of visual sensors in distributed platforms" ?

Q2. What future works have the authors mentioned in the paper "Calibrating and optimizing poses of visual sensors in distributed platforms" ?

Q3. How is the registration of triplets and subgroups achieved?

Q4. What is the effect of the use of feature matching in combination with a flat screen?

Q5. What is the way to solve the camera positioning problem?

Q6. What is the main reason for the failure of the optimization problem?

Q7. How many parameters are used for the camera matrices?

Q8. What is the basic optimization problem of the feature tracker?

Q9. What is the way to determine the optimal poses of the multiple cameras?

Q10. What are the constraints used to restrict the ambiguity?

Q11. What is the way to extract point correspondences between images?