scispace - formally typeset
Open AccessJournal ArticleDOI

Calibrating and optimizing poses of visual sensors in distributed platforms

Eva Hörster, +1 more
- 01 Dec 2006 - 
- Vol. 12, Iss: 3, pp 195-210
TLDR
A linear programming approach is derived that determines jointly for each camera the pan and tilt angle that maximizes the coverage of the space at a given sampling frequency, demonstrating the gain in visual coverage.
Abstract
Many novel multimedia, home entertainment, visual surveillance and health applications use multiple audio-visual sensors. We present a novel approach for position and pose calibration of visual sensors, i.e., cameras, in a distributed network of general purpose computing devices (GPCs). It complements our work on position calibration of audio sensors and actuators in a distributed computing platform (Raykar et al. in proceedings of ACM Multimedia `03, pp. 572---581, 2003). The approach is suitable for a wide range of possible --- even mobile --- setups since (a) synchronization is not required, (b) it works automatically, (c) only weak restrictions are imposed on the positions of the cameras, and (d) no upper limit on the number of cameras under calibration is imposed. Corresponding points across different camera images are established automatically. Cameras do not have to share one common view. Only a reasonable overlap between camera subgroups is necessary. The method has been sucessfully tested in numerous multi-camera environments with a varying number of cameras and has proven itself to work extremely accurate. Once all distributed visual sensors are calibrated, we focus on post-optimizing their poses to increase coverage of the space observed. A linear programming approach is derived that determines jointly for each camera the pan and tilt angle that maximizes the coverage of the space at a given sampling frequency. Experimental results clearly demonstrate the gain in visual coverage.

read more

Content maybe subject to copyright    Report

Universit
¨
at Augsburg
KABCROMUNGSHO0
Calibrating and Optimizing Poses of
Visual Sensors in Distributed Platforms
E. orster, R. Lienhart
Report 2006-19 Juli 2006
Institut f
¨
ur Informatik
D-86135 Augsburg

Copyright
c
E. orster, R. Lienhart
Institut ur Informatik
Universit¨at Augsburg
D–86135 Augsburg, Germany
http://www.Informatik.Uni-Augsburg.DE
all rights reserved

Calibrating and Optimizing Poses of Visual Sensors in
Distributed Platforms
Eva H¨orster, Rainer Lienhart
Multimedia Computing Lab
University of Augsburg
Augsburg, Germany
{hoerster,lienhart}@informatik.uni-augsburg.de
ABSTRACT
Many novel multimedia, home entertainment, visual surveil-
lance and health applications use multiple audio-visual sen-
sors. We present a novel approach for position and pose
calibration of visual sensors, i.e. cameras, in a distributed
network of general purpose computing devices (GPCs). It
complements our work on position calibration of audio sen-
sors and actuators in a distributed computing platform [22].
The approach is suitable for a wide range of possible - even
mobile - setups since (a) synchronization is not required,
(b) it works automatically, (c) only weak restrictions are im-
posed on the positions of the cameras, and (d) no up per limit
on the number of cameras and displays un der calibration is
imposed. Corresponding points across different camera im-
ages are established automatically. Cameras do not have
to share one common view. Only a reasonable overlap be-
tween camera subgroups is n ecessary. The method has been
sucessfully tested in numerous multi-camera environments
with a varying number of cameras and has proven itself to
work ext remely accurate. Once all distributed visual sensors
are calibrated, we focus on post-optimizing their poses to in-
crease coverage of the space observed. A linear programming
approach is derived that determines jointly for each camera
the pan and tilt angle that maximizes the coverage of the
space at a given sampling frequ ency. Experimental results
clearly demonstrate the gain in visual coverage.
1. INTRODUCTION
Today we can find microphones, cameras, loudsp eakers
and displays nearly everywhere - in public, at home and at
work. These audio/video sensors and actuators are often a
component of computing and communication devices such
as laptops, PDAs and tablets, which we refer to as General
Purpose Computers (GPCs). Often GPCs are networked us-
ing high-speed wired or wireless connections. The resulting
array of audio/video sensors and actuators along with array
processing algorithms offers a set of new features for mul-
timedia app lications such as video conferencing, smart con-
ference rooms, video surveillance, games, e-learning, home
entertainment and image based rendering.
Many of the above mentioned audio-visual array process-
ing algorithms require precise knowledge about the positions
and poses of the sensors and actuators as well as the cover-
age that is achieved by those sensors. This demands a sim-
ple and convenient calibration approach to put all sensors
and actuators into a common time and space. [14] proposes
a means to provide a common time reference to multiple
distributed GPCs. In [22] a method for automatically cal-
ibrating audio sensors and actuators is presented. In this
paper we focus on visual sensors where a room or area is
instrumented with N 3 static cameras connected to net-
worked GPCs. No precise synchronization of the different
devices is required.
In the first part of this paper we focus on providing a com-
mon space for multiple cameras by actively estimating their
3D positions and poses. We also address the problem of
effortlessly calibrating the intrinsic parameters of multiple
cameras.
In the second part of the paper another important issue in
designing visual sensor arrays is considered: orienting the
visual sensors such that they achieve optimal coverage of a
given space at a predefined ’sampling rate’ (see Section 3 for
a precise definition). We assume that the positions and inital
poses are given. This is reasonable because either cameras
have been already installed (e.g. at an airport), or they are
put up arbitrarily. Currently there exists only few theoret-
ical research on planning visual sensor positions and poses.
Positions and inital poses of the multiple cameras can be
determined automatically by our calibration approach (see
Section 2). Given the fixed positions, we develop a linear
programming model that determines the optimal poses (pan
and tilt angles) with respect to coverage while maintaining
the required resolution (i.e. minimal ’sampling frequency’).
Fig. 1 shows one ineffective setup that we desire to optimize.
Related Work: Camera calibration is a well researched
topic in computer vision. Fund amentally t here are two dif-
ferent methods of camera calibration: photogrammetric cal-
ibration and self-calibration [31]. The first method uses a
3D, a 2D (planar), or a virtual calibration object of pre-
cisely known geometry. Important approaches are described
in [31] [11] [28] [4] [27]. Planar methods are very popular
because it is easy to obtain a calibration t arget by just print-

Figure 1: Example of an inefficient setup we desire
to optimize
ing the pattern and xing the paper on a at surface. Al-
though providing good results, the major drawback of these
calibration methods is that they require special equipment
or precise manual measurements. Virtual calibration ob-
jects are constructed over time by tracking an easily iden-
tifiable object through a 3D scene. The cameras u su ally
have to be synchronized and thus the setup requires spe-
cial equipment. Self calibration techniques ([9] [26] [20]) d o
not require any special calibration target. They simultane-
ously process several images from different perspectives of
a scene and are b ased on point correspondences across the
images. The accuracy of these methods dep ends on how ac-
curately those point correspondences can be extracted be-
tween images. Point correspondences are extracted auto-
matically from the images by identifying 2D features and
tracking those between the different perspective views. Dif-
ferent feature extraction algorithms exist (see [8] [24] [15]).
There ex ist also self-calibration approaches using silouettes
or t rajectories of moving objects [21] [25]. Multiple cam-
era calibration can be solved globally in one step, or multi-
ple subsets of cameras and displays are calibrated first and
then merged into a global coordinate system. Since the first
method is only suitable if all cameras share a common view,
we follow th e second more general approach.
Although a significant amount of research exists in designing
and calibrating video sensor arrays, automated v isual sensor
placement and alignment in general has n ot been addressed
often. There is some work in the area of grid coverage prob-
lems with sensors sensing events that occur within a distance
r (the sensing range of the sensor) [23] [13] [29] [32]. Our
work is based on those approaches, but differs in the sensor
model (since cameras do not posses circular sensing ranges)
as well as the cost function and some constraints. In [5] a
camera placement algorithm b ased on a binary optimization
technique is proposed. The algorithm aims to nd the place-
ment with minimum cost of a camera set such that a given
space is viewed with some minimal spatial resolution. Space
is represented as a occupancy grid and the authors focused
on planar regions. A similar task is considered in [12] and
also solved by linear programming techniques. In [19] the
authors analyze the visibility from static sensors probabilis-
tically and p resent a solution for maximizing visibility in a
given region of interest. They solve the problem by simu-
lated annealing.
Contributions: The main contributions of the paper are:
A procedure to automatically calibrate the positions
and poses of sensors without using calibration objects.
Thus no special equipment is required. In addition
the setup does not h ave to be syn chronized. It only
requires to filter out temporally unstable salient points
and keep only stationary features. Our method is sim-
ple and convenient to use and offers mobility of the en-
tire setup. The camera views are assumed to overlap
only partly, i.e. only some cameras share a common
view.
The usage of an active display as our calibration tar-
get for intrinsic calibration giving us control over the
calibration pattern to be displayed. As a result th e
extraction of feature points is easier and more reliable.
The calibration pattern can be made adaptive to the
distance between the camera and the pattern’s image
on the LCD screen.
The automatic extraction of control points and point
correspondences across images.
A procedure to determine th e optimal poses of the
cameras such that coverage is maximized while main-
taining a minimal resolution.
The rest of the paper is organized as follows. In Section
2 we formulate the calibration problem and present our so-
lution. We describes how point features are extracted and
tracked between images and outline the calibration of the
intrinsic parameters of each camera. The algorithm used to
determine the extrinsic parameters, i.e. the positions and
poses of all cameras in a common coordinate system is pre-
sented. In S ection 3 we formulate the optimization problem
of maximizing coverage with multiple cameras by pose vari-
ation. Ou r solution is p resented and results are reported.
The paper concludes with a summary and an outlook in
Section 4.
2. MULITPLE CAMERA CALIBRATION
2.1 Problem Formulation
Given M cameras, the goal is to determine the cameras’
internal parameters and the 3D positions and poses of the
cameras automatically. Therefore we only make the assump-
tion, that we know the number of visual sensors in the net-
work.
In this work we use an enhanced perspective model to de-
scribe our cameras. The mapping performed by a perspec-
tive camera between a 3D point X and its 2D image point
x, b oth represented by their homogeneous coordinates, is
usually represented by a 3 × 4 matrix, the camera projec-
tive matrix P: x PX. The matrix P can be written as
P = K[R|t] where K is a 3 × 3 upper triangular matrix
containing the camera intrinsic parameters:
K =
0
@
f
x
s p
x
0 f
y
p
y
0 0 1
1
A
(1)
The parameters f
x
and f
y
denote the focal length, p
x
and
p
y
denote the coordinates of th e principal point, each in
terms of pixel dimensions. s denotes the skew. For most
commercial cameras, and hence below, the skew is consid-
ered to be zero. The 3 × 3 rotation matrix R and the 3 × 1
translation vector t describe the 3D position and pose of
the camera. As some desktop cameras exhibit significant
distortions, this model has to be enriched by some distor-
tion components. The distortion model introduced in [11]

X
1
C
2
C
3
C
4
C
1
C
5
X
2
X
3
Figure 2: General calibration problem
accounts for tangential and radial distortions using two co-
efficients. It describes distortions occuring in practice suffi-
ciently precise. In the following discussion we assume that
the distortion parameters of each camera are known and the
effects of those have been removed from all images.
Different views of t he same scene are related t o each other.
These relations can be used for our multiple camera calibra-
tion task. Therefore we need to determine a set of corre-
sponding points across the different images. Points are said
to correspond if they represent the same scene point in dif-
ferent views. This general calibration problem is illustrated
in Fig. 2.
A set of 3D points X
i
is viewed by a set of cameras with
matrices P
j
. Let x
j
i
denote the coordinates of th e i-th
point as detected in the j-th camera image. A 3D point
may not be visible in all cameras, thus its corresp onding
projected point will not be available in all images. The cal-
ibration problem is then t o find the set of camera matrices
P
j
and points X
i
such that for all image points x
j
i
P
j
X
i
holds. However, unless additional constraints are given, it is
in principle only possible to determine the camera matrices
up to a projective ambiguity. Additional constraints arising
from knowledge about the cameras’ parameters and/or the
scene can be used to restrict this ambiguity up to an affine,
metric or Euclidean transformation.
Solution: We solve the camera calibration problem in
two stages. I n a first stage we determine the cameras’ intrin-
sic parameters. Intrinsic calibration is done independently
for each camera by u sing a flat-panel display as the pla-
nar calibration object. In a second stage camera positions
and poses are computed in a common coordinate system
(extrinsic calibration). Their positions and poses can be de-
termined relative to each other up to a global coordinate. In
a typical distributed camera environment each camera can
only see a small volume of the total viewing space and differ-
ent intersecting subsets of cameras share different intersect-
ing views. Hence multiple camera calibrations are performed
by calibrating subsets of cameras and then building a global
coordinate system from individual overlapping views.
2.2 Point Correspondences
2D point correspondences between projections of the same
3D point onto different camera planes can be generally used
to recover the calibration matrices of the cameras. There-
Figure 3: Matched points are visualized by a con-
necting line between images
fore establishing such correspondences is the first step in
determining the cameras parameters. To establish point cor-
respond ences each image is at first represented by a set of
features. Each feature describes a specific image point, and
its neighborhood. Subsequently these features are input to
a matching procedure, which identifies features in different
images that correspond to the same point in the observed
scene. There are various approaches for extracting a set of
interest points and features from an image. Our approach
uses the so called SIFT-features proposed in [15]. SIFT-
based feature descriptors were identified in [18] to deliver
the most suitable features in the context of matching points
of a scene under different viewing conditions such as differ-
ent lighting and changes in 3D viewpoint.
SIFT-Features Extraction: The SIFT-feature extrac-
tion method combines a scale invariant region detector and a
descriptor based on the gradient distribution in the detected
regions. In order to compute a set of caracteristic image fea-
tures, first a set of interest points - also called keypoints -
is found by detecting scale-space extremas. Only keypoints
that are stable under a certain amount of additive noise are
preserved. An image location, scale and orientation is as-
signed to each keypoint. This enables the construction of
a repeatable local 2D coordinate system, in which the local
image (pixel and its surrounding region) is described invari-
antly from t hese parameters. Finally a descriptor for each
keypoint is calculated based upon image gradients in the lo-
cal image. However this approach has its limitations. To
ensure a sufficient number of reliable matching points, the
displacement between the cameras should not exceed 15
.
The resulting correspondences are within pixel accuracy.
SIFT-Feature Matching: The matching technique used
for the SIFT-features h as been proposed in [15]. Point cor-
respond ences between two images are established by com-
paring their respective keypoint descriptors. Matching is
performed by first individually measuring the Euclidean dis-
tance of each feature vector (representing a certain keypoint)
of one image to each feature vector of the other image. The
best matching candidate for a sp ecific keypoint is identified
by the keypoint belonging to the feature vector with the min -
imum distance. A match is found in the second image, if the
distance ratio between the nearest and the second nearest
neighbor (closest/second closest) is below a threshold. An
example of matched points between two images is shown in
Fig. 3.
Subpixel Accuracy: The result of SIFT-feature match-
ing is only at pixel accuracy. For position estimation of
multiple cameras experiments have shown that it is essen-

Citations
More filters

Computer vision : a modern approach = 计算机视觉 : 一种现代的方法

David Forsyth, +1 more
TL;DR: Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance and describes numerous important application areas such as image based rendering and digital libraries.
Journal ArticleDOI

A convenient multicamera self-calibration for virtual environments

TL;DR: It is shown that it is possible to calibrate an immersive virtual environment with 16 cameras in less than 60 minutes reaching about 1/5 pixel reprojection error.
Proceedings ArticleDOI

On the optimal placement of multiple visual sensors

TL;DR: This paper focuses on the placement of visual sensors with respect to maximizing coverage or achieving coverage at a certain resolution and proposes different algorithms which give a global optimum solution and heuristics which solve the problem within reasonable time and memory consumption at the cost of not necessarily determining the global optimum.
Journal ArticleDOI

The Coverage Problem in Video-Based Wireless Sensor Networks: A Survey

TL;DR: The coverage problem is a crucial issue of wireless sensor networks, requiring specific solutions when video-based sensors are employed, and the state of the art of this particular issue is surveyed regarding strategies, algorithms and general computational solutions.
Proceedings ArticleDOI

Optimal sensor placement for surveillance of large spaces

TL;DR: The practical problem of optimally placing the multiple PTZ cameras to ensure maximum coverage of user defined priority areas with optimum values of parameters like pan, tilt, zoom and the locations of the cameras is addressed.
References
More filters
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Book

Multiple view geometry in computer vision

TL;DR: In this article, the authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly in a unified framework, including geometric principles and how to represent objects algebraically so they can be computed and applied.

Multiple View Geometry in Computer Vision.

TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Proceedings ArticleDOI

A Combined Corner and Edge Detector

TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Journal ArticleDOI

A flexible new technique for camera calibration

TL;DR: A flexible technique to easily calibrate a camera that only requires the camera to observe a planar pattern shown at a few (at least two) different orientations is proposed and advances 3D computer vision one more step from laboratory environments to real world use.
Frequently Asked Questions (11)
Q1. What contributions have the authors mentioned in the paper "Calibrating and optimizing poses of visual sensors in distributed platforms" ?

The authors present a novel approach for position and pose calibration of visual sensors, i. e. cameras, in a distributed network of general purpose computing devices ( GPCs ). 

As the change in viewpoint between the different cameras is restricted, future work is needed to improve the automatic extraction of point correspondences between images. Future work on this topic will include the investigation of how to handle large numbers of grid points. 

Registration of triplets and sub-groups is achieved by computing a homography of 3-space between the different metric structures. 

The use of SIFT-feature matching in combination with a flat screen displaying a known pattern enables us to easily and automaticaly detect the subset of image points. 

Considering N cameras that are calibrated, i.e. their fields-of-view as well as positions in the space are known, the authors formulate their camera positioning problem in terms of maximizing the coverage. 

As the optimization problem of the final bundle adjustment is of very high dimension, a poor initial guess commonly results in the non-linear optimization to fail completely, i.e. to converge to a suboptimal solution or to not converge at all. 

The dimension of the minimization problem adds then up to a total number of 6(N −1) parameters for the camera matrices, plus a set of 3L parameters for the coordinates of the L reconstructed 3D points. 

The basic optimization problem solved by the feature tracker is:min d,Dωx Xx=−ωxωy Xy=−ωy(I(x+u)− J((D+ I2×2)x+d+u)) 2 (2)where I(u), J(u) represent the grey-scale values of the two images at location u, the vector d = [dx dy ]T is the optical flow at location u, and the matrix D denotes an affine deformation matrix characterized by the four coefficients dxx, dxy, dyx, dyy:D =„dxx dxy dyx dyy«(3)The objective of affine tracking is then to choose d and D in a way that minimizes the dissimilarity between feature windows of size 2ωx + 1 in x and size 2ωy + 1 in y direction around the point u and v in The authorand J respectively. 

Given the fixed positions, the authors develop a linear programming model that determines the optimal poses (pan and tilt angles) with respect to coverage while maintaining the required resolution (i.e. minimal ’sampling frequency’). 

Additional constraints arising from knowledge about the cameras’ parameters and/or the scene can be used to restrict this ambiguity up to an affine, metric or Euclidean transformation. 

As the change in viewpoint between the different cameras is restricted, future work is needed to improve the automatic extraction of point correspondences between images.