scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

COST: An Approach for Camera Selection and Multi-Object Inference Ordering in Dynamic Scenes

TL;DR: An optimization problem to select set of cameras and inference dependencies for each person which attempts to minimize the computational cost under given performance constraints is presented and results show the efficiency of COST in improving the performance of such systems and reducing the computational resources required.
Abstract: Development of multiple camera based vision systems for analysis of dynamic objects such as humans is challenging due to occlusions and similarity in the appearance of a person with the background and other people- visual "confusion". Since occlusion and confusion depends on the presence of other people in the scene, it leads to a dependency structure where there are often loops in the resulting Bayesian network. While approaches such as loopy belief propagation can be used for inference, they are computationally expensive and convergence is not guaranteed in many situations. We present a unified approach, COST, that reasons about such dependencies and yields an order for the inference of each person in a group of people and a set of cameras to be used for inferences for a person. Using the probabilistic distribution of the positions and appearances of people, COST performs visibility and confusion analysis for each part of each person and computes the amount of information that can be computed with and without more accurate estimation of the positions of other people. We present an optimization problem to select set of cameras and inference dependencies for each person which attempts to minimize the computational cost under given performance constraints. Results show the efficiency of COST in improving the performance of such systems and reducing the computational resources required.

Summary (3 min read)

1. Introduction

  • The analysis is difficult due to occlusions and appearance similarities of people with one another or the background against which they are viewed.
  • In multiple camera systems, information fusion needs to be sensitive to occlusions and confusions.
  • Additionally, the authors seek to identify the parts of the image where such occlusion and confusion occurs and use this information in the inference process.
  • A Bayesian network for such multi-object inference will generally have loops.
  • The authors present COST, a framework to reason about such dependencies, that produces an inference order for multiperson, multi-perspective pose/position estimation.

2.1. Computing Visibility

  • But one person’s visibility depends on the pose of other people in the scene, whose poses are generally known only probabilistically.
  • This lends us to compute visibility probabilistically.
  • To develop a generic formulation, let us consider an n-part model for a person where n is one for simple position estimation or ten for full body pose estimation.
  • 2By considering occlusion of a part (i, j) from itself, the authors implicitly select surface voxels instead of interior voxels.
  • There are a fixed and known number of locations, which the authors refer to as “portals”, from which a new person enters or an existing person leaves the scene.

2.2. Computing Confusion

  • The view might still not be helpful in estimating the pose because of “camouflage” - his appearance being too similar to either the background or some other person(s) occluded by him.
  • Due to such “confusion” with the “background”, segmenting the person accurately would be problematic, and most pose inferences would degrade as the segmentation quality decreases.
  • Again, consider the differential element dV that a part (i, j) may contain.

3.1. Model for Information Content

  • In order to perform inference reliably for some part of a given person using some view, that part should, ideally, not be occluded in that view and should not be “confused” with the background or other parts.
  • The accuracy of the inference will depend upon both the degrees of occlusion and confusion, as discussed in the previous section.
  • It will also depend on the uncertainty of such occlusion and confusion.
  • The authors present a simple model for measuring the information available in a view regarding a part for the task of pose estimation.
  • The information available about a specific part in a given view is then taken as the expected number of visible and discriminable voxels in that view.

3.2. Information from Dependencies

  • Inference decisions can be improved if estimates of the pose/appearance characteristics of the occluders and confusers are used.
  • The segmentation in the occluded region is then based on position priors, which would yield a better estimate of the median line as shown in Figure 4(c).
  • Thus, accurate inference of a part’s position depends upon the inference of occluders and confusers.
  • Additionally, using information from dependencies might involve expensive computation.

4. The Optimization Problem

  • The authors would like to minimize the computational cost while guaranteeing that the expected error in the estimate of the pose of person i is below ηi(termed a “performance constraint”).
  • The optimization problem stated above is NP-Hard and belongs to the class of subset selection problems [18].
  • While approaches such as simulated-annealing can be used for optimization, much faster heuristic approaches can be employed.

4.1. A Heuristic Based Optimization Approach

  • The authors present a heuristic-based, greedy algorithm for the optimization problem.
  • The authors build the dependency graph G by adding nodes one by one to G. Each node represents a person and the set of cameras selected for estimating the pose of that person.
  • The dependency from B is not included since the performance constraint of A is satisfied without it).
  • To compute the minimum cost of estimation for each remaining person at each iteration, one could exhaustively search the space of possible cameras and dependencies selection.
  • Such an approach requires exponential time in the number of cameras, so becomes infeasible when the number of cameras is large.

5.1.1 Framework

  • The algorithm cycles between using segmentation to estimate people’s ground plane positions and using ground plane position estimates to obtain segmentations; the process is iterated until stable.
  • The number of occluded voxels that can be added due to dependencies depends on the selection of dependencies and the accuracy of the position estimate of the occluder.
  • For a given camera pair, the error in estimation of position would increase as the segmentation quality decreases in either of the cameras in the pair.
  • Additionally, M2Tracker fuses many camera pairs to obtain people’s ground plane position estimates by using a weighted average of the estimates from each camera pair.
  • The authors assume, for simplicity, that the computational cost of segmentation and wide-baseline stereo is some constant and independent of view and imaging conditions.

5.1.2 Results

  • The authors evaluated the performance of their implementation of M2Tracker with and without using COST on the publicly available dataset of M2Tracker.
  • It can be seen that M2Tracker has higher variance in position estimates using the eight camera system than COST has choosing only the “best” camera pair per person.
  • This is because in many views a person is either occluded or confused with the background and this leads to inaccurate segmentations and subsequent errors in stereo reconstruction.
  • The positional ground truth values were obtained manually.
  • Experimental results indicate that it is generally sufficient to analyse only a small number of judiciously chosen cameras to obtain accuracy and performance similar to a system uniformly employing a large number of cameras.

5.2. Using COST for Multiple People Pose

  • The authors also applied the COST algorithm for full body pose estimation of multiple people.
  • These papers have considered the problem of selfocclusion, but not of one person by another.
  • The authors used similar dependency and cost functions as for M2Tracker.
  • The error function for full body pose problem was modified.

6. Conclusion

  • The authors have presented a principled approach, COST, for camera and dependency selection for improving the performance and computational resource requirements for multi-camera systems.
  • COST produces a directed acyclic dependency graph which can then be used to obtain an inference order using topological sort.
  • The selection criteria in COST is based on visibility and “confusion” analysis in each view and the resulting dependencies.
  • Experimental results indicate that COST outperforms a system which uses a large number of cameras for estimation of each person.
  • Additionally, a COST based system is faster than other possible approaches based on EM and belief propagation which use all the cameras and dependencies for analysis.

Did you find this useful? Give us your feedback

Figures (9)

Content maybe subject to copyright    Report

COST
: An Approach for Camera Selection and Multi-Object Inference
Ordering in Dynamic Scenes
Abhinav Gupta
Dept. of Computer Science
University of Maryland
College Park, MD, USA
agupta@cs.umd.edu
Anurag Mittal
Dept. of Comp. Sc. and Engg.
IIT Madras
Chennai, India
amittal@cse.iitm.ernet.in
Larry S. Davis
Dept. of Computer Science
University of Maryland
College Park, MD, USA
lsd@cs.umd.edu
Abstract
Development of multiple camera based vision systems
for analysis of dynamic objects such as humans is chal-
lenging due to occlusions and similarity in the appearance
of a person with the background and other people- visual
“confusion”. Since occlusion and confusion depends on
the presence of other people in the scene, it leads to a de-
pendency structure where there are often loops in the re-
sulting Bayesian network. While approaches such as loopy
belief propagation can be used for inference, they are com-
putationally expensive and convergence is not guaranteed
in many situations.
We present a unified approach, COST, that reasons about
such dependencies and yields an order for the inference of
each person in a group of people and a set of cameras to be
used for inferences for a person. Using the probabilistic dis-
tribution of the positions and appearances of people, COST
performs visibility and confusion analysis for each part of
each person and computes the amount of information that
can be computed with and without more accurate estimation
of the positions of other people. We present an optimization
problem to select set of cameras and inference dependen-
cies for each person which attempts to minimize the compu-
tational cost under given performance constraints. Results
show the efficiency of COST in improving the performance
of such systems and reducing the computational resources
required.
1. Introduction
We consider the problem of multi-perspective analysis
of moving people in crowded situations. Typical goals of
such an analysis are to recover the position, orientation or
the pose of each or some subset of the people in the scene.
The analysis is difficult due to occlusions and appearance
similarities of people with one another or the background
against which they are viewed. We refer to errors arising
from appearance similarities as “confusions”. In multiple
camera systems, information fusion needs to be sensitive to
occlusions and confusions.
Confusion and Occlusion analysis for Selections based on Tasks
Our goal is to develop principled methods to “select” the
camera(s) in which there is less occlusion and confusion for
a particular person to infer that person’s position or pose
(See Figure 1). Additionally, we seek to identify the parts
of the image where such occlusion and confusion occurs
and use this information in the inference process. How-
ever, determining those regions of occlusion and confusion
depends on the positions and poses of other people in the
scene. This leads to a dependency structure for inference
of position/pose of the people present in the scene, as is il-
lustrated graphically in Figure 2(b). A Bayesian network
for such multi-object inference will generally have loops.
Those loops can be eliminated by appropriate selection of
cameras and dropping inference dependencies which are not
expected to yield significant information, as shown in the
example in Figure 2(c).
We present COST, a framework to reason about such
dependencies, that produces an inference order for multi-
person, multi-perspective pose/position estimation. We ad-
ditionally identify a set of cameras and the parts of the ac-
quired images to be analyzed for each person. We show
that COST not only yields a reduction in computational time
compared to approaches such as Expectation Maximization
(EM) or Loopy Belief Propagation (LBP) [16], but also
shows quantitative improvement in the pose/position esti-
mation due to camera selection.
1.1. Related Work
There are many multi-perspective vision algorithms that
analyze crowded scenes for either person position estima-
tion or pose estimation. Most of the position estimation
algorithms constrain the motion to a ground plane and per-
form inference by first segmenting the people in each view
and then using data fusion techniques to obtain an estimate
of the 3D locations of each person [15, 12, 13, 6]. While oc-
clusion has been considered to some extent(for weighted fu-
sion) in some papers [15, 13], confusion due to appearance
similarities has not been previously considered. Addition-
ally, most earlier work either ignores the inference depen-
dencies or uses all of them, which makes the computation
costly.
Previous work on pose estimation has only considered
self-occlusion of one body part by another of the same per-
1

(1a) (1b) (2a) (2b) (3a) (3b)
Figure 1. Segmentation results and median line determination of a person in three different views. In view 1, there is no occlusion and
confusion while in views 2 and 3 there is occlusion and confusion respectively. If the median lines are used for person position estimation
as in [13, 10], without occlusion and confusion reasoning, we might mistakenly use the median lines shown in (2b) and (3b).
2
00
00
00
00
11
11
11
11
00
00
00
00
11
11
11
11
00
00
00
00
11
11
11
11
0
0
0
0
1
1
1
1
Static Object
A
C
B
1
3
00
00
00
11
11
11
(a) Ground Plane
C
AB
(b) All cam-
eras
C(1,2)
B
A(1,2)
(2,3)
(c) COST
Figure 2. (a) A multiple-person scenario with 3 people and 3 cam-
eras. (b)The dependency graph obtained if all cameras are used for
estimation of all people. An edge A B represents the informa-
tion flow from A to B in the inference process-hence estimation
of B depends on estimation of A. In this scenario, estimation of
B depends on A due to occlusion in camera 1 and estimation of
A and C depends on B and A respectively due to occlusions in
cameras 3 and 2 respectively. (c) The dependency graph obtained
if cameras are selected using COST. The selected cameras for es-
timation of each person are shown in the respective node. Since,
camera 1 is not used for estimation of B , the estimation of B
becomes independent of A. Additionally, if the degree of occlu-
sion of C due to A is small (that is one cannot generate significant
information for the estimation of C using the estimate of the lo-
cation or pose of A) then one can also eliminate the dependency
edge A C without strongly affecting the accuracy of the re-
sult. Such elimination can be critical for loop removal when there
are not enough cameras in which a person is isolated and discrim-
inable.
son [9, 8, 19, 20]; occlusion of one person by another, lead-
ing to inference dependencies between people and their parts
has not been addressed. A naive approach (by considering
all pairwise interactions of all parts of all people) would
involve constructing a large Bayesian network with loops;
however, this results in an intractable optimization problem.
We show how many of the loops in the Bayesian network
can be eliminated using selection of the best cameras and
the most important inference dependencies.
A related problem of sensor selection and information
fusion has been studied in the field of sensor networks and
distributed computing. The problem is to selectively choose
the sensors so that information gain compensates for costs
associated with information gathering. An optimal solu-
tion using such an information theoretic approach requires
evaluating all possible combinations, making the problem
NP-Hard. Denzler et. al [4] proposed an information the-
oretic based approach where the view which leads to max-
imum reduction in entropy is chosen. Since the computa-
tion of mutual information requires exponential time, other
approximate [23] and heuristic based algorithms [22]have
also been proposed. Other approaches in this field include
use of look-up tables [17] or utility functions [2] in selection
of camera views.
These information theoretic approaches only consider ge-
ometric analysis based on the fields of view of the cam-
eras when computing mutual information. However, even
though two cameras might have overlapping fields of view
they can still provide different information due to occlusion
and confusion. While [5] presents an approach for cam-
era selection in the presence of occlusions, COST involves
visibility and discriminability analysis in cojunction with
reasoning about dependencies for camera selection.
Bayesian belief networks are an important mechanism
for representation and reasoning under uncertainty. For a
given belief-net even finding an approximate solution is NP-
Hard [3]. Our approach is related to model simplification
methods(see [7]), which simplify the model until exact meth-
ods become feasible. These approaches reduce the com-
plexity by annihilating small probabilities [11]orremoving
weak dependencies [14] and arcs [21].
Our approach is complementary to these approaches. COST’s
loop removal procedure is primarily based on camera selec-
tion, which removes redundant and unreliable information
in multi-perspective vision systems. Additionally, while
previous approaches assume that weights of the dependen-
cies are given, our approach considers occlusion and confu-
sion in different cameras and removes loops based on this
information.
The paper is organized as follows. We first describe how
visibility and confusion factors for an object are computed
in section 2. We then explain our optimization framework
and a heuristic approach for fast approximate inference in

section 4. We finally present experimental results in sec-
tion 5.
2. Computing Occlusion and Confusion
2.1. Computing Visibility
To estimate a property of a given person or object from
a given camera, that person or object must be (partially)
visible from that camera. But one person’s visibility de-
pends on the pose of other people in the scene, whose poses
are generally known only probabilistically. This lends us to
compute visibility probabilistically. Specifically, we com-
pute the probability of visibility of each part of a person in
each camera based on probabilistic estimates of the poses of
all other people in the scene. To develop a generic formula-
tion, let us consider an n-part model for a person where n is
one for simple position estimation or ten for full body pose
estimation.
Let dV be a differential volume element(voxel) which
might be included in part j of person i. The Occluder Re-
gion,
k
(dV ), of a differential element dV in camera k is
defined as the 3D region in which another person, l,must
be present so that dV would not be visible in camera k (See
Fig 3). We also define the following events:
E
i,j
(dV ) = Event that part j of person i includes dV
1
EO
k
l,m
(dV ) = Event that part m of person l intersects
k
(dV )
EO
k
(dV ) = Event that no person intersects
k
(dV )
The expected visibility of a part, that is, the number of
visible voxels contained in that part, is then given by
E
v
(i, j, k)=
V
k
P (EO
k
(dV ))P (E
i,j
(dV ))dV (1)
The probability that part m of person l does not occlude
dV is the probability that part m does not contain any of
the voxels that belongs to the set
k
(dV ). Therefore, that
probability is given by
P (EO
k
l,m
(dV )) =
dV
1
k
(dV )
1 P (E
l,m
(dV
1
)) (2)
The probability that no part of any person is in the oc-
cluder region is then given by
2
P (EO
k
(dV )) =
(l,m)
P (EO
k
l,m
(dV )) (3)
Furthermore, in a tracking scenario, new people can en-
ter the scene. In this case, we also need to consider the oc-
clusions they are likely to introduce and how the expected
visibility changes to account for new people. We assume
1
Apartj can include many such voxels.
2
By considering occlusion of a part (i, j) from itself, we implicitly
select surface voxels instead of interior voxels. Interior voxels would be
occluded the by surface voxels and would not be considered.
there are a fixed and known number of locations, which we
refer to as “portals”, from which a new person enters or
an existing person leaves the scene. Let E
new
(dV ) be the
event that a new person is present in voxel dV . The likeli-
hood of this event, P (E
new
(dV )), is the product of the like-
lihood that a portal is nearby (which is represented in terms
of a prior probability P
p
(E
new
(dV ))) and the image likeli-
hood that a new person is seen in the region P
L
(E
new
(dV )).
Therefore, P (
EO
k
(dV )) is given by:
00
00
00
00
11
11
11
11
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
000000000
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
111111111
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
0000000000
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
1111111111
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
00000000000000
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
11111111111111
000
000
000
000
111
111
111
111
000
000
111
111
dA
Camera k
C
k
(dA)
k
(dA)
Figure 3. Schematic diagram showing
k
(dV ) and C
k
(dV ) pro-
jected on the ground-plane. Because of discretization,
k
(dV )
and C
k
(dV ) represent the set of voxels where another object must
be present for occlusion or confusion to occur.
dV
1
k
(dV )
(
(l,m)
1 P (E
l,m
(dV
1
)))(1 P (E
new
(dV
1
)))
(4)
2.2. Computing Confusion
Although a person(or some part) might be visible in view
k , the view might still not be helpful in estimating the pose
because of “camouflage” - his appearance being too sim-
ilar to either the background or some other person(s) oc-
cluded by him. Due to such “confusion” with the “back-
ground”, segmenting the person accurately would be prob-
lematic, and most pose inferences would degrade as the seg-
mentation quality decreases.
Again, consider the differential element dV that a part
(i, j) may contain. To compute the discriminability of dV ,
we determine the parts which can cause confusion. The
confuser space C
k
(dV ) of an element dV is defined as the
region where the presence of a part (l, m) would cause con-
fusion in the classification of a pixel that can be formed due
to the projection of part (i, j) from dV (See Figure 3). The
amount of confusion is proportional to the similarity in ap-
pearance of the two parts. We define the discriminability of
a part (i, j) in a view k, D
k
(i, j) as:
D
k
(i, j)=
(l,m)
c
l,m
d(a
k
i,j
,a
k
l,m
)+c
0
d(a
k
i,j
,B
k
) (5)
where a
k
i,j
defines the appearance of a part (i, j), B
k
defines the appearance of the background, d is a distance
metric between the appearances and c is the correspond-
ing weight. For example, if appearance is represented as a

histogram, then d could be the dot-product of the two his-
tograms or the earth mover’s distance. The weight c
l,m
is
proportional to the probability of the part (l, m) lying in the
confuser space and being visible:
c
l,m
=
1
Z
C
k
(dV )
P (EO
k
(dA))P (E
l,m
(dV
1
))dV
1
(6)
where Z is a normalizing factor. Hence, the expected num-
ber of discriminable voxels in view k contained in part (i, j)
is given by:
I
k
(i, j)=
k
V
P (EO
k
(dV ))D
k
(i, j)P (E
i,j
(dV ))dV (7)
3. Information in Views and Dependencies
3.1. Model for Information Content
In order to perform inference reliably for some part of
a given person using some view, that part should, ideally,
not be occluded in that view and should not be “confused”
with the background or other parts. The accuracy of the in-
ference will depend upon both the degrees of occlusion and
confusion, as discussed in the previous section. It will also
depend on the uncertainty of such occlusion and confusion.
We present a simple model for measuring the information
available in a view regarding a part for the task of pose esti-
mation. We say that a specific voxel belonging to a person is
informative in some view if and only if it is both visible and
discriminable. The information available about a specific
part in a given view is then taken as the expected number of
visible and discriminable voxels in that view.
3.2. Information from Dependencies
Inference decisions can be improved if estimates of the
pose/appearance characteristics of the occluders and con-
fusers are used. Such information can be employed in a va-
riety of ways; an example for the position estimation prob-
lem is shown in Figure 4. Here the inference of a person’s
position involves constructing a median line through the sil-
houette of the person, and computing that line’s intersection
with the ground plane using calibration information. Fig-
ure 4(b) shows the segmentation of the person constructed
from the visible and discriminable voxels. However, the
estimate of the median-line is inaccurate when only these
voxels are used (see the magenta voxels on the ground plane
and median line-1 based on these voxels). If we additionally
use the position of the occluder we can identify occluded re-
gions (See light blue region in Figure 4(d)). The segmenta-
tion in the occluded region is then based on position priors,
which would yield a better estimate of the median line as
shown in Figure 4(c).
The inference of a part’s position depends on the infor-
mation about the occluders and confusers; the more accu-
rate our information about the occluder and confusers, the
(a) (b) (c)
(d)
Figure 4. Importance of using occlusion information before fusion:
(a) The original image (b) Occlusion-unaware segmentation and
object inference, (c) Occlusion-aware segmentation and inference
(d)The ground plane situation of the scenario. The black boundary
show the actual voxels contained in the person. In case b, only the
magenta voxels are used for median line estimation(1). In case c,
one uses combination of magenta and blue voxels for estimation
of median line(2). However, the true median line is represented by
(3).
more accurate will be our estimate. Thus, accurate infer-
ence of a part’s position depends upon the inference of oc-
cluders and confusers. Such dependencies can be repre-
sented in a dependency graph (See Fig 2). Using the pose
of other people in the inference process can, however, lead
to loops in the Bayesian network. Additionally, using infor-
mation from dependencies might involve expensive com-
putation. Our goal is to avoid introducing edges into the
dependency graph which either do not have sufficient infor-
mation or introduce loops in the Bayesian network. We do
this as follows: For each possible occluder or confuser l,we
associate a binary decision variable, ν
k
i,l
which represents
whether the knowledge about the pose of person l is to be
used in the inference of the pose of person i from view k
3
.
If there is no edge from node l (the node representing per-
son l) to node i in the dependency graph, then k, ν
k
i,l
=0.
Given some selection of edges to include in the dependency
graph, the total amount of information, I
k
i,j
, that an al-
gorithm can extract in view k about a part (i, j) using the
3
In our model, dependencies are between people and not parts; we use
the estimate of person l to estimate the locations of all parts of person i

estimates of its dependencies can be determined. This, how-
ever, also depends on the accuracy in the estimates of the
dependencies.
4. The Optimization Problem
Given the amount of information available(with and with-
out dependencies) regarding each person in each camera,
we estimate the binary decision variables µ
k
i
, ν
k
i,l
which rep-
resent whether or not camera k will be used in the inference
of person i(µ
k
i
) and, if so, whether to use the estimate of the
pose of person l when estimating the pose of person i (that
is whether or not we should include the edge from nodes l
to i in the Bayesian network). For instance, in Figure 2 the
decision variables (µ
1
C
, µ
2
C
, ν
2
C,A
) will be set to true for
person C. We would like to minimize the computational
cost while guaranteeing that the expected error in the esti-
mate of the pose of person i is below η
i
(termed a “perfor-
mance constraint”). Thus, the optimization problem can be
formulated as
min
µ
i
i
i
J
i
(µ
i
i
) such that, e
i
(µ
i
i
) η
i
i (8)
where e
i
represents the expected error in the estimate of
the pose of person i and J
i
represents the cost of comput-
ing the estimate of the pose of person i. This model also
supports attention-based surveillance when i s.t
j=i
,
η
i
<< η
j
. In such a case, most of the computational re-
sources would be devoted to estimating the pose of a distin-
guished person.
The optimization problem stated above is NP-Hard and
belongs to the class of subset selection problems [18]. While
approaches such as simulated-annealing can be used for op-
timization, much faster heuristic approaches can be employed.
4.1. A Heuristic Based Optimization Approach
We present a heuristic-based, greedy algorithm for the
optimization problem. We build the dependency graph G
by adding nodes one by one to G. Each node represents
a person and the set of cameras selected for estimating the
pose of that person. The edges incident on a node represent
the dependencies to be used in estimation (An edge l i
indicates that to estimate the pose of person i, the pose of
person l is used).
At each iteration, we compute the minimum cost
4
of
estimation of each person, i, by selecting the best possible
settings of the decision variables(µ
i
and ν
i
). However, to
avoid loops in G, we require that dependencies be selected
from the set of nodes already present in G, and should not
introduce loops in the Bayesian network. The person with
the lowest cost of estimation is then added to G. In the next
iteration, the cost of estimation is re-computed, since the
newly introduced node can be now used as a dependency
for the remaining people.
4
If the performance constraint for any person cannot be satisfied we
assume the cost of estimation to be
The algorithm is illustrated in Figure 5. At iteration 1,
the minimum costs of computation are B=2 (Using camera
1,2 and no dependency), A= (A needs to use dependency
on either B or C for the performance constraint to be satis-
fied; since the dependency graph at t=0 is null, A cannot use
any dependency), C= (C also needs to use the estimate of
B for its performance constraint to be satisfied). At iteration
2, the computation costs become A=8 (Using cameras 1,2,3
and the dependency from B) and C = 3 (Using cameras 2,3
and the dependency from B). Hence C is added at iteration
2. At iteration 3, the new minimum computation cost for
A=4 (Using cameras 2,3 and the dependency from C. The
dependency from B is not included since the performance
constraint of A is satisfied without it)
A(2,3)
0
0
0
1
1
1
0
0
0
1
1
1
A
B
C
2
31
Iteration 1
B(1,2)
Iteration 2
B(1,2)
C(2,3)
Iteration 3
B(1,2)
C(2,3)
0
0
0
1
1
1
Figure 5. A sample scenario to illustrate the heuristic algorithm.
To compute the minimum cost of estimation for each
remaining person at each iteration, one could exhaustively
search the space of possible cameras and dependencies se-
lection. However, such an approach requires exponential
time in the number of cameras, so becomes infeasible when
the number of cameras is large. Instead, we use a greedy
approach, where we start by selecting a minimal set of cam-
eras (two for example if pose is to estimated by stereo; one
if position is estimated by intersecting a median line with
the ground plane) and add more cameras and dependencies
one at a time, based on the increase in cost of computation
and the reduction in expected errors, until the performance
constraints are satisfied.
5. Experiments
We next demonstrate how COST can be applied to mul-
tiple camera tracking algorithms.
5.1. Tracking People on a Ground Plane
5.1.1 Framework
We applied COST to a variant of M2Tracker. M2Tracker
is a system that segments, detects and tracks multiple peo-
ple on a ground plane in a cluttered scene [15]. The algo-
rithm cycles between using segmentation to estimate peo-
ple’s ground plane positions and using ground plane posi-
tion estimates to obtain segmentations; the process is iter-
ated until stable. In M2Tracker, all people are segmented
in all cameras; then the segmentations are combined using
a wide-baseline stereo reconstruction algorithm for position
estimation. In COST, selected people are segmented in se-
lected views - those for which µ
k
i
=1. To use the estimate
of the position of an occluder, we first segment the occluder
and then classify the pixels in the occluded region based on
the prior probabilities alone.

Citations
More filters
Journal ArticleDOI
TL;DR: These experiments prove that incorporation of the long-term models enable us to hold tracks of objects over extended periods of time, including situations where there are large ldquoblindrdquo areas.
Abstract: We address the problem of tracking multiple people in a network of nonoverlapping cameras. This introduces certain challenges that are unique to this particular application scenario, in addition to existing challenges in tracking like pose and illumination variations, occlusion, clutter and sensor noise. For this purpose, we propose a novel multi-objective optimization framework by combining short term feature correspondences across the cameras with long-term feature dependency models. The overall solution strategy involves adapting the similarities between features observed at different cameras based on the long-term models and finding the stochastically optimal path for each person. For modeling the long-term interdependence of the features over space and time, we propose a novel method based on discriminant analysis models. The entire process allows us to adaptively evolve the feature correspondences by observing the system performance over a time window, and correct for errors in the similarity estimations. We show results on data collected by two large camera networks. These experiments prove that incorporation of the long-term models enable us to hold tracks of objects over extended periods of time, including situations where there are large ldquoblindrdquo areas. The proposed approach is implemented by distributing the processing over the entire network.

72 citations


Cites methods from "COST: An Approach for Camera Select..."

  • ...We conduct a detailed performance analysis with data captured on practical multi-camera systems with multiple people observed over the network....

    [...]

Journal ArticleDOI
01 Feb 2010
TL;DR: This paper proposes sensor-planning methods that improve existing algorithms by adding handoff rate analysis, and preserves necessary uniform overlapped FOVs between adjacent cameras for an optimal balance between coverage and handoff success rate.
Abstract: Most existing camera placement algorithms focus on coverage and/or visibility analysis, which ensures that the object of interest is visible in the camera's field of view (FOV). However, visibility, which is a fundamental requirement of object tracking, is insufficient for automated persistent surveillance. In such applications, a continuous consistently labeled trajectory of the same object should be maintained across different camera views. Therefore, a sufficient uniform overlap between the cameras' FOVs should be secured so that camera handoff can successfully and automatically be executed before the object of interest becomes untraceable or unidentifiable. In this paper, we propose sensor-planning methods that improve existing algorithms by adding handoff rate analysis. Observation measures are designed for various types of cameras so that the proposed sensor-planning algorithm is general and applicable to scenarios with different types of cameras. The proposed sensor-planning algorithm preserves necessary uniform overlapped FOVs between adjacent cameras for an optimal balance between coverage and handoff success rate. In addition, special considerations such as resolution and frontal-view requirements are addressed using two approaches: 1) direct constraint and 2) adaptive weights. The resulting camera placement is compared with a reference algorithm published by Erdem and Sclaroff. Significantly improved handoff success rates and frontal-view percentages are illustrated via experiments using indoor and outdoor floor plans of various scales.

67 citations


Additional excerpts

  • ...ing occlusions and visual confusion [18]....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors examine most, if not all, of the recent approaches (post 2000) addressing camera placement in a structured manner, and provide a complete study of relevant formulation strategies and brief introductions to most commonly used optimization techniques by researchers.
Abstract: With recent advances in consumer electronics and the increasingly urgent need for public security, camera networks have evolved from their early role of providing simple and static monitoring to current complex systems capable of obtaining extensive video information for intelligent processing, such as target localization, identification, and tracking. In all cases, it is of vital importance that the optimal camera configuration (i.e., optimal location, orientation, etc.) is determined before cameras are deployed as a suboptimal placement solution will adversely affect intelligent video surveillance and video analytic algorithms. The optimal configuration may also provide substantial savings on the total number of cameras required to achieve the same level of utility. In this article, we examine most, if not all, of the recent approaches (post 2000) addressing camera placement in a structured manner. We believe that our work can serve as a first point of entry for readers wishing to start researching into this area or engineers who need to design a camera system in practice. To this end, we attempt to provide a complete study of relevant formulation strategies and brief introductions to most commonly used optimization techniques by researchers in this field. We hope our work to be inspirational to spark new ideas in the field.

45 citations

Journal ArticleDOI
TL;DR: This work presents a content-aware multi-camera selection technique that uses object- and frame-level features and compares the proposed approach with a maximum score based camera selection criterion and demonstrates a significant decrease in camera flickering.
Abstract: We present a content-aware multi-camera selection technique that uses object- and frame-level features. First objects are detected using a color-based change detector. Next trajectory information for each object is generated using multi-frame graph matching. Finally, multiple features including size and location are used to generate an object score. At frame-level, we consider total activity, event score, number of objects and cumulative object score. These features are used to generate score information using a multivariate Gaussian distribution. The algorithm. The best view is selected using a Dynamic Bayesian Network (DBN), which utilizes camera network information. DBN employs previous view information to select the current view thus increasing resilience to frequent switching. The performance of the proposed approach is demonstrated on three multi-camera setups with semi-overlapping fields of view: a basketball game, an indoor airport surveillance scenario and a synthetic outdoor pedestrian dataset. We compare the proposed view selection approach with a maximum score based camera selection criterion and demonstrate a significant decrease in camera flickering. The performance of the proposed approach is also validated through subjective testing.

40 citations

Journal ArticleDOI
TL;DR: This work presents a novel framework for selecting cameras to track people in a distributed smart camera network that is based on generalized information-theory and dynamically assign a subset of all available cameras to each target and track it in difficult circumstances of occlusions and limited fields of view with the same accuracy as when using all cameras.
Abstract: Tracking persons with multiple cameras with overlapping fields of view instead of with one camera leads to more robust decisions. However, operating multiple cameras instead of one requires more processing power and communication bandwidth, which are limited resources in practical networks.When the fields of view of different cameras overlap, not all cameras are equally needed for localizing a tracking target. When only a selected set of cameras do processing and transmit data to track the target, a substantial saving of resources is achieved. The recent introduction of smart cameras with on-board image processing and communication hardware makes such a distributed implementation of tracking feasible.We present a novel framework for selecting cameras to track people in a distributed smart camera network that is based on generalized information-theory. By quantifying the contribution of one or more cameras to the tracking task, the limited network resources can be allocated appropriately, such that the best possible tracking performance is achieved.With the proposed method, we dynamically assign a subset of all available cameras to each target and track it in difficult circumstances of occlusions and limited fields of view with the same accuracy as when using all cameras.

31 citations

References
More filters
Journal ArticleDOI
TL;DR: An easily applicable algorithmic technique/tool for developing approximation schemes for certain types of combinatorial optimization problems and derives the existence of an FPTAS for the scheduling problem of minimizing the weighted number of late jobs under release dates and preemption on a single machine.

45 citations


"COST: An Approach for Camera Select..." refers background in this paper

  • ...The optimization problem stated above is NP-Hard and belongs to the class of subset selection problems [18]....

    [...]

Proceedings ArticleDOI
17 Oct 2005
TL;DR: In this paper, it is shown that in a multi-camera context, this work can effectively handle occlusions in real-time at each frame independently, even when the only available data comes from the binary output of a simple blob detector, and the number of present individuals is a priori unknown.
Abstract: In this paper, we show that in a multi-camera context, we can effectively handle occlusions in real-time at each frame independently, even when the only available data comes from the binary output of a simple blob detector, and the number of present individuals is a priori unknown. We start from occupancy probability estimates in a top view and rely on a generative model to yield probability images to be compared with the actual input images. We then refine the estimates so that the probability images match the binary input images as well as possible. We demonstrate the quality of our results on several sequences involving complex occlusions.

29 citations


"COST: An Approach for Camera Select..." refers background in this paper

  • ...Most of the position estimation algorithms constrain the motion to a ground plane and perform inference by first segmenting the people in each view and then using data fusion techniques to obtain an estimate of the 3D locations of each person [15, 12, 13, 6]....

    [...]

Proceedings Article
09 Jul 2005
TL;DR: This paper presents a methodology to actively select a sensor subset with the best tradeoff between information gain and sensor cost by exploiting the synergy among sensors.
Abstract: Active information fusion is to selectively choose the sensors so that the information gain can compensate the cost spent in information gathering. However, determining the most informative and cost-effective sensors requires an evaluation of all possible sensor combinations, which is computationally intractable, particularly, when information-theoretic criterion is used. This paper presents a methodology to actively select a sensor subset with the best tradeoff between information gain and sensor cost by exploiting the synergy among sensors. Our approach includes two aspects: a method for efficient mutual information computation and a graph-theoretic approach to reduce search space. The approach can reduce the time complexity significantly in searching for a near optimal sensor subset.

16 citations


"COST: An Approach for Camera Select..." refers background in this paper

  • ...Since the computation of mutual information requires exponential time, other approximate [23] and heuristic based algorithms [22] have also been proposed....

    [...]

Proceedings ArticleDOI
14 Jun 2006
TL;DR: This work presents a system that incorporates a variety of constraints in a unified multi- view framework to automatically detect humans in possibly crowded scenes that is optimized in a nonparametric belief propagation framework using prior based search.
Abstract: Detection of articulated objects such as humans is an important task in computer vision We present a system that incorporates a variety of constraints in a unified multi- view framework to automatically detect humans in possibly crowded scenes These constraints include the kinematic constraints, the occlusion of one part by another and the high correlation between the appearance of parts such as the two arms The graphical structure (non-tree) obtained is optimized in a nonparametric belief propagation framework using prior based search

10 citations


"COST: An Approach for Camera Select..." refers background or methods in this paper

  • ...son [9, 8, 19, 20]; occlusion of one person by another, leading to inference dependencies between people and their parts has not been addressed....

    [...]

  • ...We implemented a 3D pose estimation system using non-parametric belief propagation [19, 9]....

    [...]

Book ChapterDOI
05 Apr 2004
TL;DR: An easily applicable algorithmic technique/tool for developing approximation schemes for certain types of combinatorial optimization problems and derives the existence of an FPTAS for the scheduling problem of minimizing the weighted number of late jobs under release dates and preemption on a single machine.
Abstract: In paper we develop an easily applicable algorithmic technique/tool for developing approximation schemes for certain types of combinatorial optimization problems. Special cases that are covered by our result show up in many places in the literature. For every such special case, a particular rounding trick has been implemented in a slightly different way, with slightly different arguments, and with slightly different worst case estimations. Usually, the rounding procedure depended on certain upper or lower bounds on the optimal objective value that have to be justified in a separate argument. Our easily applied result unifies many of these results, and sometimes it even leads to a simpler proof. We demonstrate how our result can be easily applied to a broad family of combinatorial optimization problems. As a special case, we derive the existence of an FPTAS for the scheduling problem of minimizing the weighted number of late jobs under release dates and preemption on a single machine. The approximability status of this problem has been open for some time.

6 citations

Frequently Asked Questions (9)
Q1. What have the authors contributed in "Cost∗: an approach for camera selection and multi-object inference ordering in dynamic scenes" ?

The authors present a unified approach, COST, that reasons about such dependencies and yields an order for the inference of each person in a group of people and a set of cameras to be used for inferences for a person. The authors present an optimization problem to select set of cameras and inference dependencies for each person which attempts to minimize the computational cost under given performance constraints. 

The algorithm cycles between using segmentation to estimate people’s ground plane positions and using ground plane position estimates to obtain segmentations; the process is iterated until stable. 

The number of occluded voxels that can be added due to dependencies depends on the selection of dependencies and the accuracy of the position estimate of the occluder. 

A naive approach (by considering all pairwise interactions of all parts of all people) would involve constructing a large Bayesian network with loops; however, this results in an intractable optimization problem. 

Let dV be a differential volume element(voxel) which might be included in part j of person i. The Occluder Region, Ωk(dV ), of a differential element dV in camera k is defined as the 3D region in which another person, l, must be present so that dV would not be visible in camera k (See Fig 3). 

The weight cl,m is proportional to the probability of the part (l,m) lying in the confuser space and being visible:cl,m = 1Z ∫ Ck(dV ) P (EO k (dA))P (El,m(dV1))dV1 (6)where Z is a normalizing factor. 

These approaches reduce the complexity by annihilating small probabilities [11] or removing weak dependencies [14] and arcs [21]. 

Typical goals of such an analysis are to recover the position, orientation or the pose of each or some subset of the people in the scene. 

the error in estimating the position of person i using the stereo pair (k1, k2) is approximated by5In M2Tracker, visibility does not vary with height and hence ground plane analysis of visibility can be performed instead of 3D modelingEi(k1, k2) = (1 − f̃(θk1,k2)Sk1i Sk2i ) (11)where θk1,k2 is the angle between the viewing directions of cameras k1 and k2 on the ground plane.