How many occluded voxels can be added to a camera?

The number of occluded voxels that can be added due to dependencies depends on the selection of dependencies and the accuracy of the position estimate of the occluder.

What is the probability of a person’s visibility in a camera?

Let dV be a differential volume element(voxel) which might be included in part j of person i. The Occluder Region, Ωk(dV ), of a differential element dV in camera k is defined as the 3D region in which another person, l, must be present so that dV would not be visible in camera k (See Fig 3).

What is the error in estimation of person i using the stereo pair?

the error in estimating the position of person i using the stereo pair (k1, k2) is approximated by5In M2Tracker, visibility does not vary with height and hence ground plane analysis of visibility can be performed instead of 3D modelingEi(k1, k2) = (1 − f̃(θk1,k2)Sk1i Sk2i ) (11)where θk1,k2 is the angle between the viewing directions of cameras k1 and k2 on the ground plane.

(Open Access) COST: An Approach for Camera Selection and Multi-Object Inference Ordering in Dynamic Scenes (2007) | Abhinav Gupta

Q: What have the authors contributed in "Cost∗: an approach for camera selection and multi-object inference ordering in dynamic scenes" ?

The authors present a unified approach, COST, that reasons about such dependencies and yields an order for the inference of each person in a group of people and a set of cameras to be used for inferences for a person. The authors present an optimization problem to select set of cameras and inference dependencies for each person which attempts to minimize the computational cost under given performance constraints.

Q: How is the algorithm used to estimate people's positions?

The algorithm cycles between using segmentation to estimate people’s ground plane positions and using ground plane position estimates to obtain segmentations; the process is iterated until stable.

Q: What is the probability of a person being seen in the confuser space?

The weight cl,m is proportional to the probability of the part (l,m) lying in the confuser space and being visible:cl,m = 1Z ∫ Ck(dV ) P (EO k (dA))P (El,m(dV1))dV1 (6)where Z is a normalizing factor.

Q: How do you reduce the complexity of the information theoretic approaches?

These approaches reduce the complexity by annihilating small probabilities [11] or removing weak dependencies [14] and arcs [21].

COST

∗

: An Approach for Camera Selection and Multi-Object Inference

Ordering in Dynamic Scenes

Abhinav Gupta

Dept. of Computer Science

University of Maryland

College Park, MD, USA

agupta@cs.umd.edu

Anurag Mittal

Dept. of Comp. Sc. and Engg.

IIT Madras

Chennai, India

amittal@cse.iitm.ernet.in

Larry S. Davis

Dept. of Computer Science

University of Maryland

College Park, MD, USA

lsd@cs.umd.edu

Abstract

Development of multiple camera based vision systems

for analysis of dynamic objects such as humans is chal-

lenging due to occlusions and similarity in the appearance

of a person with the background and other people- visual

“confusion”. Since occlusion and confusion depends on

the presence of other people in the scene, it leads to a de-

pendency structure where there are often loops in the re-

sulting Bayesian network. While approaches such as loopy

belief propagation can be used for inference, they are com-

putationally expensive and convergence is not guaranteed

in many situations.

We present a uniﬁed approach, COST, that reasons about

such dependencies and yields an order for the inference of

each person in a group of people and a set of cameras to be

used for inferences for a person. Using the probabilistic dis-

tribution of the positions and appearances of people, COST

performs visibility and confusion analysis for each part of

each person and computes the amount of information that

can be computed with and without more accurate estimation

of the positions of other people. We present an optimization

problem to select set of cameras and inference dependen-

cies for each person which attempts to minimize the compu-

tational cost under given performance constraints. Results

show the efﬁciency of COST in improving the performance

of such systems and reducing the computational resources

required.

1. Introduction

We consider the problem of multi-perspective analysis

of moving people in crowded situations. Typical goals of

such an analysis are to recover the position, orientation or

the pose of each or some subset of the people in the scene.

The analysis is difﬁcult due to occlusions and appearance

similarities of people with one another or the background

against which they are viewed. We refer to errors arising

from appearance similarities as “confusions”. In multiple

camera systems, information fusion needs to be sensitive to

occlusions and confusions.

∗

Confusion and Occlusion analysis for Selections based on Tasks

Our goal is to develop principled methods to “select” the

camera(s) in which there is less occlusion and confusion for

a particular person to infer that person’s position or pose

(See Figure 1). Additionally, we seek to identify the parts

of the image where such occlusion and confusion occurs

and use this information in the inference process. How-

ever, determining those regions of occlusion and confusion

depends on the positions and poses of other people in the

scene. This leads to a dependency structure for inference

of position/pose of the people present in the scene, as is il-

lustrated graphically in Figure 2(b). A Bayesian network

for such multi-object inference will generally have loops.

Those loops can be eliminated by appropriate selection of

cameras and dropping inference dependencies which are not

expected to yield signiﬁcant information, as shown in the

example in Figure 2(c).

We present COST, a framework to reason about such

dependencies, that produces an inference order for multi-

person, multi-perspective pose/position estimation. We ad-

ditionally identify a set of cameras and the parts of the ac-

quired images to be analyzed for each person. We show

that COST not only yields a reduction in computational time

compared to approaches such as Expectation Maximization

(EM) or Loopy Belief Propagation (LBP) [16], but also

shows quantitative improvement in the pose/position esti-

mation due to camera selection.

1.1. Related Work

There are many multi-perspective vision algorithms that

analyze crowded scenes for either person position estima-

tion or pose estimation. Most of the position estimation

algorithms constrain the motion to a ground plane and per-

form inference by ﬁrst segmenting the people in each view

and then using data fusion techniques to obtain an estimate

of the 3D locations of each person [15, 12, 13, 6]. While oc-

clusion has been considered to some extent(for weighted fu-

sion) in some papers [15, 13], confusion due to appearance

similarities has not been previously considered. Addition-

ally, most earlier work either ignores the inference depen-

dencies or uses all of them, which makes the computation

costly.

Previous work on pose estimation has only considered

self-occlusion of one body part by another of the same per-

(1a) (1b) (2a) (2b) (3a) (3b)

Figure 1. Segmentation results and median line determination of a person in three different views. In view 1, there is no occlusion and

confusion while in views 2 and 3 there is occlusion and confusion respectively. If the median lines are used for person position estimation

as in [13, 10], without occlusion and confusion reasoning, we might mistakenly use the median lines shown in (2b) and (3b).

Static Object

(a) Ground Plane

(b) All cam-

eras

C(1,2)

A(1,2)

(2,3)

Figure 2. (a) A multiple-person scenario with 3 people and 3 cam-

eras. (b)The dependency graph obtained if all cameras are used for

estimation of all people. An edge A → B represents the informa-

tion ﬂow from A to B in the inference process-hence estimation

of B depends on estimation of A. In this scenario, estimation of

B depends on A due to occlusion in camera 1 and estimation of

A and C depends on B and A respectively due to occlusions in

cameras 3 and 2 respectively. (c) The dependency graph obtained

if cameras are selected using COST. The selected cameras for es-

timation of each person are shown in the respective node. Since,

camera 1 is not used for estimation of B , the estimation of B

becomes independent of A. Additionally, if the degree of occlu-

sion of C due to A is small (that is one cannot generate signiﬁcant

information for the estimation of C using the estimate of the lo-

cation or pose of A) then one can also eliminate the dependency

edge A → C without strongly affecting the accuracy of the re-

sult. Such elimination can be critical for loop removal when there

are not enough cameras in which a person is isolated and discrim-

inable.

son [9, 8, 19, 20]; occlusion of one person by another, lead-

ing to inference dependencies between people and their parts

has not been addressed. A naive approach (by considering

all pairwise interactions of all parts of all people) would

involve constructing a large Bayesian network with loops;

however, this results in an intractable optimization problem.

We show how many of the loops in the Bayesian network

can be eliminated using selection of the best cameras and

the most important inference dependencies.

A related problem of sensor selection and information

fusion has been studied in the ﬁeld of sensor networks and

distributed computing. The problem is to selectively choose

the sensors so that information gain compensates for costs

associated with information gathering. An optimal solu-

tion using such an information theoretic approach requires

evaluating all possible combinations, making the problem

NP-Hard. Denzler et. al [4] proposed an information the-

oretic based approach where the view which leads to max-

imum reduction in entropy is chosen. Since the computa-

tion of mutual information requires exponential time, other

approximate [23] and heuristic based algorithms [22]have

also been proposed. Other approaches in this ﬁeld include

use of look-up tables [17] or utility functions [2] in selection

of camera views.

These information theoretic approaches only consider ge-

ometric analysis based on the ﬁelds of view of the cam-

eras when computing mutual information. However, even

though two cameras might have overlapping ﬁelds of view

they can still provide different information due to occlusion

and confusion. While [5] presents an approach for cam-

era selection in the presence of occlusions, COST involves

visibility and discriminability analysis in cojunction with

reasoning about dependencies for camera selection.

Bayesian belief networks are an important mechanism

for representation and reasoning under uncertainty. For a

given belief-net even ﬁnding an approximate solution is NP-

Hard [3]. Our approach is related to model simpliﬁcation

methods(see [7]), which simplify the model until exact meth-

ods become feasible. These approaches reduce the com-

plexity by annihilating small probabilities [11]orremoving

weak dependencies [14] and arcs [21].

Our approach is complementary to these approaches. COST’s

loop removal procedure is primarily based on camera selec-

tion, which removes redundant and unreliable information

in multi-perspective vision systems. Additionally, while

previous approaches assume that weights of the dependen-

cies are given, our approach considers occlusion and confu-

sion in different cameras and removes loops based on this

information.

The paper is organized as follows. We ﬁrst describe how

visibility and confusion factors for an object are computed

in section 2. We then explain our optimization framework

and a heuristic approach for fast approximate inference in

section 4. We ﬁnally present experimental results in sec-

tion 5.

2. Computing Occlusion and Confusion

2.1. Computing Visibility

To estimate a property of a given person or object from

a given camera, that person or object must be (partially)

visible from that camera. But one person’s visibility de-

pends on the pose of other people in the scene, whose poses

are generally known only probabilistically. This lends us to

compute visibility probabilistically. Speciﬁcally, we com-

pute the probability of visibility of each part of a person in

each camera based on probabilistic estimates of the poses of

all other people in the scene. To develop a generic formula-

tion, let us consider an n-part model for a person where n is

one for simple position estimation or ten for full body pose

estimation.

Let dV be a differential volume element(voxel) which

might be included in part j of person i. The Occluder Re-

gion, Ω

(dV ), of a differential element dV in camera k is

deﬁned as the 3D region in which another person, l,must

be present so that dV would not be visible in camera k (See

Fig 3). We also deﬁne the following events:

i,j

(dV ) = Event that part j of person i includes dV

l,m

(dV ) = Event that part m of person l intersects Ω

(dV )

(dV ) = Event that no person intersects Ω

(dV )

The expected visibility of a part, that is, the number of

visible voxels contained in that part, is then given by

(i, j, k)=



P (EO

(dV ))P (E

i,j

(dV ))dV (1)

The probability that part m of person l does not occlude

dV is the probability that part m does not contain any of

the voxels that belongs to the set Ω

(dV ). Therefore, that

probability is given by

P (EO

l,m

(dV )) =



∈Ω

(dV )

1 − P (E

l,m

(dV

)) (2)

The probability that no part of any person is in the oc-

cluder region is then given by

P (EO

(dV )) =



(l,m)

P (EO

l,m

(dV )) (3)

Furthermore, in a tracking scenario, new people can en-

ter the scene. In this case, we also need to consider the oc-

clusions they are likely to introduce and how the expected

visibility changes to account for new people. We assume

Apartj can include many such voxels.

By considering occlusion of a part (i, j) from itself, we implicitly

select surface voxels instead of interior voxels. Interior voxels would be

occluded the by surface voxels and would not be considered.

there are a ﬁxed and known number of locations, which we

refer to as “portals”, from which a new person enters or

an existing person leaves the scene. Let E

new

(dV ) be the

event that a new person is present in voxel dV . The likeli-

hood of this event, P (E

new

(dV )), is the product of the like-

lihood that a portal is nearby (which is represented in terms

of a prior probability P

new

(dV ))) and the image likeli-

hood that a new person is seen in the region P

new

(dV )).

Therefore, P (

(dV )) is given by:

000000000

111111111

0000000000

1111111111

00000000000000

11111111111111

0000000

1111111

000

111

000

111

Camera k

(dA)

Ω

(dA)

Figure 3. Schematic diagram showing Ω

(dV ) and C

(dV ) pro-

jected on the ground-plane. Because of discretization, Ω

(dV )

and C

(dV ) represent the set of voxels where another object must

be present for occlusion or confusion to occur.



∈Ω

(dV )

(



(l,m)

1 − P (E

l,m

(dV

)))(1 − P (E

new

(dV

)))

(4)

2.2. Computing Confusion

Although a person(or some part) might be visible in view

k , the view might still not be helpful in estimating the pose

because of “camouﬂage” - his appearance being too sim-

ilar to either the background or some other person(s) oc-

cluded by him. Due to such “confusion” with the “back-

ground”, segmenting the person accurately would be prob-

lematic, and most pose inferences would degrade as the seg-

mentation quality decreases.

Again, consider the differential element dV that a part

(i, j) may contain. To compute the discriminability of dV ,

we determine the parts which can cause confusion. The

confuser space C

(dV ) of an element dV is deﬁned as the

region where the presence of a part (l, m) would cause con-

fusion in the classiﬁcation of a pixel that can be formed due

to the projection of part (i, j) from dV (See Figure 3). The

amount of confusion is proportional to the similarity in ap-

pearance of the two parts. We deﬁne the discriminability of

a part (i, j) in a view k, D

(i, j) as:

(i, j)=



(l,m)

l,m

∗ d(a

i,j

l,m

)+c

∗ d(a

i,j

) (5)

where a

i,j

deﬁnes the appearance of a part (i, j), B

deﬁnes the appearance of the background, d is a distance

metric between the appearances and c is the correspond-

ing weight. For example, if appearance is represented as a

histogram, then d could be the dot-product of the two his-

tograms or the earth mover’s distance. The weight c

l,m

proportional to the probability of the part (l, m) lying in the

confuser space and being visible:

l,m



(dV )

P (EO

(dA))P (E

l,m

(dV

))dV

(6)

where Z is a normalizing factor. Hence, the expected num-

ber of discriminable voxels in view k contained in part (i, j)

is given by:

(i, j)=



P (EO

(dV ))D

(i, j)P (E

i,j

(dV ))dV (7)

3. Information in Views and Dependencies

3.1. Model for Information Content

In order to perform inference reliably for some part of

a given person using some view, that part should, ideally,

not be occluded in that view and should not be “confused”

with the background or other parts. The accuracy of the in-

ference will depend upon both the degrees of occlusion and

confusion, as discussed in the previous section. It will also

depend on the uncertainty of such occlusion and confusion.

We present a simple model for measuring the information

available in a view regarding a part for the task of pose esti-

mation. We say that a speciﬁc voxel belonging to a person is

informative in some view if and only if it is both visible and

discriminable. The information available about a speciﬁc

part in a given view is then taken as the expected number of

visible and discriminable voxels in that view.

3.2. Information from Dependencies

Inference decisions can be improved if estimates of the

pose/appearance characteristics of the occluders and con-

fusers are used. Such information can be employed in a va-

riety of ways; an example for the position estimation prob-

lem is shown in Figure 4. Here the inference of a person’s

position involves constructing a median line through the sil-

houette of the person, and computing that line’s intersection

with the ground plane using calibration information. Fig-

ure 4(b) shows the segmentation of the person constructed

from the visible and discriminable voxels. However, the

estimate of the median-line is inaccurate when only these

voxels are used (see the magenta voxels on the ground plane

and median line-1 based on these voxels). If we additionally

use the position of the occluder we can identify occluded re-

gions (See light blue region in Figure 4(d)). The segmenta-

tion in the occluded region is then based on position priors,

which would yield a better estimate of the median line as

shown in Figure 4(c).

The inference of a part’s position depends on the infor-

mation about the occluders and confusers; the more accu-

rate our information about the occluder and confusers, the

(a) (b) (c)

(d)

Figure 4. Importance of using occlusion information before fusion:

(a) The original image (b) Occlusion-unaware segmentation and

object inference, (c) Occlusion-aware segmentation and inference

(d)The ground plane situation of the scenario. The black boundary

show the actual voxels contained in the person. In case b, only the

magenta voxels are used for median line estimation(1). In case c,

one uses combination of magenta and blue voxels for estimation

of median line(2). However, the true median line is represented by

(3).

more accurate will be our estimate. Thus, accurate infer-

ence of a part’s position depends upon the inference of oc-

cluders and confusers. Such dependencies can be repre-

sented in a dependency graph (See Fig 2). Using the pose

of other people in the inference process can, however, lead

to loops in the Bayesian network. Additionally, using infor-

mation from dependencies might involve expensive com-

putation. Our goal is to avoid introducing edges into the

dependency graph which either do not have sufﬁcient infor-

mation or introduce loops in the Bayesian network. We do

this as follows: For each possible occluder or confuser l,we

associate a binary decision variable, ν

i,l

which represents

whether the knowledge about the pose of person l is to be

used in the inference of the pose of person i from view k

If there is no edge from node l (the node representing per-

son l) to node i in the dependency graph, then ∀k, ν

i,l

=0.

Given some selection of edges to include in the dependency

graph, the total amount of information, ∆I

i,j

, that an al-

gorithm can extract in view k about a part (i, j) using the

In our model, dependencies are between people and not parts; we use

the estimate of person l to estimate the locations of all parts of person i

estimates of its dependencies can be determined. This, how-

ever, also depends on the accuracy in the estimates of the

dependencies.

4. The Optimization Problem

Given the amount of information available(with and with-

out dependencies) regarding each person in each camera,

we estimate the binary decision variables µ

, ν

i,l

which rep-

resent whether or not camera k will be used in the inference

of person i(µ

) and, if so, whether to use the estimate of the

pose of person l when estimating the pose of person i (that

is whether or not we should include the edge from nodes l

to i in the Bayesian network). For instance, in Figure 2 the

decision variables (µ

, µ

, ν

C,A

) will be set to true for

person C. We would like to minimize the computational

cost while guaranteeing that the expected error in the esti-

mate of the pose of person i is below η

(termed a “perfor-

mance constraint”). Thus, the optimization problem can be

formulated as

min

,ν



(µ

,ν

) such that, e

(µ

,ν

) ≤ η

∀i (8)

where e

represents the expected error in the estimate of

the pose of person i and J

represents the cost of comput-

ing the estimate of the pose of person i. This model also

supports attention-based surveillance when ∃i s.t ∀

j=i

<< η

. In such a case, most of the computational re-

sources would be devoted to estimating the pose of a distin-

guished person.

The optimization problem stated above is NP-Hard and

belongs to the class of subset selection problems [18]. While

approaches such as simulated-annealing can be used for op-

timization, much faster heuristic approaches can be employed.

4.1. A Heuristic Based Optimization Approach

We present a heuristic-based, greedy algorithm for the

optimization problem. We build the dependency graph G

by adding nodes one by one to G. Each node represents

a person and the set of cameras selected for estimating the

pose of that person. The edges incident on a node represent

the dependencies to be used in estimation (An edge l → i

indicates that to estimate the pose of person i, the pose of

person l is used).

At each iteration, we compute the minimum cost

estimation of each person, i, by selecting the best possible

settings of the decision variables(µ

and ν

). However, to

avoid loops in G, we require that dependencies be selected

from the set of nodes already present in G, and should not

introduce loops in the Bayesian network. The person with

the lowest cost of estimation is then added to G. In the next

iteration, the cost of estimation is re-computed, since the

newly introduced node can be now used as a dependency

for the remaining people.

If the performance constraint for any person cannot be satisﬁed we

assume the cost of estimation to be ∞

The algorithm is illustrated in Figure 5. At iteration 1,

the minimum costs of computation are B=2 (Using camera

1,2 and no dependency), A=∞ (A needs to use dependency

on either B or C for the performance constraint to be satis-

ﬁed; since the dependency graph at t=0 is null, A cannot use

any dependency), C=∞ (C also needs to use the estimate of

B for its performance constraint to be satisﬁed). At iteration

2, the computation costs become A=8 (Using cameras 1,2,3

and the dependency from B) and C = 3 (Using cameras 2,3

and the dependency from B). Hence C is added at iteration

2. At iteration 3, the new minimum computation cost for

A=4 (Using cameras 2,3 and the dependency from C. The

dependency from B is not included since the performance

constraint of A is satisﬁed without it)

A(2,3)

Iteration 1

B(1,2)

Iteration 2

B(1,2)

C(2,3)

Iteration 3

B(1,2)

C(2,3)

Figure 5. A sample scenario to illustrate the heuristic algorithm.

To compute the minimum cost of estimation for each

remaining person at each iteration, one could exhaustively

search the space of possible cameras and dependencies se-

lection. However, such an approach requires exponential

time in the number of cameras, so becomes infeasible when

the number of cameras is large. Instead, we use a greedy

approach, where we start by selecting a minimal set of cam-

eras (two for example if pose is to estimated by stereo; one

if position is estimated by intersecting a median line with

the ground plane) and add more cameras and dependencies

one at a time, based on the increase in cost of computation

and the reduction in expected errors, until the performance

constraints are satisﬁed.

5. Experiments

We next demonstrate how COST can be applied to mul-

tiple camera tracking algorithms.

5.1. Tracking People on a Ground Plane

5.1.1 Framework

We applied COST to a variant of M2Tracker. M2Tracker

is a system that segments, detects and tracks multiple peo-

ple on a ground plane in a cluttered scene [15]. The algo-

rithm cycles between using segmentation to estimate peo-

ple’s ground plane positions and using ground plane posi-

tion estimates to obtain segmentations; the process is iter-

ated until stable. In M2Tracker, all people are segmented

in all cameras; then the segmentations are combined using

a wide-baseline stereo reconstruction algorithm for position

estimation. In COST, selected people are segmented in se-

lected views - those for which µ

=1. To use the estimate

of the position of an occluder, we ﬁrst segment the occluder

and then classify the pixels in the occluded region based on

the prior probabilities alone.

COST: An Approach for Camera Selection and Multi-Object Inference Ordering in Dynamic Scenes

Figures

Citations

A method of camera selection based on partially observable Markov decision process model in camera networks

Multi-analysis surveillance and dynamic distribution of computational resources: Towards extensible, robust, and efficient monitoring of environments

Application-driven merging and analysis of person trajectories for distributed smart camera networks

Method for estimating the visibility of features on surfaces of object instances in multi-object scenes and method for perception planning in multi-object scenes

Bargaining Strategies for Camera Selection in a Video Network

References

Loopy Belief Propagation for Approximate Inference: An Empirical Study

Loopy belief propagation for approximate inference: an empirical study

M 2 Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene

A multiview approach to tracking people in crowded scenes using a planar homography constraint

Entropy-based sensor selection heuristic for target localization

Related Papers (5)

M 2 Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene

Multicamera People Tracking with a Probabilistic Occupancy Map

Automated camera layout to satisfy task-specific and floor plan-specific coverage requirements

Tracking human motion in structured environments using a distributed-camera system

A Multi-Camera Surveillance System that Estimates Quality-of-View Measurement

Frequently Asked Questions (9)

Q1. What have the authors contributed in "Cost∗: an approach for camera selection and multi-object inference ordering in dynamic scenes" ?

Q2. How is the algorithm used to estimate people's positions?

Q3. How many occluded voxels can be added to a camera?

Q4. What is the problem of a naive approach?

Q5. What is the probability of a person’s visibility in a camera?

Q6. What is the probability of a person being seen in the confuser space?

Q7. How do you reduce the complexity of the information theoretic approaches?

Q8. What are the goals of a multi-perspective analysis of moving people?

Q9. What is the error in estimation of person i using the stereo pair?