COST: An Approach for Camera Selection and Multi-Object Inference Ordering in Dynamic Scenes
Summary (3 min read)
1. Introduction
- The analysis is difficult due to occlusions and appearance similarities of people with one another or the background against which they are viewed.
- In multiple camera systems, information fusion needs to be sensitive to occlusions and confusions.
- Additionally, the authors seek to identify the parts of the image where such occlusion and confusion occurs and use this information in the inference process.
- A Bayesian network for such multi-object inference will generally have loops.
- The authors present COST, a framework to reason about such dependencies, that produces an inference order for multiperson, multi-perspective pose/position estimation.
2.1. Computing Visibility
- But one person’s visibility depends on the pose of other people in the scene, whose poses are generally known only probabilistically.
- This lends us to compute visibility probabilistically.
- To develop a generic formulation, let us consider an n-part model for a person where n is one for simple position estimation or ten for full body pose estimation.
- 2By considering occlusion of a part (i, j) from itself, the authors implicitly select surface voxels instead of interior voxels.
- There are a fixed and known number of locations, which the authors refer to as “portals”, from which a new person enters or an existing person leaves the scene.
2.2. Computing Confusion
- The view might still not be helpful in estimating the pose because of “camouflage” - his appearance being too similar to either the background or some other person(s) occluded by him.
- Due to such “confusion” with the “background”, segmenting the person accurately would be problematic, and most pose inferences would degrade as the segmentation quality decreases.
- Again, consider the differential element dV that a part (i, j) may contain.
3.1. Model for Information Content
- In order to perform inference reliably for some part of a given person using some view, that part should, ideally, not be occluded in that view and should not be “confused” with the background or other parts.
- The accuracy of the inference will depend upon both the degrees of occlusion and confusion, as discussed in the previous section.
- It will also depend on the uncertainty of such occlusion and confusion.
- The authors present a simple model for measuring the information available in a view regarding a part for the task of pose estimation.
- The information available about a specific part in a given view is then taken as the expected number of visible and discriminable voxels in that view.
3.2. Information from Dependencies
- Inference decisions can be improved if estimates of the pose/appearance characteristics of the occluders and confusers are used.
- The segmentation in the occluded region is then based on position priors, which would yield a better estimate of the median line as shown in Figure 4(c).
- Thus, accurate inference of a part’s position depends upon the inference of occluders and confusers.
- Additionally, using information from dependencies might involve expensive computation.
4. The Optimization Problem
- The authors would like to minimize the computational cost while guaranteeing that the expected error in the estimate of the pose of person i is below ηi(termed a “performance constraint”).
- The optimization problem stated above is NP-Hard and belongs to the class of subset selection problems [18].
- While approaches such as simulated-annealing can be used for optimization, much faster heuristic approaches can be employed.
4.1. A Heuristic Based Optimization Approach
- The authors present a heuristic-based, greedy algorithm for the optimization problem.
- The authors build the dependency graph G by adding nodes one by one to G. Each node represents a person and the set of cameras selected for estimating the pose of that person.
- The dependency from B is not included since the performance constraint of A is satisfied without it).
- To compute the minimum cost of estimation for each remaining person at each iteration, one could exhaustively search the space of possible cameras and dependencies selection.
- Such an approach requires exponential time in the number of cameras, so becomes infeasible when the number of cameras is large.
5.1.1 Framework
- The algorithm cycles between using segmentation to estimate people’s ground plane positions and using ground plane position estimates to obtain segmentations; the process is iterated until stable.
- The number of occluded voxels that can be added due to dependencies depends on the selection of dependencies and the accuracy of the position estimate of the occluder.
- For a given camera pair, the error in estimation of position would increase as the segmentation quality decreases in either of the cameras in the pair.
- Additionally, M2Tracker fuses many camera pairs to obtain people’s ground plane position estimates by using a weighted average of the estimates from each camera pair.
- The authors assume, for simplicity, that the computational cost of segmentation and wide-baseline stereo is some constant and independent of view and imaging conditions.
5.1.2 Results
- The authors evaluated the performance of their implementation of M2Tracker with and without using COST on the publicly available dataset of M2Tracker.
- It can be seen that M2Tracker has higher variance in position estimates using the eight camera system than COST has choosing only the “best” camera pair per person.
- This is because in many views a person is either occluded or confused with the background and this leads to inaccurate segmentations and subsequent errors in stereo reconstruction.
- The positional ground truth values were obtained manually.
- Experimental results indicate that it is generally sufficient to analyse only a small number of judiciously chosen cameras to obtain accuracy and performance similar to a system uniformly employing a large number of cameras.
5.2. Using COST for Multiple People Pose
- The authors also applied the COST algorithm for full body pose estimation of multiple people.
- These papers have considered the problem of selfocclusion, but not of one person by another.
- The authors used similar dependency and cost functions as for M2Tracker.
- The error function for full body pose problem was modified.
6. Conclusion
- The authors have presented a principled approach, COST, for camera and dependency selection for improving the performance and computational resource requirements for multi-camera systems.
- COST produces a directed acyclic dependency graph which can then be used to obtain an inference order using topological sort.
- The selection criteria in COST is based on visibility and “confusion” analysis in each view and the resulting dependencies.
- Experimental results indicate that COST outperforms a system which uses a large number of cameras for estimation of each person.
- Additionally, a COST based system is faster than other possible approaches based on EM and belief propagation which use all the cameras and dependencies for analysis.
Did you find this useful? Give us your feedback
Citations
72 citations
Cites methods from "COST: An Approach for Camera Select..."
...We conduct a detailed performance analysis with data captured on practical multi-camera systems with multiple people observed over the network....
[...]
67 citations
Additional excerpts
...ing occlusions and visual confusion [18]....
[...]
45 citations
40 citations
31 citations
References
84 citations
83 citations
"COST: An Approach for Camera Select..." refers methods in this paper
...Other approaches in this field include use of look-up tables [17] or utility functions [2] in selection of camera views....
[...]
74 citations
63 citations
"COST: An Approach for Camera Select..." refers background in this paper
...These approaches reduce the complexity by annihilating small probabilities [11] or removing weak dependencies [14] and arcs [21]....
[...]
63 citations
Related Papers (5)
Frequently Asked Questions (9)
Q2. How is the algorithm used to estimate people's positions?
The algorithm cycles between using segmentation to estimate people’s ground plane positions and using ground plane position estimates to obtain segmentations; the process is iterated until stable.
Q3. How many occluded voxels can be added to a camera?
The number of occluded voxels that can be added due to dependencies depends on the selection of dependencies and the accuracy of the position estimate of the occluder.
Q4. What is the problem of a naive approach?
A naive approach (by considering all pairwise interactions of all parts of all people) would involve constructing a large Bayesian network with loops; however, this results in an intractable optimization problem.
Q5. What is the probability of a person’s visibility in a camera?
Let dV be a differential volume element(voxel) which might be included in part j of person i. The Occluder Region, Ωk(dV ), of a differential element dV in camera k is defined as the 3D region in which another person, l, must be present so that dV would not be visible in camera k (See Fig 3).
Q6. What is the probability of a person being seen in the confuser space?
The weight cl,m is proportional to the probability of the part (l,m) lying in the confuser space and being visible:cl,m = 1Z ∫ Ck(dV ) P (EO k (dA))P (El,m(dV1))dV1 (6)where Z is a normalizing factor.
Q7. How do you reduce the complexity of the information theoretic approaches?
These approaches reduce the complexity by annihilating small probabilities [11] or removing weak dependencies [14] and arcs [21].
Q8. What are the goals of a multi-perspective analysis of moving people?
Typical goals of such an analysis are to recover the position, orientation or the pose of each or some subset of the people in the scene.
Q9. What is the error in estimation of person i using the stereo pair?
the error in estimating the position of person i using the stereo pair (k1, k2) is approximated by5In M2Tracker, visibility does not vary with height and hence ground plane analysis of visibility can be performed instead of 3D modelingEi(k1, k2) = (1 − f̃(θk1,k2)Sk1i Sk2i ) (11)where θk1,k2 is the angle between the viewing directions of cameras k1 and k2 on the ground plane.