scispace - formally typeset
Search or ask a question

Showing papers by "Luc Van Gool published in 2009"


Proceedings ArticleDOI
01 Jan 2009
TL;DR: A novel approach for multi-person tracking-by-detection in a particle filtering framework that uses the continuous confidence of pedestrian detectors and online trained, instance-specific classifiers as a graded observation model, which relies only on information from the past and is suitable for online applications.
Abstract: We propose a novel approach for multi-person tracking-by-detection in a particle filtering framework In addition to final high-confidence detections, our algorithm uses the continuous confidence of pedestrian detectors and online trained, instance-specific classifiers as a graded observation model Thus, generic object category knowledge is complemented by instance-specific information A main contribution of this paper is the exploration of how these unreliable information sources can be used for multi-person tracking The resulting algorithm robustly tracks a large number of dynamically moving persons in complex scenes with occlusions, does not rely on background modeling, and operates entirely in 2D (requiring no camera or ground plane calibration) Our Markovian approach relies only on information from the past and is suitable for online applications We evaluate the performance on a variety of datasets and show that it improves upon state-of-the-art methods

633 citations


Proceedings ArticleDOI
01 Jan 2009
TL;DR: A multiple classifier system for model-free tracking that outperforms other on-line tracking methods especially in case of occlusions and presence of similar objects.
Abstract: We present a multiple classifier system for model-free tracking. The tasks of detection (finding the object of interest), recognition (distinguishing similar objects in a scene), and tracking (retrieving the object to be tracked) are split into separate classifiers in the spirit of simplifying each classification task. The supervised and semi-supervised classifiers are carefully trained on-line in order to increase adaptivity while limiting accumulation of errors, i.e. drifting. In the experiments, we demonstrate real-time tracking on several challenging sequences, including multi-object tracking of faces, humans, and other objects. We outperform other on-line tracking methods especially in case of occlusions and presence of similar objects.

246 citations


Proceedings ArticleDOI
01 Aug 2009
TL;DR: A complete integrated system for live facial puppetry that enables high-resolution real-time facial expression tracking with transfer to another person's face and the actor becomes a puppeteer with complete and accurate control over a digital face is presented.
Abstract: We present a complete integrated system for live facial puppetry that enables high-resolution real-time facial expression tracking with transfer to another person's face. The system utilizes a real-time structured light scanner that provides dense 3D data and texture. A generic template mesh, fitted to a rigid reconstruction of the actor's face, is tracked offline in a training stage through a set of expression sequences. These sequences are used to build a person-specific linear face model that is subsequently used for online face tracking and expression transfer. Even with just a single rigid pose of the target face, convincing real-time facial animations are achievable. The actor becomes a puppeteer with complete and accurate control over a digital face.

239 citations


Proceedings ArticleDOI
01 Sep 2009
TL;DR: To achieve robustness to partial occlusions, this work uses an individual local tracker for each segment of the articulated structure, which enforces the anatomical hand structure through soft constraints on the joints between adjacent segments.
Abstract: We present a method for tracking a hand while it is interacting with an object This setting is arguably the one where hand-tracking has most practical relevance, but poses significant additional challenges: strong occlusions by the object as well as self-occlusions are the norm, and classical anatomical constraints need to be softened due to the external forces between hand and object To achieve robustness to partial occlusions, we use an individual local tracker for each segment of the articulated structure The segments are connected in a pairwise Markov random field, which enforces the anatomical hand structure through soft constraints on the joints between adjacent segments The most likely hand configuration is found with belief propagation Both range and color data are used as input Experiments are presented for synthetic data with ground truth and for real data of people manipulating objects

181 citations


Proceedings ArticleDOI
01 Jan 2009
TL;DR: The experiments show that while a state-of-the-art scene classifier can keep global classes such as road types, similarly well apart, a manually crafted feature set based on a segmentation clearly outperforms it on object classes.
Abstract: Recognizing the traffic scene in front of a car is an important asset for autonomous driving, as well as for safety systems. While GPS-based maps abound and have reached an incredible level of accuracy, they can still profit from additional, image-based information. Especially in urban scenarios, GPS reception can be shaky, or the map might not contain the latest detours due to constructions, demonstrations, etc. Furthermore, such maps are static and cannot account for other dynamic traffic agents, such as cars or pedestrians. In this paper, we therefore propose an image-based system that is able to recognize both the road type (straight, left/right curve, crossing, ...) as well as a set of often encountered objects (car, pedestrian, pedestrian crossing). The obtained information could then be fused with existing maps and either assist the driver directly (e.g., a pedestrian crossing is ahead: slow down) or help in improving object tracking (e.g., where are possible entrance points for pedestrians or cars?). Starting from a video sequence obtained from a car driving through urban areas, we employ a two-stage architecture termed SegmentationBased Urban Traffic Scene Understanding (SUTSU) that first builds an intermediate representation of the image based on a patch-wise image classification. The patch-wise segmentation is inspired by recent work [3, 4, 5] and assigns class probabilities to every 8× 8 image patch. As a feature set, we use the coefficients of the Walsh-Hadamard transform (a decomposition of the image into square waves), and, if available, additional information from the depth map. These are then used in a oneversus-all training using AdaBoost for feature selection, where we choose 13 texture classes that we found to be representative of typical urban scenes. This yields a meta representation of the scene that is more suitable for further processing, Fig. 1 (b,c). In recent publications, such a segmentation was used for a variety of purposes, such as improvement of object detection [1, 5], analysis of occlusion boundaries, or 3D reconstruction. In this paper, we will investigate the use of a segmentation for urban scene analysis. We infer another set of features from the segmentation’s probability maps, analyzing repetitivity, curvature, and rough structure. This set is then again used with a one-versus-all training to infer both the type of road segment ahead, as well the additional presence of pedestrians, cars, or pedestrian crossing. A Hidden Markov Model is used for temporally smoothing the result. SUTSU is tested on two challenging sequences, spanning over 50 minutes video of driving through Zurich. The experiments show that while a state-of-the-art scene classifier [2] can keep global classes such as road types, similarly well apart, a manually crafted feature set based on a segmentation clearly outperforms it on object classes. Example images are shown in Fig. 2. The main contribution of this paper is the application of recent research efforts in scene categorization research to do vision “in the wild”, driving through urban scenarios. We furthermore show the advantage of a segmentation-based approach over a global descriptor, as the intermediate representation can easily be adapted to other underlying image data (e.g. dusk, rain, ...), without having to change the high-level classifier.

179 citations


Proceedings ArticleDOI
01 Jan 2009
TL;DR: A pipeline for the efficient detection and recognition of traffic signs is proposed and 2D and 3D techniques are combined to improve results beyond the state-of-the-art, which is still very much preoccupied with single view analysis.
Abstract: Several applications require information about street furniture. Part of the task is to survey all traffic signs. This has to be done for millions of km of road, and the exercise needs to be repeated every so often. A van with 8 roof-mounted cameras drove through the streets and took images every meter. The paper proposes a pipeline for the efficient detection and recognition of traffic signs. The task is challenging, as illumination conditions change regularly, occlusions are frequent, 3D positions and orientations vary substantially, and the actual signs are far less similar among equal types than one might expect. We combine 2D and 3D techniques to improve results beyond the state-of-the-art, which is still very much preoccupied with single view analysis.

139 citations


Proceedings ArticleDOI
01 Jan 2009
TL;DR: A complete 3D in-hand scanning system that allows users to scan objects by simply turning them freely in front of a real-time 3D range scanner and the online model is of sufficiently high quality to serve as the final model.
Abstract: We present a complete 3D in-hand scanning system that allows users to scan objects by simply turning them freely in front of a real-time 3D range scanner. The 3D object model is reconstructed online as a point cloud by registering and integrating the incoming 3D patches with the online 3D model. The accumulation of registration errors leads to the well-known loop closure problem. We address this issue already during the scanning session by distorting the object as rigidly as possible. Scanning errors are removed by explicitly handling outliers. As a result of our proposed online modeling and error handling procedure, the online model is of sufficiently high quality to serve as the final model. Thus, no additional post-processing is required which might lead to artifacts in the model reconstruction. We demonstrate our approach on several difficult real-world objects and quantitatively evaluate the resulting modeling accuracy.

127 citations


Proceedings ArticleDOI
01 Jan 2009
TL;DR: The efficiency of the retrieval process is optimized by creating more compact and precise indices for visual vocabularies using background information obtained in the crawling stage of the system.
Abstract: The state-of-the art in visual object retrieval from large databases allows to search millions of images on the object level Recently, complementary works have proposed systems to crawl large object databases from community photo collections on the Internet We combine these two lines of work to a large-scale system for auto-annotation of holiday snaps The resulting method allows for automatic labeling objects such as landmark buildings, scenes, pieces of art etc at the object level in a fully automatic manner The labeling is multi-modal and consists of textual tags, geographic location, and related content on the Internet Furthermore, the efficiency of the retrieval process is optimized by creating more compact and precise indices for visual vocabularies using background information obtained in the crawling stage of the system We demonstrate the scalability and precision of the proposed method by conducting experiments on millions of images downloaded from community photo collections on the Internet

108 citations



Proceedings ArticleDOI
01 Jan 2009
TL;DR: This work presents a data-driven, unsupervised method for unusual scene detection from static webcams, based on simple image features that detects plausible unusual scenes, which have not been observed in the data-stream before.
Abstract: We present a data-driven, unsupervised method for unusual scene detection from static webcams. Such time-lapse data is usually captured with very low or varying framerate. This precludes the use of tools typically used in surveillance (e.g., object tracking). Hence, our algorithm is based on simple image features. We define usual scenes based on the concept of meaningful nearest neighbours instead of building explicit models. To effectively compare the observations, our algorithm adapts the data representation. Furthermore, we use incremental learning techniques to adapt to changes in the data-stream. Experiments on several months of webcam data show that our approach detects plausible unusual scenes, which have not been observed in the data-stream before.

75 citations


Journal ArticleDOI
TL;DR: It is argued that procedural modeling technology based on shape grammars provides an interesting alternative to such measures, as they tend to spoil the experience for the observer.
Abstract: The rapid development of computer graphics and imaging provides the modern archeologist with several tools to realistically model and visualize archeological sites in 3D. This, however, creates a tension between veridical and realistic modeling. Visually compelling models may lead people to falsely believe that there exists very precise knowledge about the past appearance of a site. In order to make the underlying uncertainty visible, it has been proposed to encode this uncertainty with different levels of transparency in the rendering, or of decoloration of the textures. We argue that procedural modeling technology based on shape grammars provides an interesting alternative to such measures, as they tend to spoil the experience for the observer. Both its efficiency and compactness make procedural modeling a tool to produce multiple models, which together sample the space of possibilities. Variations between the different models express levels of uncertainty implicitly, while letting each individual model keeping its realistic appearance. The underlying, structural description makes the uncertainty explicit. Additionally, procedural modeling also yields the flexibility to incorporate changes as knowledge of an archeological site gets refined. Annotations explaining modeling decisions can be included. We demonstrate our procedural modeling implementation with several recent examples.

Proceedings ArticleDOI
01 Jan 2009
TL;DR: Based on a regularized face model, unsupervised face alignment into the Lucas-Kanade image registration approach is frame and a robust optimization scheme to handle appearance variations is proposed.
Abstract: We propose a novel approach to unsupervised facial image alignment. Differently from previous approaches, that are confined to affine transformations on either the entire face or separate patches, we extract a nonrigid mapping between facial images. Based on a regularized face model, we frame unsupervised face alignment into the Lucas-Kanade image registration approach. We propose a robust optimization scheme to handle appearance variations. The method is fully automatic and can cope with pose variations and expressions, all in an unsupervised manner. Experiments on a large set of images showed that the approach is effective.

Proceedings ArticleDOI
01 Jan 2009
TL;DR: This work is the first to extend the exemplar-based approach using local features into the spatio-temporal domain, and allows to avoid the problems that typically plague sliding window-based approaches - in particular the exhaustive search over spatial coordinates, time, and spatial as well as temporal scales.
Abstract: In this work, we present a method for action localization and recognition using an exemplar-based approach. It starts from local dense yet scale-invariant spatio-temporal features. The most discriminative visual words are selected and used to cast bounding box hypotheses, which are then verified and further grouped into the final detections. To the best of our knowledge, we are the first to extend the exemplar-based approach using local features into the spatio-temporal domain. This allows us to avoid the problems that typically plague sliding window-based approaches - in particular the exhaustive search over spatial coordinates, time, and spatial as well as temporal scales. We report state-ofthe-art results on challenging datasets, extracted from real movies, for both classification and localization.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: A GPU-oriented bitwise fast voting method is proposed to effectively improve the matching accuracy, which is enormously faster than the histogram-based approach, efficiently exploiting the computing resources of GPUs.
Abstract: This paper proposes a real-time design for accurate stereo matching on Compute Unified Device Architecture (CUDA). We adopt a leading local algorithm for its high data parallelism. A GPU-oriented bitwise fast voting method is proposed to effectively improve the matching accuracy, which is enormously faster than the histogram-based approach. The whole algorithm is parallelized on CUDA at a fine granularity, efficiently exploiting the computing resources of GPUs. On-chip shared memory is utilized to alleviate the latency of memory accesses. Compared to the CPU counterpart, our design attains a speedup factor of 52. With high matching accuracy, the proposed design is still among the fastest stereo methods on GPUs. The advantages of speed and accuracy advocate our design for practical applications such as robotics systems and multiview teleconferencing.

Proceedings ArticleDOI
20 Oct 2009
TL;DR: An architecture for a multi-camera, multi-resolution surveillance system to support a set of distributed static and pan-tilt-zoom cameras and visual tracking algorithms, together with a central supervisor unit is described.
Abstract: We describe an architecture for a multi-camera, multi-resolution surveillance system. The aim is to support a set of distributed static and pan-tilt-zoom (PTZ) cameras and visual tracking algorithms, together with a central supervisor unit. Each camera (and possibly pan-tilt device) has a dedicated process and processor. Asynchronous interprocess communications and archiving of data are achieved in a simple and effective way via a central repository, implemented using an SQL database.

Journal ArticleDOI
TL;DR: In this paper, a generative model of the relationship of body pose and image appearance using a sparse kernel regressor is proposed to track through poorly segmented low-resolution image sequences where tracking otherwise fails.
Abstract: We present a method to simultaneously estimate 3D body pose and action categories from monocular video sequences. Our approach learns a generative model of the relationship of body pose and image appearance using a sparse kernel regressor. Body poses are modelled on a low-dimensional manifold obtained by Locally Linear Embedding dimensionality reduction. In addition, we learn a prior model of likely body poses and a dynamical model in this pose manifold. Sparse kernel regressors capture the nonlinearities of this mapping efficiently. Within a Recursive Bayesian Sampling framework, the potentially multimodal posterior probability distributions can then be inferred. An activity-switching mechanism based on learned transfer functions allows for inference of the performed activity class, along with the estimation of body pose and 2D image location of the subject. Using a rough foreground segmentation, we compare Binary PCA and distance transforms to encode the appearance. As a postprocessing step, the globally optimal trajectory through the entire sequence is estimated, yielding a single pose estimate per frame that is consistent throughout the sequence. We evaluate the algorithm on challenging sequences with subjects that are alternating between running and walking movements. Our experiments show how the dynamical model helps to track through poorly segmented low-resolution image sequences where tracking otherwise fails, while at the same time reliably classifying the activity type.

Proceedings ArticleDOI
01 Jan 2009
TL;DR: This paper makes the connection between sliding-window and Hough-based object detection explicit and shows that the feature-centric view of the latter also nicely fits with the branch and bound paradigm, while it avoids the ESS memory tradeoff.
Abstract: Many object detection systems rely on linear classifiers embedded in a sliding-window scheme Such exhaustive search involves massive computation Efficient Subwindow Search (ESS) [11] avoids this by means of branch and bound However, ESS makes an unfavourable memory tradeoff Memory usage scales with both image size and overall object model size This risks becoming prohibitive in a multiclass system In this paper, we make the connection between sliding-window and Hough-based object detection explicit Then, we show that the feature-centric view of the latter also nicely fits with the branch and bound paradigm, while it avoids the ESS memory tradeoff Moreover, on-line integral image calculations are not needed Both theoretical and quantitative comparisons with the ESS bound are provided, showing that none of this comes at the expense of performance

01 Jan 2009
TL;DR: This work proposes a method using different types of context in order to collect scene specific samples from both, the background and the object class over time, which can robustly adapt to the scene without drifting.
Abstract: Generic person detection is an ill-posed problem as context is widely ignored. Local context can be used to split the generic detection task into easier sub-problems, which was recently explored by classifier grids. The detection problem gets simplified spatially by training separate classifiers for each possible location in the image. So far, adaptive grid based approaches only focused on exploring the specific background class. In contrast, we propose a method using different types of context in order to collect scene specific samples from both, the background andthe object class over time. These samples are used to update the specific object detectors. Due to limiting label noise and avoiding direct feedback loops our system can robustly adapt to the scene without drifting. Results on the PETS 2009 dataset show significantly improved person detections, especially, during static and dynamic occlusions ( e.g., lamp poles and crowded scenes).

Proceedings Article
01 Jan 2009
TL;DR: In this article, the authors present a complete 3D in-hand scanning system that allows users to scan objects by simply turning them freely in front of a real-time 3D range scanner.
Abstract: We present a complete 3D in-hand scanning system that allows users to scan objects by simply turning them freely in front of a real-time 3D range scanner. The 3D object model is reconstructed online as a point cloud by registering and integrating the incoming 3D patches with the online 3D model. The accumulation of registration errors leads to the well-known loop closure problem. We address this issue already during the scanning session by distorting the object as rigidly as possible. Scanning errors are removed by explicitly handling outliers. As a result of our proposed online modeling and error handling procedure, the online model is of sufficiently high quality to serve as the final model. Thus, no additional post-processing is required which might lead to artifacts in the model reconstruction. We demonstrate our approach on several difficult real-world objects and quantitatively evaluate the resulting modeling accuracy.

Proceedings ArticleDOI
01 Jan 2009
TL;DR: It is shown that it is sufficient to use soft-matching during learning only and to perform fast nearest neighbour matching at recognition time (where speed is of prime importance) and a framework is proposed which overcomes these problems and gives a sound justification to the voting procedure.
Abstract: This paper addresses the problem of object detection by means of the Generalised Hough transform paradigm. The Implicit Shape Model (ISM) is a well-known approach based on this idea. It made this paradigm popular and has been adopted many times. Although the algorithm exhibits robust detection performance, its description, i.e. its probabilistic model, involves arguments which are unsatisfactory from a probabilistic standpoint. We propose a framework which overcomes these problems and gives a sound justification to the voting procedure. Furthermore, our framework allows for a formal understanding of the heuristic of soft-matching commonly used in visual vocabulary systems. We show that it is sufficient to use soft-matching during learning only and to perform fast nearest neighbour matching at recognition time (where speed is of prime importance). Our implementation is based on Gaussian Mixture Models (instead of kernel density estimators as with ISM) which lead to a fast gradient-based object detector.

Journal ArticleDOI
TL;DR: A novel approach to markerless real-time pose recognition in a multicamera setup is presented and Average Neighborhood Margin Maximization (ANMM) is introduced as a powerful new technique to train Haar-like features.
Abstract: This article presents a novel approach to markerless real-time pose recognition in a multicamera setup. Body pose is retrieved using example-based classification based on Haar wavelet-like features to allow for real-time pose recognition. Average Neighborhood Margin Maximization (ANMM) is introduced as a powerful new technique to train Haar-like features. The rotation invariant approach is implemented for both 2D classification based on silhouettes, and 3D classification based on visual hulls.


Proceedings ArticleDOI
01 Jan 2009
TL;DR: The proposed method computes the convolution for every vertex in a model, it incorporates dense feature matching as opposed to sparse matching based on certain feature descriptors, and is analogous to window matching in 2D image registration.
Abstract: This paper describes a method for registering deformable 3D objects. When an object such as a hand deforms, the deformation of the local shape is small, whereas the global shape deforms to a greater extent in many cases. Therefore, the local shape can be used as a feature for matching corresponding points. Instead of using a descriptor of the local shape, we introduce the convolution of the error between corresponding points for each vertex of a 3D mesh model. This approach is analogous to window matching in 2D image registration. Since the proposed method computes the convolution for every vertex in a model, it incorporates dense feature matching as opposed to sparse matching based on certain feature descriptors. Through experiments, we show that the convolution is useful for finding corresponding points and evaluate the accuracy of the registration.

01 Jan 2009
TL;DR: An algorithm for multi-person tracking-bydetection in a particle filtering framework that tightly couples object detection, classification, and tracking components and robustly tracks a variable number of dynamically moving persons in complex scenes with occlusions is presented.
Abstract: We present an algorithm for multi-person tracking-bydetection in a particle filtering framework. To address the unreliability of current state-of-the-art object detectors, our algorithm tightly couples object detection, classification, and tracking components. Instead of relying only on the final, sparse output from a detector, we additionally employ its continuous intermediate output to impart our approach with more flexibility to handle difficult situations. The resulting algorithm robustly tracks a variable number of dynamically moving persons in complex scenes with occlusions. The approach does not rely on background modeling and is based only on 2D information from a single camera, not requiring any camera or ground plane calibration. We evaluate the algorithm on the PETS’09 tracking dataset and discuss the importance of the different algorithm components to robustly handle difficult situations.

Proceedings ArticleDOI
07 Nov 2009
TL;DR: An efficient stereo algorithm with NCC over shape-adaptive matching regions is proposed, producing depth-discontinuity preserving disparity maps while remaining the advantage of robustness to radiometric differences.
Abstract: Normalized Cross-Correlation (NCC) is a common matching technique to tolerate radiometric differences between stereo images. However, traditional rectangle-based NCC tends to blur the depth discontinuities. This paper proposes an efficient stereo algorithm with NCC over shape-adaptive matching regions, producing depth-discontinuity preserving disparity maps while remaining the advantage of robustness to radiometric differences. To alleviate the computational intensity, we propose an acceleration algorithm using an orthogonal integral image technique, achieving a speedup factor of 10∼27. In addition, a voting scheme on reliable estimates is applied to refine the initial estimates. Experiments show that, besides the robustness, the proposed method obtains accurate disparity maps at fast speed. Our method highly ranks among the local approaches in the Middlebury stereo benchmark.

Proceedings ArticleDOI
01 Jan 2009
TL;DR: A novel method for mouth localization in the context of multimodal speech recognition where audio and visual cues are fused to improve the speech recognition accuracy and the superior accuracy and quantitative improvements for audio-visual speech recognition over monomodal approaches are demonstrated.
Abstract: We present a novel method for mouth localization in the context of multimodal speech recognition where audio and visual cues are fused to improve the speech recognition accuracy. While facial feature points like mouth corners or lip contours are commonly used to estimate at least scale, position, and orientation of the mouth, we propose a Hough transform-based method. Instead of relying on a predefined sparse subset of mouth features, it casts probabilistic votes for the mouth center from several patches in the neighborhood and accumulates the votes in a Hough image. This makes the localization more robust as it does not rely on the detection of a single feature. In addition, we exploit the different shape properties of eyes and mouth in order to localize the mouth more efficiently. Using the rotation invariant representation of the iris, scale and orientation can be efficiently inferred from the localized eye positions. The superior accuracy of our method and quantitative improvements for audio-visual speech recognition over monomodal approaches are demonstrated on two datasets.

Journal ArticleDOI
TL;DR: This paper studies one particular form of cognitive feedback, where the ability to recognize objects of a given category is exploited to infer different kinds of meta-data annotations for images of previously unseen object instances, in particular information on 3D shape.

Journal ArticleDOI
TL;DR: This work presents a system that is able to recognize objects of a certain class in an image and to identify their parts for potential interactions and presents experimental results on wheelchairs, cars, and motorbikes.
Abstract: In the transition from industrial to service robotics, robots will have to deal with increasingly unpredictable and variable environments. We present a system that is able to recognize objects of a certain class in an image and to identify their parts for potential interactions. The method can recognize objects from arbitrary viewpoints and generalizes to instances that have never been observed during training, even if they are partially occluded and appear against cluttered backgrounds. Our approach builds on the implicit shape model of Leibe et al. We extend it to couple recognition to the provision of meta-data useful for a task and to the case of multiple viewpoints by integrating it with the dense multi-view correspondence finder of Ferrari et al. Meta-data can be part labels but also depth estimates, information on material types, or any other pixelwise annotation. We present experimental results on wheelchairs, cars, and motorbikes.

Proceedings ArticleDOI
01 Jan 2009
TL;DR: An approach for unusual event detection, based on a tree of trackers, where a better informed tracker performs more robustly in cases where unusual events occur and the normal assumptions about the world no longer hold, and a less informed tracker has a good chance of performing better.
Abstract: We present an approach for unusual event detection, based on a tree of trackers. At lower levels, the trackers are trained on broad classes of targets. At higher levels, they aim at more specific targets. For instance, at the root, a general blob tracker could operate which may track any object. The next level could already use information about human appearance to better track people. A further level could go after specific types of actions like walking, running, or sitting. Yet another level up, several walking trackers can be tuned to the gait of a particular person each. Thus, at each layer, one or more families of more specific trackers are available. As long as the target behaves according to expectations, a member of a higher up such family will be better tuned to the data than its parent tracker at a lower level. Typically, a better informed tracker performs more robustly. But in cases where unusual events occur and the normal assumptions about the world no longer hold, they loose their reliability. In such cases, a less informed tracker, not relying on what has now become false information, has a good chance of performing better. Such performance inversion signals an unusual event. Inversions between levels higher up represent deviations that are semantically more subtle than inversions lower down: for instance an unknown intruder entering a house rather than seeing a non-human target.

Proceedings ArticleDOI
01 Jan 2009
TL;DR: This work addresses the problem of tracking humans with skeleton-based shape models where video footage is acquired by multiple cameras where the shape deformations are parameterized by the skeleton and provides a guidance on algorithm design for different applications related to human motion capture.
Abstract: This work addresses the problem of tracking humans with skeleton-based shape models where video footage is acquired by multiple cameras. Since the shape deformations are parameterized by the skeleton, the position, orientation, and configuration of the human skeleton are estimated such that the deformed shape model is best explained by the image data. To solve this problem, several algorithms have been proposed over the last years. The approaches usually rely on filtering, local optimization, or global optimization. The global optimization algorithms can be further divided into single hypothesis (SHO) and multiple hypothesis optimization (MHO). We briefly compare the underlying mathematical models and evaluate the performance of one representative algorithm for each class. Furthermore, we compare several likelihoods and parameter settings with respect to accuracy and computation cost. A thorough evaluation is performed on two sequences with uncontrolled lighting conditions and non-static background. In addition, we demonstrate the impact of the likelihood on the HumanEva benchmark. Our results provide a guidance on algorithm design for different applications related to human motion capture.