scispace - formally typeset
Search or ask a question

Showing papers by "Luc Van Gool published in 2011"


Proceedings ArticleDOI
20 Jun 2011
TL;DR: This work addresses the problem of head pose estimation from depth data, which can be captured using the ever more affordable 3D sensing technologies available today, by proposing to use random regression forests for the task at hand.
Abstract: Fast and reliable algorithms for estimating the head pose are essential for many applications and higher-level face analysis tasks. We address the problem of head pose estimation from depth data, which can be captured using the ever more affordable 3D sensing technologies available today. To achieve robustness, we formulate pose estimation as a regression problem. While detecting specific face parts like the nose is sensitive to occlusions, learning the regression on rather generic surface patches requires enormous amount of training data in order to achieve accurate estimates. We propose to use random regression forests for the task at hand, given their capability to handle large training datasets. Moreover, we synthesize a great amount of annotated training data using a statistical model of the human face. In our experiments, we show that our approach can handle real data presenting large pose changes, partial occlusions, and facial expressions, even though it is trained only on synthetic neutral face data. We have thoroughly evaluated our system on a publicly available database on which we achieve state-of-the-art performance without having to resort to the graphics card.

501 citations


Book ChapterDOI
31 Aug 2011
TL;DR: A system for estimating location and orientation of a person's head, from depth data acquired by a low quality device, based on discriminative random regression forests based on ensembles of random trees trained by splitting each node so as to simultaneously reduce the entropy of the class labels distribution and the variance of the head position and orientation.
Abstract: We present a system for estimating location and orientation of a person's head, from depth data acquired by a low quality device Our approach is based on discriminative random regression forests: ensembles of random trees trained by splitting each node so as to simultaneously reduce the entropy of the class labels distribution and the variance of the head position and orientation We evaluate three different approaches to jointly take classification and regression performance into account during training For evaluation, we acquired a new dataset and propose a method for its automatic annotation

336 citations


Proceedings ArticleDOI
Danfeng Qin1, Stephan Gammeter1, Lukas Bossard1, Till Quack1, Luc Van Gool1 
20 Jun 2011
TL;DR: This paper introduces a simple yet effective method to improve visual word based image retrieval based on an analysis of the k-reciprocal nearest neighbor structure in the image space and demonstrates a significant improvement over standard bag-of-words retrieval.
Abstract: This paper introduces a simple yet effective method to improve visual word based image retrieval. Our method is based on an analysis of the k-reciprocal nearest neighbor structure in the image space. At query time the information obtained from this process is used to treat different parts of the ranked retrieval list with different distance measures. This leads effectively to a re-ranking of retrieved images. As we will show, this has two benefits: first, using different similarity measures for different parts of the ranked list allows for compensation of the “curse of dimensionality”. Second, it allows for dealing with the uneven distribution of images in the data space. Dealing with both challenges has very beneficial effect on retrieval accuracy. Furthermore, a major part of the process happens offline, so it does not affect speed at retrieval time. Finally, the method operates on the bag-of-words level only, thus it could be combined with any additional measures on e.g. either descriptor level or feature geometry making room for further improvement. We evaluate our approach on common object retrieval benchmarks and demonstrate a significant improvement over standard bag-of-words retrieval.

333 citations


Proceedings ArticleDOI
05 Jan 2011
TL;DR: A real-time hand gesture interaction system that allows for complex 3D gestures and is not disturbed by objects or persons in the background is improved by augmenting it with a ToF camera.
Abstract: Time-of-Flight (ToF) and other IR-based cameras that register depth are becoming more and more affordable in consumer electronics. This paper aims to improve a realtime hand gesture interaction system by augmenting it with a ToF camera. First, the ToF camera and the RGB camera are calibrated, and a mapping is made from the depth data to the RGB image. Then, a novel hand detection algorithm is introduced based on depth and color. This not only improves detection rates, but also allows for the hand to overlap with the face, or with hands from other persons in the background. The hand detection algorithm is evaluated in these settings, and compared to previous algorithms. Furthermore, the depth information allows us to track the position of the hand in 3D, allowing for more interesting modes of interaction. Finally, the hand gesture recognition algorithm is applied to the depth data as well, and compared to the recognition based on the RGB images. The result is a real-time hand gesture interaction system that allows for complex 3D gestures and is not disturbed by objects or persons in the background.

311 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: This novel approach “imagines” an actor performing an action typical for the target object class, instead of relying purely on the visual object appearance, and handles function as a cue complementary to appearance, rather than being a consideration after appearance-based detection.
Abstract: Many object classes are primarily defined by their functions. However, this fact has been left largely unexploited by visual object categorization or detection systems. We propose a method to learn an affordance detector. It identifies locations in the 3d space which “support” the particular function. Our novel approach “imagines” an actor performing an action typical for the target object class, instead of relying purely on the visual object appearance. So, function is handled as a cue complementary to appearance, rather than being a consideration after appearance-based detection. Experimental results are given for the functional category “sitting”. Such affordance is tested on a 3d representation of the scene, as can be realistically obtained through SfM or depth cameras. In contrast to appearance-based object detectors, affordance detection requires only very few training examples and generalizes very well to other sittable objects like benches or sofas when trained on a few chairs.

283 citations


Proceedings ArticleDOI
30 Aug 2011
TL;DR: A Haarlet-based hand gesture recognition system is implemented to detect hand gestures in any orientation, and more in particular pointing gestures while extracting the 3D pointing direction.
Abstract: This paper implements a real-time hand gesture recognition algorithm based on the inexpensive Kinect sensor. The use of a depth sensor allows for complex 3D gestures where the system is robust to disturbing objects or persons in the background. A Haarlet-based hand gesture recognition system is implemented to detect hand gestures in any orientation, and more in particular pointing gestures while extracting the 3D pointing direction. The system is integrated on an interactive robot (based on ROS), allowing for real-time hand gesture interaction with the robot. Pointing gestures are translated into goals for the robot, telling him where to go. A demo scenario is presented where the robot looks for persons to interact with, asks for directions, and then detects a 3D pointing direction. The robot then explores his vicinity in the given direction and looks for a new person to interact with.

192 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: Comparing pose-based, appearance-based and combined pose and appearance features for action recognition in a home-monitoring scenario shows that posebased features outperform low-level appearance features, even when heavily corrupted by noise, suggesting that pose estimation is beneficial for the action recognition task.
Abstract: Early works on human action recognition focused on tracking and classifying articulated body motions. Such methods required accurate localisation of body parts, which is a difficult task, particularly under realistic imaging conditions. As such, recent trends have shifted towards the use of more abstract, low-level appearance features such as spatio-temporal interest points. Motivated by the recent progress in pose estimation, we feel that pose-based action recognition systems warrant a second look. In this paper, we address the question of whether pose estimation is useful for action recognition or if it is better to train a classifier only on low-level appearance features drawn from video data. We compare pose-based, appearance-based and combined pose and appearance features for action recognition in a home-monitoring scenario. Our experiments show that posebased features outperform low-level appearance features, even when heavily corrupted by noise, suggesting that pose estimation is beneficial for the action recognition task.

191 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: A system for recognizing letters and finger-spelled words of the American sign language (ASL) in real-time that is based on average neighborhood margin maximization and relies on the segmented depth data of the hands.
Abstract: In this work, we present a system for recognizing letters and finger-spelled words of the American sign language (ASL) in real-time. To this end, the system segments the hand and estimates the hand orientation from captured depth data. The letter classification is based on average neighborhood margin maximization and relies on the segmented depth data of the hands. For word recognition, the letter confidences are aggregated. Furthermore, the word recognition is used to improve the letter recognition by updating the training examples of the letter classifiers on-line.

103 citations


Journal ArticleDOI
TL;DR: This paper addresses the task of efficient object class detection by means of the Hough transform by demonstrating PRISM’s flexibility by two complementary implementations: a generatively trained Gaussian Mixture Model as well as a discriminatively trained histogram approach.
Abstract: This paper addresses the task of efficient object class detection by means of the Hough transform. This approach has been made popular by the Implicit Shape Model (ISM) and has been adopted many times. Although ISM exhibits robust detection performance, its probabilistic formulation is unsatisfactory. The PRincipled Implicit Shape Model (PRISM) overcomes these problems by interpreting Hough voting as a dual implementation of linear sliding-window detection. It thereby gives a sound justification to the voting procedure and imposes minimal constraints. We demonstrate PRISM's flexibility by two complementary implementations: a generatively trained Gaussian Mixture Model as well as a discriminatively trained histogram approach. Both systems achieve state-of-the-art performance. Detections are found by gradient-based or branch and bound search, respectively. The latter greatly benefits from PRISM's feature-centric view. It thereby avoids the unfavourable memory trade-off and any on-line pre-processing of the original Efficient Subwindow Search (ESS). Moreover, our approach takes account of the features' scale value while ESS does not. Finally, we show how to avoid soft-matching and spatial pyramid descriptors during detection without losing their positive effect. This makes algorithms simpler and faster. Both are possible if the object model is properly regularised and we discuss a modification of SVMs which allows for doing so.

79 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: A method for capturing human motion in real-time is developed, which is used to temporally segment the depth streams into actions and automatically localize the object that is manipulated within a video segment, and categorize it using the corresponding action.
Abstract: Unsupervised categorization of objects is a fundamental problem in computer vision. While appearance-based methods have become popular recently, other important cues like functionality are largely neglected. Motivated by psychological studies giving evidence that human demonstration has a facilitative effect on categorization in infancy, we propose an approach for object categorization from depth video streams. To this end, we have developed a method for capturing human motion in real-time. The captured data is then used to temporally segment the depth streams into actions. The set of segmented actions are then categorized in an un-supervised manner, through a novel descriptor for motion capture data that is robust to subject variations. Furthermore, we automatically localize the object that is manipulated within a video segment, and categorize it using the corresponding action. For evaluation, we have recorded a dataset that comprises depth data with registered video sequences for 6 subjects, 13 action classes, and 174 object manipulations.

72 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: This work proposes a novel approach where a stixel world model is computed directly from the stereo images, without computing an intermediate depth map, and demonstrates that such approach can considerably reduce the set of candidate detection windows at a fraction of the computation cost of previous approaches.
Abstract: Mobile robots require object detection and classification for safe and smooth navigation. Stereo vision improves such detection by doubling the views of the scene and by giving indirect access to depth information. This depth information can also be used to reduce the set of candidate detection windows. Up to now, most algorithms compute a depth map to discard unpromising detection windows. We propose a novel approach where a stixel world model is computed directly from the stereo images, without computing an intermediate depth map. We experimentally demonstrate that such approach can considerably reduce the set of candidate detection windows at a fraction of the computation cost of previous approaches.

Proceedings Article
12 Dec 2011
TL;DR: An efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities and an incremental algorithm for the online setting which can update the latent space without extensive relearning are presented.
Abstract: A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from "simple data", i.e., single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art.

Journal ArticleDOI
TL;DR: Assessment of coding of locomotion in rhesus monkey (Macaca mulatta) temporal cortex using movies of stationary walkers suggests that actions are analyzed by temporal cortical neurons using distinct mechanisms.
Abstract: Temporal cortical neurons are known to respond to visual dynamic-action displays. Many human psychophysical and functional imaging studies examining biological motion perception have used treadmill walking, in contrast to previous macaque single-cell studies. We assessed the coding of locomotion in rhesus monkey (Macaca mulatta) temporal cortex using movies of stationary walkers, varying both form and motion (i.e., different facing directions) or varying only the frame sequence (i.e., forward vs backward walking). The majority of superior temporal sulcus and inferior temporal neurons were selective for facing direction, whereas a minority distinguished forward from backward walking. Support vector machines using the temporal cortical population responses as input classified facing direction well, but forward and backward walking less so. Classification performance for the latter improved markedly when the within-action response modulation was considered, reflecting differences in momentary body poses within the locomotion sequences. Responses to static pose presentations predicted the responses during the course of the action. Analyses of the responses to walking sequences wherein the start frame was varied across trials showed that some neurons also carried a snapshot sequence signal. Such sequence information was present in neurons that responded to static snapshot presentations and in neurons that required motion. Our data suggest that actions are analyzed by temporal cortical neurons using distinct mechanisms. Most neurons predominantly signal momentary pose. In addition, temporal cortical neurons, including those responding to static pose, are sensitive to pose sequence, which can contribute to the signaling of learned action sequences.

Proceedings ArticleDOI
20 Jun 2011
TL;DR: This work presents a scalable multi-class detection algorithm which scales sublinearly with the number of classes without compromising accuracy, and builds a taxonomy of object classes which is exploited to further reduce the cost of multi- class object detection.
Abstract: Scalability of object detectors with respect to the number of classes is a very important issue for applications where many object classes need to be detected. While combining single-class detectors yields a linear complexity for testing, multi-class detectors that localize all objects at once come often at the cost of a reduced detection accuracy. In this work, we present a scalable multi-class detection algorithm which scales sublinearly with the number of classes without compromising accuracy. To this end, a shared discriminative codebook of feature appearances is jointly trained for all classes and detection is also performed for all classes jointly. Based on the learned sharing distributions of features among classes, we build a taxonomy of object classes. The taxonomy is then exploited to further reduce the cost of multi-class object detection. Our method has linear training and sublinear detection complexity in the number of classes. We have evaluated our method on the challenging PASCAL VOC'06 and PASCAL VOC'07 datasets and show that scaling the system does not lead to a loss in accuracy.

Journal ArticleDOI
01 May 2011
TL;DR: A real-time interactive 3D scanning system that allows users to scan complete object geometry by turning the object around in front of a real- time 3D range scanner and shows that the system has comparable accuracy to offline methods with the additional benefit of immediate feedback and results.
Abstract: We present a real-time interactive 3D scanning system that allows users to scan complete object geometry by turning the object around in front of a real-time 3D range scanner. The incoming 3D surface patches are registered and integrated into an online 3D point cloud. In contrast to previous systems the online reconstructed 3D model also serves as final result. Registration error accumulation which leads to the well-known loop closure problem is addressed already during the scanning session by distorting the object as rigidly as possible. Scanning errors are removed by explicitly handling outliers based on visibility constraints. Thus, no additional post-processing is required which otherwise might lead to artifacts in the model reconstruction. Both geometry and texture are used for registration which allows for a wide range of objects with different geometric and photometric properties to be scanned. We show the results of our modeling approach on several difficult real-world objects. Qualitative and quantitative results are given for both synthetic and real data demonstrating the importance of online loop closure and outlier handling for model reconstruction. We show that our real-time scanning system has comparable accuracy to offline methods with the additional benefit of immediate feedback and results.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: The results show that preserving the sparse representation of the signals from the original space in the (lower) dimensional projected space is beneficial for several benchmarks and corroborate the usefulness of the proposed sparse representation based linear and non-linear projections.
Abstract: In dimensionality reduction most methods aim at preserving one or a few properties of the original space in the resulting embedding. As our results show, preserving the sparse representation of the signals from the original space in the (lower) dimensional projected space is beneficial for several benchmarks (faces, traffic signs, and handwritten digits). The intuition behind is that taking a sparse representation for the different samples as point of departure highlights the important correlations among the samples that one then wants to exploit to arrive at the final, effective low-dimensional embedding. We explicitly adapt the LPP and LLE techniques to work with the sparse representation criterion and compare to the original methods on the referenced databases, and this for both unsupervised and supervised cases. The improved results corroborate the usefulness of the proposed sparse representation based linear and non-linear projections.

Proceedings ArticleDOI
16 May 2011
TL;DR: This work reconstructs complete buildings as procedural models using template shape grammars and lets the grammar interpreter automatically decide on which step to take next in the reconstruction process.
Abstract: We propose a novel grammar-driven approach for reconstruction of buildings and landmarks. Our approach complements Structure-from-Motion and image-based analysis with a 'inverse' procedural modeling strategy. So far, procedural modeling has mostly been used for creation of virtual buildings, while the inverse approaches typically focus on reconstruction of single facades. In our work, we reconstruct complete buildings as procedural models using template shape grammars. In the reconstruction process, we let the grammar interpreter automatically decide on which step to take next. The process can be seen as instantiating the template by determining the correct grammar parameters. As an example, we have chosen the reconstruction of Greek Doric temples. This process significantly differs from single facade segmentation due to the immediate need for 3D reconstruction.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: This work proposes and investigates the use of different optimisation methods to search for the best patches and their respective transformations for producing consistent, improved completions in photo-editing.
Abstract: Image completion is an important photo-editing task which involves synthetically filling a hole in the image such that the image still appears natural. State-of-the-art image completion methods work by searching for patches in the image that fit well in the hole region. Our key insight is that image patches remain natural under a variety of transformations (such as scale, rotation and brightness change), and it is important to exploit this. We propose and investigate the use of different optimisation methods to search for the best patches and their respective transformations for producing consistent, improved completions. Experiments on a number of challenging problem instances demonstrate that our methods outperform state-of-the-art techniques.

Book ChapterDOI
26 Jun 2011
TL;DR: The general framework of random forests for multi-class object detection in images is described and an overview of recent developments and implementation details that are relevant for practitioners are given.
Abstract: Object detection in large-scale real-world scenes requires efficient multi-class detection approaches. Random forests have been shown to handle large training datasets and many classes for object detection efficiently. The most prominent example is the commercial application of random forests for gaming [37]. In this paper, we describe the general framework of random forests for multi-class object detection in images and give an overview of recent developments and implementation details that are relevant for practitioners.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: An on-line learning scheme for Hough forests, which allows to extend their usage to further applications, such as the tracking of arbitrary target instances or large-scale learning of visual classifiers.
Abstract: Recently, Gall & Lempitsky [6] and Okada [9] introduced Hough Forests (HF), which emerged as a powerful tool in object detection, tracking and several other vision applications. HFs are based on the generalized Hough transform [2] and are ensembles of randomized decision trees, consisting of both classification and regression nodes, which are trained recursively. Densly sampled patches of the target object {Pi = (Ai,yi,di)} represent the training data, where Ai is the appearance, yi the label, and di a vector pointing to the center of the object. Each node tries to find an optimal splitting function by either optimizing the information gain for classification nodes or the variance of offset vectors di for regression nodes. This yields quite clean leaf nodes according to both, appearance and offset. However, typically HFs are trained in off-line mode, which means that they assume having access to the entire training set at once. This limits their application in situations where the data arrives sequentially, e.g., in object tracking, in incremental, or large-scale learning. For all of these applications, on-line methods inherently can perform better. Thus, we propose in this paper an on-line learning scheme for Hough forests, which allows to extend their usage to further applications, such as the tracking of arbitrary target instances or large-scale learning of visual classifiers. Growing such a tree in an on-line fashion is a difficult task, as errors in the hard splitting rules cannot be corrected easily further down the tree. While Godec et al. [8] circumvent the recursive on-line update of classification trees by randomly growing the trees to their full size and just update the leaf node statistics, we integrate the ideas from [5, 10] that follow a tree-growing principle. The basic idea there is to start with a tree consisting of only one node, which is the root node and the only leaf at that time. Each node collects the data falling in it and decides on its own, based on a certain splitting criterion, whether to split this node or to further update the statistics. Although the splitting criteria in [5, 10] have strong theoretical support, we will show in the experiments that it even suffices to only count the number n of samples Pi that a node has already incorporated and split when n > γ , where γ is a predefined threshold. An overview of this procedure is given in Figure 1. This splitting criterion requires to find reasonable splitting functions with only a small subset of the data, which does not necessarily have to be a disadvantage when building random forests. As stated in Breiman [4], the upper bound for the generalization error of random forests can be optimized with a high strength of the individual trees but also a low correlation between them. To this end, we derive a new but simple splitting procedure for off-line HFs based on subsampling the input space on the node level, which can further decrease the correlation between the trees. That is, each node in a tree randomly samples a predefined number γ of data samples uniformly over all available data at the current node, which is then used for finding a good splitting function. In the first experiment, we demonstrate on three object detection data sets that both, our on-line formulation and subsample splitting scheme, can reach similar performance compared to the classical Hough forests and can even outperform them, see Figures 2(a)&(b). Additionally, during training both proposed methods are orders of magnitudes faster than the original approach (Figure 2(c)). In the second part of the experiments, we demonstrate the power of our method on visual object tracking. Especially, our focus lies on tracking objects of a priori unknown classes, as class-specific tracking with off-line forests has already been demonstrated before [7]. We present results on seven tracking data sets and show that our on-line HFs can outperform state-of-the-art tracking-by-detection methods. Figure 1: While labeled samples arrive on-line, each tree propagates the sample to the corresponding leaf node, which decides whether to split the current leaf or to update its statistics.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: This work proposes an efficient pipeline that starts from images captured by vans, which are then used to detect, recognise and localise the manholes, and is the first published work for manhole mapping based solely on computer vision techniques and GPS.
Abstract: Our work addresses the problem of accurately 3D localising specific types of road fixtures, such as manhole covers. The surveying task for manholes has to be done for millions of kilometres of road. We propose an efficient pipeline that starts from images captured by vans, which are then used to detect, recognise and localise the manholes. Challenges come from frequent occlusions, regular changes in illumination conditions, substantial viewpoint variance, and the fact that the actual manhole covers are far less similar than one might expect. We combine 2D and 3D techniques and demonstrate high performance for large amounts of image data. To our knowledge this is the first published work for manhole mapping based solely on computer vision techniques and GPS.

Book ChapterDOI
10 Oct 2011
TL;DR: This paper presents a service that creates complete and realistic 3D models out of a set of photographs taken with a consumer camera that automatically generates textured and dense models that require little or no post-processing.
Abstract: This paper presents a service that creates complete and realistic 3D models out of a set of photographs taken with a consumer camera. In contrast to other systems which produce sparse point clouds or individual depth maps, our system automatically generates textured and dense models that require little or no post-processing. Our reconstruction pipeline features automatic camera parameter retrieval from the web and intelligent view selection. This ARC3D system is available as a public, free-to-use web service (http://www.arc3d.be). Results are made available both as a full-resolution model and as a low-resolution for web browser viewing using WebGL.

Proceedings Article
01 Nov 2011
TL;DR: This is the first work, which covers such a system end-to-end, from offline crawling up to augmentation on the mobile device, and the complete system runs in real time on a state-of-the-art mobile phone.
Abstract: In this paper we present a fully automatic system for face augmentation on mobile devices. A user can point his mobile phone to a person and the system recognizes his or her face. A tracking algorithm overlays information about the identified person on the screen, thereby achieving an augmented reality effect. The tracker is running on the mobile client, while the recognition is running on a server. The database on the server is built by a fully autonomous crawling method, which taps social networks. For this work we collected 300 000 images from Facebook. The social context gained during this social network analysis is also used to improve the face recognition. The complete system runs in real time on a state-of-the-art mobile phone and is fully automatic, from offline crawling up to augmentation on the mobile device. It can be used to display more information about the identified persons or as a user interface for mixed reality application. To the best of our knowledge this is the first work, which covers such a system end-to-end.

Proceedings ArticleDOI
20 Jun 2011
TL;DR: This paper presents an integrated solution for the problem of detecting, tracking and identifying vehicles in a tunnel surveillance application, taking into account practical constraints including realtime operation, poor imaging conditions, and a decentralized architecture.
Abstract: This paper presents an integrated solution for the problem of detecting, tracking and identifying vehicles in a tunnel surveillance application, taking into account practical constraints including realtime operation, poor imaging conditions, and a decentralized architecture. Vehicles are followed through the tunnel by a network of non-overlapping cameras. They are detected and tracked in each camera and then identified, i.e. matched to any of the vehicles detected in the previous camera(s). To limit the computational load, we propose to reuse the same set of Haar-features for each of these steps. For the detection, we use an Adaboost cascade. Here we introduce a composite confidence score, integrating information from all stage of the cascades. A subset of the features used for detection is then selected, optimizing for the identification problem. This results in a compact binary ‘vehicle fingerprint’, requiring very limited bandwidth. Finally, we show that the same set of features can also be used for tracking. This haar features based ‘tracking-by-identification’ yields surprisingly good results on standard datasets, without the need to update the model online.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: A new multi-class version of transfer learning which requires minimal human interaction but still provides semantic labels of the new classes is developed, which is based on human tracking with multiple activity trackers.
Abstract: One of the great open challenges in visual recognition is the ability to cope with unexpected stimuli. In this work, we present a technique to interpret detected anomalies and update the existing knowledge of normal situations. The addressed context is the analysis of human behavior in indoor surveillance scenarios, where new activities might need to be learned, once the system is already in operation. Our approach is based on human tracking with multiple activity trackers. The main contribution is to integrate a learning stage, where labeled and unlabeled information is collected and analyzed. To this end we develop a new multi-class version of transfer learning which requires minimal human interaction but still provides semantic labels of the new classes. The activity model is then updated with the new activities. Experiments show promising results.

Journal ArticleDOI
TL;DR: A system to simulate, analyze and visualize occupant behavior in urban environments by combining parametric modeling and agent-based simulation, which identifies empiric correlations between properties such as: functions of buildings and other urban elements, population density, utilization and capacity of the public transport network, and congestion effect on the street network.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: The proposed method has been validated on two public datasets for the problem of detecting of cars under several views and shows that the proposed approach yields high detection rates while keeping efficiency.
Abstract: We propose an efficient method for object localization and 3D pose estimation. A two-step approach is used. In the first step, a pose estimator is evaluated in the input images in order to estimate potential object locations and poses. These candidates are then validated, in the second step, by the corresponding pose-specific classifier. The result is a detection approach that avoids the inherent and expensive cost of testing the complete set of specific classifiers over the entire image. A further speedup is achieved by feature sharing. Features are computed only once and are then used for evaluating the pose estimator and all specific classifiers. The proposed method has been validated on two public datasets for the problem of detecting of cars under several views. The results show that the proposed approach yields high detection rates while keeping efficiency.

Proceedings ArticleDOI
16 May 2011
TL;DR: This paper presents a method to combine the detection and segmentation of object categories from 3D scenes by combining the top-down cues available from object detection technique of Implicit Shape Models and the bottom-up power of Markov Random Fields for the purpose of segmentation.
Abstract: In this paper we present a method to combine the detection and segmentation of object categories from 3D scenes. In the process, we combine the top-down cues available from object detection technique of Implicit Shape Models and the bottom-up power of Markov Random Fields for the purpose of segmentation. While such approaches have been tried for the 2D image problem domain before, this is the first application of such a method in 3D. 3D scene understanding is prone to many problems different from 2D owing to problems from noise, lack of distinctive high-frequency feature information, mesh parametrization problems etc. Our method enables us to localize objects of interest for more purposeful meshing and subsequent scene understanding.

Book ChapterDOI
01 Jan 2011
TL;DR: A realtime version of driver assistance systems that integrates single view detection with region-based 3D tracking of traffic signs and obtains 3D pose information that is used to establish the relevance of the traffic sign to the driver.
Abstract: We briefly review the advances in driver assistance systems and present a realtime version that integrates single view detection with region-based 3D tracking of traffic signs. The system has a typical pipeline: detection and recognition of traffic signs in independent frames, followed by tracking for temporal integration. The detection process finds an optimal set of candidates and is accelerated using AdaBoost cascades. A hierarchy of SVMs handles the recognition of traffic sign types. The 2D detections are then employed in simultaneous 2D segmentation and 3D pose tracking, using the known 3D model of the recognized traffic sign. Thus, we achieve not only 2D tracking of the recognized traffic signs, but we also obtain 3D pose information, which we use to establish the relevance of the traffic sign to the driver. The performance of the system is demonstrated by tracking multiple road signs in real-world scenarios.

Proceedings ArticleDOI
15 Jul 2011
TL;DR: A method for incorporating latent variables into object and action classification and an exploration of a way to learn a better classifier by iterative expansion of the latent parameter space are provided.
Abstract: In this paper we propose a generic framework to incorporate unobserved auxiliary information for classifying objects and actions. This framework allows us to explicitly account for localisation and alignment of representations for generic object and action classes as latent variables. We approach this problem in the discriminative setting as learning a max-margin classifier that infers the class label along with the latent variables. Through this paper we make the following contributions a) We provide a method for incorporating latent variables into object and action classification b) We specifically account for the presence of an explicit class related subregion which can include foreground and/or background. c) We explore a way to learn a better classifier by iterative expansion of the latent parameter space. We demonstrate the performance of our approach by rigorous experimental evaluation on a number of standard object and action recognition datasets.