scispace - formally typeset
Search or ask a question

Showing papers by "Luc Van Gool published in 2008"


Journal ArticleDOI
TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.

12,449 citations


Book ChapterDOI
12 Oct 2008
TL;DR: In this article, the Hessian scale-invariant saliency measure is used to detect spatio-temporal interest points that are at the same time scale invariant and densely cover the video content.
Abstract: Over the years, several spatio-temporal interest point detectors have been proposed. While some detectors can only extract a sparse set of scale-invariant features, others allow for the detection of a larger amount of features at user-defined scales. This paper presents for the first time spatio-temporal interest points that are at the same time scale-invariant (both spatially and temporally) and densely cover the video content. Moreover, as opposed to earlier work, the features can be computed efficiently. Applying scale-space theory, we show that this can be achieved by using the determinant of the Hessian as the saliency measure. Computations are speeded-up further through the use of approximative box-filter operations on an integral video structure. A quantitative evaluation and experimental results on action recognition show the strengths of the proposed detector in terms of repeatability, accuracy and speed, in comparison with previously proposed detectors.

759 citations


Proceedings ArticleDOI
07 Jul 2008
TL;DR: This paper describes an approach for mining images of objects from community photo collections in an unsupervised fashion, and demonstrates this approach on several urban areas, densely covering an area of over 700 square kilometers and mining over 200,000 photos, making it probably the largest experiment of its kind to date.
Abstract: In this paper, we describe an approach for mining images of objects (such as touristic sights) from community photo collections in an unsupervised fashion. Our approach relies on retrieving geotagged photos from those web-sites using a grid of geospatial tiles. The downloaded photos are clustered into potentially interesting entities through a processing pipeline of several modalities, including visual, textual and spatial proximity. The resulting clusters are analyzed and are automatically classified into objects and events. Using mining techniques, we then find text labels for these clusters, which are used to again assign each cluster to a corresponding Wikipedia article in a fully unsupervised manner. A final verification step uses the contents (including images) from the selected Wikipedia article to verify the cluster-article assignment. We demonstrate this approach on several urban areas, densely covering an area of over 700 square kilometers and mining over 200,000 photos, making it probably the largest experiment of its kind to date.

295 citations


Journal ArticleDOI
TL;DR: A novel city modeling framework which builds upon this philosophy to create 3D content at high speed by integrating it with an object recognition module that automatically detects cars in the input video streams and localizes them in 3D.
Abstract: Supplying realistically textured 3D city models at ground level promises to be useful for pre-visualizing upcoming traffic situations in car navigation systems. Because this pre-visualization can be rendered from the expected future viewpoints of the driver, the required maneuver will be more easily understandable. 3D city models can be reconstructed from the imagery recorded by surveying vehicles. The vastness of image material gathered by these vehicles, however, puts extreme demands on vision algorithms to ensure their practical usability. Algorithms need to be as fast as possible and should result in compact, memory efficient 3D city models for future ease of distribution and visualization. For the considered application, these are not contradictory demands. Simplified geometry assumptions can speed up vision algorithms while automatically guaranteeing compact geometry models. In this paper, we present a novel city modeling framework which builds upon this philosophy to create 3D content at high speed. Objects in the environment, such as cars and pedestrians, may however disturb the reconstruction, as they violate the simplified geometry assumptions, leading to visually unpleasant artifacts and degrading the visual realism of the resulting 3D city model. Unfortunately, such objects are prevalent in urban scenes. We therefore extend the reconstruction framework by integrating it with an object recognition module that automatically detects cars in the input video streams and localizes them in 3D. The two components of our system are tightly integrated and benefit from each other's continuous input. 3D reconstruction delivers geometric scene context, which greatly helps improve detection precision. The detected car locations, on the other hand, are used to instantiate virtual placeholder models which augment the visual realism of the reconstructed city model.

256 citations


Journal ArticleDOI
TL;DR: This work constructs a biologically plausible hierarchy of neural detectors, which can discriminate seven basic emotional states from static views of associated body poses, and is evaluated against human test subjects on a recent set of stimuli manufactured for research on emotional body language.

164 citations


Journal Article
TL;DR: In this article, the authors address the problem of 3D articulated multi-person tracking in busy street scenes from a moving, human-level observer and propose a two-stage strategy.
Abstract: In this paper, we address the problem of 3D articulated multi-person tracking in busy street scenes from a moving, human-level observer. In order to handle the complexity of multi-person interactions, we propose to pursue a two-stage strategy. A multi-body detection-based tracker first analyzes the scene and recovers individual pedestrian trajectories, bridging sensor gaps and resolving temporary occlusions. A specialized articulated tracker is then applied to each recovered pedestrian trajectory in parallel to estimate the tracked person's precise body pose over time. This articulated tracker is implemented in a Gaussian Process framework and operates on global pedestrian silhouettes using a learned statistical representation of human body dynamics. We interface the two tracking levels through a guided segmentation stage, which combines traditional bottom-up cues with top-down information from a human detector and the articulated tracker's shape prediction. We show the proposed approach's viability and demonstrate its performance for articulated multi-person tracking on several challenging video sequences of a busy inner-city scenario.

54 citations


Book ChapterDOI
26 Mar 2008
TL;DR: A system which allows to request information on physical objects by taking a picture of them, using a mobile phone with integrated camera, and which identifies an object from a query image through multiple recognition stages, including local visual features, global geometry, and optionally also metadata such as GPS location.
Abstract: We present a system which allows to request information on physical objects by taking a picture of them. This way, using a mobile phone with integrated camera, users can interact with objects or "things" in a very simple manner. A further advantage is that the objects themselves don't have to be tagged with any kind of markers. At the core of our system lies an object recognition method, which identifies an object from a query image through multiple recognition stages, including local visual features, global geometry, and optionally also metadata such as GPS location. We present two applications for our system, namely a slide tagging application for presentation screens in smart meeting rooms and a cityguide on a mobile phone. Both systems are fully functional, including an application on the mobile phone, which allows simplest point-and-shoot interaction with objects. Experiments evaluate the performance of our approach in both application scenarios and show good recognition results under challenging conditions.

49 citations


Book ChapterDOI
12 Oct 2008
TL;DR: This paper addresses the problem of 3D articulated multi-person tracking in busy street scenes from a moving, human-level observer and proposes to pursue a two-stage strategy, which combines traditional bottom-up cues with top-down information from a human detector and the articulated tracker's shape prediction.
Abstract: In this paper, we address the problem of 3D articulated multi-person tracking in busy street scenes from a moving, human-level observer. In order to handle the complexity of multi-person interactions, we propose to pursue a two-stage strategy. A multi-body detection-based tracker first analyzes the scene and recovers individual pedestrian trajectories, bridging sensor gaps and resolving temporary occlusions. A specialized articulated tracker is then applied to each recovered pedestrian trajectory in parallel to estimate the tracked person's precise body pose over time. This articulated tracker is implemented in a Gaussian Process framework and operates on global pedestrian silhouettes using a learned statistical representation of human body dynamics. We interface the two tracking levels through a guided segmentation stage, which combines traditional bottom-up cues with top-down information from a human detector and the articulated tracker's shape prediction. We show the proposed approach's viability and demonstrate its performance for articulated multi-person tracking on several challenging video sequences of a busy inner-city scenario.

43 citations


Proceedings ArticleDOI
30 Oct 2008
TL;DR: A new method for robust content-based video copy detection based on local spatio-temporal features as shown by experimental validation brings additional robustness and discriminativity to the task of video footage reuse detection in news broadcasts.
Abstract: n this paper, we present a new method for robust content-based video copy detection based on local spatio-temporal features. As we show by experimental validation, the use of local spatio-temporal features instead of purely spatial ones brings additional robustness and discriminativity. Efficient operation is ensured by using the new spatio-temporal features proposed in [20]. To cope with the high-dimensionality of the resulting descriptors, these features are incorporated in a disk-based index and query system based on p-stable locality sensitive hashing. The system is applied to the task of video footage reuse detection in news broadcasts. Results are reported on 88 hours of news broadcast data from the TRECVID2006 dataset.

39 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: This work presents an online learning approach for robustly combining unreliable observations from a pedestrian detector to estimate the rough 3D scene geometry from video sequences of a static camera based on an entropy modelling framework.
Abstract: We present an online learning approach for robustly combining unreliable observations from a pedestrian detector to estimate the rough 3D scene geometry from video sequences of a static camera. Our approach is based on an entropy modelling framework, which allows to simultaneously adapt the detector parameters, such that the expected information gain about the scene structure is maximised. As a result, our approach automatically restricts the detector scale range for each image region as the estimation results become more confident, thus improving detector run-time and limiting false positives.

21 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: A novel multi-class object detector that optimizes the detection costs while retaining a desired detection rate is proposed, that uses a cascade that unites the handling of similar object classes while separating off classes at appropriate levels of the cascade.
Abstract: We propose a novel multi-class object detector, that optimizes the detection costs while retaining a desired detection rate. The detector uses a cascade that unites the handling of similar object classes while separating off classes at appropriate levels of the cascade. No prior knowledge about the relationship between classes is needed as the classifier structure is automatically determined during the training phase. The detection nodes in the cascade use Haar wavelet features and Gentle AdaBoost, however the approach is not dependent on the specific features used and can easily be extended to other cases. Experiments are presented for several numbers of object classes and the approach is compared to other classifying schemes. The results demonstrate a large efficiency gain that is particularly prominent for a greater number of classes. Also the complexity of the training scales well with the number of classes.

Book ChapterDOI
01 Jan 2008
TL;DR: This chapter describes a method to automatically build topological maps for robot navigation out of a sequence of visual observations taken from a camera mounted on the robot based on Dempster-Shafer probability theory.
Abstract: This chapter describes a method to automatically build topological maps for robot navigation out of a sequence of visual observations taken from a camera mounted on the robot. This direct non-metrical approach relies completely on the detection of loop closings, i.e. repeated visitations of one particular place. In natural environments, visual loop closing can be very hard, for two reasons. Firstly, the environment at one place can look differently at different time instances due to illumination changes and viewpoint differences. Secondly, there can be different places that look alike, i.e. the environment is self-similar. Here we propose a method that combines state-of-the-art visual comparison techniques and evidence collection based on Dempster-Shafer probability theory to tackle this problem.

Journal ArticleDOI
TL;DR: This work presents a 3-D measurement technique capable of optically measuring microchip devices using a camera-projector system and improves the dynamic range of the imaging system through the use of a set of gray-code and phase-shift measures with different CCD integration times.
Abstract: The industry dealing with microchip inspection requires fast, flexible, repeatable, and stable 3-D measuring systems. The typical devices used for this purpose are coordinate measurement machines (CMMs). These systems have limitations such as high cost, low measurement speed, and small quantity of measured 3-D points. Now optical techniques are beginning to replace the typical touch probes because of their noncontact nature, their full-field measurement capability, their high measurement density, as well as their low cost and high measurement speed. However, typical properties of microchip devices, which include a strongly spatially varying reflectance, make impossible the direct use of the classical optical 3-D measurement techniques. We present a 3-D measurement technique capable of optically measuring these devices using a camera-projector system. The proposed method improves the dynamic range of the imaging system through the use of a set of gray-code (GC) and phase- shift (PS) measures with different CCD integration times. A set of extended-range GC and PS images are obtained and used to acquire a dense 3-D measure of the object. We measure the 3-D shape of an integrated circuit and obtained satisfactory results.

Proceedings ArticleDOI
25 Jun 2008
TL;DR: A system that is able to recognize objects of a certain class in an image and to identify their parts for potential interactions is presented, demonstrated for object instances that have never been observed during training, and under partial occlusion and against cluttered backgrounds.
Abstract: In the transition from industrial to service robotics, robots will have to deal with increasingly unpredictable and variable environments We present a system that is able to recognize objects of a certain class in an image and to identify their parts for potential interactions This is demonstrated for object instances that have never been observed during training, and under partial occlusion and against cluttered backgrounds Our approach builds on the Implicit Shape Model of Leibe and Schiele, and extends it to couple recognition to the provision of meta-data useful for a task Meta-data can for example consist of part labels or depth estimates We present experimental results on wheelchairs and cars

01 Jan 2008
TL;DR: Instead of a monolithic application this work proposes a viewer architecture that builds upon a module concept and a scripting language that permits to design with reasonable effort non-trivial interaction components for exploration and inspection of individual models as well as of complex 3D-scenes.
Abstract: The presentation of CH artefacts is technically demanding because it has to meet a variety of requirements: A plethora of file formats, compatibility with numerous application scenarios from powerwall to web-browser, sustainability and long-term availability, extensibility with respect to digital model representations, and last but not least a good usability. Instead of a monolithic application we propose a viewer architecture that builds upon a module concept and a scripting language. This permits to design with reasonable effort non-trivial interaction components for exploration and inspection of individual models as well as of complex 3D-scenes. Furthermore some specific CH-models will be discussed in more detail.

Proceedings Article
01 Jan 2008
TL;DR: In this article, a two-stage procedure is proposed to estimate the scene geometry and an overcomplete set of object detections, and then address object-object interactions, tracking and prediction in a second step.
Abstract: In this paper, we address the problem of multi-person tracking in busy pedestrian zones, using a stereo rig mounted on a mobile platform. The complexity of the problem calls for an integrated solution, which extracts as much visual information as possible and combines it through cognitive feedback. We propose such an approach, which jointly estimates camera position, stereo depth, object detection, and tracking. We model the interplay between these components using a graphical model. Since the model has to incorporate object-object interactions, and temporal links to past frames, direct inference is intractable. We therefore propose a two-stage procedure: for each frame we first solve a simplified version of the model (disregarding interactions and temporal continuity) to estimate the scene geometry and an overcomplete set of object detections. Conditioned on these results, we then address object interactions, tracking, and prediction in a second step. The approach is experimentally evaluated on several long and difficult video sequences from busy inner-city locations. Our results show that the proposed integration makes it possible to deliver stable tracking performance in scenes of realistic complexity.

Proceedings Article
01 Jan 2008
TL;DR: This paper addresses the problem of multi-person tracking in busy pedestrian zones, using a stereo rig mounted on a mobile platform, and model the interplay between these components using a graphical model.
Abstract: In this paper, we address the problem of multi-person tracking in busy pedestrian zones, using a stereo rig mounted on a mobile platform. The complexity of the problem calls for an integrated solution, which extracts as much visual information as possible and combines it through cognitive feedback. We propose such an approach, which jointly estimates camera position, stereo depth, object detection, and tracking. We model the interplay between these components using a graphical model. Since the model has to incorporate object-object interactions, and temporal links to past frames, direct inference is intractable. We therefore propose a two-stage procedure: for each frame we first solve a simplified version of the model (disregarding interactions and temporal continuity) to estimate the scene geometry and an overcomplete set of object detections. Conditioned on these results, we then address object interactions, tracking, and prediction in a second step. The approach is experimentally evaluated on several long and difficult video sequences from busy inner-city locations. Our results show that the proposed integration makes it possible to deliver stable tracking performance in scenes of realistic complexity.

Journal ArticleDOI
TL;DR: A system prototype for self-determination and privacy enhancement in video surveilled areas by integrating computer vision and cryptographic techniques into networked building automation systems is presented.
Abstract: We present a system prototype for self-determination and privacy enhancement in video surveilled areas by integrating computer vision and cryptographic techniques into networked building automation systems. This paper describes a system prototype and research work that has been conducted by an interdisciplinary team of researchers. People appearing in a video stream control their visibility on a per-viewer basis and can choose to allow either the real view or an obscured image to be seen. The parts of the video stream containing a person's image are protected by an AES cipher and can be sent over untrusted networks. This paper presents experimental results with the example of a meeting room scenario. We conclude with remarks on the system's usability and on the problems encountered.

01 Jan 2008
TL;DR: A complete free software pipeline for the 3D digital acquisition of Cultural Heritage assets based on the use of standard photographic equipment using Arc3D and MeshLab is presented.
Abstract: The paper presents a complete free software pipeline for the 3D digital acquisition of Cultural Heritage assets based on the use of standard photographic equipment. The presented solution makes use of two main tools: Arc3D and MeshLab. Arc3D is a web based reconstruction service that using computer vision techniques are used to vision technique based on the automatic matching of image features compute for each photo a depth map. MeshLab is a tool that allow to import and process these range maps in order to obtain a ready to use 3D model. Through the combined use of these two systems it is possible to digitally acquire CH artifacts and monuments in affordable way.

Book ChapterDOI
10 Jun 2008
TL;DR: In an evaluation on two standard datasets, the method outperforms the state-of-the-art, confirming that the combination of form and motion improves recognition.
Abstract: We present a method for human action recognition from video, which exploits both form (local shape) and motion (local flow). Inspired by models of the human visual system, the two feature sets are processed independently in separate channels. The form channel extracts a dense local shape representation from every frame, while the motion channel extracts dense optic flow from the frame and its immediate predecessor. The same processing pipeline is applied in both channels: feature maps are pooled locally, down-sampled, and compared to a collection of learnt templates, yielding a vector of similarity scores. In a final step, the two score vectors are merged, and recognition is performed with a discriminative classifier. In an evaluation on two standard datasets our method outperforms the state-of-the-art, confirming that the combination of form and motion improves recognition.

Proceedings Article
31 Oct 2008
TL;DR: The 1st ACM Workshop on Analysis and Retrieval of Events, Actions and Workflows in Video Streams as discussed by the authors was held this year in Vancouver, Canada, with 16 papers that cover a variety of topics.
Abstract: It is our great pleasure to welcome you to the 1st ACM Workshop on Analysis and Retrieval of Events, Actions and Workflows in Video Streams --ACM AREA 2008 which is held this year in Vancouver, Canada. The mission of this workshop is to present the current research advantages in the area of cognitive video supervision and analysis of events, actions and workflows which is a critical research task for many real-life multimedia applications. ACM AREA 2008 gives researchers a unique opportunity to share their perspectives with their colleagues interested in the various aspects of video supervision and event analysis. The call for papers attracted submissions from Asia, Europe and the United States. The program committee accepted 16 papers that cover a variety of topics. These papers have been organized in four sessions. More specifically, the first session is dedicated to new algorithms and methods for object tracking under complex environments as well as with object labeling/matching techniques. The second session presents the new research outcomes in the area of detecting events, actions and workflows in video sequences. Event-driven analysis of videos is presented in the third session, while the workshop ends with a special session that covers the recent advantages of ongoing research projects in this research field. We hope that these proceedings will serve as a valuable reference for events detection and retrieval in video streams.

01 Jan 2008
TL;DR: This paper deals with the computation of a true orthographic image given a set of overlapping perspective images by using a Bayesian approach and defining a generative model of the input images.
Abstract: Orthographic images compose an efficient and economic way to represent aerial images. This kind of information allows to measure two-dimensional objects and relate these to Geographic Information Systems. This paper deals with the computation of a true orthographic image given a set of overlapping perspective images. These are, together with the internal and external calibration the only input to our approach. These few requirements form a large advantage to systems where the digital surface model (DSM), e.g. provided by LIDAR data, is necessary. We used a Bayesian approach and define a generative model of the input images. In this, the input images are regarded as noisy measurements of an underlying true and hence unknown orthoimage. These measurements are obtained by an image formation process (generative model) that involves apart from the true orthoimage several additional parameters. Our goal is to invert the image formation process by estimating those parameters which make our input images most likely. We present results on aerial images of a complex urban environment.

Proceedings ArticleDOI
31 Oct 2008-Area
TL;DR: A monocular object tracker, able to detect and track multiple object classes in non-controlled environments, using Bayesian per-pixel classification to segment an image into foreground and background objects, based on observations of object appearances and motions in real-time.
Abstract: This paper describes a monocular object tracker, able to detect and track multiple object classes in non-controlled environments. Our tracking framework uses Bayesian per-pixel classification to segment an image into foreground and background objects, based on observations of object appearances and motions in real-time. Furthermore, semantically high level events are automatically extracted from the tracking data for performance evaluation. The reliability of the event detection is demonstrated by applying it to state-of-the-art methods and comparing the results to human annotated ground truth data for multiple public datasets.


Proceedings ArticleDOI
26 Oct 2008
TL;DR: This workshop consists of 16 high quality papers organized in four thematic sessions dedicated to new objects tracking algorithms under complex environments and to object labeling techniques for detecting high level semantics in video sequences.
Abstract: AREA 2008 is the first ACM international workshop on analysis and retrieval of events, actions and workflows in video streams. Such research is nowadays critical for many real-life applications, such as area supervision, semantic characterization and annotation of video streams, quality assurance, and security. This workshop consists of 16 high quality papers organized in four thematic sessions. More specifically, the first session is dedicated to new objects tracking algorithms under complex environments and to object labeling techniques. The second session deals with methods, tools and architectures for detecting high level semantics (events, actions, and workflows) in video sequences. The third session presents new algorithms for analyzing video sequences oriented to detecting humans' actions or implicitly annotating multimedia content. Finally, the fourth includes a special session of the recent advantages of the ongoing research projects in the field of multimedia analysis, cognitive video supervision, personalized video annotation, fast retrieval of multimedia content in compressed domain and scheduling tools for interactive multimedia services. We hope that these proceedings will serve as a valuable reference for analysis of events in video streams.

Book ChapterDOI
01 Jun 2008
TL;DR: In this paper, the authors present a system for autonomous mobile robot navigation with only an omnidirectional camera as sensor, which is able to build automatically and robust accurate topologically organized environment maps of a complex, natural environment.
Abstract: In this work we present a novel system for autonomous mobile robot navigation. With only an omnidirectional camera as sensor, this system is able to build automatically and robust accurate topologically organised environment maps of a complex, natural environment. It can localise itself using that map at each moment, including both at startup (kidnapped robot) or using knowledge of former localisations. The topological nature of the map is similar to the intuitive maps humans use, is memory-efficient and enables fast and simple path planning towards a specified goal. We developed a real-time visual servoing technique to steer the system along the computed path. The key technology making this all possible is the novel fast wide baseline feature matching, which yields an efficient description of the scene, with a focus on man-made environments.

Proceedings ArticleDOI
20 Oct 2008
TL;DR: An audio-visual platform has been constructed with the goal to help users with disabilities or a high cognitive load to deal with unexpected events, and algorithmic approaches to the detection of such events are developed.
Abstract: It is of prime importance in everyday human life to cope with and respond appropriately to events that are not foreseen by prior experience. Machines to a large extent lack the ability to respond appropriately to such inputs. An important class of unexpected events is defined by incongruent combinations of inputs from different modalities and therefore multimodal information provides a crucial cue for the identification of such events, e.g., the sound of a voice is being heard while the person in the field-of-view does not move her lips. In the project DIRAC ("Detection and Identification of Rare Audio-visual Cues") we have been developing algorithmic approaches to the detection of such events, as well as an experimental hardware platform to test it. An audio-visual platform ("AWEAR" - audio-visual wearable device) has been constructed with the goal to help users with disabilities or a high cognitive load to deal with unexpected events. Key hardware components include stereo panoramic vision sensors and 6-channel worn-behind-the-ear (hearing aid) microphone arrays. Data have been recorded to study audio-visual tracking, a/v scene/object classification and a/v detection of incongruencies.

01 Jan 2008
TL;DR: In this paper, a unified approach for managing cultural heritage information is proposed to handle storing and exchanging data, and an implementation demonstrates its use in two different cultural heritage applications in the context of large-scale projects.
Abstract: This paper explores the infrastructure needs for cultural heritage applications. Small dedicated applications as well as very large projects are considered. A unified approach for managing cultural heritage information is proposed to handle storing and exchanging data. An implementation demonstrates its use in two different cultural heritage applications.


01 Jan 2008
TL;DR: In this article, a Bayesian approach is used to define a generative model of the input images, which are regarded as noisy measurements of an underlying true and hence unknown orthoimage.
Abstract: Orthographic images compose an efficient and economic way to represent aerial images. This kind of information allows to measure two-dimensional objects and relate these to Geographic Information Systems. This paper deals with the computation of a true ortho-graphic image given a set of overlapping perspective images. These are, together with the internal and external calibration the only input to our approach. These few requirements form a large advantage to systems where the digital surface model (DSM), e.g. provided by LIDAR data, is necessary. We used a Bayesian approach and define a generative model of the input images. In this, the input images are regarded as noisy measurements of an underlying true and hence unknown orthoimage. These measurements are obtained by an image formation process (generative model) that involves apart from the true orthoimage several additional parameters. Our goal is to invert the image formation process by estimating those parameters which make our input images most likely. We present results on aerial images of a complex urban environment.