scispace - formally typeset
Search or ask a question

Showing papers by "Luc Van Gool published in 2004"


Journal ArticleDOI
TL;DR: A complete system to build visual models from camera images is presented and a combined approach with view-dependent geometry and texture is presented, as an application fusion of real and virtual scenes is also shown.
Abstract: In this paper a complete system to build visual models from camera images is presented. The system can deal with uncalibrated image sequences acquired with a hand-held camera. Based on tracked or matched features the relations between multiple views are computed. From this both the structure of the scene and the motion of the camera are retrieved. The ambiguity on the reconstruction is restricted from projective to metric through self-calibration. A flexible multi-view stereo matching scheme is used to obtain a dense estimation of the surface geometry. From the computed data different types of visual models are constructed. Besides the traditional geometry- and image-based approaches, a combined approach with view-dependent geometry and texture is presented. As an application fusion of real and virtual scenes is also shown.

1,029 citations


Journal ArticleDOI
TL;DR: To increase the robustness of the system, two semi-local constraints on combinations of region correspondences are derived (one geometric, the other photometric) allow to test the consistency of correspondences and hence to reject falsely matched regions.
Abstract: ‘Invariant regions’ are self-adaptive image patches that automatically deform with changing viewpoint as to keep on covering identical physical parts of a scene. Such regions can be extracted directly from a single image. They are then described by a set of invariant features, which makes it relatively easy to match them between views, even under wide baseline conditions. In this contribution, two methods to extract invariant regions are presented. The first one starts from corners and uses the nearby edges, while the second one is purely intensity-based. As a matter of fact, the goal is to build an opportunistic system that exploits several types of invariant regions as it sees fit. This yields more correspondences and a system that can deal with a wider range of images. To increase the robustness of the system, two semi-local constraints on combinations of region correspondences are derived (one geometric, the other photometric). They allow to test the consistency of correspondences and hence to reject falsely matched regions. Experiments on images of real-world scenes taken from substantially different viewpoints demonstrate the feasibility of the approach.

568 citations


Journal ArticleDOI
TL;DR: Although the generalised color moment invariants are extracted from planar surface patches, it is argued that invariant neighbourhoods offer a concept through which they can also be used to deal with 3D objects and scenes.

279 citations


Book ChapterDOI
11 May 2004
TL;DR: A novel Object Recognition approach which overcomes limitations in dealing with extensive clutter, dominant occlusion, large scale and viewpoint changes, and can extend any viewpoint invariant feature extractor.
Abstract: Methods based on local, viewpoint invariant features have proven capable of recognizing objects in spite of viewpoint changes, occlusion and clutter. However, these approaches fail when these factors are too strong, due to the limited repeatability and discriminative power of the features. As additional shortcomings, the objects need to be rigid and only their approximate location is found. We present a novel Object Recognition approach which overcomes these limitations. An initial set of feature correspondences is first generated. The method anchors on it and then gradually explores the surrounding area, trying to construct more and more matching features, increasingly farther from the initial ones. The resulting process covers the object with matches, and simultaneously separates the correct matches from the wrong ones. Hence, recognition and segmentation are achieved at the same time. Only very few correct initial matches suffice for reliable recognition. The experimental results demonstrate the stronger power of the presented method in dealing with extensive clutter, dominant occlusion, large scale and viewpoint changes. Moreover non-rigid deformations are explicitly taken into account, and the approximative contours of the object are produced. The approach can extend any viewpoint invariant feature extractor.

209 citations


Book ChapterDOI
16 May 2004
TL;DR: An EM-algorithm is described, which iterates between estimating values for all hidden quantities, and optimizing the current optical flow estimates by differential techniques, and an important new feature is the photometric detection of occluded pixels.
Abstract: This paper deals with the computation of optical flow and occlusion detection in the case of large displacements. We propose a Bayesian approach to the optical flow problem and solve it by means of differential techniques. The images are regarded as noisy measurements of an underlying ’true’ image-function. Additionally, the image data is considered incomplete, in the sense that we do not know which pixels from a particular image are occluded in the other images. We describe an EM-algorithm, which iterates between estimating values for all hidden quantities, and optimizing the current optical flow estimates by differential techniques. The Bayesian way of describing the problem leads to more insight in existing differential approaches, and offers some natural extensions to them. The resulting system involves less parameters and gives an interpretation to the remaining ones. An important new feature is the photometric detection of occluded pixels. We compare the algorithm with existing optical flow methods on ground truth data. The comparison shows that our algorithm generates the most accurate optical flow estimates. We further illustrate the approach with some challenging real-world examples.

65 citations


Journal ArticleDOI
01 Dec 2004
TL;DR: This paper proposes to exploit the increased linear coupling between camera and object translations that tends to appear at false scales to provide a second, 'non-accidentalness' criterion for the selection of the correct motion among the one-parameter family.
Abstract: The 3D reconstruction of scenes containing independently moving objects from uncalibrated monocular sequences still poses serious challenges. Even if the background and the moving objects are rigid, each reconstruction is only known up to a certain scale, which results in a one-parameter family of possible, relative trajectories per moving object with respect to the background. In order to determine a realistic solution from this family of possible trajectories, this paper proposes to exploit the increased linear coupling between camera and object translations that tends to appear at false scales. An independence criterion is formulated in the sense of true object and camera motions being minimally correlated. The increased coupling at false scales can also lead to the destruction of special properties such as planarity, periodicity, etc. of the true object motion. This provides us with a second, 'non-accidentalness' criterion for the selection of the correct motion among the one-parameter family.

46 citations


Journal ArticleDOI
01 Jul 2004
TL;DR: In this article, the authors presented a practical approach to detecting shot cuts and extracting keyframes from video sequences, which has two stages - global motion compensation, followed by an adaptive thresholding algorithm.
Abstract: This paper presents a practical approach to detecting shot cuts and extracting keyframes from video sequences. Shot cut detection has two stages - global motion compensation, followed by an adaptive thresholding algorithm. The motion information is further utilized to extract representative keyframes. Special consideration has been given to achieving real-time performance on a regular PC, which led to a motion estimation algorithm of linear complexity.

41 citations


01 Jan 2004
TL;DR: A view interpolation algorithm is proposed which makes it possible to create new intermediate views from the existing camera images, and the combination of all these techniques gathers information from a complete camera network and produces one attractive real-time video stream.
Abstract: We present a camera network system consisting of several modules of 2-3 low end cameras attached to one computer. It is not possible for a human to observe all the information coming from such a network simultaneously. Our system is designed to select the best viewpoint for each part of the video sequence, thus automatically creating one real-time video stream that contains the most important data. It acts as a combination of a director and a cameraman. Cinematography developed its own terminology, techniques and rules, how to make a good movie. We illustrate here some of these techniques and how they can be applied to a camera network, to solve the best viewpoint selection problem. Our system consists of only fixed cameras, but the output is not constrained to already existing views. A virtual zoom can be applied to select only a part of the view. We propose a view interpolation algorithm which makes it possible to create new intermediate views from the existing camera images. The combination of all these techniques gathers information from a complete camera network and produces one attractive real-time video stream. The resulting video can typically be used for telepresence applications or as a documentary or instruction video.

31 citations


01 May 2004
TL;DR: This work was motivated by the goal of building a navigation system that could guide people or robots around in large complex urban environments, even in situations in which Global Positioning Systems cannot provide navigational information.
Abstract: 1. ABSTRACT This work was motivated by the goal of building a navigation system that could guide people or robots around in large complex urban environments, even in situations in which Global Positioning Systems (GPS) cannot provide navigational information. Such environments include indoor and crowded city areas where there is no line of sight to the GPS satellites. Because installing active badges or beacon systems involves substantial effort and expense, we have developed a system which navigates solely based on naturally occurring landmarks. As sensory input, we only use a panoramic camera system which provides omnidirectional images of the environment. During the training stage, the system is led around in the environment while recording images at regular time intervals. Offline, these images are automatically archived in a world model. Unlike traditional approaches we don’t build an Euclidean metrical map. The used world model is a graph reflecting the topological structure of the environment: e.g. for indoor environments rooms are nodes and corridors are edges of the graph. Image comparison is done using both global color measures and matching of specially developed local features. These measures are designed to be robust, respectively invariant, to image distortions caused by viewpoint changes, illumination changes and occlusions. This leads to a system that can recognize a certain place even if its location is not exactly the same as the location from where the reference image was taken, even if the illumination is substantially different, and even if there are large occluded parts. Using this world model, localization can be done by comparing a new query image, taken at the current position of the mobile system, with the images in the model. A Bayesian framework makes it possible to track the system’s position in quasi real time. When the present location is known, a path to a target location can be carried out easily using the topological map. 2. INTRODUCTION AND RELATED WORK Nowadays, the number of applications using location information is growing exponentially. As localization system, the GPS system’s popularity is growing rapidly. However, there are reasons why we did not choose working with GPS, as explained further.

25 citations


Journal IssueDOI
TL;DR: Way to improve the performance by incorporating inertial sensors for vision based mobile robot navigation for wheel chairs is investigated.
Abstract: This paper describes ongoing research on vision based mobile robot navigation for wheel chairs. After a guided tour through a natural environment while taking images at regular time intervals, natural landmarks are extracted to automatically build a topological map. Later on this map can be used for place recognition and navigation. We use visual servoing on the landmarks to steer the robot. In this paper, we investigate ways to improve the performance by incorporating inertial sensors. © 2004 Wiley Periodicals, Inc.

20 citations


Journal ArticleDOI
TL;DR: This paper investigates ways to improve the performance by incorporating inertial sensors in vision based mobile robot navigation for wheel chairs.
Abstract: This paper describes ongoing research on vision based mobile robot navigation for wheel chairs. After a guided tour through a natural environment while taking images at regular time intervals, natural landmarks are extracted to automatically build a topological map. Later on this map can be used for place recognition and navigation. We use visual servoing on the landmarks to steer the robot. In this paper, we investigate ways to improve the performance by incorporating inertial sensors.


01 Jan 2004
TL;DR: The aim has been to build a maximally realistic but also veridical model of the Antonine nymphaeum at the Sagalassos excavation site, using techniques for 3D acquisition, texture modelling and synthesis, data clean-up, and visualisation.
Abstract: Computer technologies make possible virtual reconstructions of ancient structures. In this paper we give a concise overview of the techniques we have used to build a detailed 3D model of the Antonine nymphaeum at the Sagalassos excavation site. These include techniques for 3D acquisition, texture modelling and synthesis, data clean-up, and visualisation. Our aim has been to build a maximally realistic but also veridical model. The paper is also meant as a plea to include such levels of detail into models where the data allow it. There is an ongoing debate whether high levels of detail, and photo-realistic visualisation for that matter, are desirable in the first place. Indeed, detailed models combined with photo-realistic rendering may convey an impression of reality, whereas they can never represent the situation like it really was. Of course, we agree that filling in completely hypothetical structures may be more misleading than it is informative. On the other hand, often good indications about these structures, or even actual fragments thereof, may be available. Leaving out any structures one is not absolutely sure about, combining basic geometric primitives, or adopting copy-and-paste methods – all aspects regularly found with simple model building – also entail dangers. Such models may fail to generate interest with the public and also if they do, may fail to illustrate ornamental sophistication or shape and pattern irregularities. * Corresponding author.

Proceedings Article
01 Jan 2004
TL;DR: This work presents a method to (semi-)automatically annotate video material by focusing on recognizing specific objects and scenes in keyframes and proposes to gather more evidence about the presence of the object by exploring the image around the initial matches.
Abstract: We present a method to (semi-)automatically annotate video material. More precisely, we focus on recognizing specific objects and scenes in keyframes. Objects are learnt simply by having the user delineate them in one (or a few) images. The basic building block to achieve this goal consists of affine invariant regions. These are local image patches that adapt their shape based on the image content so as to be invariant to viewpoint changes. Instead of simply matching the regions and counting the number of matches, we propose to gather more evidence about the presence of the object by exploring the image around the initial matches. This boosts the performance, especially under difficult, real-world imaging conditions. Experimental results on news broadcast data demonstrate the viability of the approach.

Book ChapterDOI
01 Jan 2004
TL;DR: This work attempts to improve on the current state of the art in face animation, especially for the creation of highly realistic lip and speech-related motions, by narrowing the gap between modelling and animation.
Abstract: The problem of realistic face animation is a difficult one. This is hampering a further breakthrough of some high-tech domains, such as special effects in the movies, the use of 3D face models in communications, the use of avatars and likenesses in virtual reality, and the production of games with more subtle scenarios. This work attempts to improve on the current stateof-the-art in face animation, especially for the creation of highly realistic lip and speech-related motions. To that end, 3D models of faces are used and based on the latest technology speech-related 3D face motion will be learned from examples. Thus, the chapter subscribes to the surging field of image-based modelling and widens its scope to include animation. The exploitation of detailed 3D motion sequences is quite unique, thereby Modeling and Synthesis of Realistic Visual Speech in 3D 267 Copyright © 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. narrowing the gap between modelling and animation. From measured 3D face deformations around the mouth area, typical motions are extracted for different “visemes.” Visemes are the basic motion patterns observed for speech and are comparable to the phonemes of auditory speech. The visemes are studied with sufficient detail to also cover natural variations and differences between individuals. Furthermore, the transition between visemes is analysed in terms of co-articulation effects, i.e., the visual blending of visemes as required for fluent, natural speech. The work presented in this chapter also encompasses the animation of faces for which no visemes have been observed and extracted. The “transplantation” of visemes to novel faces for which no viseme data have been recorded and for which only a static 3D model is available allows for the animation of faces without an extensive learning procedure for each individual.

Book ChapterDOI
21 Jun 2004
TL;DR: In this paper, a wavelet transform from the upper body part is used for recognition, tracking and pose estimation of people in video sequences and a particle filter is used to detect specific, learned poses.
Abstract: This paper presents a new system for recognition, tracking and pose estimation of people in video sequences. It is based on the wavelet transform from the upper body part and uses Support Vector Machines (SVM) for classification. Recognition is carried out hierarchically by first recognizing people and then individual characters. The characteristic features that best discriminate one person from another are learned automatically. Tracking is solved via a particle filter that utilizes the SVM output and a first order kinematic model to obtain a robust scheme that successfully handles occlusion, different poses and camera zooms. For pose estimation a collection of SVM classifiers is evaluated to detect specific, learned poses.



01 Jan 2004
TL;DR: This work proposes work to enable oppositely positioned scanning modules to acquire 3D data simultaneously and thereby to speed up the acquisition even further.
Abstract: Motion capturing systems that are based on 3D models require high-speed scanning methods. One-shot structured light techniques aim at a good balance between speed and accuracy. Due to pattern interference, currently available setups capture 3D only from one single viewpoint. We propose work to enable oppositely positioned scanning modules to acquire 3D data simultaneously and thereby to speed up the acquisition even further. Key is the application of dynamic projection masks, that limit the structured light projection to the relevant part of the scene, i.e. the person. This requires tracking of the person’s outline.

Proceedings ArticleDOI
01 Jan 2004
TL;DR: A structured light approach that supports interactive modeling that allows the user to check the quality of the result during the scanning and to perform effective view planning and the projected patterns are automatically adapted to the scene.
Abstract: 3D acquisition technology has made big strides forward over the past years. Systems have become easier to use, cheaper, and faster. These developments are discussed in the plenary presentation that goes with this paper, both for the capture of shape and of surface textures. As the breadth of topics would only allow for a very superficial description of any particular method, the paper focuses on one such recent development as a good case in point: a structured light approach that supports interactive modeling. Textured 3D models are produced on-line, while manipulating the object in front of the system. This allows the user to check the quality of the result during the scanning and to perform effective view planning. Moreover, the projected patterns are automatically adapted to the scene. The system only requires a regular camera, LCD projector, and PC.

Proceedings ArticleDOI
01 Jan 2004
TL;DR: An algorithm to generate an interpolated view between two camera viewpoints in a fast and automatic way is presented to develop more advanced tele-teaching and videoconferencing environments, and this without the need of many cameras.
Abstract: This paper presents an algorithm to generate an interpolated view between two camera viewpoints in a fast and automatic way (6-7 fps on a PentIV @ 2.6 GHz, Geforce FX AGP 4). Nothing more than a desktop PC and a set of low end consumer grade cameras are needed to simulate the video stream of any intermediate camera. Parallel use of the GPU ('plane sweep' algorithm) and the CPU ('min-cut/max-flow' regularisation algorithm) is made to calculate the depth values. The final interpolations for any intermediate camera position are obtained by a projectively correct blended warp of the input images on a 3D mesh. Limited extrapolation is also feasible. The goal is to develop more advanced tele-teaching and videoconferencing environments, and this without the need of many cameras. Camera movements can be simulated and the best view can be selected whether this is recorded by a real camera or not. Compared to putting a human editor in control, the cost decreases dramatically, without losing all the added value of video stream editing.

Book ChapterDOI
21 Jun 2004
TL;DR: The paper proposes a novel affine invariant region type, that is built up from a combination of fitted superellipses, that has the advantage of offering a much wider range of shapes through the addition of a very limited number of shape parameters.
Abstract: Affine invariant regions have proved a powerful feature for object recognition and categorization. These features heavily rely on object textures rather than shapes, however. Typically, their shapes have been fixed to ellipses or parallelograms. The paper proposes a novel affine invariant region type, that is built up from a combination of fitted superellipses. These novel features have the advantage of offering a much wider range of shapes through the addition of a very limited number of shape parameters, with the traditional ellipses and parallelograms as subsets. The paper offers a solution for the robust fitting of superellipses to partial contours, which is a crucial step towards the implementation of the novel features.

Book ChapterDOI
12 Oct 2004
TL;DR: An EM-algorithm is described, which iterates between estimating values for all hidden quantities, and optimizing the optical flow by differential techniques, and gives an interpretation to the remaining ones of the resulting system.
Abstract: This paper deals with the computation of dense image correspondences and the detection of occlusion. We propose a Bayesian approach to the image registration problem. The images are regarded as noisy measurements of an underlying 'true' image-function. Additionally, the image data is considered incomplete, in the sense that we do not know which pixels from a particular image are occluded in the other images. We describe an EM-algorithm, which iterates between estimating values for all hidden quantities, and optimizing the optical flow by differential techniques. The Bayesian way of describing the problem leads to more insight in existing differential approaches, and offers some natural extensions to them. The resulting system involves less parameters and gives an interpretation to the remaining ones. An important feature is the photometric detection of occluded pixels.