scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Computer Vision in 2007"


Journal ArticleDOI
TL;DR: This work forms stitching as a multi-image matching problem, and uses invariant local features to find matches between all of the images, and is insensitive to the ordering, orientation, scale and illumination of the input images.
Abstract: This paper concerns the problem of fully automated panoramic image stitching. Though the 1D problem (single axis of rotation) is well studied, 2D or multi-row stitching is more difficult. Previous approaches have used human input or restrictions on the image sequence in order to establish matching images. In this work, we formulate stitching as a multi-image matching problem, and use invariant local features to find matches between all of the images. Because of this our method is insensitive to the ordering, orientation, scale and illumination of the input images. It is also insensitive to noise images that are not part of a panorama, and can recognise multiple panoramas in an unordered image dataset. In addition to providing more detail, this paper extends our previous work in the area (Brown and Lowe, 2003) by introducing gain compensation and automatic straightening steps.

2,550 citations


Journal ArticleDOI
TL;DR: A survey of a specific class of region-based level set segmentation methods and how they can all be derived from a common statistical framework is presented.
Abstract: Since their introduction as a means of front propagation and their first application to edge-based segmentation in the early 90's, level set methods have become increasingly popular as a general framework for image segmentation. In this paper, we present a survey of a specific class of region-based level set segmentation methods and clarify how they can all be derived from a common statistical framework. Region-based segmentation schemes aim at partitioning the image domain by progressively fitting statistical models to the intensity, color, texture or motion in each of a set of regions. In contrast to edge-based schemes such as the classical Snakes, region-based methods tend to be less sensitive to noise. For typical images, the respective cost functionals tend to have less local minima which makes them particularly well-suited for local optimization methods such as the level set method. We detail a general statistical formulation for level set segmentation. Subsequently, we clarify how the integration of various low level criteria leads to a set of cost functionals. We point out relations between the different segmentation schemes. In experimental results, we demonstrate how the level set function is driven to partition the image plane into domains of coherent color, texture, dynamic texture or motion. Moreover, the Bayesian formulation allows to introduce prior shape knowledge into the level set method. We briefly review a number of advances in this domain.

1,117 citations


Journal ArticleDOI
TL;DR: This work presents an approach to automatically detect and track multiple, possibly partially occluded humans in a walking or standing pose from a single camera, which may be stationary or moving.
Abstract: Detection and tracking of humans in video streams is important for many applications. We present an approach to automatically detect and track multiple, possibly partially occluded humans in a walking or standing pose from a single camera, which may be stationary or moving. A human body is represented as an assembly of body parts. Part detectors are learned by boosting a number of weak classifiers which are based on edgelet features. Responses of part detectors are combined to form a joint likelihood model that includes an analysis of possible occlusions. The combined detection responses and the part detection responses provide the observations used for tracking. Trajectory initialization and termination are both automatic and rely on the confidences computed from the detection responses. An object is tracked by data association and meanshift methods. Our system can track humans with both inter-object and scene occlusions with static or non-static backgrounds. Evaluation results on a number of images and videos and comparisons with some previous methods are given.

836 citations


Journal ArticleDOI
TL;DR: This paper takes the first step towards constructing the surface layout, a labeling of the image intogeometric classes, to learn appearance-based models of these geometric classes, which coarsely describe the 3D scene orientation of each image region.
Abstract: Humans have an amazing ability to instantly grasp the overall 3D structure of a scene--ground orientation, relative positions of major landmarks, etc.--even from a single image. This ability is completely missing in most popular recognition algorithms, which pretend that the world is flat and/or view it through a patch-sized peephole. Yet it seems very likely that having a grasp of this "surface layout" of a scene should be of great assistance for many tasks, including recognition, navigation, and novel view synthesis. In this paper, we take the first step towards constructing the surface layout, a labeling of the image intogeometric classes. Our main insight is to learn appearance-based models of these geometric classes, which coarsely describe the 3D scene orientation of each image region. Our multiple segmentation framework provides robust spatial support, allowing a wide variety of cues (e.g., color, texture, and perspective) to contribute to the confidence in each geometric label. In experiments on a large set of outdoor images, we evaluate the impact of the individual cues and design choices in our algorithm. We further demonstrate the applicability of our method to indoor images, describe potential applications, and discuss extensions to a more complete notion of surface layout.

735 citations


Journal ArticleDOI
TL;DR: This paper presents a multi-cue vision system for the real-time detection and tracking of pedestrians from a moving vehicle, with results from extensive field tests in difficult urban traffic conditions suggest system performance is at the leading edge.
Abstract: This paper presents a multi-cue vision system for the real-time detection and tracking of pedestrians from a moving vehicle. The detection component involves a cascade of modules, each utilizing complementary visual criteria to successively narrow down the image search space, balancing robustness and efficiency considerations. Novel is the tight integration of the consecutive modules: (sparse) stereo-based ROI generation, shape-based detection, texture-based classification and (dense) stereo-based verification. For example, shape-based detection activates a weighted combination of texture-based classifiers, each attuned to a particular body pose. Performance of individual modules and their interaction is analyzed by means of Receiver Operator Characteristics (ROCs). A sequential optimization technique allows the successive combination of individual ROCs, providing optimized system parameter settings in a systematic fashion, avoiding ad-hoc parameter tuning. Application-dependent processing constraints can be incorporated in the optimization procedure. Results from extensive field tests in difficult urban traffic conditions suggest system performance is at the leading edge.

605 citations


Journal ArticleDOI
TL;DR: This work addresses the problem of detecting irregularities in visual data, e.g., detecting suspicious behaviors in video sequences, or identifying salient patterns in images, using a probabilistic graphical model.
Abstract: We address the problem of detecting irregularities in visual data, e.g., detecting suspicious behaviors in video sequences, or identifying salient patterns in images. The term "irregular" depends on the context in which the "regular" or "valid" are defined. Yet, it is not realistic to expect explicit definition of all possible valid configurations for a given context. We pose the problem of determining the validity of visual data as a process of constructing a puzzle: We try to compose a new observed image region or a new video segment ("the query") using chunks of data ("pieces of puzzle") extracted from previous visual examples ("the database"). Regions in the observed data which can be composed using large contiguous chunks of data from the database are considered very likely, whereas regions in the observed data which cannot be composed from the database (or can be composed, but only using small fragmented pieces) are regarded as unlikely/suspicious. The problem is posed as an inference process in a probabilistic graphical model. We show applications of this approach to identifying saliency in images and video, for detecting suspicious behaviors and for automatic visual inspection for quality assurance.

523 citations


Journal ArticleDOI
TL;DR: A method is designed, based on intersecting epipolar constraints, for providing ground truth correspondence automatically, which is based purely on geometric information, and does not rely on the choice of a specific feature appearance descriptor.
Abstract: We explore the performance of a number of popular feature detectors and descriptors in matching 3D object features across viewpoints and lighting conditions. To this end we design a method, based on intersecting epipolar constraints, for providing ground truth correspondence automatically. These correspondences are based purely on geometric information, and do not rely on the choice of a specific feature appearance descriptor. We test detector-descriptor combinations on a database of 100 objects viewed from 144 calibrated viewpoints under three different lighting conditions. We find that the combination of Hessian-affine feature finder and SIFT features is most robust to viewpoint change. Harris-affine combined with SIFT and Hessian-affine combined with shape context descriptors were best respectively for lighting change and change in camera focal length. We also find that no detector-descriptor combination performs well with viewpoint changes of more than 25---30?.

497 citations


Journal ArticleDOI
TL;DR: This paper study face hallucination, or synthesizing a high-resolution face image from an input low-resolution image, with the help of a large collection of other high- resolution face images to generate photorealistic face images.
Abstract: In this paper, we study face hallucination, or synthesizing a high-resolution face image from an input low-resolution image, with the help of a large collection of other high-resolution face images. Our theoretical contribution is a two-step statistical modeling approach that integrates both a global parametric model and a local nonparametric model. At the first step, we derive a global linear model to learn the relationship between the high-resolution face images and their smoothed and down-sampled lower resolution ones. At the second step, we model the residue between an original high-resolution image and the reconstructed high-resolution image after applying the learned linear model by a patch-based non-parametric Markov network to capture the high-frequency content. By integrating both global and local models, we can generate photorealistic face images. A practical contribution is a robust warping algorithm to align the low-resolution face images to obtain good hallucination results. The effectiveness of our approach is demonstrated by extensive experiments generating high-quality hallucinated face images from low-resolution input with no manual alignment.

450 citations


Journal ArticleDOI
TL;DR: This paper shows how to perform photometric stereo assuming that all lights in a scene are distant from the object but otherwise unconstrained, and reconstructing the shape of objects from images obtained under a variety of lightings.
Abstract: Work on photometric stereo has shown how to recover the shape and reflectance properties of an object using multiple images taken with a fixed viewpoint and variable lighting conditions. This work has primarily relied on known lighting conditions or the presence of a single point source of light in each image. In this paper we show how to perform photometric stereo assuming that all lights in a scene are distant from the object but otherwise unconstrained. Lighting in each image may be an unknown and may include arbitrary combination of diffuse, point and extended sources. Our work is based on recent results showing that for Lambertian objects, general lighting conditions can be represented using low order spherical harmonics. Using this representation we can recover shape by performing a simple optimization in a low-dimensional space. We also analyze the shape ambiguities that arise in such a representation. We demonstrate our method by reconstructing the shape of objects from images obtained under a variety of lightings. We further compare the reconstructed shapes against shapes obtained with a laser scanner.

438 citations


Journal ArticleDOI
TL;DR: A photometric model that describes the intensities produced by individual rain streaks and a dynamic model that captures the spatio-temporal properties of rain are developed, which describe the complete visual appearance of rain.
Abstract: The visual effects of rain are complex. Rain produces sharp intensity changes in images and videos that can severely impair the performance of outdoor vision systems. In this paper, we provide a comprehensive analysis of the visual effects of rain and the various factors that affect it. Based on this analysis, we develop efficient algorithms for handling rain in computer vision as well as for photorealistic rendering of rain in computer graphics. We first develop a photometric model that describes the intensities produced by individual rain streaks and a dynamic model that captures the spatio-temporal properties of rain. Together, these models describe the complete visual appearance of rain. Using these models, we develop a simple and effective post-processing algorithm for detection and removal of rain from videos. We show that our algorithm can distinguish rain from complex motion of scene objects and other time-varying textures. We then extend our analysis by studying how various factors such as camera parameters, rain properties and scene brightness affect the appearance of rain. We show that the unique physical properties of rain--its small size, high velocity and spatial distribution--makes its visibility depend strongly on camera parameters. This dependence is used to reduce the visibility of rain during image acquisition by judiciously selecting camera parameters. Conversely, camera parameters can also be chosen to enhance the visibility of rain. This ability can be used to develop an inexpensive and portable camera-based rain gauge that provides instantaneous rain-rate measurements. Finally, we develop a rain streak appearance model that accounts for the rapid shape distortions (i.e. oscillations) that a raindrop undergoes as it falls. We show that modeling these distortions allows us to faithfully render the complex intensity patterns that are visible in the case of raindrops that are close to the camera.

434 citations


Journal ArticleDOI
TL;DR: A novel image representation is presented that renders it possible to access natural scenes by local semantic description by using a perceptually plausible distance measure that leads to a high correlation between the human and the automatically obtained typicality ranking.
Abstract: In this paper, we present a novel image representation that renders it possible to access natural scenes by local semantic description. Our work is motivated by the continuing effort in content-based image retrieval to extract and to model the semantic content of images. The basic idea of the semantic modeling is to classify local image regions into semantic concept classes such as water, rocks, or foliage. Images are represented through the frequency of occurrence of these local concepts. Through extensive experiments, we demonstrate that the image representation is well suited for modeling the semantic content of heterogenous scene categories, and thus for categorization and retrieval. The image representation also allows us to rank natural scenes according to their semantic similarity relative to certain scene categories. Based on human ranking data, we learn a perceptually plausible distance measure that leads to a high correlation between the human and the automatically obtained typicality ranking. This result is especially valuable for content-based image retrieval where the goal is to present retrieval results in descending semantic similarity from the query.

Journal ArticleDOI
TL;DR: A new real-time localization system for a mobile robot that shows that autonomous navigation is possible in outdoor situation with the use of a single camera and natural landmarks and a three step approach is presented.
Abstract: This paper presents a new real-time localization system for a mobile robot. We show that autonomous navigation is possible in outdoor situation with the use of a single camera and natural landmarks. To do that, we use a three step approach. In a learning step, the robot is manually guided on a path and a video sequence is recorded with a front looking camera. Then a structure from motion algorithm is used to build a 3D map from this learning sequence. Finally in the navigation step, the robot uses this map to compute its localization in real-time and it follows the learning path or a slightly different path if desired. The vision algorithms used for map building and localization are first detailed. Then a large part of the paper is dedicated to the experimental evaluation of the accuracy and robustness of our algorithms based on experimental data collected during two years in various environments.

Journal ArticleDOI
TL;DR: A new variational method for multi-view stereovision and non-rigid three-dimensional motion estimation from multiple video sequences that minimizes the prediction error of the shape and motion estimates and results in a simpler, more flexible, and more efficient implementation than in existing methods.
Abstract: We present a new variational method for multi-view stereovision and non-rigid three-dimensional motion estimation from multiple video sequences. Our method minimizes the prediction error of the shape and motion estimates. Both problems then translate into a generic image registration task. The latter is entrusted to a global measure of image similarity, chosen depending on imaging conditions and scene properties. Rather than integrating a matching measure computed independently at each surface point, our approach computes a global image-based matching score between the input images and the predicted images. The matching process fully handles projective distortion and partial occlusions. Neighborhood as well as global intensity information can be exploited to improve the robustness to appearance changes due to non-Lambertian materials and illumination changes, without any approximation of shape, motion or visibility. Moreover, our approach results in a simpler, more flexible, and more efficient implementation than in existing methods. The computation time on large datasets does not exceed thirty minutes on a standard workstation. Finally, our method is compliant with a hardware implementation with graphics processor units. Our stereovision algorithm yields very good results on a variety of datasets including specularities and translucency. We have successfully tested our motion estimation algorithm on a very challenging multi-view video sequence of a non-rigid scene.

Journal ArticleDOI
TL;DR: Two approaches to the SLAM problem using vision are presented: one with stereovision, and one with monocular images, which rely on a robust interest point matching algorithm that works in very diverse environments.
Abstract: Building a spatially consistent model is a key functionality to endow a mobile robot with autonomy. Without an initial map or an absolute localization means, it requires to concurrently solve the localization and mapping problems. For this purpose, vision is a powerful sensor, because it provides data from which stable features can be extracted and matched as the robot moves. But it does not directly provide 3D information, which is a difficulty for estimating the geometry of the environment. This article presents two approaches to the SLAM problem using vision: one with stereovision, and one with monocular images. Both approaches rely on a robust interest point matching algorithm that works in very diverse environments. The stereovision based approach is a classic SLAM implementation, whereas the monocular approach introduces a new way to initialize landmarks. Both approaches are analyzed and compared with extensive experimental results, with a rover and a blimp.

Journal ArticleDOI
TL;DR: Experiments with natural and synthetic sequences illustrate how the learned optical flow prior quantitatively improves flow accuracy and how it captures the rich spatial structure found in natural scene motion.
Abstract: We present an analysis of the spatial and temporal statistics of "natural" optical flow fields and a novel flow algorithm that exploits their spatial statistics. Training flow fields are constructed using range images of natural scenes and 3D camera motions recovered from hand-held and car-mounted video sequences. A detailed analysis of optical flow statistics in natural scenes is presented and machine learning methods are developed to learn a Markov random field model of optical flow. The prior probability of a flow field is formulated as a Field-of-Experts model that captures the spatial statistics in overlapping patches and is trained using contrastive divergence. This new optical flow prior is compared with previous robust priors and is incorporated into a recent, accurate algorithm for dense optical flow computation. Experiments with natural and synthetic sequences illustrate how the learned optical flow prior quantitatively improves flow accuracy and how it captures the rich spatial structure found in natural scene motion.

Journal ArticleDOI
TL;DR: In this article, a stereo method for image-based rendering is proposed, which relies on over-segmenting the source images and computing match values over entire segments rather than single pixels.
Abstract: In this paper, we propose a stereo method specifically designed for image-based rendering For effective image-based rendering, the interpolated views need only be visually plausible The implication is that the extracted depths do not need to be correct, as long as the recovered views appear to be correct Our stereo algorithm relies on over-segmenting the source images Computing match values over entire segments rather than single pixels provides robustness to noise and intensity bias Color-based segmentation also helps to more precisely delineate object boundaries, which is important for reducing boundary artifacts in synthesized views The depths of the segments for each image are computed using loopy belief propagation within a Markov Random Field framework Neighboring MRFs are used for occlusion reasoning and ensuring that neighboring depth maps are consistent We tested our stereo algorithm on several stereo pairs from the Middlebury data set, and show rendering results based on two of these data sets We also show results for video-based rendering

Journal ArticleDOI
TL;DR: An extension of the loop closing technique to a multi-robot mapping problem in which the outputs of several, uncoordinated and SLAM-enabled robots are fused without requiring inter-vehicle observations or a-priori frame alignment.
Abstract: This paper is concerned with "loop closing" for mobile robots. Loop closing is the problem of correctly asserting that a robot has returned to a previously visited area. It is a particularly hard but important component of the Simultaneous Localization and Mapping (SLAM) problem. Here a mobile robot explores an a-priori unknown environment performing on-the-fly mapping while the map is used to localize the vehicle. Many SLAM implementations look to internal map and vehicle estimates (p.d.fs) to make decisions about whether a vehicle is revisiting a previously mapped area or is exploring a new region of workspace. We suggest that one of the reasons loop closing is hard in SLAM is precisely because these internal estimates can, despite best efforts, be in gross error. The "loop closer" we propose, analyze and demonstrate makes no recourse to the metric estimates of the SLAM system it supports and aids---it is entirely independent. At regular intervals the vehicle captures the appearance of the local scene (with camera and laser). We encode the similarity between all possible pairings of scenes in a "similarity matrix". We then pose the loop closing problem as the task of extracting statistically significant sequences of similar scenes from this matrix. We show how suitable analysis (introspection) and decomposition (remediation) of the similarity matrix allows for the reliable detection of loops despite the presence of repetitive and visually ambiguous scenes. We demonstrate the technique supporting a SLAM system driven by scan-matching laser data in a variety of settings. Some of the outdoor settings are beyond the capability of the SLAM system itself in which case GPS was used to provide a ground truth. We further show how the techniques can equally be applied to detect loop closure using spatial images taken with a scanning laser. We conclude with an extension of the loop closing technique to a multi-robot mapping problem in which the outputs of several, uncoordinated and SLAM-enabled robots are fused without requiring inter-vehicle observations or a-priori frame alignment.

Journal ArticleDOI
TL;DR: An algorithm which automatically learns two separate sets of facial components for the detection and identification tasks is described, which clearly shows that the component-based approach is superior to global approaches.
Abstract: We present a component-based framework for face detection and identification. The face detection and identification modules share the same hierarchical architecture. They both consist of two layers of classifiers, a layer with a set of component classifiers and a layer with a single combination classifier. The component classifiers independently detect/identify facial parts in the image. Their outputs are passed the combination classifier which performs the final detection/identification of the face. We describe an algorithm which automatically learns two separate sets of facial components for the detection and identification tasks. In experiments we compare the detection and identification systems to standard global approaches. The experimental results clearly show that our component-based approach is superior to global approaches.

Journal ArticleDOI
TL;DR: The flexible nature of the model is demonstrated by results over six diverse object categories including geometrically constrained categories (e.g. faces, cars) and flexible objects (such as animals).
Abstract: We investigate a method for learning object categories in a weakly supervised manner. Given a set of images known to contain the target category from a similar viewpoint, learning is translation and scale-invariant; does not require alignment or correspondence between the training images, and is robust to clutter and occlusion. Category models are probabilistic constellations of parts, and their parameters are estimated by maximizing the likelihood of the training data. The appearance of the parts, as well as their mutual position, relative scale and probability of detection are explicitly described in the model. Recognition takes place in two stages. First, a feature-finder identifies promising locations for the model"s parts. Second, the category model is used to compare the likelihood that the observed features are generated by the category model, or are generated by background clutter. The flexible nature of the model is demonstrated by results over six diverse object categories including geometrically constrained categories (e.g. faces, cars) and flexible objects (such as animals).

Journal ArticleDOI
TL;DR: In this article, the elastic properties of the curves are encoded in Riemannian metrics on these spaces and shape spaces are used to quantify shape divergence and to develop morphing techniques.
Abstract: We study shapes of planar arcs and closed contours modeled on elastic curves obtained by bending, stretching or compressing line segments non-uniformly along their extensions. Shapes are represented as elements of a quotient space of curves obtained by identifying those that differ by shape-preserving transformations. The elastic properties of the curves are encoded in Riemannian metrics on these spaces. Geodesics in shape spaces are used to quantify shape divergence and to develop morphing techniques. The shape spaces and metrics constructed are novel and offer an environment for the study of shape statistics. Elasticity leads to shape correspondences and deformations that are more natural and intuitive than those obtained in several existing models. Applications of shape geodesics to the definition and calculation of mean shapes and to the development of shape clustering techniques are also investigated.

Journal ArticleDOI
TL;DR: In this paper, an ellipse fitting method was used to detect eyeglass regions and replaced with eye template patterns to preserve the details useful for face recognition in the fused image.
Abstract: This paper describes a new software-based registration and fusion of visible and thermal infrared (IR) image data for face recognition in challenging operating environments that involve illumination variations. The combined use of visible and thermal IR imaging sensors offers a viable means for improving the performance of face recognition techniques based on a single imaging modality. Despite successes in indoor access control applications, imaging in the visible spectrum demonstrates difficulties in recognizing the faces in varying illumination conditions. Thermal IR sensors measure energy radiations from the object, which is less sensitive to illumination changes, and are even operable in darkness. However, thermal images do not provide high-resolution data. Data fusion of visible and thermal images can produce face images robust to illumination variations. However, thermal face images with eyeglasses may fail to provide useful information around the eyes since glass blocks a large portion of thermal energy. In this paper, eyeglass regions are detected using an ellipse fitting method, and replaced with eye template patterns to preserve the details useful for face recognition in the fused image. Software registration of images replaces a special-purpose imaging sensor assembly and produces co-registered image pairs at a reasonable cost for large-scale deployment. Face recognition techniques using visible, thermal IR, and data-fused visible-thermal images are compared using a commercial face recognition software (FaceIt®) and two visible-thermal face image databases (the NIST/Equinox and the UTK-IRIS databases). The proposed multiscale data-fusion technique improved the recognition accuracy under a wide range of illumination changes. Experimental results showed that the eyeglass replacement increased the number of correct first match subjects by 85% (NIST/Equinox) and 67% (UTK-IRIS).

Journal ArticleDOI
TL;DR: Six recent cost aggregation approaches are implemented and optimized for graphics hardware so that real-time speed can be achieved and the performances of these aggregation approaches in terms of both processing speed and result quality are reported.
Abstract: Many vision applications require high-accuracy dense disparity maps in real-time and online. Due to time constraint, most real-time stereo applications rely on local winner-takes-all optimization in the disparity computation process. These local approaches are generally outperformed by offline global optimization based algorithms. However, recent research shows that, through carefully selecting and aggregating the matching costs of neighboring pixels, the disparity maps produced by a local approach can be more accurate than those generated by many global optimization techniques. We are therefore motivated to investigate whether these cost aggregation approaches can be adopted in real-time stereo applications and, if so, how well they perform under the real-time constraint. The evaluation is conducted on a real-time stereo platform, which utilizes the processing power of programmable graphics hardware. Six recent cost aggregation approaches are implemented and optimized for graphics hardware so that real-time speed can be achieved. The performances of these aggregation approaches in terms of both processing speed and result quality are reported.

Journal ArticleDOI
TL;DR: This paper reformulates the generic geometric active contour model by redefining the notion of gradient in accordance with Sobolev-type inner products, and calls the resulting flows SoboleV active contours.
Abstract: All previous geometric active contour models that have been formulated as gradient flows of various energies use the same L 2-type inner product to define the notion of gradient. Recent work has shown that this inner product induces a pathological Riemannian metric on the space of smooth curves. However, there are also undesirable features associated with the gradient flows that this inner product induces. In this paper, we reformulate the generic geometric active contour model by redefining the notion of gradient in accordance with Sobolev-type inner products. We call the resulting flows Sobolev active contours. Sobolev metrics induce favorable regularity properties in their gradient flows. In addition, Sobolev active contours favor global translations, but are not restricted to such motions; they are also less susceptible to certain types of local minima in contrast to traditional active contours. These properties are particularly useful in tracking applications. We demonstrate the general methodology by reformulating some standard edge-based and region-based active contour models as Sobolev active contours and show the substantial improvements gained in segmentation.

Journal ArticleDOI
TL;DR: In this paper, the authors present a system for autonomous mobile robot navigation with only an omnidirectional camera as sensor, which is able to build automatically and robustly accurate topologically organized environment maps of a complex, natural environment.
Abstract: In this work we present a novel system for autonomous mobile robot navigation. With only an omnidirectional camera as sensor, this system is able to build automatically and robustly accurate topologically organised environment maps of a complex, natural environment. It can localise itself using such a map at each moment, including both at startup (kidnapped robot) or using knowledge of former localisations. The topological nature of the map is similar to the intuitive maps humans use, is memory-efficient and enables fast and simple path planning towards a specified goal. We developed a real-time visual servoing technique to steer the system along the computed path. A key technology making this all possible is the novel fast wide baseline feature matching, which yields an efficient description of the scene, with a focus on man-made environments.

Journal ArticleDOI
TL;DR: The design and performance of computer vision algorithms used on Mars in the NASA/JPL Mars Exploration Rover (MER) mission was a major step forward in the use ofComputer vision in space.
Abstract: Increasing the level of spacecraft autonomy is essential for broadening the reach of solar system exploration. Computer vision has and will continue to play an important role in increasing autonomy of both spacecraft and Earth-based robotic vehicles. This article addresses progress on computer vision for planetary rovers and landers and has four main parts. First, we review major milestones in the development of computer vision for robotic vehicles over the last four decades. Since research on applications for Earth and space has often been closely intertwined, the review includes elements of both. Second, we summarize the design and performance of computer vision algorithms used on Mars in the NASA/JPL Mars Exploration Rover (MER) mission, which was a major step forward in the use of computer vision in space. These algorithms did stereo vision and visual odometry for rover navigation and feature tracking for horizontal velocity estimation for the landers. Third, we summarize ongoing research to improve vision systems for planetary rovers, which includes various aspects of noise reduction, FPGA implementation, and vision-based slip perception. Finally, we briefly survey other opportunities for computer vision to impact rovers, landers, and orbiters in future solar system exploration missions.

Journal ArticleDOI
TL;DR: A new methodology to compute the mean periorbital temperature signal is proposed and is capable of coping with the challenges posed by the realistic setting and opens the way for automating lie detection in realistic settings.
Abstract: Previous work has demonstrated the correlation of increased blood perfusion in the orbital muscles and stress levels for human beings. It has also been suggested that this periorbital perfusion can be quantified through the processing of thermal video. The idea has been based on the fact that skin temperature is heavily modulated by superficial blood flow. Proof of this concept was established for two different types of stress inducing experiments: startle experiments and mock-crime polygraph interrogations. However, the polygraph interrogation scenarios were simplistic and highly constrained. In the present paper, we report results derived from a large and realistic mock-crime interrogation experiment. The interrogation is free flowing and no restrictions have been placed on the subjects. Additionally, we propose a new methodology to compute the mean periorbital temperature signal. The present approach addresses the deficiencies of the earlier methodology and is capable of coping with the challenges posed by the realistic setting. Specifically, it features a tandem CONDENSATION tracker to register the periorbital area in the context of a moving face. It operates on the raw temperature signal and tries to improve the information content by suppressing the noise level instead of amplifying the signal as a whole. Finally, a pattern recognition method classifies stressful (Deceptive) from non-stressful (Non-Deceptive) subjects based on a comparative measure between the entire interrogation signal (baseline) and a critical subsection of it (transient response). The successful classification rate is 87.2% for 39 subjects. This is on par with the success rate achieved by highly trained psycho-physiological experts and opens the way for automating lie detection in realistic settings.

Journal ArticleDOI
TL;DR: The present transform not only achieves important mathematical properties, it also follows as much as possible the knowledge on the receptive field properties of the simple cells of the Primary Visual Cortex (V1) and on the statistics of natural images to make it a promising tool for processing natural images.
Abstract: Orthogonal and biorthogonal wavelets became very popular image processing tools but exhibit major drawbacks, namely a poor resolution in orientation and the lack of translation invariance due to aliasing between subbands. Alternative multiresolution transforms which specifically solve these drawbacks have been proposed. These transforms are generally overcomplete and consequently offer large degrees of freedom in their design. At the same time their optimization gets a challenging task. We propose here the construction of log-Gabor wavelet transforms which allow exact reconstruction and strengthen the excellent mathematical properties of the Gabor filters. Two major improvements on the previous Gabor wavelet schemes are proposed: first the highest frequency bands are covered by narrowly localized oriented filters. Secondly, the set of filters cover uniformly the Fourier domain including the highest and lowest frequencies and thus exact reconstruction is achieved using the same filters in both the direct and the inverse transforms (which means that the transform is self-invertible). The present transform not only achieves important mathematical properties, it also follows as much as possible the knowledge on the receptive field properties of the simple cells of the Primary Visual Cortex (V1) and on the statistics of natural images. Compared to the state of the art, the log-Gabor wavelets show excellent ability to segregate the image information (e.g. the contrast edges) from spatially incoherent Gaussian noise by hard thresholding, and then to represent image features through a reduced set of large magnitude coefficients. Such characteristics make the transform a promising tool for processing natural images.

Journal ArticleDOI
TL;DR: A method to recover a dense optical flow field map from two images, while explicitely taking into account the symmetry across the images as well as possible occlusions in the flow field, and extended a classical approach to handle those.
Abstract: Traditional techniques of dense optical flow estimation do not generally yield symmetrical solutions: the results will differ if they are applied between images I 1 and I 2 or between images I 2 and I 1. In this work, we present a method to recover a dense optical flow field map from two images, while explicitely taking into account the symmetry across the images as well as possible occlusions in the flow field. The idea is to consider both displacements vectors from I 1 to I 2 and I 2 to I 1 and to minimise an energy functional that explicitely encodes all those properties. This variational problem is then solved using the gradient flow defined by the Euler---Lagrange equations associated to the energy. To prove the importance of the concepts of symmetry and occlusions for optical flow computation, we have extended a classical approach to handle those. Experiments clearly show the added value of these properties to improve the accuracy of the computed flows. Figures appear in color in the online version of this paper.

Journal ArticleDOI
TL;DR: This article presents the integration of 3-D shape knowledge into a variational model for level set based image segmentation and contour based3-D pose tracking and proves that for each view the model can fit the data in the image very well.
Abstract: In this article we present the integration of 3-D shape knowledge into a variational model for level set based image segmentation and contour based 3-D pose tracking. Given the surface model of an object that is visible in the image of one or multiple cameras calibrated to the same world coordinate system, the object contour extracted by the segmentation method is applied to estimate the 3-D pose parameters of the object. Vice-versa, the surface model projected to the image plane helps in a top-down manner to improve the extraction of the contour. While common alternative segmentation approaches, which integrate 2-D shape knowledge, face the problem that an object can look very differently from various viewpoints, a 3-D free form model ensures that for each view the model can fit the data in the image very well. Moreover, one additionally solves the problem of determining the object's pose in 3-D space. The performance is demonstrated by numerous experiments with a monocular and a stereo camera system.

Journal ArticleDOI
TL;DR: The notion of the support associated to an instantiation is defined, and the combination of a deformable model with an efficient estimation procedure yields competitive results in a variety of applications with very small training sets, without need to train decision boundaries.
Abstract: We formulate a deformable template model for objects with an efficient mechanism for computation and parameter estimation. The data consists of binary oriented edge features, robust to photometric variation and small local deformations. The template is defined in terms of probability arrays for each edge type. A primary contribution of this paper is the definition of the instantiation of an object in terms of shifts of a moderate number local submodels--parts--which are subsequently recombined using a patchwork operation, to define a coherent statistical model of the data. Object classes are modeled as mixtures of patchwork of parts POP models that are discovered sequentially as more class data is observed. We define the notion of the support associated to an instantiation, and use this to formulate statistical models for multi-object configurations including possible occlusions. All decisions on the labeling of the objects in the image are based on comparing likelihoods. The combination of a deformable model with an efficient estimation procedure yields competitive results in a variety of applications with very small training sets, without need to train decision boundaries--only data from the class being trained is used. Experiments are presented on the MNIST database, reading zipcodes, and face detection.