scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Computer Vision in 2000"


Journal ArticleDOI
TL;DR: This paper investigates the properties of a metric between two distributions, the Earth Mover's Distance (EMD), for content-based image retrieval, and compares the retrieval performance of the EMD with that of other distances.
Abstract: We investigate the properties of a metric between two distributions, the Earth Mover's Distance (EMD), for content-based image retrieval. The EMD is based on the minimal cost that must be paid to transform one distribution into the other, in a precise sense, and was first proposed for certain vision problems by Peleg, Werman, and Rom. For image retrieval, we combine this idea with a representation scheme for distributions that is based on vector quantization. This combination leads to an image comparison framework that often accounts for perceptual similarity better than other previously proposed methods. The EMD is based on a solution to the transportation problem from linear optimization, for which efficient algorithms are available, and also allows naturally for partial matching. It is more robust than histogram matching techniques, in that it can operate on variable-length representations of the distributions that avoid quantization and other binning problems typical of histograms. When used to compare distributions with the same overall mass, the EMD is a true metric. In this paper we focus on applications to color and texture, and we compare the retrieval performance of the EMD with that of other distances.

4,593 citations


Journal ArticleDOI
TL;DR: A universal statistical model for texture images in the context of an overcomplete complex wavelet transform is presented, demonstrating the necessity of subgroups of the parameter set by showing examples of texture synthesis that fail when those parameters are removed from the set.
Abstract: We present a universal statistical model for texture images in the context of an overcomplete complex wavelet transform. The model is parameterized by a set of statistics computed on pairs of coefficients corresponding to basis functions at adjacent spatial locations, orientations, and scales. We develop an efficient algorithm for synthesizing random images subject to these constraints, by iteratively projecting onto the set of images satisfying each constraint, and we use this to test the perceptual validity of the model. In particular, we demonstrate the necessity of subgroups of the parameter set by showing examples of texture synthesis that fail when those parameters are removed from the set. We also demonstrate the power of our model by successfully synthesizing examples drawn from a diverse collection of artificial and natural textures.

1,978 citations


Journal ArticleDOI
TL;DR: Two evaluation criteria for interest points' repeatability rate and information content are introduced and different interest point detectors are compared using these two criteria.
Abstract: Many different low-level feature detectors exist and it is widely agreed that the evaluation of detectors is important. In this paper we introduce two evaluation criteria for interest points' repeatability rate and information content. Repeatability rate evaluates the geometric stability under different transformations. Information content measures the distinctiveness of features. Different interest point detectors are compared using these two criteria. We determine which detector gives the best results and show that it satisfies the criteria well.

1,690 citations


Journal ArticleDOI
TL;DR: A learning-based method for low-level vision problems—estimating scenes from images with Bayesian belief propagation, applied to the “super-resolution” problem (estimating high frequency details from a low-resolution image), showing good results.
Abstract: We describe a learning-based method for low-level vision problems—estimating scenes from images. We generate a synthetic world of scenes and their corresponding rendered images, modeling their relationships with a Markov network. Bayesian belief propagation allows us to efficiently find a local maximum of the posterior probability for the scene, given an image. We call this approach VISTA—Vision by Image/Scene TrAining. We apply VISTA to the “super-resolution” problem (estimating high frequency details from a low-resolution image), showing good results. To illustrate the potential breadth of the technique, we also apply it in two other problem domains, both simplified. We learn to distinguish shading from reflectance variations in a single image under particular lighting conditions. For the motion estimation problem in a “blobs world”, we show figure/ground discrimination, solution of the aperture problem, and filling-in arising from application of the same probabilistic machinery.

1,647 citations


Journal ArticleDOI
TL;DR: A provably-correct algorithm is given, called Space Carving, for computing the 3D shape of an unknown, arbitrarily-shaped scene from multiple photographs taken at known but arbitrarily-distributed viewpoints to capture photorealistic shapes that accurately model scene appearance from a wide range of viewpoints.
Abstract: In this paper we consider the problem of computing the 3D shape of an unknown, arbitrarily-shaped scene from multiple photographs taken at known but arbitrarily-distributed viewpoints. By studying the equivalence class of all 3D shapes that reproduce the input photographs, we prove the existence of a special member of this class, the photo hull, that (1) can be computed directly from photographs of the scene, and (2) subsumes all other members of this class. We then give a provably-correct algorithm, called Space Carving, for computing this shape and present experimental results on complex real-world scenes. The approach is designed to (1) capture photorealistic shapes that accurately model scene appearance from a wide range of viewpoints, and (2) account for the complex interactions between occlusion, parallax, shading, and their view-dependent effects on scene-appearance.

1,487 citations


Journal ArticleDOI
TL;DR: A general, trainable system for object detection in unconstrained, cluttered scenes that derives much of its power from a representation that describes an object class in terms of an overcomplete dictionary of local, oriented, multiscale intensity differences between adjacent regions, efficiently computable as a Haar wavelet transform.
Abstract: This paper presents a general, trainable system for object detection in unconstrained, cluttered scenes. The system derives much of its power from a representation that describes an object class in terms of an overcomplete dictionary of local, oriented, multiscale intensity differences between adjacent regions, efficiently computable as a Haar wavelet transform. This example-based learning approach implicitly derives a model of an object class by training a support vector machine classifier using a large set of positive and negative examples. We present results on face, people, and car detection tasks using the same architecture. In addition, we quantify how the representation affects detection performance by considering several alternate representations including pixels and principal components. We also describe a real-time application of our person detection system as part of a driver assistance system.

1,436 citations


Journal ArticleDOI
TL;DR: An algebraic representation is developed which unifies the three types of measurement and permits a first order error propagation analysis to be performed, associating an uncertainty with each measurement.
Abstract: We describe how 3D affine measurements may be computed from a single perspective view of a scene given only minimal geometric information determined from the image This minimal information is typically the vanishing line of a reference plane, and a vanishing point for a direction not parallel to the plane It is shown that affine scene structure may then be determined from the image, without knowledge of the camera's internal calibration (eg focal length), nor of the explicit relation between camera and world (pose) In particular, we show how to (i) compute the distance between planes parallel to the reference plane (up to a common scale factor)s (ii) compute area and length ratios on any plane parallel to the reference planes (iii) determine the camera's location Simple geometric derivations are given for these results We also develop an algebraic representation which unifies the three types of measurement and, amongst other advantages, permits a first order error propagation analysis to be performed, associating an uncertainty with each measurement We demonstrate the technique for a variety of applications, including height measurements in forensic images and 3D graphical modelling from single images

760 citations


Journal ArticleDOI
TL;DR: This article presents a technique where appearances of objects are represented by the joint statistics of such local neighborhood operators, which represents a new class of appearance based techniques for computer vision.
Abstract: The appearance of an object is composed of local structure. This local structure can be described and characterized by a vector of local features measured by local operators such as Gaussian derivatives or Gabor filters. This article presents a technique where appearances of objects are represented by the joint statistics of such local neighborhood operators. As such, this represents a new class of appearance based techniques for computer vision. Based on joint statistics, the paper develops techniques for the identification of multiple objects at arbitrary positions and orientations in a cluttered scene. Experiments show that these techniques can identify over 100 objects in the presence of major occlusions. Most remarkably, the techniques have low complexity and therefore run in real-time.

480 citations


Journal ArticleDOI
TL;DR: This paper presents a complete system for constructing panoramic image mosaics from sequences of images, and introduces a rotational mosaic representation that associates a rotation matrix with each input image and a patch-based alignment algorithm to quickly align two images given motion models.
Abstract: This paper presents a complete system for constructing panoramic image mosaics from sequences of images. Our mosaic representation associates a transformation matrix with each input image, rather than explicitly projecting all of the images onto a common surface (e.g., a cylinder). In particular, to construct a full view panorama, we introduce a rotational mosaic representation that associates a rotation matrix (and optionally a focal length) with each input image. A patch-based alignment algorithm is developed to quickly align two images given motion models. Techniques for estimating and refining camera focal lengths are also presented. In order to reduce accumulated registration errors, we apply global alignment (block adjustment) to the whole sequence of images, which results in an optimally registered image mosaic. To compensate for small amounts of motion parallax introduced by translations of the camera and other unmodeled distortions, we use a local alignment (deghosting) technique which warps each image based on the results of pairwise local image registrations. By combining both global and local alignment, we significantly improve the quality of our image mosaics, thereby enabling the creation of full view panoramic mosaics with hand-held cameras. We also present an inverse texture mapping algorithm for efficiently extracting environment maps from our panoramic image mosaics. By mapping the mosaic onto an arbitrary texture-mapped polyhedron surrounding the origin, we can explore the virtual environment using standard 3D graphics viewers and hardware without requiring special-purpose players.

454 citations


Journal ArticleDOI
TL;DR: This work combines stereo, color, and face detection modules into a single robust system, shows an initial application in an interactive, face-responsive display, and discusses the failure modes of each individual module.
Abstract: We present an approach to real-time person tracking in crowded and/or unknown environments using integration of multiple visual modalities. We combine stereo, color, and face detection modules into a single robust system, and show an initial application in an interactive, face-responsive display. Dense, real-time stereo processing is used to isolate users from other objects and people in the background. Skin-hue classification identifies and tracks likely body parts within the silhouette of a user. Face pattern detection discriminates and localizes the face within the identified body parts. Faces and bodies of users are tracked over several temporal scales: short-term (user stays within the field of view), medium-term (user exits/reenters within minutes), and long term (user returns after hours or days). Short-term tracking is performed using simple region position and size correspondences, while medium and long-term tracking are based on statistics of user appearance. We discuss the failure modes of each individual module, describe our integration method, and report results with the complete system in trials with thousands of users.

435 citations


Journal ArticleDOI
TL;DR: An observation density for tracking is presented which solves this problem by exhibiting a probabilistic exclusion principle, and is presented of partitioned sampling, a new sampling method for multiple object tracking.
Abstract: Tracking multiple targets is a challenging problem, especially when the targets are “identical”, in the sense that the same model is used to describe each target. In this case, simply instantiating several independent 1-body trackers is not an adequate solution, because the independent trackers tend to coalesce onto the best-fitting target. This paper presents an observation density for tracking which solves this problem by exhibiting a probabilistic exclusion principle. Exclusion arises naturally from a systematic derivation of the observation density, without relying on heuristics. Another important contribution of the paper is the presentation of partitioned sampling, a new sampling method for multiple object tracking. Partitioned sampling avoids the high computational load associated with fully coupled trackers, while retaining the desirable properties of coupling.

Journal ArticleDOI
TL;DR: A dynamic system incorporating flow as a hard constraint is derived and solved, producing a model-based least-squares optical flow solution that ensures the constraint remains satisfied when combined with edge information, which helps combat tracking error accumulation.
Abstract: Optical flow provides a constraint on the motion of a deformable model. We derive and solve a dynamic system incorporating flow as a hard constraint, producing a model-based least-squares optical flow solution. Our solution also ensures the constraint remains satisfied when combined with edge information, which helps combat tracking error accumulation. Constraint enforcement can be relaxed using a Kalman filter, which permits controlled constraint violations based on the noise present in the optical flow information, and enables optical flow and edge information to be combined more robustly and efficiently. We apply this framework to the estimation of face shape and motion using a 3D deformable face model. This model uses a small number of parameters to describe a rich variety of face shapes and facial expressions. We present experiments in extracting the shape and motion of a face from image sequences which validate the accuracy of the method. They also demonstrate that our treatment of optical flow as a hard constraint, as well as our use of a Kalman filter to reconcile these constraints with the uncertainty in the optical flow, are vital for improving the performance of our system.

Journal ArticleDOI
TL;DR: This paper shows that a classic optical flow technique by Nagel and Enkelmann can be regarded as an early anisotropic diffusion method with a diffusion tensor, and introduces three improvements into the model formulation that avoid inconsistencies caused by centering the brightness term and the smoothness term in different images.
Abstract: In this paper we show that a classic optical flow technique by Nagel and Enkelmann (1986, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 8, pp. 565–593) can be regarded as an early anisotropic diffusion method with a diffusion tensor. We introduce three improvements into the model formulation that (i) avoid inconsistencies caused by centering the brightness term and the smoothness term in different images, (ii) use a linear scale-space focusing strategy from coarse to fine scales for avoiding convergence to physically irrelevant local minima, and (iii) create an energy functional that is invariant under linear brightness changes. Applying a gradient descent method to the resulting energy functional leads to a system of diffusion–reaction equations. We prove that this system has a unique solution under realistic assumptions on the initial data, and we present an efficient linear implicit numerical scheme in detail. Our method creates flow fields with 100 % density over the entire image domain, it is robust under a large range of parameter variations, and it can recover displacement fields that are far beyond the typical one-pixel limits which are characteristic for many differential methods for determining optical flow. We show that it performs better than the optical flow methods with 100 % density that are evaluated by Barron et al. (1994, Int. J. Comput. Vision, Vol. 12, pp. 43–47). Our software is available from the Internet.

Journal ArticleDOI
TL;DR: The geometric framework and the general Beltrami flow are applied to feature-preserving denoising of images in various spaces to propose enhancement techniques that selectively smooth images while preserving either the multi-channel edges or the orientation-dependent texture features in them.
Abstract: We extend the geometric framework introduced in Sochen et al. (IEEE Trans. on Image Processing, 7(3):310–318, 1998) for image enhancement. We analyze and propose enhancement techniques that selectively smooth images while preserving either the multi-channel edges or the orientation-dependent texture features in them. Images are treated as manifolds in a feature-space. This geometrical interpretation lead to a general way for grey level, color, movies, volumetric medical data, and color-texture image enhancement. We first review our framework in which the Polyakov action from high-energy physics is used to develop a minimization procedure through a geometric flow for images. Here we show that the geometric flow, based on manifold volume minimization, yields a novel enhancement procedure for color images. We apply the geometric framework and the general Beltrami flow to feature-preserving denoising of images in various spaces. Next, we introduce a new method for color and texture enhancement. Motivated by Gabor's geometric image sharpening method (Gabor, Laboratory Investigation, 14(6):801–807, 1965), we present a geometric sharpening procedure for color images with texture. It is based on inverse diffusion across the multi-channel edge, and diffusion along the edge.

Journal ArticleDOI
TL;DR: A supervised classification model based on a variational approach to find an optimal partition composed of homogeneous classes with regular interfaces and shows how these forces can be defined through the minimization of a unique fonctional.
Abstract: We present a supervised classification model based on a variational approach This model is devoted to find an optimal partition composed of homogeneous classes with regular interfaces The originality of the proposed approach concerns the definition of a partition by the use of level sets Each set of regions and boundaries associated to a class is defined by a unique level set function We use as many level sets as different classes and all these level sets are moving together thanks to forces which interact in order to get an optimal partition We show how these forces can be defined through the minimization of a unique fonctional The coupled Partial Differential Equations (PDE) related to the minimization of the functional are considered through a dynamical scheme Given an initial interface set (zero level set), the different terms of the PDE's are governing the motion of interfaces such that, at convergence, we get an optimal partition as defined above Each interface is guided by internal forces (regularity of the interface), and external ones (data term, no vacuum, no regions overlapping) Several experiments were conducted on both synthetic and real images

Journal ArticleDOI
TL;DR: This paper presents how the 2 1/2 D visual servoing scheme, recently developed, can be used with unknown objects characterized by a set of points, based on the estimation of the camera displacement from two views, given by the current and desired images.
Abstract: Classical visual servoing techniques need a strong a priori knowledge of the shape and the dimensions of the observed objects In this paper, we present how the 2 1/2 D visual servoing scheme we have recently developed, can be used with unknown objects characterized by a set of points Our scheme is based on the estimation of the camera displacement from two views, given by the current and desired images Since vision-based robotics tasks generally necessitate to be performed at video rate, we focus only on linear algorithms Classical linear methods are based on the computation of the essential matrix In this paper, we propose a different method, based on the estimation of the homography matrix related to a virtual plane attached to the object We show that our method provides a more stable estimation when the epipolar geometry degenerates This is particularly important in visual servoing to obtain a stable control law, especially near the convergence of the system Finally, experimental results confirm the improvement in the stability, robustness, and behaviour of our scheme with respect to classical methods

Journal ArticleDOI
TL;DR: In this article, the effect of noise in different spatial frequencies, and derive the optimal changes of the focus settings in DFD are analyzed and compared with shape from stereo (or motion) algorithms.
Abstract: Depth from Focus (DFF) and Depth from Defocus (DFD) methods are theoretically unified with the geometric triangulation principle. Fundamentally, the depth sensitivities of DFF and DFD are not different than those of stereo (or motion) based systems having the same physical dimensions. Contrary to common belief, DFD does not inherently avoid the matching (correspondence) problem. Basically, DFD and DFF do not avoid the occlusion problem any more than triangulation techniques, but they are more stable in the presence of such disruptions. The fundamental advantage of DFF and DFD methods is the two-dimensionality of the aperture, allowing more robust estimation. We analyze the effect of noise in different spatial frequencies, and derive the optimal changes of the focus settings in DFD. These results elucidate the limitations of methods based on depth of field and provide a foundation for fair performance comparison between DFF/DFD and shape from stereo (or motion) algorithms.

Journal ArticleDOI
TL;DR: A simple method of fast background subtraction based upon disparity verification that is invariant to arbitrarily rapid run-time changes in illumination that is easily implemented in real-time on conventional hardware.
Abstract: This paper describes a simple method of fast background subtraction based upon disparity verification that is invariant to arbitrarily rapid run-time changes in illumination. Using two or more cameras, the method requires the off-line construction of disparity fields mapping the primary background images. At runtime, segmentation is performed by checking background image to each of the additional auxiliary color intensity values at corresponding pixels. If more than two cameras are available, more robust segmentation can be achieved and, in particular, the occlusion shadows can be generally eliminated as well. Because the method only assumes fixed background geometry, the technique allows for illumination variation at runtime. Since no disparity search is performed, the algorithm is easily implemented in real-time on conventional hardware.

Journal ArticleDOI
TL;DR: An algorithm to estimate the parameters of a linear model in the presence of heteroscedastic noise, i.e., each data point having a different covariance matrix, achieves the accuracy of nonlinear optimization techniques at much less computational cost.
Abstract: We present an algorithm to estimate the parameters of a linear model in the presence of heteroscedastic noise, i.e., each data point having a different covariance matrix. The algorithm is motivated by the recovery of bilinear forms, one of the fundamental problems in computer vision which appears whenever the epipolar constraint is imposed, or a conic is fit to noisy data points. We employ the errors-in-variables (EIV) model and show why already at moderate noise levels most available methods fail to provide a satisfactory solution. The improved behavior of the new algorithm is due to two factors: taking into account the heteroscedastic nature of the errors arising from the linearization of the bilinear form, and the use of generalized singular value decomposition (GSVD) in the computations. The performance of the algorithm is compared with several methods proposed in the literature for ellipse fitting and estimation of the fundamental matrix. It is shown that the algorithm achieves the accuracy of nonlinear optimization techniques at much less computational cost.

Journal ArticleDOI
TL;DR: A novel framework for isotropic and anisotropic diffusion of directions and can be applied both to denoise directional data and to obtain multiscale representations of it, to apply and extend results from the theory of harmonic maps.
Abstract: In a number of disciplines, directional data provides a fundamental source of information. A novel framework for isotropic and anisotropic diffusion of directions is presented in this paper. The framework can be applied both to denoise directional data and to obtain multiscale representations of it. The basic idea is to apply and extend results from the theory of harmonic maps, and in particular, harmonic maps in liquid crystals. This theory deals with the regularization of vectorial data, while satisfying the intrinsic unit norm constraint of directional data. We show the corresponding variational and partial differential equations formulations for isotropic diffusion, obtained from an L_2 norm, and edge preserving diffusion, obtained from an L norm in general and an L_1 norm in particular. In contrast with previous approaches, the framework is valid for directions in any dimensions, supports non-smooth data, and gives both isotropic and anisotropic formulations. In addition, the framework of harmonic maps here described can be used to diffuse and analyze general image data defined on general non-flat manifolds, that is, functions between two general manifolds. We present a number of theoretical results, open questions, and examples for gradient vectors, optical flow, and color images.

Journal ArticleDOI
TL;DR: Multi-view relationships are developed for lines, conics and non-algebraic curves using the homography induced by this plane for transfer from one image to another in a projective reconstruction of imaged curves.
Abstract: This paper describes the geometry of imaged curves in two and three views. Multi-view relationships are developed for lines, conics and non-algebraic curves. The new relationships focus on determining the plane of the curve in a projective reconstruction, and in particular using the homography induced by this plane for transfer from one image to another. It is shown that given the fundamental matrix between two views, and images of the curve in each view, then the plane of a conic may be determined up to a two fold ambiguity, but local curvature of a curve uniquely determines the plane. It is then shown that given the trifocal tensor between three views, this plane defines a homography map which may be used to transfer a conic or the curvature from two views to a third. Simple expressions are developed for the plane and homography in each case. A set of algorithms are then described for automatically matching individual line segments and curves between images. The algorithms use both photometric information and the multiple view geometric relationships. For image pairs the homography facilitates the computation of a neighbourhood cross-correlation based matching score for putative line/curve correspondences. For image triplets cross-correlation matching scores are used in conjunction with line/curve transfer based on the trifocal geometry to disambiguate matches. Algorithms are developed for both short and wide baselines. The algorithms are robust to deficiencies in the segment extraction and partial occlusion. Experimental results are given for image pairs and triplets, for varying motions between views, and for different scene types. The methods are applicable to line/curve matching in stereo and trinocular rigs, and as a starting point for line/curve matching through monocular image sequences.

Journal ArticleDOI
TL;DR: This work explores the use of parameterized motion models that represent much more varied and complex motions, and shows how these model coefficients can be use to detect and recognize specific motions such as occlusion boundaries and facial expressions.
Abstract: Linear parameterized models of optical flow, particularly affine models, have become widespread in image motion analysis. The linear model coefficients are straightforward to estimate, and they provide reliable estimates of the optical flow of smooth surfaces. Here we explore the use of parameterized motion models that represent much more varied and complex motions. Our goals are threefold: to construct linear bases for complex motion phenomenas to estimate the coefficients of these linear modelss and to recognize or classify image motions from the estimated coefficients. We consider two broad classes of motions: i) generic “motion features” such as motion discontinuities and moving barss and ii) non-rigid, object-specific, motions such as the motion of human mouths. For motion features we construct a basis of steerable flow fields that approximate the motion features. For object-specific motions we construct basis flow fields from example motions using principal component analysis. In both cases, the model coefficients can be estimated directly from spatiotemporal image derivatives with a robust, multi-resolution scheme. Finally, we show how these model coefficients can be use to detect and recognize specific motions such as occlusion boundaries and facial expressions.

Journal ArticleDOI
TL;DR: MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes which are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.
Abstract: We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.

Journal ArticleDOI
TL;DR: This article presents a technique that permits to represent complex motion or action patterns by linear combinations of a small number of prototypical image sequences, and shows how the knowledge about the topology of the pattern space can be exploited during pattern recognition.
Abstract: The linear combination of prototypical views provides a powerful approach for the recognition and the synthesis of images of stationary three-dimensional objects. In this article, we present initial results that demonstrate that similar ideas can be developed for the recognition and synthesis of complex motion patterns. We present a technique that permits to represent complex motion or action patterns by linear combinations of a small number of prototypical image sequences. We demonstrate the applicability of this new approach for the synthesis and analysis of biological motion using simulated and real video data from different locomotion patterns. Our results show that complex motion patterns are embedded in pattern spaces with a defined topological structure, which can be uncovered with our methods. The underlying pattern space seems to have locally, but not globally, the properties of a linear vector space. We show how the knowledge about the topology of the pattern space can be exploited during pattern recognition. Our method may provide a new interesting approach for the analysis and synthesis of video sequences and complex movements.

Journal ArticleDOI
TL;DR: A Bayesian framework for representing and recognizing local image motion in terms of translational motion and motion boundaries is proposed, which provides a general probabilistic framework for motion estimation with multiple, non-linear, models.
Abstract: We propose a Bayesian framework for representing and recognizing local image motion in terms of two basic models: translational motion and motion boundaries. Motion boundaries are represented using a non-linear generative model that explicitly encodes the orientation of the boundary, the velocities on either side, the motion of the occluding edge over time, and the appearance/disappearance of pixels at the boundary. We represent the posterior probability distribution over the model parameters given the image data using discrete samples. This distribution is propagated over time using a particle filtering algorithm. To efficiently represent such a high-dimensional space we initialize samples using the responses of a low-level motion discontinuity detector. The formulation and computational model provide a general probabilistic framework for motion estimation with multiple, non-linear, models.

Journal ArticleDOI
TL;DR: This work proposes a method for self calibration of the blur kernels, given the raw images, which are sought to minimize the mutual information of the recovered layers.
Abstract: Consider situations where the depth at each point in the scene is multi-valued, due to the presence of a virtual image semi-reflected by a transparent surface. The semi-reflected image is linearly superimposed on the image of an object that is behind the transparent surface. A novel approach is proposed for the separation of the superimposed layers. Focusing on either of the layers yields initial separation, but crosstalk remains. The separation is enhanced by mutual blurring of the perturbing components in the images. However, this blurring requires the estimation of the defocus blur kernels. We thus propose a method for self calibration of the blur kernels, given the raw images. The kernels are sought to minimize the mutual information of the recovered layers. Autofocusing and depth estimation in the presence of semi-reflections are also considered. Experimental results are presented.

Journal ArticleDOI
TL;DR: An approach based on regularized bundle-adjustment that takes advantage of the rough knowledge of the head's shape, in the form of a generic face model, to recover relative head-motion and epipolar geometry accurately and consistently enough to exploit a previously-developed stereo-based approach to head modeling.
Abstract: We address the structure-from-motion problem in the context of head modeling from video sequences for which calibration data is not available This task is made challenging by the fact that correspondences are difficult to establish due to lack of texture and that a quasi-euclidean representation is required for realism We have developed an approach based on regularized bundle-adjustment It takes advantage of our rough knowledge of the head's shape, in the form of a generic face model It allows us to recover relative head-motion and epipolar geometry accurately and consistently enough to exploit a previously-developed stereo-based approach to head modeling In this way, complete and realistic head models can be acquired with a cheap and entirely passive sensor, such as an ordinary video camera, with minimal manual intervention We chose to demonstrate and evaluate our technique mainly in the context of head-modeling We do so because it is the application for which all the tools required to perform the complete reconstruction are available to us We will, however, argue that the approach is generic and could be applied to other tasks, such as body modeling, for which generic facetized models exist

Journal ArticleDOI
TL;DR: Numerically invariant expressions for the four differential invariants parameterizing the three dimensional version of the Euclidean signature curve, namely the curvature, the torsion and their derivatives with respect to arc length are given.
Abstract: Corrected versions of the numerically invariant expressions for the affine and Euclidean signature of a planar curve introduced by Calabi et al. in (Int. J. Comput. Vision, 26: 107–135, 1998) are presented. The new formulas are valid for fine but otherwise arbitrary partitions of the curve. We also give numerically invariant expressions for the four differential invariants parameterizing the three dimensional version of the Euclidean signature curve, namely the curvature, the torsion and their derivatives with respect to arc length.

Journal ArticleDOI
TL;DR: In this article, a vision system for locating, recognising and tracking multiple vehicles, using an image sequence taken by a single camera mounted on a moving vehicle, is presented, where the camera motion is estimated by matching features on the ground plane from one image to the next.
Abstract: An overview is given of a vision system for locating, recognising and tracking multiple vehicles, using an image sequence taken by a single camera mounted on a moving vehicle. The camera motion is estimated by matching features on the ground plane from one image to the next. Vehicle detection and hypothesis generation are performed using template correlation and a 3D wire frame model of the vehicle is fitted to the image. Once detected and identified, vehicles are tracked using dynamic filtering. A separate batch mode filter obtains the 3D trajectories of nearby vehicles over an extended time. Results are shown for a motorway image sequence.

Journal ArticleDOI
TL;DR: This work describes how to model the appearance of a 3-D object using multiple views, learn such a model from training images, and use the model for object recognition, and demonstrates that OLIVER is capable of learning to recognize complex objects in cluttered images, while acquiring models that represent those objects using relatively few views.
Abstract: We describe how to model the appearance of a 3-D object using multiple views, learn such a model from training images, and use the model for object recognition. The model uses probability distributions to describe the range of possible variation in the object's appearance. These distributions are organized on two levels. Large variations are handled by partitioning training images into clusters corresponding to distinctly different views of the object. Within each cluster, smaller variations are represented by distributions characterizing uncertainty in the presence, position, and measurements of various discrete features of appearance. Many types of features are used, ranging in abstraction from edge segments to perceptual groupings and regions. A matching procedure uses the feature uncertainty information to guide the search for a match between model and image. Hypothesized feature pairings are used to estimate a viewpoint transformation taking account of feature uncertainty. These methods have been implemented in an object recognition system, OLIVER. Experiments show that OLIVER is capable of learning to recognize complex objects in cluttered images, while acquiring models that represent those objects using relatively few views.