Showing papers in "International Journal of Computer Vision in 2004"
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.
TL;DR: An efficient segmentation algorithm is developed based on a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image and it is shown that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties.
Abstract: This paper addresses the problem of segmenting an image into regions. We define a predicate for measuring the evidence for a boundary between two regions using a graph-based representation of the image. We then develop an efficient segmentation algorithm based on this predicate, and show that although this algorithm makes greedy decisions it produces segmentations that satisfy global properties. We apply the algorithm to image segmentation using two different kinds of local neighborhoods in constructing the graph, and illustrate the results with both real and synthetic images. The algorithm runs in time nearly linear in the number of graph edges and is also fast in practice. An important characteristic of the method is its ability to preserve detail in low-variability image regions while ignoring detail in high-variability regions.
TL;DR: A comparative evaluation of different detectors is presented and it is shown that the proposed approach for detecting interest points invariant to scale and affine transformations provides better results than existing methods.
Abstract: In this paper we propose a novel approach for detecting interest points invariant to scale and affine transformations. Our scale and affine invariant detectors are based on the following recent results: (1) Interest points extracted with the Harris detector can be adapted to affine transformations and give repeatable results (geometrically stable). (2) The characteristic scale of a local structure is indicated by a local extremum over scale of normalized derivatives (the Laplacian). (3) The affine shape of a point neighborhood is estimated based on the second moment matrix. Our scale invariant detector computes a multi-scale representation for the Harris interest point detector and then selects points at which a local measure (the Laplacian) is maximal over scales. This provides a set of distinctive points which are invariant to scale, rotation and translation as well as robust to illumination changes and limited changes of viewpoint. The characteristic scale determines a scale invariant region for each point. We extend the scale invariant detector to affine invariance by estimating the affine shape of a point neighborhood. An iterative algorithm modifies location, scale and neighborhood of each point and converges to affine invariant points. This method can deal with significant affine transformations including large scale changes. The characteristic scale and the affine shape of neighborhood determine an affine invariant region for each point. We present a comparative evaluation of different detectors and show that our approach provides better results than existing methods. The performance of our detector is also confirmed by excellent matching resultss the image is described by a set of scale/affine invariant descriptors computed on the regions associated with our points.
TL;DR: In this paper, a wide variety of extensions have been made to the original formulation of the Lucas-Kanade algorithm and their extensions can be used with the inverse compositional algorithm without any significant loss of efficiency.
Abstract: Since the Lucas-Kanade algorithm was proposed in 1981 image alignment has become one of the most widely used techniques in computer vision Applications range from optical flow and tracking to layered motion, mosaic construction, and face coding Numerous algorithms have been proposed and a wide variety of extensions have been made to the original formulation We present an overview of image alignment, describing most of the algorithms and their extensions in a consistent framework We concentrate on the inverse compositional algorithm, an efficient algorithm that we recently proposed We examine which of the extensions to Lucas-Kanade can be used with the inverse compositional algorithm without any significant loss of efficiency, and which cannot In this paper, Part 1 in a series of papers, we cover the quantity approximated, the warp update rule, and the gradient descent approximation In future papers, we will cover the choice of the error function, how to allow linear appearance variation, and how to impose priors on the parameters
TL;DR: This work proposes an efficient fitting algorithm for AAMs based on the inverse compositional image alignment algorithm and shows that the effects of appearance variation during fitting can be precomputed (“projected out”) using this algorithm and how it can be extended to include a global shape normalising warp.
Abstract: Active Appearance Models (AAMs) and the closely related concepts of Morphable Models and Active Blobs are generative models of a certain visual phenomenon. Although linear in both shape and appearance, overall, AAMs are nonlinear parametric models in terms of the pixel intensities. Fitting an AAM to an image consists of minimising the error between the input image and the closest model instances i.e. solving a nonlinear optimisation problem. We propose an efficient fitting algorithm for AAMs based on the inverse compositional image alignment algorithm. We show that the effects of appearance variation during fitting can be precomputed (“projected out”) using this algorithm and how it can be extended to include a global shape normalising warp, typically a 2D similarity transformation. We evaluate our algorithm to determine which of its novel aspects improve AAM fitting performance.
TL;DR: A complete system to build visual models from camera images is presented and a combined approach with view-dependent geometry and texture is presented, as an application fusion of real and virtual scenes is also shown.
Abstract: In this paper a complete system to build visual models from camera images is presented. The system can deal with uncalibrated image sequences acquired with a hand-held camera. Based on tracked or matched features the relations between multiple views are computed. From this both the structure of the scene and the motion of the camera are retrieved. The ambiguity on the reconstruction is restricted from projective to metric through self-calibration. A flexible multi-view stereo matching scheme is used to obtain a dense estimation of the surface geometry. From the computed data different types of visual models are constructed. Besides the traditional geometry- and image-based approaches, a combined approach with view-dependent geometry and texture is presented. As an application fusion of real and virtual scenes is also shown.
TL;DR: To increase the robustness of the system, two semi-local constraints on combinations of region correspondences are derived (one geometric, the other photometric) allow to test the consistency of correspondences and hence to reject falsely matched regions.
Abstract: ‘Invariant regions’ are self-adaptive image patches that automatically deform with changing viewpoint as to keep on covering identical physical parts of a scene. Such regions can be extracted directly from a single image. They are then described by a set of invariant features, which makes it relatively easy to match them between views, even under wide baseline conditions. In this contribution, two methods to extract invariant regions are presented. The first one starts from corners and uses the nearby edges, while the second one is purely intensity-based. As a matter of fact, the goal is to build an opportunistic system that exploits several types of invariant regions as it sees fit. This yields more correspondences and a system that can deal with a wider range of images. To increase the robustness of the system, two semi-local constraints on combinations of region correspondences are derived (one geometric, the other photometric). They allow to test the consistency of correspondences and hence to reject falsely matched regions. Experiments on images of real-world scenes taken from substantially different viewpoints demonstrate the feasibility of the approach.
TL;DR: This work proposes a mechanism for computing a very large number of highly selective features which capture some aspects of this causal structure and shows results on a wide variety of image queries.
Abstract: We present an approach for image retrieval using a very large number of highly selective features and efficient learning of queries. Our approach is predicated on the assumption that each image is generated by a sparse set of visual “causes” and that images which are visually similar share causes. We propose a mechanism for computing a very large number of highly selective features which capture some aspects of this causal structure (in our implementation there are over 46,000 highly selective features). At query time a user selects a few example images, and the AdaBoost algorithm is used to learn a classification function which depends on a small number of the most appropriate features. This yields a highly efficient classification function. In addition we show that the AdaBoost framework provides a natural mechanism for the incorporation of relevance feedback. Finally we show results on a wide variety of image queries.
TL;DR: This paper presents a method for computing elastic registration and warping maps based on the Monge–Kantorovich theory of optimal mass transport, and shows how this approach leads to practical algorithms, and demonstrates the method with a number of examples, including those from the medical field.
Abstract: Image registration is the process of establishing a common geometric reference frame between two or more image data sets possibly taken at different times. In this paper we present a method for computing elastic registration and warping maps based on the Monge–Kantorovich theory of optimal mass transport. This mass transport method has a number of important characteristics. First, it is parameter free. Moreover, it utilizes all of the grayscale data in both images, places the two images on equal footing and is symmetrical: the optimal mapping from image A to image B being the inverse of the optimal mapping from B to A. The method does not require that landmarks be specified, and the minimizer of the distance functional involved is uniques there are no other local minimizers. Finally, optimal transport naturally takes into account changes in density that result from changes in area or volume. Although the optimal transport method is certainly not appropriate for all registration and warping problems, this mass preservation property makes the Monge–Kantorovich approach quite useful for an interesting class of warping problems, as we show in this paper. Our method for finding the registration mapping is based on a partial differential equation approach to the minimization of the L2 Kantorovich–Wasserstein or “Earth Mover's Distance” under a mass preservation constraint. We show how this approach leads to practical algorithms, and demonstrate our method with a number of examples, including those from the medical field. We also extend this method to take into account changes in intensity, and show that it is well suited for applications such as image morphing.
TL;DR: A trainable object detector achieves reliable and efficient detection of human faces and passenger cars with out-of-plane rotation.
Abstract: In this paper we describe a trainable object detector and its instantiations for detecting faces and cars at any size, location, and pose. To cope with variation in object orientation, the detector uses multiple classifiers, each spanning a different range of orientation. Each of these classifiers determines whether the object is present at a specified size within a fixed-size image window. To find the object at any location and size, these classifiers scan the image exhaustively. Each classifier is based on the statistics of localized parts. Each part is a transform from a subset of wavelet coefficients to a discrete set of values. Such parts are designed to capture various combinations of locality in space, frequency, and orientation. In building each classifier, we gathered the class-conditional statistics of these part values from representative samples of object and non-object images. We trained each classifier to minimize classification error on the training set by using Adaboost with Confidence-Weighted Predictions (Shapire and Singer, 1999). In detection, each classifier computes the part values within the image window and looks up their associated class-conditional probabilities. The classifier then makes a decision by applying a likelihood ratio test. For efficiency, the classifier evaluates this likelihood ratio in stages. At each stage, the classifier compares the partial likelihood ratio to a threshold and makes a decision about whether to cease evaluation—labeling the input as non-object—or to continue further evaluation. The detector orders these stages of evaluation from a low-resolution to a high-resolution search of the image. Our trainable object detector achieves reliable and efficient detection of human faces and passenger cars with out-of-plane rotation.
TL;DR: An automated method for fast, ground-based acquisition of large-scale 3D city models by utilizing an aerial photograph or a Digital Surface Model as a global map, to which the ground- based horizontal laser scans are matched.
Abstract: In this paper, we describe an automated method for fast, ground-based acquisition of large-scale 3D city models. Our experimental set up consists of a truck equipped with one camera and two fast, inexpensive 2D laser scanners, being driven on city streets under normal traffic conditions. One scanner is mounted vertically to capture building facades, and the other one is mounted horizontally. Successive horizontal scans are matched with each other in order to determine an estimate of the vehicle's motion, and relative motion estimates are concatenated to form an initial path. Assuming that features such as buildings are visible from both ground-based and airborne view, this initial path is globally corrected by Monte-Carlo Localization techniques. Specifically, the final global pose is obtained by utilizing an aerial photograph or a Digital Surface Model as a global map, to which the ground-based horizontal laser scans are matched. A fairly accurate, textured 3D cof the downtown Berkeley area has been acquired in a matter of minutes, limited only by traffic conditions during the data acquisition phase. Subsequent automated processing time to accurately localize the acquisition vehicle is 235 minutes for a 37 minutes or 10.2 km drive, i.e. 23 minutes per kilometer.
TL;DR: A new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a single 2D image when correspondences between object points and image points are not known, which has an asymptotic run-time complexity that is better than previous methods by a factor of the number of image points.
Abstract: The problem of pose estimation arises in many areas of computer vision, including object recognition, object tracking, site inspection and updating, and autonomous navigation when scene models are available. We present a new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a single 2D image when correspondences between object points and image points are not known. The algorithm combines the iterative softassign algorithm (Gold and Rangarajan, 1996; Gold et al., 1998) for computing correspondences and the iterative POSIT algorithm (DeMenthon and Davis, 1995) for computing object pose under a full-perspective camera model. Our algorithm, unlike most previous algorithms for pose determination, does not have to hypothesize small sets of matches and then verify the remaining image points. Instead, all possible matches are treated identically throughout the search for an optimal pose. The performance of the algorithm is extensively evaluated in Monte Carlo simulations on synthetic data under a variety of levels of clutter, occlusion, and image noise. These tests show that the algorithm performs well in a variety of difficult scenarios, and empirical evidence suggests that the algorithm has an asymptotic run-time complexity that is better than previous methods by a factor of the number of image points. The algorithm is being applied to a number of practical autonomous vehicle navigation problems including the registration of 3D architectural models of a city to images, and the docking of small robots onto larger robots.
TL;DR: This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of-freedom articulated human body configurations in complex video sequences, and is the first computer vision based system able to process such challenging footage.
Abstract: This paper demonstrates a new visual motion estimation technique that is able to recover high degree-of-freedom articulated human body configurations in complex video sequences. We introduce the use and integration of a mathematical technique, the product of exponential maps and twist motions, into a differential motion estimation. This results in solving simple linear systems, and enables us to recover robustly the kinematic degrees-of-freedom in noise and complex self occluded configurations. A new factorization technique lets us also recover the kinematic chain model itself. We are able to track several human walk cycles, several wallaby hop cycles, and two walk cycels of the famous movements of Eadweard Muybridge's motion studies from the last century. To the best of our knowledge, this is the first computer vision based system that is able to process such challenging footage.
TL;DR: This paper describes the automatic acquisition of three dimensional architectural models from short image sequences using Bayesian and model based methods and proves the validity of the prior by verifying that plausible buildings are generated under varying conditions.
Abstract: This paper describes the automatic acquisition of three dimensional architectural models from short image sequences. The approach is Bayesian and model based. Bayesian methods necessitate the formulation of a prior distributions however designing a generative model for buildings is a difficult task. In order to overcome this a building is described as a set of walls together with a ‘Lego’ kit of parameterised primitives, such as doors or windows. A prior on wall layout, and a prior on the parameters of each primitive can then be defined. Part of this prior is learnt from training data and part comes from expert architects. The validity of the prior is tested by generating example buildings using MCMC and verifying that plausible buildings are generated under varying conditions. The same MCMC machinery can also be used for optimising the structure recovery, this time generating a range of possible solutions from the posterior. The fact that a range of solutions can be presented allows the user to select the best when the structure recovery is ambiguous.
TL;DR: A new method for the extraction of roads from remotely sensed images is proposed, under the assumption that roads form a thin network in the image, by connected line segments by minimizing an energy function.
Abstract: In this paper we propose a new method for the extraction of roads from remotely sensed images. Under the assumption that roads form a thin network in the image, we approximate such a network by connected line segments. To perform this task, we construct a point process able to simulate and detect thin networks. The segments have to be connected, in order to form a line-network. Aligned segments are favored whereas superposition is penalized. These constraints are enforced by the interaction model (called the Candy model). The specific properties of the road network in the image are described by the data term. This term is based on statistical hypothesis tests. The proposed probabilistic model can be written within a Gibbs point process framework. The estimate for the network is found by minimizing an energy function. In order to avoid local minima, we use a simulated annealing algorithm, based on a Monte Carlo dynamics (RJMCMC) for finite point processes. Results are shown on SPOT, ERS and aerial images.
TL;DR: An optimized random sampling algorithm that is able to detect a rigid motion and estimate the fundamental matrix when the set of point matches contains up to 90% of outliers, which outperforms the best currently known methods like M-estimators, LMedS, classical RANSAC and Tensor Voting.
Abstract: The perspective projections of n physical points on two views (stereovision) are constrained as soon as n ≥ 8 However, to prove in practice the existence of a rigid motion between two images, more than 8 point matches are desirable in order to compensate for the limited accuracy of the matches In this paper, we propose a computational definition of rigidity and a probabilistic criterion to rate the meaningfulness of a rigid set as a function of both the number of pairs of points (n) and the accuracy of the matches This criterion yields an objective way to compare, say, precise matches of a few points and approximate matches of a lot of points It gives a yes/no answer to the question: “could this rigid points correspondence have occurred by chance?”, since it guarantees that the expected number of meaningful rigid sets found by chance in a random distribution of points is as small as desired It also yields absolute accuracy requirements for rigidity detection in the case of non-matched points, and optimal values of n, depending on the expected accuracy of the matches and on the proportion of outliers We use it to build an optimized random sampling algorithm that is able to detect a rigid motion and estimate the fundamental matrix when the set of point matches contains up to 90% of outliers, which outperforms the best currently known methods like M-estimators, LMedS, classical RANSAC and Tensor Voting
TL;DR: A 3D texture recognition method is designed which employs the BFH as the surface model, and classifies surfaces based on a single novel texture image of unknown imaging parameters, and a computational method for quantitatively evaluating the relative significance of texture images within the BTF is developed.
Abstract: Textured surfaces are an inherent constituent of the natural surroundings, therefore efficient real-world applications of computer vision algorithms require precise surface descriptors. Often textured surfaces present not only variations of color or reflectance, but also local height variations. This type of surface is referred to as a 3D texture. As the lighting and viewing conditions are varied, effects such as shadowing, foreshortening and occlusions, give rise to significant changes in texture appearance. Accounting for the variation of texture appearance due to changes in imaging parameters is a key issue in developing accurate 3D texture models. The bidirectional texture function (BTF) is observed image texture as a function of viewing and illumination directions. In this work, we construct a BTF-based surface model which captures the variation of the underlying statistical distribution of local structural image features, as the viewing and illumination conditions are changed. This 3D texture representation is called the bidirectional feature histogram (BFH). Based on the BFH, we design a 3D texture recognition method which employs the BFH as the surface model, and classifies surfaces based on a single novel texture image of unknown imaging parameters. Also, we develop a computational method for quantitatively evaluating the relative significance of texture images within the BTF. The performance of our methods is evaluated by employing over 6200 texture images corresponding to 40 real-world surface samples from the CUReT (Columbia-Utrecht reflectance and texture) database. Our experiments produce excellent classification results, which validate the strong descriptive properties of the BFH as a 3D texture representation.
TL;DR: This paper describes a camera design for simultaneously acquiring multiple images and implemented a video-rate camera based on this design, and the results obtained are presented.
Abstract: Most imaging sensors have limited dynamic range and hence are sensitive to only a part of the illumination range present in a natural scene. The dynamic range can be improved by acquiring multiple images of the same scene under different exposure settings and then combining them. In this paper, we describe a camera design for simultaneously acquiring multiple images. The cross-section of the incoming beam from a scene point is partitioned into as many parts as the required number of images. This is done by splitting the aperture into multiple parts and directing the beam exiting from each in a different direction using an assembly of mirrors. A sensor is placed in the path of each beam and exposure of each sensor is controlled either by appropriately setting its exposure parameter, or by splitting the incoming beam unevenly. The resulting multiple exposure images are used to construct a high dynamic range image. We have implemented a video-rate camera based on this design and the results obtained are presented.
TL;DR: This paper formulates the problem of multiple cue integration and tracking in a probabilistic framework based on a factorized graphical model and proposes a sequential Monte Carlo algorithm to provide an efficient simulation and approximation of the co-inferencing of multiple cues.
Abstract: Visual tracking can be treated as a parameter estimation problem that infers target states based on image observations from video sequences. A richer target representation may incur better chances of successful tracking in cluttered and dynamic environments, and thus enhance the robustness. Richer representations can be constructed by either specifying a detailed model of a single cue or combining a set of rough models of multiple cues. Both approaches increase the dimensionality of the state space, which results in a dramatic increase of computation. To investigate the integration of rough models from multiple cues and to explore computationally efficient algorithms, this paper formulates the problem of multiple cue integration and tracking in a probabilistic framework based on a factorized graphical model. Structured variational analysis of such a graphical model factorizes different modalities and suggests a co-inference process among these modalities. Based on the importance sampling technique, a sequential Monte Carlo algorithm is proposed to provide an efficient simulation and approximation of the co-inferencing of multiple cues. This algorithm runs in real-time at around 30 Hz. Our extensive experiments show that the proposed algorithm performs robustly in a large variety of tracking scenarios. The approach presented in this paper has the potential to solve other problems including sensor fusion problems.
TL;DR: A visibility approach that uses all possible color information from the photographs during reconstruction, photo-consistency measures that are more robust and/or require less manual intervention, and a volumetric warping method for application of these reconstruction methods to large-scale scenes are described.
Abstract: In this paper, we present methods for 3D volumetric reconstruction of visual scenes photographed by multiple calibrated cameras placed at arbitrary viewpoints. Our goal is to generate a 3D model that can be rendered to synthesize new photo-realistic views of the scene. We improve upon existing voxel coloring/space carving approaches by introducing new ways to compute visibility and photo-consistency, as well as model infinitely large scenes. In particular, we describe a visibility approach that uses all possible color information from the photographs during reconstruction, photo-consistency measures that are more robust and/or require less manual intervention, and a volumetric warping method for application of these reconstruction methods to large-scale scenes.
TL;DR: This paper describes a view-based method for recognizing 3D objects from 2D images using an aspect-graph structure, where the aspects are not based on the singularities of visual mapping but are instead formed using a notion of shape similarity between views.
Abstract: This paper describes a view-based method for recognizing 3D objects from 2D images. We employ an aspect-graph structure, where the aspects are not based on the singularities of visual mapping but are instead formed using a notion of shape similarity between views. Specifically, the viewing sphere is endowed with a metric of dis-similarity for each pair of views and the problem of aspect generation is viewed as a “segmentation” of the viewing sphere into homogeneous regions. The viewing sphere is sampled at regular (5 degree) intervals and the similarity metric is used in an iterative procedure to combine views into aspects with a prototype representing each aspect. This is done in a “region-growing” regime which stands in contrast to the usual “edge detection” styles to computing the aspect graph. The aspect growth is constrained such that two aspects of an object remain distinct under the given similarity metric. Once the database of 3D objects is organized as a set of aspects, and prototypes for these aspects for each object, unknown views of database objects are compared with the prototypes and the results are ordered by similarity. We use two similarity metrics for shape, one based on curve matching and the other based on matching shock graphs, which for a database of 64 objects and unknown views of objects from the database give a recall rate of (90.3%, 74.2%, 59.7%) and (95.2%, 69.0%, 57.5%), respectively, for the top three matchess cumulative recall rate based on the top three matches is 98% and 100%, respectively. The result of indexing unknown views of objects not in the database also produce intuitive matches. We also develop a hierarchical indexing scheme to prune unlikely objects at an early stage to improve the efficiency of indexing, resulting in savings of 35% at the top level and of 55% at the next level, cumulatively.
TL;DR: A new representation is proposed that overcomes the appearance variation problem associated with an image sequence and simultaneously optimizes a set of self-consistent depth maps at multiple key-frames.
Abstract: Stereo correspondence algorithms typically produce a single depth map. In addition to the usual problems of occlusions and textureless regions, such algorithms cannot model the variation in scene or object appearance with respect to the viewing position. In this paper, we propose a new representation that overcomes the appearance variation problem associated with an image sequence. Rather than estimating a single depth map, we associate a depth map with each input image (or a subset of them). Our representation is motivated by applications such as view interpolation and depth-based segmentation for model-building or layer extraction. We describe two approaches to extract such a representation from a sequence of images. The first approach, which is more classical, computes the local depth map associated with each chosen reference frame independently. The novelty of this approach lies in its combination of shiftable windows, temporal selection, and graph cut optimization. The second approach simultaneously optimizes a set of self-consistent depth maps at multiple key-frames. Since multiple depth maps are estimated simultaneously, visibility can be modeled explicitly and disparity consistency imposed across the different depth maps. Results, which include a difficult specular scene example, show the effectiveness of our approach.
TL;DR: A user-centric system for visualization and layout for content-based image retrieval and the ability of this framework to model or “mimic” users, by automatically generating layouts according to their preferences is demonstrated.
Abstract: We present a user-centric system for visualization and layout for content-based image retrieval. Image features (visual and/or semantic) are used to display retrievals as thumbnails in a 2-D spatial layout or “configuration” which conveys all pair-wise mutual similarities. A graphical optimization technique is used to provide maximally uncluttered and informative layouts. Moreover, a novel subspace feature weighting technique can be used to modify 2-D layouts in a variety of context-dependent ways. An efficient computational technique for subspace weighting and re-estimation leads to a simple user-modeling framework whereby the system can learn to display query results based on layout examples (or relevance feedback) provided by the user. The resulting retrieval, browsing and visualization can adapt to the user's (time-varying) notions of content, context and preferences in style and interactive navigation. Monte Carlo simulations with machine-generated layouts as well as pilot user studies have demonstrated the ability of this framework to model or “mimic” users, by automatically generating layouts according to their preferences.
TL;DR: An easy-to-use and cost-effective system to construct textured 3D animated face models from videos with minimal user interaction, which makes full use of generic knowledge of faces in head motion determination, head tracking, model fitting, and multiple-view bundle adjustment.
Abstract: We have developed an easy-to-use and cost-effective system to construct textured 3D animated face models from videos with minimal user interaction. This is a particularly challenging task for faces due to a lack of prominent textures. We develop a robust system by following a model-based approach: we make full use of generic knowledge of faces in head motion determination, head tracking, model fitting, and multiple-view bundle adjustment. Our system first takes, with an ordinary video camera, images of a face of a person sitting in front of the camera turning their head from one side to the other. After five manual clicks on two images to indicate the position of the eye corners, nose tip and mouth corners, the system automatically generates a realistic looking 3D human head model that can be animated immediately (different poses, facial expressions and talking). A user, with a PC and a video camera, can use our system to generate his/her face model in a few minutes. The face model can then be imported in his/her favorite game, and the user sees themselves and their friends take part in the game they are playing. We have demonstrated the system on a laptop computer live at many events, and constructed face models for hundreds of people. It works robustly under various environment settings.
TL;DR: The 6 × 6 3D line motion matrix that acts on Plücker coordinates is introduced, its algebraic properties are characterized, and various methods for estimating 3D motion from line correspondences are proposed, based on cost functions defined in images or 3D space.
Abstract: We study the problem of aligning two 3D line reconstructions in projective, affine, metric or Euclidean space. We introduce the 6 × 6 3D line motion matrix that acts on Plucker coordinates. We characterize its algebraic properties and its relation to the usual 4 × 4 point motion matrix, and propose various methods for estimating 3D motion from line correspondences, based on cost functions defined in images or 3D space. We assess the quality of the different estimation methods using simulated data and real images.
TL;DR: A novel and highly robust estimator, called MDPE1 (Maximum Density Power Estimator), which applies nonparametric density estimation and density gradient estimation techniques in parametric estimation (“model fitting”).
Abstract: In this paper, we propose a novel and highly robust estimator, called MDPE1 (Maximum Density Power Estimator). This estimator applies nonparametric density estimation and density gradient estimation techniques in parametric estimation (“model fitting”). MDPE optimizes an objective function that measures more than just the size of the residuals. Both the density distribution of data points in residual space and the size of the residual corresponding to the local maximum of the density distribution, are considered as important characteristics in our objective function. MDPE can tolerate more than 85% outliers. Compared with several other recently proposed similar estimators, MDPE has a higher robustness to outliers and less error variance. We also present a new range image segmentation algorithm, based on a modified version of the MDPE (Quick-MDPE), and its performance is compared to several other segmentation methods. Segmentation requires more than a simple minded application of an estimator, no matter how good that estimator is: our segmentation algorithm overcomes several difficulties faced with applying a statistical estimator to this task.
TL;DR: The key observation is that anti-aliased light field rendering is equivalent to eliminating the “double image” artifacts caused by view interpolation, and a closed-form solution of the minimum sampling rate is presented.
Abstract: Recently, many image-based modeling and rendering techniques have been successfully designed to render photo-realistic images without the need for explicit 3D geometry. However, these techniques (e.g., light field rendering (Levoy, M. and Hanrahan, P., 1996. In SIGGRAPH 1996 Conference Proceedings, Annual Conference Series, Aug. 1996, pp. 31–42) and Lumigraph (Gortler, S.J., Grzeszczuk, R., Szeliski, R., and Cohen, M.F., 1996. In SIGGRAPH 1996 Conference Proceedings, Annual Conference Series, Aug. 1996, pp. 43–54)) may require a substantial number of images. In this paper, we adopt a geometric approach to investigate the minimum sampling problem for light field rendering, with and without geometry information of the scene. Our key observation is that anti-aliased light field rendering is equivalent to eliminating the “double image” artifacts caused by view interpolation. Specifically, we present a closed-form solution of the minimum sampling rate for light field rendering. The minimum sampling rate is determined by the resolution of the camera and the depth variation of the scene. This rate is ensured if the optimal constant depth for rendering is chosen as the harmonic mean of the maximum and minimum depths of the scene. Moreover, we construct the minimum sampling curve in the joint geometry and image space, with the consideration of depth discontinuity. The minimum sampling curve quantitatively indicates how reduced geometry information can be compensated by increasing the number of images, and vice versa. Experimental results demonstrate the effectiveness of our theoretical analysis.
TL;DR: An image indexing scheme and a query language, which allow the user to introduce cognitive dimension to the search, and the development of a “semantic-friendly” query language for browsing and searching diverse collections of images.
Abstract: image semantics resists all forms of modeling, very much like any kind of intelligence does. However, in order to develop more satisfying image navigation systems, we need tools to construct a semantic bridge between the user and the database. In this paper we present an image indexing scheme and a query language, which allow the user to introduce cognitive dimension to the search. At an abstract level, this approach consists of: (1) learning the “natural language” that humans speak to communicate their semantic experience of images, (2) understanding the relationships between this language and objective measurable image attributes, and then (3) developing corresponding feature extraction schemes. More precisely, we have conducted a number of subjective experiments in which we asked human subjects to group images, and then explain verbally why they did so. The results of this study indicated that a part of the abstraction involved in image interpretation is often driven by semantic categories, which can be broken into more tangible semantic entities, i.e. objective semantic indicators. By analyzing our experimental data, we have identified some candidate semantic categories (i.e. portraits, people, crowds, cityscapes, landscapes, etc.) and their underlying semantic indicators (i.e. skin, sky, water, object, etc.). These experiments also helped us derive important low-level image descriptors, accounting for our perception of these indicators. We have then used these findings to develop an image feature extraction and indexing scheme. In particular, our feature set has been carefully designed to match the way humans communicate image meaning. This led us to the development of a “semantic-friendly” query language for browsing and searching diverse collections of images. We have implemented our approach into an Internet search engine, and tested it on a large number of images. The results we obtained are very promising.
TL;DR: Since every symmetric structure admits a “canonical” coordinate frame with respect to which the group action can be naturally represented, the canonical pose between the viewer and this canonical frame can be recovered too, which explains why symmetric objects provide us overwhelming clues to their orientation and position.
Abstract: In this paper, we provide a principled explanation of how knowledge in global 3-D structural invariants, typically captured by a group action on a symmetric structure, can dramatically facilitate the task of reconstructing a 3-D scene from one or more images. More importantly, since every symmetric structure admits a “canonical” coordinate frame with respect to which the group action can be naturally represented, the canonical pose between the viewer and this canonical frame can be recovered too, which explains why symmetric objects (e.g., buildings) provide us overwhelming clues to their orientation and position. We give the necessary and sufficient conditions in terms of the symmetry (group) admitted by a structure under which this pose can be uniquely determined. We also characterize, when such conditions are not satisfied, to what extent this pose can be recovered. We show how algorithms from conventional multiple-view geometry, after properly modified and extended, can be directly applied to perform such recovery, from all “hidden images” of one image of the symmetric structure. We also apply our results to a wide range of applications in computer vision and image processing such as camera self-calibration, image segmentation and global orientation, large baseline feature matching, image rendering and photo editing, as well as visual illusions (caused by symmetry if incorrectly assumed).