scispace - formally typeset
Search or ask a question

Showing papers by "Takeo Kanade published in 2004"


Proceedings ArticleDOI
27 Jun 2004
TL;DR: This work describes how a non-rigid structure-from-motion algorithm can be used to construct the corresponding 3D shape modes of a 2D AAM and proposes a real-time algorithm for fitting the AAM while enforcing the constraints, creating what it calls a "combined 2D+3D A AM".
Abstract: Active appearance models (AAMs) are generative models commonly used to model faces. Another closely related types of face models are 3D morphable models (3DMMs). Although AAMs are 2D, they can still be used to model 3D phenomena such as faces moving across pose. We first study the representational power of AAMs and show that they can model anything a 3DMM can, but possibly require more shape parameters. We quantify the number of additional parameters required and show that 2D AAMs can generate model instances that are not possible with the equivalent 3DMM. We proceed to describe how a non-rigid structure-from-motion algorithm can be used to construct the corresponding 3D shape modes of a 2D AAM. We then show how the 3D modes can be used to constrain the AAM so that it can only generate model instances that can also be generated with the 3D modes. Finally, we propose a real-time algorithm for fitting the AAM while enforcing the constraints, creating what we call a "combined 2D+3D AAM".

480 citations


Book ChapterDOI
11 May 2004
TL;DR: The point set registration problem is defined as finding the maximum kernel correlation configuration of the the two point sets to be registered, and the new registration method has intuitive interpretations, simple to implement algorithm and easy to prove convergence property.
Abstract: Correlation is a very effective way to align intensity images We extend the correlation technique to point set registration using a method we call kernel correlation Kernel correlation is an affinity measure, and it is also a function of the point set entropy We define the point set registration problem as finding the maximum kernel correlation configuration of the the two point sets to be registered The new registration method has intuitive interpretations, simple to implement algorithm and easy to prove convergence property Our method shows favorable performance when compared with the iterative closest point (ICP) and EM-ICP methods

439 citations


Journal ArticleDOI
TL;DR: A trainable object detector achieves reliable and efficient detection of human faces and passenger cars with out-of-plane rotation.
Abstract: In this paper we describe a trainable object detector and its instantiations for detecting faces and cars at any size, location, and pose. To cope with variation in object orientation, the detector uses multiple classifiers, each spanning a different range of orientation. Each of these classifiers determines whether the object is present at a specified size within a fixed-size image window. To find the object at any location and size, these classifiers scan the image exhaustively. Each classifier is based on the statistics of localized parts. Each part is a transform from a subset of wavelet coefficients to a discrete set of values. Such parts are designed to capture various combinations of locality in space, frequency, and orientation. In building each classifier, we gathered the class-conditional statistics of these part values from representative samples of object and non-object images. We trained each classifier to minimize classification error on the training set by using Adaboost with Confidence-Weighted Predictions (Shapire and Singer, 1999). In detection, each classifier computes the part values within the image window and looks up their associated class-conditional probabilities. The classifier then makes a decision by applying a likelihood ratio test. For efficiency, the classifier evaluates this likelihood ratio in stages. At each stage, the classifier compares the partial likelihood ratio to a threshold and makes a decision about whether to cease evaluation—labeling the input as non-object—or to continue further evaluation. The detector orders these stages of evaluation from a low-resolution to a high-resolution search of the image. Our trainable object detector achieves reliable and efficient detection of human faces and passenger cars with out-of-plane rotation.

399 citations


Book ChapterDOI
11 May 2004
TL;DR: This paper proves that, under the weak-perspective projection model, enforcing both the basis and the rotation constraints leads to a closed-form solution to the problem of non-rigid shape and motion recovery, and proposes a set of novel constraints, basis constraints, which uniquely determine the shape bases.
Abstract: Recovery of three dimensional (3D) shape and motion of non-static scenes from a monocular video sequence is important for applications like robot navigation and human computer interaction. If every point in the scene randomly moves, it is impossible to recover the non-rigid shapes. In practice, many non-rigid objects, e.g. the human face under various expressions, deform with certain structures. Their shapes can be regarded as a weighted combination of certain shape bases. Shape and motion recovery under such situations has attracted much interest. Previous work on this problem [6,4,13] utilized only orthonormality constraints on the camera rotations (rotation constraints). This paper proves that using only the rotation constraints results in ambiguous and invalid solutions. The ambiguity arises from the fact that the shape bases are not unique because their linear transformation is a new set of eligible bases. To eliminate the ambiguity, we propose a set of novel constraints, basis constraints, which uniquely determine the shape bases. We prove that, under the weak-perspective projection model, enforcing both the basis and the rotation constraints leads to a closed-form solution to the problem of non-rigid shape and motion recovery. The accuracy and robustness of our closed-form solution is evaluated quantitatively on synthetic data and qualitatively on real video sequences.

260 citations


Journal Article
TL;DR: In this article, a process is described for analysing the motion of a human target in a video stream, and two motion cues are determined from this skeletonization: body posture, and cyclic motion of skeleton segments.
Abstract: In this paper a process is described for analysing the motion of a human target in a video stream. Moving targets are detected and their boundaries extracted. From these, a \"star\" skeleton is produced. Two motion cues are determined from this skeletonization: body posture, and cyclic motion of skeleton segments. These cues are used to determine human activities such as walking or running, and even potentially, the target's gait. Unlike other methods, this does not require an a priori human model, or a large number of \"pixels on target\". Furthermore, it is computationally inexpensive, and thus ideal for real-world video applications such as outdoor video surveillance.

186 citations


Proceedings ArticleDOI
01 Dec 2004
TL;DR: This paper presents work on real-time 3D vision algorithms for recovering motion and structure from a video sequence, 3D terrain mapping from a laser range finder onboard a small autonomous helicopter, and sensor fusion of visual and GPS/INS sensors.
Abstract: Autonomous control of small and micro air vehicles (SMAV) requires precise estimation of both vehicle state and its surrounding environment. Small cameras, which are available today at very low cost, are attractive sensors for SMAV. 3D vision by video and laser scanning has distinct advantages in that they provide positional information relative to objects and environments, in which the vehicle operates, that is critical to obstacle avoidance and mapping of the environment. This paper presents work on real-time 3D vision algorithms for recovering motion and structure from a video sequence, 3D terrain mapping from a laser range finder onboard a small autonomous helicopter, and sensor fusion of visual and GPS/INS sensors.

128 citations


Book ChapterDOI
26 Sep 2004
TL;DR: Three users used each of three PFS prototype concepts to cut a faceted shape in wax and the results of this experiment were analyzed to identify the largest sources of error.
Abstract: The Precision Freehand Sculptor (PFS) is a compact, handheld, intelligent tool to assist the surgeon in accurately cutting bone. A retractable rotary blade on the PFS allows a computer to control what bone is removed. Accuracy is ensured even though the surgeon uses the tool freehand. The computer extends or retracts the blade based on data from an optical tracking camera. Three users used each of three PFS prototype concepts to cut a faceted shape in wax. The results of this experiment were analyzed to identify the largest sources of error.

83 citations


Journal ArticleDOI
TL;DR: An algorithm to recover the scene structure, the trajectories of the moving objects and the camera motion simultaneously given a monocular image sequence is described and a unified geometrical representation of the static scene and theMoving objects is proposed.
Abstract: In this paper we describe an algorithm to recover the scene structure, the trajectories of the moving objects and the camera motion simultaneously given a monocular image sequence. The number of the moving objects is automatically detected without prior motion segmentation. Assuming that the objects are moving linearly with constant speeds, we propose a unified geometrical representation of the static scene and the moving objects. This representation enables the embedding of the motion constraints into the scene structure, which leads to a factorization-based algorithm. We also discuss solutions to the degenerate cases which can be automatically detected by the algorithm. Extension of the algorithm to weak perspective projections is presented as well. Experimental results on synthetic and real images show that the algorithm is reliable under noise.

69 citations


Proceedings ArticleDOI
28 Sep 2004
TL;DR: CAMEO's fast people detection and tracking module makes use of a combination of frame differencing, face detection, and adaptive color blob tracking based on mean shift analysis to detect and track people in the panoramic image.
Abstract: We have designed a physical awareness system called CAMEO, the camera assisted meeting event observer, which consists of a multi-camera omnidirectional vision system designed to be used in meeting environments. CAMEO is designed to monitor the activities of people in meetings so that it can generate a semantically-indexed summary of what occurred in the meeting. In this paper, we describe CAMEO's fast people detection and tracking module. This module makes use of a combination of frame differencing, face detection, and adaptive color blob tracking based on mean shift analysis to detect and track people in the panoramic image. We describe this algorithm and present experimental results from captured meeting logs.

59 citations


Proceedings ArticleDOI
01 Jan 2004
TL;DR: This paper describes how a single AAM can be fit to multiple images, captured simultaneously by cameras with arbitrary geometry and response functions, and retains the major benefits of Coupled-View AAMs: the integration of information from multiple images into a single model, and improved fitting robustness.
Abstract: Active Appearance Models (AAMs) are a well studied 2D deformable model. One recently proposed extension of AAMs to multiple images is the CoupledView AAM. Coupled-View AAMs model the 2D shape and appearance of a face in two or more views simultaneously. The major limitation of CoupledView AAMs, however, is that they are specific to a particular set of cameras, both in geometry and the photometric responses. In this paper, we describe how a single AAM can be fit to multiple images, captured simultaneously by cameras with arbitrary geometry and response functions. Our algorithm retains the major benefits of Coupled-View AAMs: the integration of information from multiple images into a single model, and improved fitting robustness.

57 citations


Proceedings ArticleDOI
27 Jun 2004
TL;DR: The problem of super-resolving a human face video by a very high (/spl times/ 16) zoom factor is considered using a graphical model that encodes, (1) spatio-temporal consistencies, and (2) image formation & degradation processes.
Abstract: In this paper, we consider the problem of super-resolving a human face video by a very high (/spl times/ 16) zoom factor. Inspired by the literature on hallucination and example-based learning, we formulate this task using a graphical model that encodes, (1) spatio-temporal consistencies, and (2) image formation & degradation processes. A video database of facial expressions is used to learn a domain-specific prior for high-resolution videos. The problem is posed as one of probabilistic inference, in which we aim to find the high-resolution video that satisfies the constraints expressed through the graphical model. Traditional approaches to this problem using video data first estimate the relative motion between frames and then compensate for it, and effectively resulting in multiple measurements of the scene. Our use of time is rather direct, we define data structures that span multiple consecutive frames enriching our feature vectors with a temporal signature. We then exploit these signatures to find consistent solutions over time. In our experiments, an 8/spl times/6 pixel-wide face video, subject to translational jitter and additive noise, gets magnified to a 128/spl times/96 pixel video. Our results show that by exploiting both space and time, drastic improvements can be achieved in both video flicker artifacts and mean-squared-error.

Proceedings ArticleDOI
19 Jul 2004
TL;DR: In this paper, the problem of 3D non-rigid shape and motion recovery from a monocular video sequence, under the degenerate deformations, was studied, where the shape of a deformable object was regarded as a linear combination of certain shape bases.
Abstract: This paper studies the problem of 3D non-rigid shape and motion recovery from a monocular video sequence, under the degenerate deformations. The shape of a deformable object is regarded as a linear combination of certain shape bases. When the bases are non-degenerate, i.e. of full rank-3, a closed-form solution exists by enforcing linear constraints on both the camera rotation and the shape bases. In practice, degenerate deformations occur often, i.e. some bases are of rank 1 or 2. For example, cars moving or pedestrians walking independently on a straight road refer to rank-1 deformations of the scene. This paper quantitatively shows that, when the shape is composed of only rank-3 and rank-1 bases, i.e. the 3D points either are static or independently move along straight lines, the linear rotation and basis constraints are sufficient to achieve a unique solution. When the shape bases contain rank-2 ones, imposing only the linear constraints results in an ambiguous solution space. In such cases, we propose an alternating linear approach that imposes the positive semi-definite constraint to determine the desired solution in the solution space. The performance of the approach is evaluated quantitatively on synthetic data and qualitatively on real videos.

Proceedings ArticleDOI
07 Jul 2004
TL;DR: In this article, a maximum entropy approach using a non-standard measure of entropy is proposed to solve a set of linear equations that can be efficiently solved in a collaborative filtering setting.
Abstract: Within the task of collaborative filtering two challenges for computing conditional probabilities exist. First, the amount of training data available is typically sparse with respect to the size of the domain. Thus, support for higher-order interactions is generally not present. Second, the variables that we are conditioning upon vary for each query. That is, users label different variables during each query. For this reason, there is no consistent input to output mapping. To address these problems we purpose a maximum entropy approach using a non-standard measure of entropy. This approach can be simplified to solving a set of linear equations that can be efficiently solved.

Journal Article
TL;DR: In this paper, a method for detecting multiple overlapping objects from a real-time video stream is described, which is based on two processes: pixel analysis and region analysis Pixel analysis determines whether a pixel is stationary or transient by observing its intensity over time Region analysis detects regions consisting of stationary pixels corresponding to stopped objects.
Abstract: This paper describes a method for detecting multiple overlapping objects from a real-time video stream Layered detection is based on two processes: pixel analysis and region analysis Pixel analysis determines whether a pixel is stationary or transient by observing its intensity over time Region analysis detects regions consisting of stationary pixels corresponding to stopped objects These regions are registered as layers on the background image, and thus new moving objects passing through these layers can be detected An important aspect of this work derives from the observation that legitimately moving objects in a scene tend to cause much faster intensity transitions than changes due to lighting, meteorological, and diurnal effects The resulting system robustly detects objects at an outdoor surveillance site For 8 hours of video evaluation, a detection rate of 92% was measured which is higher than traditional background subtraction methods

Journal ArticleDOI
TL;DR: Comparisons in between human being and H7 walk made using the following characteristics: ZMP trajectories; torso movement, free leg trajectories, joint angle usage, joint torque usage, and so on.

Proceedings ArticleDOI
01 Jan 2004
TL;DR: This paper presents the structure and performance of the newly developed fingerprint imaging system, and the outline of the image processing for the quantification of incipient slip, and a force sensor for the contact force measurements.
Abstract: This paper presents the structure and performance of the newly developed fingerprint imaging system, and the outline of the image processing for the quantification of incipient slip Incipient slip, that is considered to have direct relation with slip perception, is visualized as distortion of a fingerprint pattern A force sensor for the contact force measurements are also newly developed Thus this system enables highly accurate incipient slip and fingertip contact force measurements

01 Jan 2004
TL;DR: This paper presents a prediction and planning framework for analysing the safety and interaction of moving objects in complex road scenes that can be applied, either as a driver warning system, or as an action recommendation system (human in the loop), or as a intelligent cruise control system (closed loop).
Abstract: This paper presents a prediction and planning framework for analysing the safety and interaction of moving objects in complex road scenes. Rather than detecting specific, known, dangerous configurations, we simulate all the possible motion and interaction of objects. This simulation is used to detect dangerous situations, and to select the best path. The best path can be chosen according to a number of different criterion, such as: smoothest motion, largest avoiding distance, or quickest path. This framework can be applied, either as a driver warning system (open loop), or as an action recommendation system (human in the loop), or as an intelligent cruise control system (closed loop). This framework is evaluated using synthetic data, using simple and complex road scenes.

Proceedings ArticleDOI
06 Sep 2004
TL;DR: A computer vision-based system to transfer human motion from one subject to another using a network of eight calibrated and synchronized cameras and an image-based rendering algorithm to render the captured motion applied to the articulated model of another person.
Abstract: We develop a computer vision-based system to transfer human motion from one subject to another. Our system uses a network of eight calibrated and synchronized cameras. We first build detailed kinematic models of the subjects based on our algorithms for extracting shape from silhouette across time (G. Cheung et al., 2003). These models are then used to capture the motion (joint angles) of the subjects in new video sequences. Finally we describe an image-based rendering algorithm to render the captured motion applied to the articulated model of another person. Our rendering algorithm uses an ensemble of spatially and temporally distributed images to generate photo-realistic video of the transferred motion. We demonstrate the performance of the system by rendering throwing and kungfu motions on subjects who did not perform them.

Proceedings ArticleDOI
27 Jun 2004
TL;DR: The k-th nearest neighbor distance (kNND) metric is presented, which, without actually clustering the data, can exploit the intrinsic data cluster structure to detect and remove influential outliers as well as small data clusters.
Abstract: Subspace clustering has many applications in computer vision, such as image/video segmentation and pattern classification. The major issue in subspace clustering is to obtain the most appropriate subspace from the given noisy data. Typical methods (e.g., SVD, PCA, and eigen-decomposition) use least squares techniques, and are sensitive to outliers. In this paper, we present the k-th nearest neighbor distance (kNND) metric, which, without actually clustering the data, can exploit the intrinsic data cluster structure to detect and remove influential outliers as well as small data clusters. The remaining data provide a good initial inlier data set that resides in a linear subspace whose rank (dimension) is upper-bounded. Such linear subspace constraint can then be exploited by simple algorithms, such as iterative SVD algorithm, to (1) detect the remaining outliers that violate the correlation structure enforced by the low rank subspace, and (2) reliably compute the subspace. As an example, we apply our method to extracting layers from image sequences containing dynamically moving objects.

Proceedings ArticleDOI
14 Mar 2004
TL;DR: This paper proposes a system for quickly realizing a function for robustly detecting daily human activity events in handling objects in the real world and evaluates the robustness by comparing RANSAC with a least-squares optimization method.
Abstract: This paper proposes a system for quickly realizing a function for robustly detecting daily human activity events in handling objects in the real world. The system has three functions: 1) robustly measuring 3D positions of the objects; 2) quickly calibrating a system for measuring 3D positions of the objects; 3) quickly registering target activity events; and 4) robustly detecting the registered events in real time. As for 1), the system realizes robust measurement of 3D positions of the objects using an ultrasonic 3D tag system, which is a kind of a location sensor, and robust estimation algorithm known as random sample consensus (RANSAC). The paper evaluates the robustness by comparing RANSAC with a least-squares optimization method. As for 2), the system realizes quick calibration by a calibrating device having three or more ultrasonic transmitters. Quick calibration enables the system to be portable. As for 3), quick registration of target activity events is realized by a stereoscopic camera with ultrasonic 3D tags and interactive software for creating 3D shape model, creating virtual sensors based on the 3D shape model, and associating the virtual sensors with the target events. The system makes it possible to quickly create object-shaped sensors to which a new function for detecting activity events are added while maintaining the original functions of the objects.

Proceedings ArticleDOI
10 Oct 2004
TL;DR: A feature tracking based visual odometry system using stereo vision to achieve simultaneous 6D localization and 3D mapping and is more accurate than typical dead-reckoning system such as gyro sensors.
Abstract: Localization and 3D mapping are important tasks, for a humanoid robot to move in a complex human environment. In This work, we present a feature tracking based visual odometry system using stereo vision to achieve simultaneous 6D localization and 3D mapping. A camera position can be estimated from 3D environmental depth data and a previous 3D map, and used to incrementally update the 3D map. Visual odometry was employed to compute the view transformation relating a pair of views. Experimental results that estimate camera trajectory and build a 3D map are accurately obtained. The system is more accurate than typical dead-reckoning system such as gyro sensors.

Book ChapterDOI
26 Sep 2004
TL;DR: This paper presents the principle, structure and performance of a newly developed MR-compatible force sensor that employs a new optical micrometry that enables highly accurate and highly sensitive displacement measurement.
Abstract: This paper presents the principle, structure and performance of a newly developed MR-compatible force sensor. It employs a new optical micrometry that enables highly accurate and highly sensitive displacement measurement. The sensor accuracy is better than 1.0 %, and the maximum displacement of the detector is about 10 μm for a range of the applied force from 0 to 6 N.

01 Jan 2004
TL;DR: The results indicate that the performance of CFIR improves with the number of accumulated feedbacks, outperforming a basic but typical conventional CBIR system and being dubbed by a term “content-free” image retrieval (CFIR).
Abstract: Consider a stereotypical image-retrieval problem; a user submits a set of query images to a system and through repeated interactions during which the system presents its current choices and the user gives his/her preferences to them, the choices are narrowed to the image(s) that satisfies the user. The problem obviously must deal with image content, i.e., interpretation and preference. For this purpose, conventional so-called contentbased image retrieval (CBIR) approach uses image-processing and computer-vision techniques, and tries to understand the image content. Such attempts have produced good but limited success, mainly because image interpretation is a highly complicated perceptive process. We propose a new approach to this problem from a totally different angle. It attempts to exploit the human’s perceptual capabilities and certain common, if not identical, tendencies that must exist among people’s interpretation and preference of images. Instead of processing images, the system simply accumulates records of user feedback and recycles them in the form of collaborative filtering, just like a purchase recommendation system such as Amazo.com. To emphasize the point that it does not deal with image pixel information, we dub the approach by a term “content-free” image retrieval (CFIR). We discuss various issues of image retrieval, argue for the idea of CFIR, and present results of preliminary experiment. The results indicate that the performance of CFIR improves with the number of accumulated feedbacks, outperforming a basic but typical conventional CBIR system.

Book ChapterDOI
06 Jul 2004
TL;DR: This work describes two circular microphone arrays and a square microphone array which can be used for sound localization and sound capture and both systems are evaluated by using frequency components of the sound.
Abstract: This work describes two circular microphone arrays and a square microphone array which can be used for sound localization and sound capture. Sound capture by microphone array is achieved by sum and delay beam former (SDBF). A dedicated PCI 128-channel simultaneous input analog-to-digital (AD) board is developed for a 128 ch microphone array with a maximum sampling rate of 22. 7 /spl mu/s/sample. Simulation of sound pressure distribution of 24 and 128 ch circular microphone array and 128 ch square microphone array are shown. Then a 24 ch circular microphone array and a 128 ch square microphone array have been developed. The 24 ch circular microphone array can capture sound from an arbitrary direction. The 128 ch square microphone array can capture sound from a specific point. Both systems are evaluated by using frequency components of the sound. The circular type system can be used on a mobile robot including humanoid robot and square type can be extended towards room coverage type application.

Proceedings Article
01 Jan 2004
TL;DR: This work describes two circular microphone arrays and a square microphone array which can be used for sound localization and sound capture and both systems are evaluated by using frequency components of the sound.
Abstract: This work describes two circular microphone arrays and a square microphone array which can be used for sound localization and sound capture. Sound capture by microphone array is achieved by sum and delay beam former (SDBF). A dedicated PCI 128-channel simultaneous input analog-to-digital (AD) board is developed for a 128 ch microphone array with a maximum sampling rate of 22. 7 /spl mu/s/sample. Simulation of sound pressure distribution of 24 and 128 ch circular microphone array and 128 ch square microphone array are shown. Then a 24 ch circular microphone array and a 128 ch square microphone array have been developed. The 24 ch circular microphone array can capture sound from an arbitrary direction. The 128 ch square microphone array can capture sound from a specific point. Both systems are evaluated by using frequency components of the sound. The circular type system can be used on a mobile robot including humanoid robot and square type can be extended towards room coverage type application.

Proceedings ArticleDOI
01 Jan 2004
TL;DR: This paper presents the principle, structure and performance of a newly developed MR-compatible force sensor that employs a new optical micrometry that enables highly accurate and highly sensitive displacement measurement.
Abstract: This paper presents the principle, structure and performance of a newly developed MR-compatible force sensor. It employs a new optical micrometry that enables highly accurate and highly sensitive displacement measurement. The sensor accuracy is better than 1.0 %, and the maximum displacement of the detector is about 10 /spl mu/m for a range of the applied force from 0 to 6 N.

Book
20 Oct 2004
TL;DR: Video Structure and Terminology, Multimodal Video Characterization, Video Summarization, Visualization Techniques, Evaluation, and Conclusions.
Abstract: Video Structure and Terminology.- Multimodal Video Characterization.- Video Summarization.- Visualization Techniques.- Evaluation.- Conclusions.

Proceedings ArticleDOI
08 Aug 2004
TL;DR: A computer vision-based system to transfer human motion from one subject to another and an image-based rendering algorithm to render the captured motion applied to the articulated model of another person is developed.
Abstract: In this paper we develop a computer vision-based system to transfer human motion from one subject to another. Our system uses a network of eight calibrated and synchronized cameras. We first build detailed kinematic models of the subjects based on our algorithms for extracting shape from silhouette across time [A 3D reconstruction algorithm combining shape-frame-shilhouette]. These models are then used to capture the motion (joint angles) of the subjects in new video sequences. Finally we describe an image-based rendering algorithm to render the captured motion applied to the articulated model of another person. Our rendering algorithm uses an ensemble of spatially and temporally distributed images to generate photo-realistic video of the transferred motion. We demonstrate the performance of the system by rendering throwing and kungfu motions on subjects who did not perform them.

Proceedings ArticleDOI
01 Jan 2004

Proceedings ArticleDOI
19 Jul 2004
TL;DR: Kernel correlation is shown to be equal to distance minimization in the M-estimator sense and to have good properties that can result in renderable, very smooth and accurate depth map.
Abstract: All non-trivial stereo problems need model priors to deal with ambiguities and noise perturbations. To meet requirements of increasingly demanding tasks such as modeling for rendering, a proper model prior should impose preference on the true scene structure, while avoiding artificial bias such as fronto-parallel. We introduce a geometric model prior based on a novel technique we call kernel correlation. Maximizing kernel correlation is shown to be equal to distance minimization in the M-estimator sense. As a model prior, kernel correlation is demonstrated to have good properties that can result in renderable, very smooth and accurate depth map. The results are evaluated both qualitatively by view synthesis and quantitatively by error analysis.