scispace - formally typeset
Search or ask a question

Showing papers by "Takeo Kanade published in 2002"


Journal ArticleDOI
TL;DR: This work derives a sequence of analytical results which show that the reconstruction constraints provide less and less useful information as the magnification factor increases, and proposes a super-resolution algorithm which attempts to recognize local features in the low-resolution images and then enhances their resolution in an appropriate manner.
Abstract: Nearly all super-resolution algorithms are based on the fundamental constraints that the super-resolution image should generate low resolution input images when appropriately warped and down-sampled to model the image formation process. (These reconstruction constraints are normally combined with some form of smoothness prior to regularize their solution.) We derive a sequence of analytical results which show that the reconstruction constraints provide less and less useful information as the magnification factor increases. We also validate these results empirically and show that, for large enough magnification factors, any smoothness prior leads to overly smooth results with very little high-frequency content. Next, we propose a super-resolution algorithm that uses a different kind of constraint in addition to the reconstruction constraints. The algorithm attempts to recognize local features in the low-resolution images and then enhances their resolution in an appropriate manner. We call such a super-resolution algorithm a hallucination or reconstruction algorithm. We tried our hallucination algorithm on two different data sets, frontal images of faces and printed Roman text. We obtained significantly better results than existing reconstruction-based algorithms, both qualitatively and in terms of RMS pixel error.

1,418 citations


Proceedings ArticleDOI
20 May 2002
TL;DR: This paper evaluates a Gabor-wavelet-based method to recognize AUs in image sequences of increasing complexity and finds that the best recognition is a rate of 92.7% obtained by combining Gabor wavelets and geometry features.
Abstract: Previous work suggests that Gabor-wavelet-based methods can achieve high sensitivity and specificity for emotion-specified expressions (e.g., happy, sad) and single action units (AUs) of the Facial Action Coding System (FACS). This paper evaluates a Gabor-wavelet-based method to recognize AUs in image sequences of increasing complexity. A recognition rate of 83% is obtained for three single AUs when image sequences contain homogeneous subjects and are without observable head motion. The accuracy of AU recognition decreases to 32% when the number of AUs increases to nine and the image sequences consist of AU combinations, head motion, and non-homogeneous subjects. For comparison, an average recognition rate of 87.6% is achieved for the geometry-feature-based method. The best recognition is a rate of 92.7% obtained by combining Gabor wavelets and geometry features.

256 citations



Proceedings ArticleDOI
20 May 2002
TL;DR: This paper presents a method to recover the full-motion (3 rotations and 3 translations) of the head using a cylindrical model and uses the iteratively re-weighted least squares (IRLS) technique in conjunction with the image gradient to deal with non-rigid motion and occlusion.
Abstract: This paper presents a method to recover the full-motion (3 rotations and 3 translations) of the head using a cylindrical model. The robustness of the approach is achieved by a combination of three techniques. First, we use the iteratively re-weighted least squares (IRLS) technique in conjunction with the image gradient to deal with non-rigid motion and occlusion. Second, while tracking, the templates are dynamically updated to diminish the effects of self-occlusion and gradual lighting changes and keep tracking the head when most of the face is not visible. Third, because the dynamic templates may cause error accumulation, we re-register images to a reference frame when head pose is close to a reference pose. The performance of the real-time tracking program was evaluated in three separate experiments using image sequences (both synthetic and real) for which ground truth head motion is known. The real sequences included pitch and yaw of as large as 40/spl deg/ and 75/spl deg/ respectively. The average recovery accuracy of the 3D rotations was found to be about 3/spl deg/.

150 citations


Proceedings ArticleDOI
10 Dec 2002
TL;DR: A system is described for acquiring multi-view video of a person moving through the environment that adjusts the pan, tilt, zoom and focus parameters of multiple active cameras to keep the moving person centered in each view.
Abstract: A system is described for acquiring multi-view video of a person moving through the environment. A real-time tracking algorithm adjusts the pan, tilt, zoom and focus parameters of multiple active cameras to keep the moving person centered in each view. The output of the system is a set of synchronized, time-stamped video streams, showing the person simultaneously from several viewpoints.

108 citations


Proceedings ArticleDOI
26 Jul 2002
TL;DR: This work proposes a fully automatic algorithm for view interpolation of a completely non-rigid dynamic event across both space and time, and uses it to create re-timed slow-motion fly-by movies of dynamic real-world events.
Abstract: We propose a fully automatic algorithm for view interpolation of a completely non-rigid dynamic event across both space and time. The algorithm operates by combining images captured across space to compute voxel models of the scene shape at each time instant, and images captured across time to compute the "scene flow" between the voxel models. The scene-flow is the non-rigid 3D motion of every point in the scene. To interpolate in time, the voxel models are "flowed" using an appropriate multiple of the scene flow and a smooth surface fit to the result. The novel image is then computed by ray-casting to the surface at the intermediate time instant, following the scene flow to the neighboring time instants, projecting into the input images at those times, and finally blending the results. We use our algorithm to create re-timed slow-motion fly-by movies of dynamic real-world events.

96 citations


Proceedings ArticleDOI
11 Aug 2002
TL;DR: In this article, a system that detects discrete and important facial actions (e.g., eye blinking) in spontaneously occurring facial behavior with non-frontal pose, moderate out-of-plane head motion, and occlusion was developed.
Abstract: Previous research in automatic facial expression recognition has been limited to recognition of gross expression categories (e.g., joy or anger) in posed facial behavior under well-controlled conditions (e.g., frontal pose and minimal out-of-plane head motion). We developed a system that detects discrete and important facial actions, (e.g., eye blinking), in spontaneously occurring facial behavior with non-frontal pose, moderate out-of-plane head motion, and occlusion. The system recovers 3D motion parameters, stabilizes facial regions, extracts motion and appearance information, and recognizes discrete facial actions in spontaneous facial behavior. We tested the system in video data from a 2-person interview. Subjects were ethnically diverse, action units occurred during speech, and out-of-plane motion and occlusion from head motion and glasses were common. The video data were originally collected to answer substantive questions in psychology, and represent a substantial challenge to automated AU recognition. In the analysis of 335 single and multiple blinks and non-blinks, the system achieved 98% accuracy.

86 citations


Patent
12 Feb 2002
TL;DR: In this paper, a plurality of camera systems relative to a scene such that the camera systems define a gross trajectory is used to generate a video image sequence, which is then used to display the transformed images in sequence corresponding to the position of corresponding camera systems along the gross trajectory.
Abstract: A method and a system of generating a video image sequence. According to one embodiment, the method includes positioning a plurality of camera systems relative to a scene such that the camera systems define a gross trajectory. The method further includes transforming images from the camera systems to superimpose a secondary induced motion on the gross trajectory. And the method includes displaying the transformed images in sequence corresponding to the position of the corresponding camera systems along the gross trajectory.

75 citations



Patent
12 Feb 2002
TL;DR: In this paper, a system of generating an image sequence of an object within a scene is presented, which includes capturing an image (images I1-N) of the object with a plurality of camera systems, wherein the camera systems are positioned around the scene.
Abstract: A method and a system of generating an image sequence of an object within a scene. According to one embodiment, the method includes capturing an image (images I1-N) of the object with a plurality of camera systems, wherein the camera systems are positioned around the scene. Next, the method includes 2D projective transforming certain of the images (I2-N) such that a point of interest in each of the images is at a same position as a point of interest in a first image (I1) from one of the camera systems. The method further includes outputting the transformed images (I2'-N') and the first image (I1) in a sequence corresponding to a positioning of the corresponding camera systems around the scene.

55 citations


Patent
12 Feb 2002
TL;DR: In this article, a system and method for servoing on a moving target within a dynamic scene is described, which includes a master variable pointing camera system and a plurality of slave variable pointing cameras positioned around the scene.
Abstract: A system and method for servoing on a moving target within a dynamic scene. According to one embodiment, the system includes a master variable pointing camera system and a plurality of slave variable pointing camera systems positioned around the scene. The system also includes a master control unit in communication with the master variable pointing camera system. The master control unit is for determining, based on parameters of the master variable pointing camera system, parameters for each of the slave variable pointing camera systems such that, at a point in time, the master variable pointing camera system and the slave variable pointing camera systems are aimed at the target and a size of the target in an image from each of the master variable pointing camera system and the slave variable pointing camera systems is substantially the same. The system also includes a plurality of slave camera control units in communication with the master control unit. The slave camera control units are for controlling at least one of the slave variable pointing camera systems based on the parameters for each of the slave variable pointing camera systems. The system may also include a video image sequence generator in communication with the master control unit and the slave camera control units. The video image sequence generator may generate a video image sequence of the target by outputting an image from certain of the master variable pointing camera system and the slave variable pointing camera systems in sequence according to the position of the master variable pointing camera system and the slave variable pointing camera systems around the scene.

Proceedings ArticleDOI
05 Dec 2002
TL;DR: A robust subspace approach to extracting layers from images reliably is presented by taking advantage of the fact that homographies induced by planar patches in the scene form a low dimensional linear subspace, which provides a constraint for detecting outliers in the local measurements, thus making the algorithm robust to outliers.
Abstract: Representing images with layers has many important applications, such as video compression, motion analysis, and 3D scene analysis. The paper presents a robust subspace approach to extracting layers from images reliably by taking advantage of the fact that homographies induced by planar patches in the scene form a low dimensional linear subspace. Such a subspace provides not only a feature space where layers in the image domain are mapped onto denser and better-defined clusters, but also a constraint for detecting outliers in the local measurements, thus making the algorithm robust to outliers. By enforcing the subspace constraint, spatial and temporal redundancy from multiple frames are simultaneously utilized, and noise can be effectively reduced. Good layer descriptions are shown to be extracted in the experimental results.

Patent
23 Oct 2002
TL;DR: In this article, a system and method for obtaining video of a moving fixation point within a scene is presented, which includes a control unit and a plurality of non-moving image capturing devices positioned around the scene, wherein the scene is within a field of view of each image capturing device.
Abstract: A system and method for obtaining video of a moving fixation point within a scene. According to one embodiment, the system includes a control unit and a plurality of non-moving image capturing devices positioned around the scene, wherein the scene is within a field of view of each image capturing device. The system also includes a plurality of image generators, wherein each image generator is in communication with one of the image capturing devices, and wherein a first of the image generators is responsive to a command from the control unit. The system also includes a surround-view image sequence generator in communication with each of the image generators and responsive to the command form the control unit for generating a surround-view video sequence of the fixation point within the scene based on output form certain of the image generators.

Journal ArticleDOI
Mei Han1, Takeo Kanade
TL;DR: A factorization-based method to recover Euclidean structure from multiple perspective views with uncalibrated cameras, and presents three normalization algorithms which enforce Euclideans constraints on camera calibration parameters to recover the scene structure and the camera calibration simultaneously, assuming zero skew cameras.
Abstract: Structure from motion (SFM), which is recovering camera motion and scene structure from image sequences, has various applications, such as scene modelling, robot navigation, object recognition and virtual reality. Most of previous research on SFM requires the use of intrinsically calibrated cameras. In this paper we describe a factorization-based method to recover Euclidean structure from multiple perspective views with uncalibrated cameras. The method first performs a projective reconstruction using a bilinear factorization algorithm, and then converts the projective solution to a Euclidean one by enforcing metric constraints. The process of updating a projective solution to a full metric one is referred as normalization in most factorization-based SFM methods. We present three normalization algorithms which enforce Euclidean constraints on camera calibration parameters to recover the scene structure and the camera calibration simultaneously, assuming zero skew cameras. The first two algorithms are linear, one for dealing with the case that only the focal lengths are unknown, and another for the case that the focal lengths and the constant principal point are unknown. The third algorithm is bilinear, dealing with the case that the focal lengths, the principal points and the aspect ratios are all unknown. The results of experiments are presented. Copyright © 2002 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
01 Aug 2002
TL;DR: An important aspect of this work derives from the observation that legitimately moving objects in a scene tend to cause much faster intensity transitions than changes due to lighting, meteorological, and diurnal effects.
Abstract: This paper describes a method for detecting multiple overlapping objects from a real-time video stream Layered detection is based on two processes: pixel analysis and region analysis Pixel analysis determines whether a pixel is stationary or transient by observing its intensity over time Region analysis detects regions consisting of stationary pixels corresponding to stopped objects These regions are registered as layers on the background image, and thus new moving objects passing through these layers can be detected An important aspect of this work derives from the observation that legitimately moving objects in a scene tend to cause much faster intensity transitions than changes due to lighting, meteorological, and diurnal effects The resulting system robustly detects objects at an outdoor surveillance site For 8 hours of video evaluation, a detection rate of 92% was measured which is higher than traditional background subtraction methods

Book ChapterDOI
01 Jan 2002
TL;DR: This final chapter investigates how much extra information is actually added by having more than one image for super- resolution and proposes a super-resolution algorithm which uses a completely different source of information, in addition to the reconstruction constraints.
Abstract: A variety of super-resolution algorithms have been described in this book. Most of them are based on the same source of information however; that the super-resolution image should generate the lower resolution input images when appropriately warped and down-sampled to model image formation. (This information is usually incorporated into super-resolution algorithms in the form of reconstruction constraints which are frequently combined with a smoothness prior to regularize their solution.) In this final chapter, we first investigate how much extra information is actually added by having more than one image for super-resolution. In particular, we derive a sequence of analytical results which show that the reconstruction constraints provide far less useful information as the decimation ratio increases. We validate these results empirically and show that for large enough decimation ratios any smoothness prior leads to overly smooth results with very little high-frequency content however many (noiseless) low resolution input images are used. In the second half of this chapter, we propose a super-resolution algorithm which uses a completely different source of information, in addition to the reconstruction constraints. The algorithm recognizes local “features” in the low resolution images and then enhances their resolution in an appropriate manner, based on a collection of high and low-resolution training samples. We call such an algorithm a hallucination algorithm.

01 Jan 2002
TL;DR: This thesis presents a novel sensor, calibration methodology, and synchronization approach for a working terrain sensor prototype, which has proven to be effective in over 50 modeling flights, which produced terrain models accurate to <20cm in 3D.
Abstract: This thesis develops a novel aerial terrain modeling system. The system is unique since it flies onboard a small autonomous helicopter and senses the structure and color of its surroundings to build accurate 3D terrain models. The system is capable of modeling terrain where current approaches are too expensive, too dangerous, or too difficult. The prototype system is primarily composed of a mechanically aligned laser rangefinder and 1-pixel color camera, viewing the terrain through a common scan mechanism. The merit of this sensing approach is that range and color measurements are inherently collected from an identical terrain location. This thesis presents a novel sensor, calibration methodology, and synchronization approach for a working terrain sensor prototype. The prototype's performance was verified by carrying out a number of real-world mapping missions. These missions range from geological feature modeling in the Arctic for NASA scientists, to mapping an urban building complex for DARPA researchers. The system has proven to be effective in over 50 modeling flights, which produced terrain models accurate to <20cm in 3D.

Book ChapterDOI
01 Jan 2002
TL;DR: This work identifies another source of error which is called feature localization error, which captures how well a feature corresponds to the true 3D point, rather than how well features correspond over multiple images.
Abstract: Uncertainty modeling in 3D Computer Vision typically relies on propagating the uncertainty of measured feature positions through the modeling equations to obtain the uncertainty of the 3D shape being estimated. It is widely believed that this adequately captures the uncertainties of estimated geometric properties when there are no large errors due to mismatching. However, we identify another source of error which we call feature localization error. This captures how well a feature corresponds to the true 3D point, rather than how well features correspond over multiple images. We model this error as independent of the tracking error, and when combined as part of the total error, we show that it is significant and may even dominate the 3D reconstruction error.

Journal Article
TL;DR: A vision based monitoring system which classifies targets (vehicles and humans) based on shape appearance, estimates their colors, and detects special targets, from images of color video cameras set up toward a street.
Abstract: This paper describes a vision based monitoring system which (1) classifies targets (vehicles and humans) based on shape appearance, (2) estimates their colors, and (3) detects special targets, from images of color video cameras set up toward a street. The categories of targets were classified into {human, sedan, van, truck, mule (golf cart for workers), and others), and their colors were classified into the groups of {redorange-yellow, green, blue-lightblue, white-silver-gray, darkblue-darkgray-black, and darkred-darkorange). On the detection of special targets, the test was carried out setting {FedEx van, UPS van, Police Car) as target and yielded desirable results. The system tracks the target, independently conducts category classification and color estimation, extracts the result with the largest probability throughout the tracking sequence from each result, and provides the data as the final decision. For classification and special target detection, we cooperatively used a stochastic linear discrimination method (linear discriminant analysis : LDA) and nonlinear decision rule (K-Nearest Neighbor rule: K-NN).

01 Jan 2002
TL;DR: A Benchmark 2000 problem is suggested using children's "what is wrong" puzzles in which defective objects in a line drawing of a scene must be found to show how far machine vision has yet to go.
Abstract: We discuss the need for a new series of benchmarks in the vision field. to provide a direct quantitative measure of progress understandable to sponsors of research as well as a guide to practitioners in the field. A first set of benchmarks in two categories is proposed ( 1 ) static scenes containing manmade objects, and (2) static naturalloutdoor scenes. The tests are "end-to-end" and involve determining how well a system can identify instances (an item or condition is present or absent) in selected regwns of an image. The scoring would be set up so that the automatic setting of adjustable parameters is rewarded and manual tuning is penalized. To show how far machine vision has yet to go, a Benchmark 2000 problem is also suggested using children's "what is wrong" puzzles in which defective objects in a line drawing of a scene must be found.