scispace - formally typeset
Search or ask a question

Showing papers by "Takeo Kanade published in 1999"


Proceedings Article•DOI•
20 Sep 1999
TL;DR: This work presents a framework for the computation of dense, non-rigid scene flow from optical flow and shows that multiple estimates of the normal flow cannot be used to estimate dense scene flow directly without some form of smoothing or regularization.
Abstract: Scene flow is the three-dimensional motion field of points in the world, just as optical flow is the two-dimensional motion field of points in an image. Any optical flow is simply the projection of the scene flow onto the image plane of a camera. We present a framework for the computation of dense, non-rigid scene flow from optical flow. Our approach leads to straightforward linear algorithms and a classification of the task into three major scenarios: complete instantaneous knowledge of the scene structure; knowledge only of correspondence information; and no knowledge of the scene structure. We also show that multiple estimates of the normal flow cannot be used to estimate dense scene flow directly without some form of smoothing or regularization.

335 citations


Journal Article•DOI•
TL;DR: Name-It, a system that associates faces and names in news videos, takes a multimodal video analysis approach: face sequence extraction and similarity evaluation from videos, name extraction from transcripts, and video-caption recognition.
Abstract: We developed Name-It, a system that associates faces and names in news videos. It processes information from the videos and can infer possible name candidates for a given face or locate a face in news videos by name. To accomplish this task, the system takes a multimodal video analysis approach: face sequence extraction and similarity evaluation from videos, name extraction from transcripts, and video-caption recognition.

311 citations


Journal Article•DOI•
TL;DR: An automated method of facial display analysis by feature point tracking demonstrated high concurrent validity with manual FACS coding.
Abstract: The face is a rich source of information about human behavior. Available methods for coding facial displays, however, are human-observer dependent, labor intensive, and difficult to standardize. To enable rigorous and efficient quantitative measurement of facial displays, we have developed an automated method of facial display analysis. In this report, we compare the results with this automated system with those of manual FACS (Facial Action Coding System, Ekman & Friesen, 1978a) coding. One hundred university students were videotaped while performing a series of facial displays. The image sequences were coded from videotape by certified FACS coders. Fifteen action units and action unit combinations that occurred a minimum of 25 times were selected for automated analysis. Facial features were automatically tracked in digitized image sequences using a hierarchical algorithm for estimating optical flow. The measurements were normalized for variation in position, orientation, and scale. The image sequences were randomly divided into a training set and a cross-validation set, and discriminant function analyses were conducted on the feature point measurements. In the training set, average agreement with manual FACS coding was 92% or higher for action units in the brow, eye, and mouth regions. In the cross-validation set, average agreement was 91%, 88%, and 81% for action units in the brow, eye, and mouth regions, respectively. Automated face analysis by feature point tracking demonstrated high concurrent validity with manual FACS coding.

287 citations


30 Nov 1999
TL;DR: An overview of theVSAM system, which uses multiple, cooperative video sensors to provide continuous coverage of people and vehicles in a cluttered environment, and of the technical accomplishments that have been achieved is presented.
Abstract: : Under the three-year Video Surveillance and Monitoring (VSAM) project, the Robotics Institute at Carnegie Mellon University (CMU) and the Sarnoff Corporation have developed a system for autonomous Video Surveillance and Monitoring. The technical approach uses multiple, cooperative video sensors to provide continuous coverage of people and vehicles in a cluttered environment. This final report presents an overview of the system, and of the technical accomplishments that have been achieved. Details can be found in a set of previously published papers that together comprise Appendix A.

279 citations


01 Jan 1999
TL;DR: In this paper, an accurate, high-bandwidth, linear state-space model was derived for the hover condition of a fully-instrumented model-scale unmanned helicopter (Yamaha R-SO with loft. diameter rotor) for dynamic model identification.
Abstract: Abstmcf: Flight testing of a fully-instrumented model-scale unmanned helicopter (Yamaha R-SO with loft. diameter rotor) was conducted for the purpose of dynamic model identification. This paper describes the application of CIFER' system identification techniques, which have been developed for full size helicopters, to this aircraft. An accurate, high-bandwidth, linear state-space model was derived for the hover condition. The model structure includes the explicit representation of regressive rotor-flap dynamics, rigid-body fuselage dynamics, and the yaw damper. The R-50 codiguration and identified dynamics are compared with those of a dynamically scaled UH-1H. The identified model shows excellent predictive capability and is well suited for flight control design and simulation applications.

218 citations


Journal Article•DOI•
TL;DR: To solve two problems of character recognition for videos, low-resolution characters and extremely complex backgrounds, an interpolation filter, multi-frame integration and character extraction filters are applied and the overall recognition results are satisfactory for use in news indexing.
Abstract: The automatic extraction and recognition of news captions and annotations can be of great help locating topics of interest in digital news video libraries. To achieve this goal, we present a technique, called Video OCR (Optical Character Reader), which detects, extracts, and reads text areas in digital video data. In this paper, we address problems, describe the method by which Video OCR operates, and suggest applications for its use in digital news archives. To solve two problems of character recognition for videos, low-resolution characters and extremely complex backgrounds, we apply an interpolation filter, multiframe integration and character extraction filters. Character segmentation is performed by a recognition-based segmentation method, and intermediate character recognition results are used to improve the segmentation. We also include a method for locating text areas using text-like properties and the use of a language-based postprocessing technique to increase word recognition rates. The overall recognition results are satisfactory for use in news indexing. Performing Video OCR on news video and combining its results with other video understanding techniques will improve the overall understanding of the news video content.

215 citations


01 Jan 1999
TL;DR: The objective is to develop a cooperative, multi-sensor video surveillance system that provides continuous coverage over battle eld areas and achievements have been demonstrated during VSAM Demo I.
Abstract: Carnegie Mellon University (CMU) and the Sarno Corporation (Sarno ) are performing an integrated feasibility demonstration of Video Surveillance and Monitoring (VSAM). The objective is to develop a cooperative, multi-sensor video surveillance system that provides continuous coverage over battle eld areas. Signi cant achievements have been demonstrated during VSAM Demo I in November 1997, and in the intervening year leading up to Demo II in October 1998.

206 citations


Journal Article•DOI•
31 Aug 1999
TL;DR: A visual odometer for autonomous helicopter flight that estimates helicopter position by visually locking on to and tracking ground objects and the philosophy behind the odometer as well as its tracking algorithm and implementation are described.
Abstract: This paper presents a visual odometer for autonomous helicopter flight. The odometer estimates helicopter position by visually locking on to and tracking ground objects. The paper describes the philosophy behind the odometer as well as its tracking algorithm and implementation. The paper concludes by presenting test flight data of the odometer's performance on-board indoor and outdoor prototype autonomous helicopters.

177 citations


Proceedings Article•DOI•
01 Jan 1999
TL;DR: The quality of the virtual view images re-synthesized from the projective shape demonstrates the effectiveness of the proposed scheme for projective reconstruction from a large number of images.
Abstract: This paper proposes a new scheme for multi-image projective reconstruction based on a projective grid space. The projective grid space is defined by two basis views and the fundamental matrix relating these views. Given fundamental matrices relating other views to each of the two basis views, this projective grid space can be related to any view. In the projective grid space as a general space that is related to all images, a projective shape can be reconstructed from all the images of weakly calibrated cameras. The projective reconstruction is one way to reduce the effort of the calibration because it does not need Euclid metric information, but rather only correspondences of several points between the images. For demonstrating the effectiveness of the proposed projective grid definition, we modify the voxel coloring algorithm for the projective voxel scheme. The quality of the virtual view images re-synthesized from the projective shape demonstrates the effectiveness of our proposed scheme for projective reconstruction from a large number of images.

105 citations


Journal Article•DOI•
TL;DR: This paper proposes a method that can realize correct visual/haptic registration, namely WYSIWYF, by using a vision-based, object-tracking technique and a video-keying technique and provides realistic haptic sensations, such as free-to-touch and move-and-collide.
Abstract: To build a VR training system for visuomotor skills, an image displayed by a visual interface should be correctly registered to a haptic interface so that the visual sensation and the haptic sensation are both spatially and temporally consistent. In other words, it is desirable that what you see is what you feel (WYSIWYF). In this paper, we propose a method that can realize correct visual/haptic registration, namely WYSIWYF, by using a vision-based, object-tracking technique and a video-keying technique. Combining an encountered-type haptic device with a motion-command-type haptic rendering algorithm makes it possible to deal with two extreme cases (free motion and rigid constraint). This approach provides realistic haptic sensations, such as free-to-touch and move-and-collide. We describe a first prototype and illustrate its use with several demonstrations. The user encounters the haptic device exactly when his or her hand reaches a virtual object in the display. Although this prototype has some remaining technical problems to be solved, it serves well to show the validity of the proposed approach.

82 citations


Proceedings Article•DOI•
Hideo Saito1, S. Baba1, M. Kimura1, Sundar Vedula1, Takeo Kanade1 •
04 Oct 1999
TL;DR: An "appearance based" virtual view generation method for temporally-varying events taken by multiple cameras of the "3D Room", developed by the group and presented for demonstrating the performance of the virtual view image generation in the 3D Room.
Abstract: We present an "appearance based" virtual view generation method for temporally-varying events taken by multiple cameras of the "3D Room", developed by our group. With this method we can generate images from any virtual view point between two selected real views. The virtual appearance view generation method is based on simple interpolation between two selected views. The correspondence between the views are automatically generated from the multiple images by use of the volumetric model shape reconstruction framework. Since the correspondences are obtained by the recovered volumetric model, even occluded regions in the views can be correctly interpolated in the virtual view images. The virtual view image sequences are presented for demonstrating the performance of the virtual view image generation in the 3D Room.


Book Chapter•DOI•
21 Sep 1999
TL;DR: A Geometric Equivalence Relationship is derived with which covariances under different parametrizations and gauges can be compared, based on their true geometric uncertainty, and it is shown that the uncertainty of gauge invariants exactly captures the geometric uncertainty of the solution, and hence provides useful measures for evaluating the uncertaintyof the solution.
Abstract: The parameters estimated by Structure from Motion (SFM) contain inherent indeterminacies which we call gauge freedoms. Under a perspective camera, shape and motion parameters are only recovered up to an unknown similarity transformation. In this paper we investigate how covariance-based uncertainty can be represented under these gauge freedoms. Past work on uncertainty modeling has implicitly imposed gauge constraints on the solution before considering covariance estimation. Here we examine the effect of selecting a particular gauge on the uncertainty of parameters. We show potentially dramatic effects of gauge choice on parameter uncertainties. However the inherent geometric uncertainty remains the same irrespective of gauge choice. We derive a Geometric Equivalence Relationship with which covariances under different parametrizations and gauges can be compared, based on their true geometric uncertainty. We show that the uncertainty of gauge invariants exactly captures the geometric uncertainty of the solution, and hence provides useful measures for evaluating the uncertainty of the solution. Finally we propose a fast method for covariance estimation and show its correctness using the Geometric Equivalence Relationship.

Journal Article•DOI•
01 Feb 1999
TL;DR: A sorting image computational sensor-a VLSI chip which senses an image and sorts all pixel by their intensities, and the global cumulative histogram is used internally on-chip in a top-down fashion to adapt the values in individual pixel so as to reflect the index of the incoming light, thus computing an "image of indices".
Abstract: Presents a new intensity-to-time processing paradigm suitable for very large scale integration (VLSI) computational sensor implementation of global operations over sensed images. Global image quantities usually describe images with fewer data. When computed at the point of sensing, global quantities result in a low-latency performance due to the reduced data transfer requirements between an image sensor and a processor. The global quantities also help global top-down adaptation: the quantities are continuously computed on-chip, and are readily available to sensing for adaptation. As an example, we have developed a sorting image computational sensor-a VLSI chip which senses an image and sorts all pixel by their intensities. The first sorting sensor prototype is a 21/spl times/26 array of cells. It receives an image optically, senses it, and computes the image's cumulative histogram-a global quantity which can be quickly routed off chip via one pin. In addition, the global cumulative histogram is used internally on-chip in a top-down fashion to adapt the values in individual pixel so as to reflect the index of the incoming light, thus computing an "image of indices". The image of indices never saturates and has a uniform histogram.

Book Chapter•DOI•
19 Sep 1999
TL;DR: This work characterize such anatomical variations to achieve accurate registration between 3-D images of human anatomies and shows how innate differences in the appearance and location of anatomical structures between individuals make accurate registration difficult.
Abstract: Registration between 3-D images of human anatomies enables cross-subject diagnosis. However, innate differences in the appearance and location of anatomical structures between individuals make accurate registration difficult. We characterize such anatomical variations to achieve accurate registration.

01 Jan 1999
TL;DR: This thesis focuses on characterizing non-pathological variations in human brain anatomy, and applying such knowledge to achieve accurate 3D deformable registration, to reduce the overall error on 40 test cases by 34%.
Abstract: Registering medical images of different individuals is difficult due to inherent anatomical variabilities and possible pathologies. This thesis focuses on characterizing non-pathological variations in human brain anatomy, and applying such knowledge to achieve accurate 3D deformable registration. Inherent anatomical variations are automatically extracted by deformably registering training data with an expert-segmented 3-D image, a digital brain atlas. Statistical properties of the density and geometric variations in brain anatomy are measured and encoded into the atlas to build a statistical atlas. These statistics can function as prior knowledge to guide the automatic registration process. Compared to an algorithm with no knowledge guidance, registration using the statistical atlas reduces the overall error on 40 test cases by 34%. Automatic registration between the atlas and a subject’s data adapts the expert segmentation for the subject, thus reduces the months-long manual segmentation process to minutes. Accurate and efficient segmentation of medical images enable quantitative study of anatomical differences between populations, as well as detection of abnormal variations indicative of pathologies.

Journal Article•DOI•
TL;DR: A system that automatically segments and classifies features in brain MRI volumes using an atlas, a hand-segmented and classified MRI of a normal brain, which is warped in 3-D using a hierarchical deformable matching algorithm until it closely matches the subject.

01 Jan 1999
TL;DR: In this paper, a cooperative, multi-sensor video surveillance system that provides continuous coverage over large battle field areas is presented. And the authors have begun a joint, integrated feasibility demonstration in the area of Video Surveillance and Monitoring (VSAM).
Abstract: Carnegie Mellon University (CMU) and the David Sarno Research Center (Sarno ) have begun a joint, integrated feasibility demonstration in the area of Video Surveillance and Monitoring (VSAM). The objective is to develop a cooperative, multi-sensor video surveillance system that provides continuous coverage over large battle eld areas. Image Understanding (IU) technologies will be developed to: 1) coordinate multiple sensors to seamlessly track moving targets over an extended area, 2) actively control sensor and platform parameters to track multiple moving targets, 3) integrate multisensor output with collateral data to maintain an evolving, scene-level representation of all targets and platforms, and 4) monitor the scene for unusual \trigger" events and activities. These technologies will be integrated into an experimental testbed to support evaluation, data collection, and demonstration of other VSAM technologies developed within the DARPA IU community.

Patent•
23 Apr 1999
TL;DR: In this paper, a CCD camera is used to produce video image data involving a license plate obtained by photographing a front and rear portion of a motor vehicle, and a literal region extracting device is provided to recognize a letter from a literal image (571) of the literal positional region obtained from the literal region extractor.
Abstract: In a license plate information reader device (A) for motor vehicles, a CCD camera (1) is provided to produce video image data (11) involving a license plate obtained by photographing a front and rear portion of a motor vehicle. An A/D converter (3) produces a digital multivalue image data (31) by A/D converting the video image data (11). A license plate extracting device (4) is provided to produce a digital multivalue image data (41) corresponding to an area in which the license plate occupies. A literal region extracting device (5) extracts a literal positional region of a letter sequence of the license plate based on the image obtained from the license plate extracting device (4). A literal recognition device (6) is provided to recognize a letter from a literal image (571) of the literal positional region obtained from the literal region extracting device (5). An image emphasis device is provided to emphasize the literal image (571) of the literal positional region by replacing a part of the literal region extracting device (5) with a filter net which serves as a neural network.

Proceedings Article•DOI•
01 Dec 1999
TL;DR: The concept of projective 3D voxel, which makes it possible to handle 3D geometric data without complete 3D geometry information, is described.
Abstract: In this paper, we propose an approach for construction of a projective 3D voxel space based on epipolar geometry obtained with weak calibration. This concept of voxel space defines the epipolar lines following to the coordinate axis. In the field of computer vision, it is common to reconstruct 3D geometry data based on camera calibration data and disparities of matching points in each image. In the case that the goal of the system is generating the images from another point of view, complete 3D geometry reconstruction is not presently required. However, detecting the consistent matching points in several pairs of images without complete 3D geometry information is difficult. This paper describes the concept of projective 3D voxel, which makes it possible to handle 3D geometric data without complete 3D geometry information.

Journal Article•DOI•
TL;DR: Experiments using real image sequences taken by a hand-held camcorder show that the proposed automatic line tracking method is robust against line extraction problems, closely-spaced lines, and large motion.
Abstract: : We propose an automatic line tracking method which can deal with broken or closely-spaced line segments more accurately than previous methods over an image sequence. The method uses both grey scale information of the original images and geometric attributes of line segments. By using our hierarchical optical flow technique, we can get a good prediction of line segments in a consecutive frame even with large motion. The line attribute of direction, not the orientation, discriminates closely-spaced line segments because when lines are crowded or closely-spaced, their directions are opposite in many cases, even though their orientations are the same. A proposed new matching cost function enables us to deal with multiple collinear line segment matching easily instead of using one-to-one matching. Experiments using real image sequences taken by a hand-held camcorder show that our method is robust against line extraction problems, closely-spaced lines, and large motion.

01 Jan 1999
TL;DR: A computer vision system that automatically recognizes facial action units (AUs) or AU combinations using Hidden Markov Models (HMMs) and uses principal component analysis (PCA) to compress the data.
Abstract: We developed a computer vision system that automatically recognizes facial action units (AUs) or AU combinations using Hidden Markov Models (HMMs). AUs are defined as visually discriminable muscle movements. The facial expressions are recognized in digitized image sequences of arbitrary length. In this paper, we use two approaches to extract the expression information: (1) facial feature point tracking, which is sensitive to subtle feature motion, in the mouth region, and (2) pixel-wise flow tracking, which includes more motion information, in the forehead and brow regions. In the latter approach, we use principal component analysis (PCA) to compress the data. We accurately recognize 93% of the lower face expressions and 91% of the upper face expressions.

01 Jan 1999
TL;DR: This work presents a work toward a robust system to detect and track facial features including both permanent and transient facial features in a nearly frontal image sequence, by combining color, shape, edge and motion information.
Abstract: Accurately and robustly tracking facial features must cope with the large variation in appearance across subjects and the combination of rigid and non-rigid motion. We present a work toward a robust system to detect and track facial features including both permanent (e.g. mouth, eye, and brow) and transient (e.g. furrows and wrinkles) facial features in a nearly frontal image sequence. Multi-state facial component models are proposed for tracking and modeling different facial features. Based on these multi-state models, and without any artificial enhancement, we detect and track the facial features, including mouth, eyes, brows, cheeks, and their related wrinkles and facial furrows by combining color, shape, edge and motion information. Given the initial location of the facial features in the first frame, the facial features can be detected or tracked in remainder images automatically. Our system is tested on 500 image sequences from the Pittsburgh-Carnegie Mellon University (Pitt-CMU) Facial Expression Action Unit (AU) Coded Database, which includes image sequences from children and adults of European, African, and Asian ancestry. Accurate tracking results are obtained in 98% of image sequences.

01 Jan 1999
TL;DR: A robust homography algorithm is described which incorporates contrast/brightness adjustment and robust estimation into image registration and which applies the Levenburg-Marquardt method to generate a dense projectivedepth map.
Abstract: We propose a framework to recover projectivedepth based on image homography and discuss itsapplication to scene analysis of video sequences.We describe a robust homography algorithmwhich incorporates contrast/brightness adjustmentand robust estimation into image registration. Wepresent a camera motion solver to obtain the ego-motion and the real/virtual plane position fromhomography. We then apply the Levenburg-Marquardt method to generate a dense projectivedepth map. We also discuss temporal integrationover video sequences. Finally we present theresults of applying the homography-based videoanalysis to motion detection. 1 Introduction Temporal information redundancy of videosequences allows us to use efficient, incrementalmethods which perform temporal integration ofinformation for gradual refinement.Approaches handling 3D scene analysis of videosequences with camera motion can be classifiedinto two categories: algorithms which use 2Dtransformation or model fitting, and algorithmswhich use 3D geometry analysis. Video sequencesof our interest are taken from a moving airborneplatform where the ego-motion is complex and thescene is relatively distant but not necessarily flat;

01 Jan 1999
TL;DR: This work represents anatomical variations in the form of statistical models, and embed these statistics into a 3-D digital brain atlas which is built by registering a training set of brain MRI volumes with the atlas.
Abstract: Registration between 3-D images of human anatomies enables cross-subject diagnosis. However, innate differences in the appearance and location of anatomical structures between individuals make accurate registration difficult. We characterize such anatomical variations to achieve accurate registration. We represent anatomical variations in the form of statistical models, and embed these statistics into a 3-D digital brain atlas which we use as a reference. These models are built by registering a training set of brain MRI volumes with the atlas. This associates each voxel in the atlas with multi-dimensional distributions of variations in intensity and geometry of the training set. We evaluate statistical properties of these distributions to build a statistical atlas. When we register the statistical atlas with a particular subject, the embedded statistics function as prior knowledge to guide the deformation process. This allows the deformation to tolerate variations between individuals while retaining discrimination between different structures. This method gives an overall voxel mis-classification rate of 2.9% on 40 test cases; this is a 34% error reduction over the performance of our previous algorithm without using anatomical knowledge. Besides achieving accurate registration, statistical models of anatomical variations also enable quantitative study of anatomical differences between populations.

Proceedings Article•DOI•
01 Jul 1999

Proceedings Article•DOI•
04 Oct 1999
TL;DR: PALM-a portable sensor-augmented vision system for large-scene modeling solves the problem of recovering large structures in arbitrary scenes from video streams taken by a sensor-AUgmented camera through the use of multiple constraints derived from GPS measurements, camera orientation sensor readings, and image features.
Abstract: We propose PALM-a portable sensor-augmented vision system for large-scene modeling. The system solves the problem of recovering large structures in arbitrary scenes from video streams taken by a sensor-augmented camera. Central to the solution method is the use of multiple constraints derived from GPS measurements, camera orientation sensor readings, and image features. The knowledge of camera orientation enhances computational efficiency by making a linear formulation of perspective ray constraints possible. The overall shape is constructed by merging smaller shape segments. Shape merging errors are minimized using the concept of shape hierarchy, which is realized through a "landmarking" technique. The features of the system include its use of a small number of images and feature points, its portability, and its low cost interface for synchronizing sensor measurements with the video stream. Example reconstructions of a football stadium and two large buildings are presented and these results are compared with the ground truth.

Proceedings Article•DOI•
28 May 1999
TL;DR: This work describes a method to extract quantitative information from optically-sectioned DIC microscope images and attempts to reconstruct the three-dimensional structure and refractive index distribution throughout the specimen.
Abstract: Summary form only given. Differential interference contrast (DIC) microscopy, a method pioneered by Georges Nomarski, is widely used to study live biological specimens. However, to date, biologists only qualitatively interpret DIC microscope images. In this work, we describe a method to extract quantitative information from optically-sectioned DIC microscope images. Specifically, given a set of images of a specimen, we attempt to reconstruct the three-dimensional structure and refractive index distribution throughout the specimen.

Proceedings Article•DOI•
13 Oct 1999
TL;DR: A computational model is developed and verified for the image formation process of differential interference contrast microscopy and it is planned to use this model to reconstruct the properties of unknown specimens.
Abstract: Biologists often use differential interference contrast (DIC) microscopy to study live cells. However, they are limited to qualitative observations due to the inherent nonlinear relation between the object properties and image intensity. As a first step towards quantitatively measuring optical properties of objects from DIC images, we develop and verify a computational model for the image formation process. Next, we plan to use this model to reconstruct the properties of unknown specimens.