scispace - formally typeset
Search or ask a question

Showing papers by "Takeo Kanade published in 1997"


Journal ArticleDOI
TL;DR: In this paper, a new visual medium, Virtualized Reality, immerses viewers in a virtual reconstruction of real-world events, which consists of real images and depth information computed from these images.
Abstract: A new visual medium, Virtualized Reality, immerses viewers in a virtual reconstruction of real-world events. The Virtualized Reality world model consists of real images and depth information computed from these images. Stereoscopic reconstructions provide a sense of complete immersion, and users can select their own viewpoints at view time, independent of the actual camera positions used to capture the event.

677 citations


Journal ArticleDOI
TL;DR: This work has shown that the paraperspective factorization method can be applied to a much wider range of motion scenarios, including image sequences containing motion toward the camera and aerial image sequences of terrain taken from a low-altitude airplane.
Abstract: The factorization method, first developed by Tomasi and Kanade (1992), recovers both the shape of an object and its motion from a sequence of images, using many images and tracking many feature points to obtain highly redundant feature position information. The method robustly processes the feature trajectory information using singular value decomposition (SVD), taking advantage of the linear algebraic properties of orthographic projection. However, an orthographic formulation limits the range of motions the method can accommodate. Paraperspective projection, first introduced by Ohta et al. (1981), is a projection model that closely approximates perspective projection by modeling several effects not modeled under orthographic projection, while retaining linear algebraic properties. Our paraperspective factorization method can be applied to a much wider range of motion scenarios, including image sequences containing motion toward the camera and aerial image sequences of terrain taken from a low-altitude airplane.

511 citations


Proceedings ArticleDOI
17 Jun 1997
TL;DR: The goal of this work is to show the utility of integrating language and image understanding techniques for video skimming by extraction of significant information, such as specific objects, audio keywords and relevant video structure.
Abstract: Digital video is rapidly becoming important for education, entertainment, and a host of multimedia applications. With the size of the video collections growing to thousands of hours, technology is needed to effectively browse segments in a short time without losing the content of the video. We propose a method to extract the significant audio and video information and create a "skim" video which represents a very short synopsis of the original. The goal of this work is to show the utility of integrating language and image understanding techniques for video skimming by extraction of significant information, such as specific objects, audio keywords and relevant video structure. The resulting skim video is much shorter, where compaction is as high as 20:1, and yet retains the essential content of the original segment.

390 citations


Journal ArticleDOI
TL;DR: A CCD-based range-finding sensor which uses the time-of-flight method for range measurement and exploits two charge packets for light integration and detects the delay of the received light pulse relative to the transmitted light pulse.
Abstract: Integration-time-based, time-domain computation provides an area-efficient way to process image information by directly handling photo-created charge during photo-sensing. We have fabricated and tested a CCD-based range-finding sensor which uses the time-of-flight method for range measurement. The sensor exploits two charge packets for light integration and detects the delay of the received light pulse relative to the transmitted light pulse. It has detected a 10 cm distance difference at the range of 150 cm in the dark background.

314 citations


Journal ArticleDOI
TL;DR: A sequential factorization method for recovering the three-dimensional shape of an object and the motion of the camera from a sequence of images, using tracked features, by regarding the feature positions as a vector time series.
Abstract: We present a sequential factorization method for recovering the three-dimensional shape of an object and the motion of the camera from a sequence of images, using tracked features. The factorization method originally proposed by Tomasi and Kanade (1992) produces robust and accurate results incorporating the singular value decomposition. However, it is still difficult to apply the method to real-time applications, since it is based on a batch-type operation and the cost of the singular value decomposition is large. We develop the factorization method into a sequential method by regarding the feature positions as a vector time series. The new method produces estimates of shape and motion at each frame. The singular value decomposition is replaced with an updating computation of only three dominant eigenvectors, which can be performed in O(P/sup 2/) time, while the complete singular value decomposition requires O(FP/sup 2/) operations for an F/spl times/P matrix. Also, the method is able to handle infinite sequences, since it does not store any increasingly large matrices. Experiments using synthetic and real images illustrate that the method has nearly the same accuracy and robustness as the original method.

225 citations


Patent
18 Sep 1997
TL;DR: In this article, a camera is used to generate digital image signals representing an image of one or more natural or artificial fiducials on a patient positioned on treatment or diagnosis equipment, and a processor applies multiple levels of filtering at multiple level of resolution to repetitively determine successive fiducial positions.
Abstract: A camera (35) generates digital image signals representing an image of one or more natural or artificial fiducials (39) on a patient positioned on treatment or diagnosis equipment. A processor applies multiple levels of filtering at multiple levels of resolution to repetitively determine successive fiducial positions. A warning signal is generated if movement exceeds certain limits but is still acceptable for treatment. Unacceptable displacement results in termination of the treatment beam (15). Tracking templates can be generated interactively from a display of the digital image signals or through automatic selection of an image having the median correlation to an initial template. A gating signal synchronized to patient breathing can be extracted from the digital image signals for controlling the radiation beam generator.

205 citations


Proceedings ArticleDOI
01 Oct 1997
TL;DR: A detailed comparison of existing view synthesis techniques with the authors' own approach is included, which has the added benefit of eliminating the need to hand-edit the range images to correct errors made in stereo, a drawback of previous techniques.
Abstract: Virtualized reality is a modeling technique that constructs full 3D virtual representations of dynamic events from multiple video streams. Image-based stereo is used to compute a range image corresponding to each intensity image in each video stream. Each range and intensity image pair encodes the scene structure and appearance of the scene visible to the camera at that moment, and is therefore called a visible surface model (VSM). A single time instant of the dynamic event can be modeled as a collection of VSMs from different viewpoints, and the full event can be modeled as a sequence of static scenes-the 3D equivalent of video. Alternatively, the collection of VSMs at a single time can be fused into a global 3D surface model, thus creating a traditional virtual representation out of real world events. Global modeling has the added benefit of eliminating the need to hand-edit the range images to correct errors made in stereo, a drawback of previous techniques. Like image-based rendering models, these virtual representations can be used to synthesize nearly any view of the virtualized event. For this reason, the paper includes a detailed comparison of existing view synthesis techniques with the authors' own approach. In the virtualized representations, however, scene structure is explicitly represented and therefore easily manipulated, for example by adding virtual objects to (or removing virtualized objects from) the model without interfering with real event. Virtualized reality, then, is a platform not only for image-based rendering but also for 3D scene manipulation.

177 citations


Book ChapterDOI
19 Mar 1997
TL;DR: A system for simulating arthroscopic knee surgery that is based on volumetric object models derived from 3D Magnetic Resonance Imaging is presented and feedback is provided to the user via real-time volume rendering and force feedback for haptic exploration.
Abstract: A system for simulating arthroscopic knee surgery that is based on volumetric object models derived from 3D Magnetic Resonance Imaging is presented. Feedback is provided to the user via real-time volume rendering and force feedback for haptic exploration. The system is the result of a unique collaboration between an industrial research laboratory, two major universities, and a leading research hospital. In this paper, components of the system are detailed and the current state of the integrated system is presented. Issues related to future research and plans for expanding the current system are discussed.

175 citations


Proceedings ArticleDOI
17 Jun 1997
TL;DR: A system that associates faces and names in videos, called Name-It, is developed, which is given news videos as a knowledge source, then automatically extracts face and name association as content information.
Abstract: This paper proposes a novel approach to extract meaningful content information from video by collaborative integration of image understanding and natural language processing. As an actual example, we developed a system that associates faces and names in videos, called Name-It, which is given news videos as a knowledge source, then automatically extracts face and name association as content information. The system can infer the name of a given unknown face image, or guess faces which are likely to have the name given to the system. This paper explains the method with several successful matching results which reveal effectiveness in integrating heterogeneous techniques as well as the importance of real content information extraction from video, especially face-name association.

164 citations


Patent
29 Jul 1997
TL;DR: In this paper, a patient is automatically accurately positioned relative to a fixed reference of a treatment/diagnostic device by an optical system which operates a patient positioning assembly to bring fiducials or skin markers on the patient into coincidence with impigement points of laser beams projected in a fixed pattern relative to the device.
Abstract: A patient is automatically accurately positioned relative to a fixed reference of a treatment/diagnostic device by an optical system which operates a patient positioning assembly to bring fiducials or skin markers on the patient into coincidence with impigement points of laser beams projected in a fixed pattern relative to the device. Cameras record images of the fiducials and laser impingement points from which alignment error and velocity error in pixel space are determined. The velocity error in pixel space is converted to a velocity error in room space by the inverse of an Image Jacobian. The Image Jacobian is initially derived using rough values for system parameters and is continuously updated and refined using the calculated errors in pixel space derived from the camera images and errors in room space derived from position encoders on the treatment/diagnostic device.

159 citations


Journal ArticleDOI
TL;DR: A linear algorithm for recovering 3D affine shape and motion from line correspondences with uncalibrated affine cameras with the introduction of a one-dimensional projective camera is presented.
Abstract: This paper presents a linear algorithm for recovering 3D affine shape and motion from line correspondences with uncalibrated affine cameras. The algorithm requires a minimum of seven line correspondences over three views. The key idea is the introduction of a one-dimensional projective camera. This converts 3D affine reconstruction of "line directions" into 2D projective reconstruction of "points". In addition, a line-based factorization method is also proposed to handle redundant views. Experimental results both on simulated and real image sequences validate the robustness and the accuracy of the algorithm.

Proceedings ArticleDOI
01 Nov 1997
TL;DR: A new method for making correspondences between image clues detected by image analysis and Iangriage clries detected by natural language analysis is proposed and applied to closed-captioned C N N Headline News.
Abstract: Spotting by Association method for video analysis is a novel metliod to detect video segments with typical semantics. Video data contains various kinds of information through continuous images, natural language, and sound. For videos to be stored and retrieved in a Digital Library, it is essential to segment the video data into meaningful pieces. To detect meaningful segments, we need to identify the segment in each modality (video, language, and sound) that corresponds to the same story. For this purpose, we propose a new method for making correspondences between image clues detected by image analysis and Iangriage clries detected by natural language analysis. As a result, relevant video segments with sufficient informat ion froni every modality are obtained. We applied OUT nietliod to closed-captioned C N N Headline News. Video segments with important events, such as a public speech, meeting, or visit. are detc-cted fairly well.

Patent
28 Oct 1997
TL;DR: In this paper, the digitized images are first coarse aligned using a transform generated from seed points selected interactively from the two images or through detection and identification of x-ray opaque fiducials placed on the patient.
Abstract: X-ray images (53) such as radiotherapy portal images and simulation images (45) are matched by apparatus (27) which digitizes the images and automatically processes the digitized signals to generate matched digitized signals which can be displayed for comparison. The digitized images are first coarse aligned (33) using a transform generated from seed points selected interactively (110) from the two images or through detection and identification (120) of x-ray opaque fiducials placed on the patient. A fine alignment (35) is then performed by first selecting intersecting regions of the two images (150) and enhancing those regions (154). Secondly, an updated transform is generated (160) using robust motion flow in these regions at successive ascending levels of resolution. The updated transform is then used to align the images (167) which are displayed for comparison. The updated transform can also be used to control the radiotherapy equipment (140).

Journal ArticleDOI
TL;DR: A fast pattern matching algorithm with a large set of templates based on the typical template matching speeded up by the dual decomposition; the Fourier transform and the Karhunen-Loeve transform that is appropriate for the search of an object with unknown distortion within a short period.
Abstract: We present a fast pattern matching algorithm with a large set of templates. The algorithm is based on the typical template matching speeded up by the dual decomposition; the Fourier transform and the Karhunen-Loeve transform. The proposed algorithm is appropriate for the search of an object with unknown distortion within a short period. Patterns with different distortion differ slightly from each other and are highly correlated. The image vector subspace required for effective representation can be defined by a small number of eigenvectors derived by the Karhunen-Loeve transform. A vector subspace spanned by the eigenvectors is generated, and any image vector in the subspace is considered as a pattern to be recognized. The pattern matching of objects with unknown distortion is formulated as the process to extract the portion of the input image, find the pattern most similar to the extracted portion in the subspace, compute normalized correlation between them at each location in the input image, and find the location with the best score. Searching for objects with unknown distortion requires vast computation. The formulation above makes it possible to decompose highly correlated reference images into eigenvectors, as well as to decompose images in frequency domain, and to speed up the process significantly.


Journal Article
TL;DR: A new visual medium, Virtualized Reality, immerses viewers in a virtual reconstruction of real-world events, and users can select their own viewpoints at view time, independent of the actual camera positions used to capture the event.
Abstract: A new visual medium, Virtualized Reality, immerses viewers in a virtual reconstruction of real-world events. The Virtualized Reality world model consists of real images and depth information computed from these images. Stereoscopic reconstructions provide a sense of complete immersion, and users can select their own viewpoints at view time, independent of the actual camera positions used to capture the event.

Book ChapterDOI
19 Mar 1997
TL;DR: The goals of the current HipNav system are to reduce dislocations following total hip replacement surgery due to acetabular malposition, determine and potentially increase the “safe” range of motion, and track in real-time the position of the pelvis and acetabulum during surgery.
Abstract: During the past year our group has been developing HipNav, a system which helps surgeons determine optimal, patient-specific acetabular implant placement and accurately achieve the desired implant placement during surgery. HipNav includes three components: a pre-operative planner, a range of motion simulator, and an intra-operative tracking and guidance system. The goals of the current HipNav system are to: 1) reduce dislocations following total hip replacement surgery due to acetabular malposition; 2) determine and potentially increase the “safe” range of motion; 3) reduce wear debris resulting from impingement of the implant's femoral neck with the acetabular rim; and 4) track in real-time the position of the pelvis and acetabulum during surgery.

Proceedings Article
23 Aug 1997
TL;DR: The proposed Name-It system, a system that associates faces and names in news videos, takes full advantage of advanced image and natural language processing and effectively extracts names by using lexical/grammatical analysis and knowledge of the news video topics structure.
Abstract: We have been developing Name-It, a system that associates faces and names in news videos. First, as the only knowledge source, the system is given news videos which include image sequences and transcripts obtained from audio tracks or closed caption texts. The system can then either infer the name of a given face and output the name candidates, or can locate the faces in news videos by a name. To accomplish this task, the system extracts faces from image sequences and names from transcripts, both of which might correspond to key persons in news topics. The proposed system takes full advantage of advanced image and natural language processing. The image processing contributes to the extraction of face sequences which provide rich information for face-name association. The processing also helps to select the best frontal view of a face in a face sequence to enhance the face identification which is required for the processing. On the other hand, the natural language processing effectively extracts names by using lexical/grammatical analysis and knowledge of the news video topics structure. The success of our experiments demonstrates the benefits of the advanced image and natural language processing methods and their incorporation.

Book ChapterDOI
19 Mar 1997
TL;DR: Constraint analysis, constraint synthesis, and online accuracy estimation are described, demonstrating that registration accuracy can be significantly improved via application of these methods.
Abstract: Shape-based registration is a process for estimating the transformation between two shape representations of an object. It is used in many image-guided surgical systems to establish a transformation between pre- and intra-operative coordinate systems. This paper describes several tools which are useful for improving the accuracy resulting from shape-based registration: constraint analysis, constraint synthesis, and online accuracy estimation. Constraint analysis provides a scalar measure of sensitivity which is well correlated with registration accuracy. This measure can be used as a criterion function by constraint synthesis, an optimization process which generates configurations of registration data which maximize expected accuracy. Online accuracy estimation uses a conventional root-mean-squared error measure coupled with constraint analysis to estimate an upper bound on true registration error. This paper demonstrates that registration accuracy can be significantly improved via application of these methods.


01 Jan 1997
TL;DR: A tracking computational sensor — a VLSI implementation of a sensory attention that reliably tracks features of interest while it suppresses other irrelevant features that may interfere with the task at hand.
Abstract: The need for robust self‐contained and low-latency vision systems is growing: high speed visual servoing and vision‐based human computer interface. Conventional vision systems can hardly meet this need because 1) the latency is incurred in a data transfer and computational bottlenecks, and 2) there is no top‐down feedback to adapt sensor performance for improved robustness. In this paper we present a tracking computational sensor — a VLSI implementation of a sensory attention. The tracking sensor focuses attention on a salient feature in its receptive field and maintains this attention in the world coordinates. Using both low ‐latency massive parallel processing and top‐down sensory adaptation, the sensor reliably tracks features of interest while it suppresses other irrelevant features that may interfere with the task at hand.

Proceedings ArticleDOI
20 Jun 1997
TL;DR: It is concluded that the core idea of passive articulated link mechanism could be extended to a multiple link mechanism suitable for neurosurgery.
Abstract: Summary form only given. A new concept of passive articulated link mechanism is proposed to assist surgeons in the strong magnetic field of open configuration MRI, especially for neurosurgery. The core idea has been experimentally validated with the device which gives variable viscosity to cylinder rod motion by controlling the valve opening attached to the cylinder. Then a multiple degrees of freedom link mechanism is designed and simulated under several restrictions and constraints mainly concerned with space factors of the open configuration MRI and its field of view. Through this design and sequential simulation, it is concluded that the core idea could be extended to a multiple link mechanism suitable for neurosurgery.

01 Jan 1997
TL;DR: The Spotting by Association method for video analysis is introduced, which is a novel method to detect video segments with typical semantics by making correspondences between image clues detected by image analysis and language clues created by natural language analysis.
Abstract: This paper introduces the Spotting by Association method for video analysis, which is a novel method to detect video segments with typical semantics. Video data contains various kinds of information by means of continuous images, natural language, and sound. For use in a Digital Library, it is essential to segment the video data into meaningful pieces. To detect meaningful segments, we should associate data from each modality, including video, language, and sound. For this purpose, we propose a new method for segment spotting by making correspondences between image clues detected by image analysis and language clues created by natural language analysis. As a result, relevant video segments with sufficient information in every modMity are obtained. We applied our method to closed-captioned CNN Headline News. Video segments with important situations, that is a speech, meeting, or visit, are detected fairly well.


01 Jan 1997
TL;DR: The Carnegie Mellon University MURI project sponsored by ONR performs multi-disciplinary research in integrating vision algorithms with sens- ing technology for low-power, low-latency, adaptive vision systems, which lack attributes shared by most successful mass- market technologies.
Abstract: ' The Carnegie Mellon University MURI project sponsored by ONR performs multi-disciplinary research in integrating vision algorithms with sens- ing technology for low-power, low-latency, com- pact adaptive vision systems. These are crucial features necessary for augmenting the human sen- sory system and enabling sensory driven informa- tion delivery. The project spans four subareas ranging from low to high level of vision: (1) smart filters, based on the Acousto-Optic Tunable Filter (AOTF) technology; (2) computational sensor methodology, which integrates raw sensing and computation by means of VLSI technology; (3) neural-network based saliency identification tech- niques for identifying the most useful information for extraction and display; and (4) visual learning methods for automatic signal-to-symbol mapping. 1. Introduction Automated vision and sensing research has made great strides in the last 30 years. Yet vision systems still lack attributes shared by most successful mass- market technologies


Patent
18 Sep 1997
TL;DR: In this article, the authors present a set of features for rayonnement therapeutique based on signaux image numeriques and validate them with the respiration of the patient.
Abstract: Une camera (35) genere des signaux image numeriques representant une image d'un ou plusieurs reperes naturels ou artificiels (39) sur un patient place sur un appareil de traitement ou de diagnostic. Un processeur applique de multiples niveaux de filtrage a de multiples niveaux de resolution, de facon a determiner iterativement les positions de reperage successives. Un signal d'alarme est genere si les mouvements depassent certaines limites tout en restant acceptables pour le traitement. Des mouvements inacceptables arretent le rayonnement therapeutique (15). Des modeles de poursuite peuvent etre generes interactivement a partir de l'affichage des signaux image numeriques ou par selection automatique d'une image ayant la correlation mediane par rapport a un modele initial. Un signal de validation synchronise avec la respiration du patient peut etre extrait des signaux image numeriques afin de commander le generateur de faisceaux de rayons.