scispace - formally typeset
Search or ask a question

Showing papers presented at "Workshop on Applications of Computer Vision in 2005"


Proceedings ArticleDOI
05 Jan 2005
TL;DR: The key contributions of this empirical study are to demonstrate that a model trained in this manner can achieve results comparable to a modeltrained in the traditional manner using a much larger set of fully labeled data, and that a training data selection metric that is defined independently of the detector greatly outperforms a selection metric based on the detection confidence generated by the detector.
Abstract: The construction of appearance-based object detection systems is time-consuming and difficult because a large number of training examples must be collected and manually labeled in order to capture variations in object appearance. Semi-supervised training is a means for reducing the effort needed to prepare the training set by training the model with a small number of fully labeled examples and an additional set of unlabeled or weakly labeled examples. In this work we present a semi-supervised approach to training object detection systems based on self-training. We implement our approach as a wrapper around the training process of an existing object detector and present empirical results. The key contributions of this empirical study is to demonstrate that a model trained in this manner can achieve results comparable to a model trained in the traditional manner using a much larger set of fully labeled data, and that a training data selection metric that is defined independently of the detector greatly outperforms a selection metric based on the detection confidence generated by the detector.

767 citations


Proceedings ArticleDOI
05 Jan 2005
TL;DR: A two-stage template-based method to detect people in widely varying thermal imagery using a generalized template and an AdaBoosted ensemble classifier using automatically tuned filters to test the hypothesized person locations.
Abstract: We present a two-stage template-based method to detect people in widely varying thermal imagery. The approach initially performs a fast screening procedure using a generalized template to locate potential person locations. Next an AdaBoosted ensemble classifier using automatically tuned filters is employed to test the hypothesized person locations. We demonstrate and evaluate the approach using a challenging dataset of thermal imagery

307 citations


Proceedings ArticleDOI
05 Jan 2005
TL;DR: This paper proposes a measure that addresses the above concerns and has desirable properties such as accommodation of labeling errors at segment boundaries, region sensitive refinement, and compensation for differences in segment ambiguity between images.
Abstract: Quantitative evaluation and comparison of image segmentation algorithms is now feasible owing to the recent availability of collections of hand-labeled images. However, little attention has been paid to the design of measures to compare one segmentation result to one or more manual segmentations of the same image. Existing measures in statistics and computer vision literature suffer either from intolerance to labeling refinement, making them unsuitable for image segmentation, or from the existence of degenerate cases, making the process of training algorithms using the measures to be prone to failure. This paper surveys previous work on measures of similarity and illustrates scenarios where they are applicable for performance evaluation in computer vision. For the image segmentation problem, we propose a measure that addresses the above concerns and has desirable properties such as accommodation of labeling errors at segment boundaries, region sensitive refinement, and compensation for differences in segment ambiguity between images

186 citations


Proceedings ArticleDOI
05 Jan 2005
TL;DR: A two-step ICP (Iterative Closest Point) algorithm for matching 3D ears is introduced and results on a dataset of 30 subjects with 3D ear images are presented to demonstrate the effectiveness of the approach.
Abstract: Ear is a new class of relatively stable biometric that is invariant from childhood to early old age (8 to 70). It is not affected with facial expressions, cosmetics and eye glasses. In this paper, we introduce a two-step ICP (Iterative Closest Point) algorithm for matching 3D ears. In the first step, the helix of the ear in 3D images is detected. The ICP algorithm is run to find the initial rigid transformation to align a model ear helix with the test ear helix. In the second step, the initial transformation is applied to selected locations of model ears and the ICP algorithm iteratively refines the transformation to bring model ears and test ear into best alignment. The root mean square (RMS) registration error is used as the matching error criterion. The model ear with the minimum RMS error is declared as the recognized ear. Experimental results on a dataset of 30 subjects with 3D ear images are presented to demonstrate the effectiveness of the approach.

137 citations


Proceedings ArticleDOI
05 Jan 2005
TL;DR: The proposed method lengthens the period of time during which a human or vehicle can navigate in GPS-deprived environments by contributing stochastic epipolar constraints over a broad baseline in time and space.
Abstract: This paper describes a new method to improve inertial navigation using feature-based constraints from one or more video cameras The proposed method lengthens the period of time during which a human or vehicle can navigate in GPS-deprived environments Our approach integrates well with existing navigation systems, because we invoke general sensor models that represent a wide range of available hardware The inertial model includes errors in bias, scale, and random walk Any purely projective camera and tracking algorithm may be used, as long as the tracking output can be expressed as ray vectors extending from known locations on the sensor body A modified linear Kalman filter performs the data fusion Unlike traditional SLAM, our state vector contains only inertial sensor errors related to position This choice allows uncertainty to be properly represented by a covariance matrix We do not augment the state with feature coordinates Instead, image data contributes stochastic epipolar constraints over a broad baseline in time and space, resulting in improved observability of the IMU error states The constraints lead to a relative residual and associated relative covariance, defined partly by the state history Navigation results are presented using high-quality synthetic data and real fisheye imagery

127 citations


Proceedings ArticleDOI
Yingli Tian1, Arun Hampapur1
05 Jan 2005
TL;DR: The effectiveness of the proposed algorithm to robust detect salient motion is demonstrated for a variety of real environments with distracting motions such as lighting changes, swaying branches, rippling water, waterfall, and fountains.
Abstract: Moving object detection is very important for video surveillance. In many environments, motion maybe either interesting (salient) motion (e.g., a person) or uninteresting motion (e.g., swaying branches.) In this paper, we propose a new real-time algorithm to detect salient motion in complex environments by combining temporal difference imaging and a temporal filtered motion field. We assume that the object with salient motion moves in a consistent direction for a period of time. No prior knowledge about object size and shape is necessary. Compared to background subtraction methods, our method does NOT need to learn the background model from hundreds of images and can handle quick image variations; e.g., a light being turned on or off. The average speed of our method is about 50fps on images at size 160x120 in 1GB Pentium III machines. The effectiveness of the proposed algorithm to robust detect salient motion is demonstrated for a variety of real environments with distracting motions such as lighting changes, swaying branches, rippling water, waterfall, and fountains.

120 citations


Proceedings ArticleDOI
05 Jan 2005
TL;DR: A novel method to generate plausible video sequences after removing relatively large objects from the original videos is proposed by applying motion layer segmentation method and a set of synthesized layers are generated.
Abstract: This paper proposes a novel method to generate plausible video sequences after removing relatively large objects from the original videos. In order to maintain temporal coherence among the frames, a motion layer segmentation method is applied. Then, a set of synthesized layers are generated by applying motion compensation and region completion algorithm. Finally, a new video, in which the selected object is removed, is plausibly rendered given the synthesized layers and the motion parameters. A number of example videos are shown in the results to demonstrate the effectiveness of our method

105 citations


Proceedings ArticleDOI
05 Jan 2005
TL;DR: A face surface matching framework to take into account both rigid and non-rigid variations to match a 2.5D face image to a 3D face model is proposed and the number of errors is reduced.
Abstract: Current two-dimensional image based face recognition systems encounter difficulties with large facial appearance variations due to the pose, illumination and expression changes. Utilizing 3D information of human faces is promising to handle the pose and lighting variations. While the 3D shape of a face does not change due to head pose (rigid) and lighting changes, it is not invariant to the non-rigid facial movement and evolution, such as expressions and aging effect. We propose a face surface matching framework to take into account both rigid and non-rigid variations to match a 2.5D face image to a 3D face model. The rigid registration is achieved by a modified Iterative Closest Point (ICP) algorithm. The thin plate spline (TPS) model is applied to estimate the deformation displacement vector field, which is used to represent the non-rigid deformation. For the purpose of face matching, the non-rigid deformations from different sources are identified, which is formulated as a two-class classification problem: intra-subject deformation vs. inter-subject deformation. The deformation classification results are integrated with the matching distances to make the final decision. Experimental results on a database containing 100 3D face models and 98 2.5D scans with smiling expression show that the number of errors is reduced from 28 to 18.

102 citations


Proceedings ArticleDOI
05 Jan 2005
TL;DR: A face recognition system that utilizes three-dimensional shape information to make the system more robust to arbitrary view, lighting, and facial appearance is developed and the results show the feasibility of the proposed matching scheme.
Abstract: The performance of face recognition systems that use two-dimensional images depends on consistent conditions w.r.t. lighting, pose, and facial appearance. We are developing a face recognition system that utilizes three-dimensional shape information to make the system more robust to arbitrary view, lighting, and facial appearance. For each subject, a 3D face model is constructed by integrating several 2.5D face scans from different viewpoints. A 2.5D scan is composed of one range image along with a registered 2D color image. The recognition engine consists of two components, surface matching and appearance-based matching. The surface matching component is based on a modified Iterative Closest Point (ICP) algorithm.The candidate list used for appearance matching is dynamically generated based on the output of the surface matching component, which reduces the complexity of the appearance-based matching stage. The 3D model in the gallery is used to synthesize new appearance samples with pose and illumination variations that are used for discriminant subspace analysis. The weighted sum rule is applied to combine the two matching components. A hierarchical matching structure is designed to further improve the system performance in both accuracy and efficiency. Experimental results are given for matching a database of 100 3D face models with 598 2.5D independent test scans acquired in different pose and lighting conditions, and with some smiling expression. The results show the feasibility of the proposed matching scheme.

98 citations


Proceedings ArticleDOI
05 Jan 2005
TL;DR: This work proposed a novel characterization of dynamic textures that poses the problems of recognizing, and described a simple matching algorithm based on multiresolution histogram, which measure difference between two sequences.
Abstract: Dynamic textures are sequences of images of moving scenes that exhibit certain stationarity properties in time, for example, sea-waves, smoke, foliage, whirlwind etc. This work proposed a novel characterization of dynamic textures that poses the problems of recognizing. A method by spatio-temporal multiresolution histogram based on velocity and acceleration fields is presented. The spatio-temporal multiresolution histogram has many desirable properties including simple computing, spatial efficiency, robustness to noise and ability of encoding spatio-temporal dynamic information, which can reliably capture and represent the motion properties of different image sequences. Velocity and acceleration fields of different spatio-temporal resolution image sequences are accurately estimated by structure tensor method. We describe a simple matching algorithm based on multiresolution histogram, which measure difference between two sequences.

86 citations


Proceedings ArticleDOI
Andrew W. Senior1, Arun Hampapur1, Max Lu1
05 Jan 2005
TL;DR: A novel method to automatically calibrate between multiple cameras, estimating the homography between the cameras in a home position, together with the effects of pan and tilt controls and the expected height of a person in the image is described.
Abstract: This paper describes a system for automatically acquiring high-resolution images by steering a pan-tilt-zoom camera at targets detected in a fixed camera view. The system uses a novel method to automatically calibrate between multiple cameras, estimating the homography between the cameras in a home position, together with the effects of pan and tilt controls and the expected height of a person in the image. These calibrations are chained together to steer a slave camera. In addition we describe a simple manual calibration scheme

Proceedings ArticleDOI
05 Jan 2005
TL;DR: It is shown that the MRF approach produces more accurate and visually appealing silhouettes that are less prone to noise and background camouflaging effects than traditional per-pixel based methods.
Abstract: Many video surveillance and identification applications need to find moving objects in the field of view of a stationary camera. A popular method for obtaining these silhouettes is through the process of background subtraction. We present a novel method for comparing image frames to the model of the stationary background that exploits the spatial and temporal dependencies that objects in motion impose on their images. We achieve this through the development and use of Markov random fields of binary segmentation variates. We show that the MRF approach produces more accurate and visually appealing silhouettes that are less prone to noise and background camouflaging effects than traditional per-pixel based methods. Results include visual examination of silhouettes, comparisons against hand-segmented data, and an analysis of the effects of various silhouette extraction techniques on gait recognition performance.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: A technique for automatic identification of plankton using a variety of features and classification methods including ensembles is presented, expecting that upon completion, the system will become a useful tool for marine biologists to assess the health of the world's oceans.
Abstract: Earth's oceans are a soup of living micro-organisms known as plankton. As the foundation of the food chain for marine life, plankton are also an integral component of the global carbon cycle which regulates the planet's temperature. In this paper, we present a technique for automatic identification of plankton using a variety of features and classification methods including ensembles. The images were obtained in situ by an instrument known as the flow cytometer and microscope (FlowCAM), that detects particles from a stream of water siphoned directly from the ocean. The images are of necessity of limited resolution, making their identification a rather difficult challenge. We expect that upon completion, our system will become a useful tool for marine biologists to assess the health of the world's oceans.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: The proposed framework includes translation invariant recognition of gestures, a desirable property for many HCI systems, that allows for multiple candidate feature vectors to be extracted at each time step.
Abstract: A method for the simultaneous localization and recognition of dynamic hand gestures is proposed. At the core of this method is a dynamic space-time warping (DSTW) algorithm, that aligns a pair of query and model gestures in both space and time. For every frame of the query sequence, feature detectors generate multiple hand region candidates. Dynamic programming is then used to compute both a global matching cost, which is used to recognize the query gesture, and a warping path, which aligns the query and model sequences in time, and also finds the best hand candidate region in every query frame. The proposed framework includes translation invariant recognition of gestures, a desirable property for many HCI systems. The performance of the approach is evaluated on a dataset of hand signed digits gestured by people wearing short sleeve shirts, in front of a background containing other non-hand skin-colored objects. The algorithm simultaneously localizes the gesturing hand and recognizes the hand-signed digit. Although DSTW is illustrated in a gesture recognition setting, the proposed algorithm is a general method for matching time series, that allows for multiple candidate feature vectors to be extracted at each time step.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: The SVM approach with normalized image patches provides detection and localization performance closest to that of human labelers and is shown to be substantially superior to boundary-based approaches such as the Hough transform.
Abstract: Machine learning techniques have shown considerable promise for visual inspection tasks such as locating human faces in cluttered scenes. In this paper, we examine the utility of such techniques for the scientifically-important problem of detecting and cataloging impact craters in planetary images gathered by spacecraft. Various supervised learning algorithms, including ensemble methods (bagging and AdaBoost with feed-forward neural networks as base learners), support vector machines (SVM), and continuously-scalable template models (CSTM), are employed to derive crater detectors from ground-truthed images. The resulting detectors are evaluated on a challenging set of Viking Orbiter images of Mars containing roughly one thousand craters. The SVM approach with normalized image patches provides detection and localization performance closest to that of human labelers and is shown to be substantially superior to boundary-based approaches such as the Hough transform.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: A system that detects independently moving objects from a mobile platform in real time using a calibrated stereo camera and an efficient three-point algorithm in a RANSAC framework for outlier detection is described.
Abstract: We describe a system that detects independently moving objects from a mobile platform in real time using a calibrated stereo camera. Interest points are first detected and tracked through the images. These tracks are used to obtain the motion of the platform by using an efficient three-point algorithm in a RANSAC framework for outlier detection. We use a formulation based on disparity space for our inlier computation. In the disparity space, two disparity images of a rigid object are related by a homography that depends on the object's euclidean rigid motion. We use the homography obtained from the camera motion to detect the independently moving objects from the disparity maps obtained by an efficient stereo algorithm. Our system is able to reliably detect the independently moving objects at 16 Hz for a 320 x 240 stereo image sequence using a standard laptop computer.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: A temporal filtering framework for hand tracking is proposed that can initialize and reset itself without human intervention, and can automatically identify video trajectories of unambiguous hand motion, and detect frames where tracking becomes ambiguous because of occlusions or overlaps.
Abstract: In gesture and sign language video sequences, hand motion tends to be rapid, and hands frequently appear in front of each other or in front of the face. Thus, hand location is often ambiguous, and naive color-based hand tracking is insufficient. To improve tracking accuracy, some methods employ a prediction-update framework, but such methods require careful initialization of model parameters, and tend to drift and lose track in extended sequences. In this paper, a temporal filtering framework for hand tracking is proposed that can initialize and reset itself without human intervention. In each frame, simple features like color and motion residue are exploited to identify multiple candidate hand locations. The temporal filter then uses the Viterbi algorithm to select among the candidates from frame to frame. The resulting tracking system can automatically identify video trajectories of unambiguous hand motion, and detect frames where tracking becomes ambiguous because of occlusions or overlaps. Experiments on video sequences of several hundred frames in duration demonstrate the system's ability to track hands robustly, to detect and handle tracking ambiguities, and to extract the trajectories of unambiguous hand motion.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: A novel method to temporally synchronize multiple stationary video cameras with overlapping views that suffices for all variants of the synchronization problem exposed by the theoretical disseration and does not rely on the trajectory correspondence problem to be solved apriori.
Abstract: In this work, we present a formalization of the video synchronization problem that exposes new variants of the problem that have been left unexplored to date. We also present a novel method to temporally synchronize multiple stationary video cameras with overlapping views that: 1) does not rely on certain scene properties, 2) suffices for all variants of the synchronization problem exposed by the theoretical disseration, and 3) does not rely on the trajectory correspondence problem to be solved apriori. The method uses a two stage approach that first approximates the synchronization by tracking moving objects and identifying inflection points. The method then proceeds to refine the estimate using a consensus based matching heuristic to find moving features that best agree with the pre-computed camera geometries from stationary image features. By using the fundamental matrix and the trifocal tensor in the second refinement step we are able to improve the estimation of the first step and handle a broader range of input scenarios and camera conditions.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: This paper presents a method utilizing the registered 2D color and range image of a face to automatically identify the eyes, nose, and mouth and aims to run the algorithm as fast as possible.
Abstract: As interest in 3D face recognition increases the importance of the initial alignment problem does as well. In this paper we present a method utilizing the registered 2D color and range image of a face to automatically identify the eyes, nose, and mouth. These features are important to initially align faces in both standard 2D and 3D face recognition algorithms. For our algorithm to run as fast as possible, we focus on the 2D color information. This allows the algorithm to run in approximately 4 seconds on a 640times480 image with registered range data. On a database of 1,500 images the algorithm achieved a facial feature detection rate of 99.6% with 0.4% of the images skipped due to hair obstruction of the face.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: This paper investigates an unsupervised hypothesis testing method for learning the characteristics of objects passing unobserved from one observed location to another that is robust to non-stationary traffic processes that result from traffic lights, vehicle grouping, and other non-linear vehicle-vehicle interactions.
Abstract: As tracking systems become more effective at reliably tracking multiple objects over extended periods of time within single camera views and across overlapping camera views, increasing attention is being focused on tracking objects through periods where they are not observed. This paper investigates an unsupervised hypothesis testing method for learning the characteristics of objects passing unobserved from one observed location to another. This method not only reliably determines whether objects predictably pass from one location to another without performing explicit correspondence, but it approximates the likelihood of those transitions. It is robust to non-stationary traffic processes that result from traffic lights, vehicle grouping, and other non-linear vehicle-vehicle interactions. Synthetic data allows us to test and verify our results for complex traffic situations over multiple city blocks and contrast it with previous approaches.

Proceedings ArticleDOI
Rui Li1, Stan Sclaroff1
05 Jan 2005
TL;DR: In this article, a multi-scale method along with a novel adaptive smoothing technique is used to gain a regularized solution, which preserves discontinuities and prevents over-regularization.
Abstract: Scene flow methods estimate the three-dimensional motion field for points in the world, using multi-camera video data. Such methods combine multi-view reconstruction with motion estimation approaches. This paper describes an alternative formulation for dense scene flow estimation that provides convincing results using only two cameras by fusing stereo and optical flow estimation into a single coherent framework. To handle the aperture problems inherent in the estimation task, a multi-scale method along with a novel adaptive smoothing technique is used to gain a regularized solution. This combined approach both preserves discontinuities and prevents over-regularization - two problems commonly associated with basic multi-scale approaches. Internally, the framework generates probability distributions for optical flow and disparity. Taking into account the uncertainty in the intermediate stages allows for more reliable estimation of the 3D scene flow than standard stereo and optical flow methods allow. Experiments with synthetic and real test data demonstrate the effectiveness of the approach.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: All the distances between the given postmortem radiographs and the antemortem radiographs that provide candidate identities are combined to establish the identity of the subject associated with the post autopsy radiographs.
Abstract: Dental biometrics utilizes the evidence revealed by dental radiographs for human identification. This evidence includes the tooth contours, the relative positions of neighboring teeth, and the shapes of the dental work (e.g., crowns, fillings and bridges). The proposed system has two main stages: feature extraction, and matching. The feature extraction stage uses anisotropic diffusion to enhance the images and a mixture of Gaussians model to segment the dental work. The matching stage has three sequential steps: shape registration, computation of image similarity, and subject identification. In shape registration, we align the tooth contours and obtain the distance between them. A second method based on overlapped areas is used to match the dental work. The distance between the shapes of the teeth and the distance between the shapes of the dental work are then combined using likelihood estimates to improve the retrieval accuracy. At the second step, the correspondence of teeth between two given images is established. A distance measure based on this correspondence is then used to represent the similarity between the two images. Finally, the distances are used to infer the subject's identity.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: A video tracking system that tracks and analyzes the behavioral pattern of users in a public space and has obtained important statistical measurements about users' behavior, which can be used to evaluate architectural design in terms of human spatial behavior and model the behavior ofusers in public spaces.
Abstract: The paper describes a video tracking system that tracks and analyzes the behavioral pattern of users in a public space We have obtained important statistical measurements about users' behavior, which can be used to evaluate architectural design in terms of human spatial behavior and model the behavior of users in public spaces Previously, such measurements could only be obtained through costly manual processes, eg behavioral mapping and time-lapse filming with human examiners Our system has automated the process of analyzing the behavior of users The system consists of a head detector for detecting people in each single frame of the video and data association for tracking people through frames We compared the results obtained using our system with those obtained by manual counting, for a small data set, and found the results to be fairly accurate We then applied the system to a large-scale data set and obtained substantial statistical measurements of parameters such as the total number of users who entered the space, the total number of users who sat by a fountain, the time that each spent by the fountain, etc These statistics allow fundamental rethinking of the way people use a public space This research is a novel application of computer vision in evaluating architectural design in terms of human behavior

Proceedings ArticleDOI
05 Jan 2005
TL;DR: This paper describes a stereo-based tree traversability algorithm implemented and tested on a robotic vehicle under the DARPA PerceptOR program, and results from the daytime for short baseline (9 cm) and wide baseline (30 cm) stereo are presented.
Abstract: Autonomous off-road navigation through forested areas is particularly challenging when there exists a mixture of densely distributed thin and thick trees. To make progress through a dense forest, the robot must decide which trees it can push over and which trees it must circumvent. This paper describes a stereo-based tree traversability algorithm implemented and tested on a robotic vehicle under the DARPA PerceptOR program. Edge detection is applied to the left view of the stereo pair to extract long and vertical edge contours. A search step matches anti-parallel line pairs that correspond to the boundaries of individual trees. Stereo ranging is performed and the range data within trunk fragments are averaged. The diameters of each tree is then estimated, based on the average range to the tree, the focal length of the camera, and the distance in pixels between matched contour lines. We use the estimated tree diameters to construct a tree traversability image used in generating a terrain map. In stationary experiments, the average error in estimating the diameter of thirty mature tree trunks (having diameters ranging from 10-65 cm and a distance from the cameras ranging from 2.5-30 meters) was less than 5 cm. Tree traversability results from the daytime for short baseline (9 cm) and wide baseline (30 cm) stereo are presented. Results from nighttime using wide baseline (33.5 cm) thermal infrared stereo are also presented.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: Rather than learning and storing feature representations separately for each object, this work creates a finite set of representative features and share these features within and between different object models to achieve fast recognition of a large number of different objects.
Abstract: We present a framework for learning object representations for fast recognition of a large number of different objects Rather than learning and storing feature representations separately for each object, we create a finite set of representative features and share these features within and between different object models In contrast to traditional recognition methods that scale linearly with the number of objects, the shared features can be exploited by bottom-up search algorithms which require a constant number of feature comparisons for any number of objects We demonstrate the feasibility of this approach on a novel database of 50 everyday objects in cluttered real-world scenes Using Gabor wavelet-response features extracted only at corner points, our system achieves good recognition results despite substantial occlusion and background clutter

Proceedings ArticleDOI
05 Jan 2005
TL;DR: This work addresses the tracking problem by modeling the appearance and motion of the moving regions of moving objects by defining a spatio-temporal Joint Probability Data Association Filter (JPDAF) for integrating multiple cues.
Abstract: We present an approach for persistent tracking of moving objects observed by non-overlapping and moving cameras. Our approach robustly recovers the geometry of non-overlapping views using a moving camera that pans across the scene. We address the tracking problem by modeling the appearance and motion of the moving regions. The appearance of the detected blobs is described by multiple spatial distributions models of blobs' colors and edges. This representation is invariant to 2D rigid and scale transformation. It provides a rich description of the detected regions, and produces an efficient blob similarity measure for tracking. The motion model is obtained using a Kalman Filter (KF) process, which predicts the position of the moving objects while taking into account the camera motion. Tracking is performed by the maximization of a joint probability model combining objects' appearance and motion. The novelty of our approach consists in defining a spatio-temporal Joint Probability Data Association Filter (JPDAF) for integrating multiple cues. The proposed method tracks a large number of moving people with partial and total occlusions and provides automatic handoff of tracked objects. We demonstrate the performance of the system on several real video surveillance sequences.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: This work presents a Fourier-based approach that estimates large translations, scalings, and rotations using the pseudopolar (PP) Fourier transform to achieve substantial improved approximations of the polar and log-polar Fourier transforms of an image.
Abstract: One of the major challenges related to image registration is the estimation of large motions without prior knowledge This paper presents a Fourier based approach that estimates large translation, scale and rotation motions The algorithm uses the pseudo-polar transform to achieve substantial improved approximations of the polor and log-polar Fourier transforms of an image Thus, rotation and scale changes are reduced to translations which are estimated using phase correlation By utilizing the pseudo-polar grid we increase the performance (accuracy, speed, robustness) of the registration algorithms Scales up to 4 and arbitrary rotation angles can be robustly recovered, compared to a maximum scaling of 2 recovered by the current state-of-the-art algorithms The algorithm utilizes only 1D FFT calculations whose overall complexity is significantly lower than prior works Experimental results demosntrate the applicability of these algorithms

Proceedings ArticleDOI
05 Jan 2005
TL;DR: Improvements to the popular scale invariant feature transform (SIFT) are suggested which incorporate local object boundary information and the resulting feature detection and descriptor creation processes are invariant to changes in background.
Abstract: Current feature-based object recognition methods use information derived from local image patches. For robustness, features are engineered for invariance to various transformations, such as rotation, scaling, or affine warping. When patches overlap object boundaries, however, errors in both detection and matching will almost certainly occur due to inclusion of unwanted background pixels. This is common in real images, which often contain significant background clutter, objects which are not heavily textured, or objects which occupy a relatively small portion of the image. We suggest improvements to the popular scale invariant feature transform (SIFT) which incorporate local object boundary information. The resulting feature detection and descriptor creation processes are invariant to changes in background. We call this method the background and scale invariant feature transform (BSIFT). We demonstrate BSIFT's superior performance in feature detection and matching on synthetic and natural images.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: An accurate vision-based position tracking system which is significantly more robust and reliable over a wide range of environments than existing approaches and nonlinear optimization of the camera position during tracking gives accuracy comparable with full bundle adjustment but at significantly reduced cost.
Abstract: This paper describes an accurate vision-based position tracking system which is significantly more robust and reliable over a wide range of environments than existing approaches. Based on fiducial detection for robustness, we show how a machine-learning approach allows the development of significantly more reliable fiducial detection than has previously been demonstrated. We calibrate fiducial positions using a structure-from-motion solver. We then show how nonlinear optimization of the camera position during tracking gives accuracy comparable with full bundle adjustment but at significantly reduced cost.

Proceedings ArticleDOI
05 Jan 2005
TL;DR: A framework that combines visual human motion tracking with RFID based object tracking is proposed that enables the accurate estimation of high-level interactions between people and objects for application domains such as retail, home-care, workplace-safety, manufacturing and others.
Abstract: Computer vision-based articulated human motion tracking is attractive for many applications since it allows unobtrusive and passive estimation of people's activities. Although much progress has been made on human-only tracking, the visual tracking of people that interact with objects such as tools, products, packages, and devices is considerably more challenging. The wide variety of objects, their varying visual appearance, and their varying (and often small) size makes a vision-based understanding of person-object interactions very difficult. To alleviate this problem for at least some application domains, we propose a framework that combines visual human motion tracking with RFID based object tracking. We customized commonly available RFID technology to obtain orientation estimates of objects in the field of RFID emitter coils. The resulting fusion of visual human motion tracking and RFID-based object tracking enables the accurate estimation of high-level interactions between people and objects for application domains such as retail, home-care, workplace-safety, manufacturing and others