scispace - formally typeset
Search or ask a question

Showing papers on "Object detection published in 2005"


Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


Journal ArticleDOI
TL;DR: A real-time algorithm for foreground-background segmentation that can handle scenes containing moving backgrounds or illumination variations, and it achieves robust detection for different types of videos is presented.
Abstract: We present a real-time algorithm for foreground-background segmentation. Sample background values at each pixel are quantized into codebooks which represent a compressed form of background model for a long image sequence. This allows us to capture structural background variation due to periodic-like motion over a long period of time under limited memory. The codebook representation is efficient in memory and speed compared with other background modeling techniques. Our method can handle scenes containing moving backgrounds or illumination variations, and it achieves robust detection for different types of videos. We compared our method with other multimode modeling techniques. In addition to the basic algorithm, two features improving the algorithm are presented-layered modeling/detection and adaptive codebook updating. For performance evaluation, we have applied perturbation detection rate analysis to four background subtraction algorithms and two videos of different types of scenes.

1,552 citations


Proceedings ArticleDOI
20 Jun 2005
TL;DR: The performance of the approach constitutes a suggestive plausibility proof for a class of feedforward models of object recognition in cortex and exhibits excellent recognition performance and outperforms several state-of-the-art systems on a variety of image datasets including many different object categories.
Abstract: We introduce a novel set of features for robust object recognition. Each element of this set is a complex feature obtained by combining position- and scale-tolerant edge-detectors over neighboring positions and multiple orientations. Our system's architecture is motivated by a quantitative model of visual cortex. We show that our approach exhibits excellent recognition performance and outperforms several state-of-the-art systems on a variety of image datasets including many different object categories. We also demonstrate that our system is able to learn from very few examples. The performance of the approach constitutes a suggestive plausibility proof for a class of feedforward models of object recognition in cortex.

969 citations


Proceedings ArticleDOI
20 Jun 2005
TL;DR: Qualitative and quantitative results on a large data set confirm that the core part of the method is the combination of local and global cues via probabilistic top-down segmentation that allows examining and comparing object hypotheses with high precision down to the pixel level.
Abstract: In this paper, we address the problem of detecting pedestrians in crowded real-world scenes with severe overlaps. Our basic premise is that this problem is too difficult for any type of model or feature alone. Instead, we present an algorithm that integrates evidence in multiple iterations and from different sources. The core part of our method is the combination of local and global cues via probabilistic top-down segmentation. Altogether, this approach allows examining and comparing object hypotheses with high precision down to the pixel level. Qualitative and quantitative results on a large data set confirm that our method is able to reliably detect pedestrians in crowded scenes, even when they overlap and partially occlude each other. In addition, the flexible nature of our approach allows it to operate on very small training sets.

952 citations


Proceedings Article
05 Dec 2005
TL;DR: MILBoost adapts the feature selection criterion of MILBoost to optimize the performance of the Viola-Jones cascade to show the advantage of simultaneously learning the locations and scales of the objects in the training set along with the parameters of the classifier.
Abstract: A good image object detection algorithm is accurate, fast, and does not require exact locations of objects in a training set. We can create such an object detector by taking the architecture of the Viola-Jones detector cascade and training it with a new variant of boosting that we call MIL-Boost. MILBoost uses cost functions from the Multiple Instance Learning literature combined with the AnyBoost framework. We adapt the feature selection criterion of MILBoost to optimize the performance of the Viola-Jones cascade. Experiments show that the detection rate is up to 1.6 times better using MILBoost. This increased detection rate shows the advantage of simultaneously learning the locations and scales of the objects in the training set along with the parameters of the classifier.

808 citations


Proceedings ArticleDOI
17 Oct 2005
TL;DR: This work shows that it can estimate the coarse geometric properties of a scene by learning appearance-based models of geometric classes, even in cluttered natural scenes, and provides a multiple-hypothesis framework for robustly estimating scene structure from a single image and obtaining confidences for each geometric label.
Abstract: Many computer vision algorithms limit their performance by ignoring the underlying 3D geometric structure in the image. We show that we can estimate the coarse geometric properties of a scene by learning appearance-based models of geometric classes, even in cluttered natural scenes. Geometric classes describe the 3D orientation of an image region with respect to the camera. We provide a multiple-hypothesis framework for robustly estimating scene structure from a single image and obtaining confidences for each geometric label. These confidences can then be used to improve the performance of many other applications. We provide a thorough quantitative evaluation of our algorithm on a set of outdoor images and demonstrate its usefulness in two applications: object detection and automatic single-view reconstruction.

792 citations


Proceedings ArticleDOI
05 Jan 2005
TL;DR: The key contributions of this empirical study are to demonstrate that a model trained in this manner can achieve results comparable to a modeltrained in the traditional manner using a much larger set of fully labeled data, and that a training data selection metric that is defined independently of the detector greatly outperforms a selection metric based on the detection confidence generated by the detector.
Abstract: The construction of appearance-based object detection systems is time-consuming and difficult because a large number of training examples must be collected and manually labeled in order to capture variations in object appearance. Semi-supervised training is a means for reducing the effort needed to prepare the training set by training the model with a small number of fully labeled examples and an additional set of unlabeled or weakly labeled examples. In this work we present a semi-supervised approach to training object detection systems based on self-training. We implement our approach as a wrapper around the training process of an existing object detector and present empirical results. The key contributions of this empirical study is to demonstrate that a model trained in this manner can achieve results comparable to a model trained in the traditional manner using a much larger set of fully labeled data, and that a training data selection metric that is defined independently of the detector greatly outperforms a selection metric based on the detection confidence generated by the detector.

767 citations


Journal ArticleDOI
TL;DR: An object detection scheme that has three innovations over existing approaches that is based on a model of the background as a single probability density, and the posterior function is maximized efficiently by finding the minimum cut of a capacitated graph.
Abstract: Accurate detection of moving objects is an important precursor to stable tracking or recognition. In this paper, we present an object detection scheme that has three innovations over existing approaches. First, the model of the intensities of image pixels as independent random variables is challenged and it is asserted that useful correlation exists in intensities of spatially proximal pixels. This correlation is exploited to sustain high levels of detection accuracy in the presence of dynamic backgrounds. By using a nonparametric density estimation method over a joint domain-range representation of image pixels, multimodal spatial uncertainties and complex dependencies between the domain (location) and range (color) are directly modeled. We propose a model of the background as a single probability density. Second, temporal persistence is proposed as a detection criterion. Unlike previous approaches to object detection which detect objects by building adaptive models of the background, the foregrounds modeled to augment the detection of objects (without explicit tracking) since objects detected in the preceding frame contain substantial evidence for detection in the current frame. Finally, the background and foreground models are used competitively in a MAP-MRF decision framework, stressing spatial context as a condition of detecting interesting objects and the posterior function is maximized efficiently by finding the minimum cut of a capacitated graph. Experimental validation of the proposed method is performed and presented on a diverse set of dynamic scenes.

685 citations


Journal ArticleDOI
TL;DR: The model's performance on search for single features and feature conjunctions is consistent with existing psychophysical data, and results suggest that the model may provide a reasonable approximation to many brain processes involved in complex task-driven visual behaviors.

677 citations


Proceedings ArticleDOI
17 Oct 2005
TL;DR: This paper constructs a realtime event detector for each action of interest by learning a cascade of filters based on volumetric features that efficiently scans video sequences in space and time and confirms that it achieves performance comparable to a current interest point based human activity recognizer on a standard database of human activities.
Abstract: This paper studies the use of volumetric features as an alternative to popular local descriptor approaches for event detection in video sequences. Motivated by the recent success of similar ideas in object detection on static images, we generalize the notion of 2D box features to 3D spatio-temporal volumetric features. This general framework enables us to do real-time video analysis. We construct a realtime event detector for each action of interest by learning a cascade of filters based on volumetric features that efficiently scans video sequences in space and time. This event detector recognizes actions that are traditionally problematic for interest point methods - such as smooth motions where insufficient space-time interest points are available. Our experiments demonstrate that the technique accurately detects actions on real-world sequences and is robust to changes in viewpoint, scale and action speed. We also adapt our technique to the related task of human action classification and confirm that it achieves performance comparable to a current interest point based human activity recognizer on a standard database of human activities.

616 citations


Journal ArticleDOI
TL;DR: A novel red lesion detection method is presented based on a hybrid approach, combining prior works by Spencer et al. (1996) and Frame (1998) with two important new contributions, including a new red lesions candidate detection system based on pixel classification.
Abstract: The robust detection of red lesions in digital color fundus photographs is a critical step in the development of automated screening systems for diabetic retinopathy. In this paper, a novel red lesion detection method is presented based on a hybrid approach, combining prior works by Spencer et al. (1996) and Frame et al. (1998) with two important new contributions. The first contribution is a new red lesion candidate detection system based on pixel classification. Using this technique, vasculature and red lesions are separated from the background of the image. After removal of the connected vasculature the remaining objects are considered possible red lesions. Second, an extensive number of new features are added to those proposed by Spencer-Frame. The detected candidate objects are classified using all features and a k-nearest neighbor classifier. An extensive evaluation was performed on a test set composed of images representative of those normally found in a screening set. When determining whether an image contains red lesions the system achieves a sensitivity of 100% at a specificity of 87%. The method is compared with several different automatic systems and is shown to outperform them all. Performance is close to that of a human expert examining the images for the presence of red lesions.

Journal ArticleDOI
TL;DR: By the time subjects knew an image contained an object at all, they already knew its category, and these findings place powerful constraints on theories of object recognition.
Abstract: What is the sequence of processing steps involved in visual object recognition? We varied the exposure duration of natural images and measured subjects' performance on three different tasks, each designed to tap a different candidate component process of object recognition. For each exposure duration, accuracy was lower and reaction time longer on a within-category identification task (e.g., distinguishing pigeons from other birds) than on a perceptual categorization task (e.g., birds vs. cars). However, strikingly, at each exposure duration, subjects performed just as quickly and accurately on the categorization task as they did on a task requiring only object detection: By the time subjects knew an image contained an object at all, they already knew its category. These findings place powerful constraints on theories of object recognition.

Book ChapterDOI
TL;DR: The gist of a scene is a spatial representation of the outside world that is rich enough to grasp the meaning of the scene, to recognize a few objects and other salient information in the image, and to facilitate object detection and the deployment of attention.
Abstract: Studies in scene perception have shown that observers recognize a real-world scene at a single glance. During this expeditious process of seeing, the visual system forms a spatial representation of the outside world that is rich enough to grasp the meaning of the scene, to recognize a few objects and other salient information in the image, and to facilitate object detection and the deployment of attention. This representation refers to the gist of a scene, which includes all levels of processing, from low-level features (e.g., color, spatial frequencies) to intermediate image properties (e.g., surface, volume) and high-level information (e.g., objects, activation of semantic knowledge). Therefore, gist can be studied at both perceptual and conceptual levels.

Journal ArticleDOI
TL;DR: In this paper, a Bayesian framework for parsing images into their constituent visual patterns is presented, which optimizes the posterior probability and outputs a scene representation as a "parsing graph", in a spirit similar to parsing sentences in speech and natural language.
Abstract: In this paper we present a Bayesian framework for parsing images into their constituent visual patterns. The parsing algorithm optimizes the posterior probability and outputs a scene representation as a "parsing graph", in a spirit similar to parsing sentences in speech and natural language. The algorithm constructs the parsing graph and re-configures it dynamically using a set of moves, which are mostly reversible Markov chain jumps. This computational framework integrates two popular inference approaches--generative (top-down) methods and discriminative (bottom-up) methods. The former formulates the posterior probability in terms of generative models for images defined by likelihood functions and priors. The latter computes discriminative probabilities based on a sequence (cascade) of bottom-up tests/filters. In our Markov chain algorithm design, the posterior probability, defined by the generative models, is the invariant (target) probability for the Markov chain, and the discriminative probabilities are used to construct proposal probabilities to drive the Markov chain. Intuitively, the bottom-up discriminative probabilities activate top-down generative models. In this paper, we focus on two types of visual patterns--generic visual patterns, such as texture and shading, and object patterns including human faces and text. These types of patterns compete and cooperate to explain the image and so image parsing unifies image segmentation, object detection, and recognition (if we use generic visual patterns only then image parsing will correspond to image segmentation (Tu and Zhu, 2002. IEEE Trans. PAMI, 24(5):657--673). We illustrate our algorithm on natural images of complex city scenes and show examples where image segmentation can be improved by allowing object specific knowledge to disambiguate low-level segmentation cues, and conversely where object detection can be improved by using generic visual patterns to explain away shadows and occlusions.

Proceedings ArticleDOI
Claus Bahlmann1, Ying Zhu1, Visvanathan Ramesh1, M. Pellkofer2, T. Koehler2 
06 Jun 2005
TL;DR: This paper describes a computer vision based system for real-time robust traffic sign detection, tracking, and recognition that offers a generic, joint modeling of color and shape information without the need of tuning free parameters.
Abstract: This paper describes a computer vision based system for real-time robust traffic sign detection, tracking, and recognition. Such a framework is of major interest for driver assistance in an intelligent automotive cockpit environment. The proposed approach consists of two components. First, signs are detected using a set of Haar wavelet features obtained from AdaBoost training. Compared to previously published approaches, our solution offers a generic, joint modeling of color and shape information without the need of tuning free parameters. Once detected, objects are efficiently tracked within a temporal information propagation framework. Second, classification is performed using Bayesian generative modeling. Making use of the tracking information, hypotheses are fused over multiple frames. Experiments show high detection and recognition accuracy and a frame rate of approximately 10 frames per second on a standard PC.

Patent
22 Dec 2005
TL;DR: In this paper, an imaging array sensor is used to capture an image of a scene occurring exteriorly of the vehicle and a control algorithmically processes the image data set to extract information from the reduced image data sets.
Abstract: An imaging system for a vehicle includes an imaging array sensor and a control. The image array sensor comprises a plurality of photo-sensing pixels and is positioned at the vehicle with a field of view exteriorly of the vehicle. The imaging array sensor is operable to capture an image of a scene occurring exteriorly of the vehicle. The captured image comprises an image data set representative of the exterior scene. The control algorithmically processes the image data set to a reduced image data set of the image data set. The control processes the reduced image data set to extract information from the reduced image data set. The control selects the reduced image data set based on a steering angle of the vehicle.

Proceedings ArticleDOI
20 Jun 2005
TL;DR: A method for training object detectors using a generalization of the cascade architecture is described, which results in a detection rate and speed comparable to that of the best published detectors while allowing for easier training and a detector with fewer features.
Abstract: We describe a method for training object detectors using a generalization of the cascade architecture, which results in a detection rate and speed comparable to that of the best published detectors while allowing for easier training and a detector with fewer features. In addition, the method allows for quickly calibrating the detector for a target detection rate, false positive rate or speed. One important advantage of our method is that it enables systematic exploration of the ROC surface, which characterizes the trade-off between accuracy and speed for a given classifier.

Proceedings ArticleDOI
06 Nov 2005
TL;DR: It is argued that a camera sensor network containing heterogeneous elements provides numerous benefits over traditional homogeneous sensor networks and that a multi-tier sensor network can reconcile the traditionally conflicting systems goals of latency and energy-efficiency.
Abstract: This paper argues that a camera sensor network containing heterogeneous elements provides numerous benefits over traditional homogeneous sensor networks. We present the design and implementation of senseye---a multi-tier network of heterogeneous wireless nodes and cameras. To demonstrate its benefits, we implement a surveillance application using senseye comprising three tasks: object detection, recognition and tracking. We propose novel mechanisms for low-power low-latency detection, low-latency wakeups, efficient recognition and tracking. Our techniques show that a multi-tier sensor network can reconcile the traditionally conflicting systems goals of latency and energy-efficiency. An experimental evaluation of our prototype shows that, when compared to a single-tier prototype, our multi-tier senseye can achieve an order of magnitude reduction in energy usage while providing comparable surveillance accuracy.

Proceedings ArticleDOI
17 Oct 2005
TL;DR: A very efficient and robust visual object tracking algorithm based on the particle filter that maintains multiple hypotheses and offers robustness against clutter or short period occlusions is presented.
Abstract: A very efficient and robust visual object tracking algorithm based on the particle filter is presented. The method characterizes the tracked objects using color and edge orientation histogram features. While the use of more features and samples can improve the robustness, the computational load required by the particle filter increases. To accelerate the algorithm while retaining robustness we adopt several enhancements in the algorithm. The first is the use of integral images for efficiently computing the color features and edge orientation histograms, which allows a large amount of particles and a better description of the targets. Next, the observation likelihood based on multiple features is computed in a coarse-to-fine manner, which allows the computation to quickly focus on the more promising regions. Quasi-random sampling of the particles allows the filter to achieve a higher convergence rate. The resulting tracking algorithm maintains multiple hypotheses and offers robustness against clutter or short period occlusions. Experimental results demonstrate the efficiency and effectiveness of the algorithm for single and multiple object tracking.

Proceedings ArticleDOI

[...]

20 Jun 2005
TL;DR: A principled Bayesian method for detecting and segmenting instances of a particular object category within an image, providing a coherent methodology for combining top down and bottom up cues and developing an efficient method, OBJ CUT, to obtain segmentations using this model.
Abstract: In this paper, we present a principled Bayesian method for detecting and segmenting instances of a particular object category within an image, providing a coherent methodology for combining top down and bottom up cues. The work draws together two powerful formulations: pictorial structures (PS) and Markov random fields (MRFs) both of which have efficient algorithms for their solution. The resulting combination, which we call the object category specific MRF, suggests a solution to the problem that has long dogged MRFs namely that they provide a poor prior for specific shapes. In contrast, our model provides a prior that is global across the image plane using the PS. We develop an efficient method, OBJ CUT, to obtain segmentations using this model. Novel aspects of this method include an efficient algorithm for sampling the PS model, and the observation that the expected log likelihood of the model can be increased by a single graph cut. Results are presented on two object categories, cows and horses. We compare our methods to the state of the art in object category specific image segmentation and demonstrate significant improvements.

Proceedings ArticleDOI
Yingli Tian1, M. Lu1, A. Hampapur1
20 Jun 2005
TL;DR: A new method to robustly and efficiently analyze foreground when the authors detect background for a fixed camera view by using mixture of Gaussians models and multiple cues is presented.
Abstract: We present a new method to robustly and efficiently analyze foreground when we detect background for a fixed camera view by using mixture of Gaussians models and multiple cues. The background is modeled by three Gaussian mixtures as in the work of Stauffer and Grimson (1999). Then the intensity and texture information are integrated to remove shadows and to enable the algorithm working for quick lighting changes. For foreground analysis, the same Gaussian mixture model is employed to detect the static foreground regions without using any tracking or motion information. Then the whole static regions are pushed back to the background model to avoid a common problem in background subtraction /spl times/ fragmentation (one object becomes multiple parts). The method was tested on our real time video surveillance system. It is robust and run about 130 fpsfor color images and 150 fps for grayscale images at size 160/spl times/120 on a 2GB Pentium IV machine with MMX optimization.

Proceedings ArticleDOI
17 Oct 2005
TL;DR: Applied to a database of images of isolated objects, the sharing of parts among objects improves detection accuracy when few training examples are available and this hierarchical probabilistic model is extended to scenes containing multiple objects.
Abstract: We describe a hierarchical probabilistic model for the detection and recognition of objects in cluttered, natural scenes. The model is based on a set of parts which describe the expected appearance and position, in an object centered coordinate frame, of features detected by a low-level interest operator. Each object category then has its own distribution over these parts, which are shared between objects. We learn the parameters of this model via a Gibbs sampler which uses the graphical model's structure to analytically average over many parameters. Applied to a database of images of isolated objects, the sharing of parts among objects improves detection accuracy when few training examples are available. We also extend this hierarchical framework to scenes containing multiple objects

Proceedings ArticleDOI
17 Oct 2005
TL;DR: The major contributions are the application of boosted local contour-based features for object detection in a partially supervised learning framework, and an efficient new boosting procedure for simultaneously selecting features and estimating per-feature parameters.
Abstract: We present a novel categorical object detection scheme that uses only local contour-based features. A two-stage, partially supervised learning architecture is proposed: a rudimentary detector is learned from a very small set of segmented images and applied to a larger training set of un-segmented images; the second stage bootstraps these detections to learn an improved classifier while explicitly training against clutter. The detectors are learned with a boosting algorithm which creates a location-sensitive classifier using a discriminative set of features from a randomly chosen dictionary of contour fragments. We present results that are very competitive with other state-of-the-art object detection schemes and show robustness to object articulations, clutter, and occlusion. Our major contributions are the application of boosted local contour-based features for object detection in a partially supervised learning framework, and an efficient new boosting procedure for simultaneously selecting features and estimating per-feature parameters.

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown that all brightness transfer functions from a given camera to another camera lie in a low dimensional subspace and it is demonstrated that this subspace can be used to compute appearance similarity.
Abstract: When viewed from a system of multiple cameras with non-overlapping fields of view, the appearance of an object in one camera view is usually very different from its appearance in another camera view due to the differences in illumination, pose and camera parameters. In order to handle the change in observed colors of an object as it moves from one camera to another, we show that all brightness transfer functions from a given camera to another camera lie in a low dimensional subspace and demonstrate that this subspace can be used to compute appearance similarity. In the proposed approach, the system learns the subspace of inter-camera brightness transfer functions in a training phase during which object correspondences are assumed to be known. Once the training is complete, correspondences are assigned using the maximum a posteriori (MAP) estimation framework using both location and appearance cues. We evaluate the proposed method under several real world scenarios obtaining encouraging results.

Proceedings ArticleDOI
01 Jan 2005
TL;DR: The algorithm presented in this paper, called multiRANSAC, on average performs better than traditional approaches based on the sequential application of a standard RANSAC algorithm followed by the removal of the detected set of inliers.
Abstract: A RANSAC based procedure is described for detecting inliers corresponding to multiple models in a given set of data points. The algorithm we present in this paper (called multiRANSAC) on average performs better than traditional approaches based on the sequential application of a standard RANSAC algorithm followed by the removal of the detected set of inliers. We illustrate the effectiveness of our approach on a synthetic example and apply it to the problem of identifying multiple world planes in pairs of images containing dominant planar structures.

Journal ArticleDOI
01 Jun 2005
TL;DR: This paper considers the problem of automatically learning an activity-based semantic scene model from a stream of video data and proposes a scene model that labels regions according to an identifiable activity in each region, such as entry/exit zones, junctions, paths, and stop zones.
Abstract: This paper considers the problem of automatically learning an activity-based semantic scene model from a stream of video data. A scene model is proposed that labels regions according to an identifiable activity in each region, such as entry/exit zones, junctions, paths, and stop zones. We present several unsupervised methods that learn these scene elements and present results that show the efficiency of our approach. Finally, we describe how the models can be used to support the interpretation of moving objects in a visual surveillance environment.

Journal ArticleDOI
TL;DR: A foreground validation algorithm that first builds a foreground mask using a slow-adapting Kalman filter, and then validates individual foreground pixels by a simple moving object model built using both the foreground and background statistics as well as the frame difference is proposed.
Abstract: Identifying moving objects in a video sequence is a fundamental and critical task in many computer-vision applications. Background subtraction techniques are commonly used to separate foreground moving objects from the background. Most background subtraction techniques assume a single rate of adaptation, which is inadequate for complex scenes such as a traffic intersection where objects are moving at different and varying speeds. In this paper, we propose a foreground validation algorithm that first builds a foreground mask using a slow-adapting Kalman filter, and then validates individual foreground pixels by a simple moving object model built using both the foreground and background statistics as well as the frame difference. Ground-truth experiments with urban traffic sequences show that our proposed algorithm significantly improves upon results using only Kalman filter or frame-differencing, and outperforms other techniques based on mixture of Gaussians, median filter, and approximated median filter.

Proceedings ArticleDOI
17 Oct 2005
TL;DR: A robust parallax filtering scheme is proposed to accumulate the geometric constraint errors within a sliding window and estimate a likelihood map for pixel classification, which is integrated into the tracking framework based on the spatio-temporal joint probability data association filter (JPDAF).
Abstract: We present a novel approach to detect and track independently moving regions in a 3D scene observed by a moving camera in the presence of strong parallax. Detected moving pixels are classified into independently moving regions or parallax regions by analyzing two geometric constraints: the commonly used epipolar constraint, and the structure consistency constraint. The second constraint is implemented within a "plane+parallax" framework and represented by a bilinear relationship which relates the image points to their relative depths. This newly derived relationship is related to trilinear tensor, but can be enforced into more than three frames. It does not assume a constant reference plane in the scene and therefore eliminates the need for manual selection of reference plane. Then, a robust parallax filtering scheme is proposed to accumulate the geometric constraint errors within a sliding window and estimate a likelihood map for pixel classification. The likelihood map is integrated into our tracking framework based on the spatio-temporal joint probability data association filter (JPDAF). This tracking approach infers the trajectory and bounding box of the moving objects by searching the optimal path with maximum joint probability within a fixed size of buffer. We demonstrate the performance of the proposed approach on real video sequences where parallax effects are significant.

Journal ArticleDOI
TL;DR: A novel approach for clustering shots into scenes by transforming this task into a graph partitioning problem, which automates this objective which is useful for applications such as video-on-demand, digital libraries, and the Internet.
Abstract: This paper presents a method to perform a high-level segmentation of videos into scenes. A scene can be defined as a subdivision of a play in which either the setting is fixed, or when it presents continuous action in one place. We exploit this fact and propose a novel approach for clustering shots into scenes by transforming this task into a graph partitioning problem. This is achieved by constructing a weighted undirected graph called a shot similarity graph (SSG), where each node represents a shot and the edges between the shots are weighted by their similarity based on color and motion information. The SSG is then split into subgraphs by applying the normalized cuts for graph partitioning. The partitions so obtained represent individual scenes in the video. When clustering the shots, we consider the global similarities of shots rather than the individual shot pairs. We also propose a method to describe the content of each scene by selecting one representative image from the video as a scene key-frame. Recently, DVDs have become available with a chapter selection option where each chapter is represented by one image. Our algorithm automates this objective which is useful for applications such as video-on-demand, digital libraries, and the Internet. Experiments are presented with promising results on several Hollywood movies and one sitcom.

Proceedings ArticleDOI
17 Oct 2005
TL;DR: A two-layer hierarchical formulation to exploit different levels of contextual information in images for robust classification and is general enough to be applied to different domains ranging from pixelwise image labeling to contextual object detection.
Abstract: We present a two-layer hierarchical formulation to exploit different levels of contextual information in images for robust classification. Each layer is modeled as a conditional field that allows one to capture arbitrary observation-dependent label interactions. The proposed framework has two main advantages. First, it encodes both the short-range interactions (e.g., pixelwise label smoothing) as well as the long-range interactions (e.g., relative configurations of objects or regions) in a tractable manner. Second, the formulation is general enough to be applied to different domains ranging from pixelwise image labeling to contextual object detection. The parameters of the model are learned using a sequential maximum-likelihood approximation. The benefits of the proposed framework are demonstrated on four different datasets and comparison results are presented