scispace - formally typeset
Search or ask a question

Showing papers on "Object detection published in 2006"


Journal ArticleDOI
TL;DR: The goal of this article is to review the state-of-the-art tracking methods, classify them into different categories, and identify new trends to discuss the important issues related to tracking including the use of appropriate image features, selection of motion models, and detection of objects.
Abstract: The goal of this article is to review the state-of-the-art tracking methods, classify them into different categories, and identify new trends. Object tracking, in general, is a challenging problem. Difficulties in tracking objects can arise due to abrupt object motion, changing appearance patterns of both the object and the scene, nonrigid object structures, object-to-object and object-to-scene occlusions, and camera motion. Tracking is usually performed in the context of higher-level applications that require the location and/or shape of the object in every frame. Typically, assumptions are made to constrain the tracking problem in the context of a particular application. In this survey, we categorize the tracking methods on the basis of the object and motion representations used, provide detailed descriptions of representative methods in each category, and examine their pros and cons. Moreover, we discuss the important issues related to tracking including the use of appropriate image features, selection of motion models, and detection of objects.

5,318 citations


Proceedings ArticleDOI
17 Jun 2006
TL;DR: This work integrates the cascade-of-rejectors approach with the Histograms of Oriented Gradients features to achieve a fast and accurate human detection system that can process 5 to 30 frames per second depending on the density in which the image is scanned, while maintaining an accuracy level similar to existing methods.
Abstract: We integrate the cascade-of-rejectors approach with the Histograms of Oriented Gradients (HoG) features to achieve a fast and accurate human detection system. The features used in our system are HoGs of variable-size blocks that capture salient features of humans automatically. Using AdaBoost for feature selection, we identify the appropriate set of blocks, from a large set of possible blocks. In our system, we use the integral image representation and a rejection cascade which significantly speed up the computation. For a 320 × 280 image, the system can process 5 to 30 frames per second depending on the density in which we scan the image, while maintaining an accuracy level similar to existing methods.

1,626 citations


Journal ArticleDOI
TL;DR: A novel and efficient texture-based method for modeling the background and detecting moving objects from a video sequence that provides many advantages compared to the state-of-the-art.
Abstract: This paper presents a novel and efficient texture-based method for modeling the background and detecting moving objects from a video sequence. Each pixel is modeled as a group of adaptive local binary pattern histograms that are calculated over a circular region around the pixel. The approach provides us with many advantages compared to the state-of-the-art. Experimental results clearly justify our model.

1,355 citations


Journal ArticleDOI
TL;DR: A review of recent vision-based on-road vehicle detection systems where the camera is mounted on the vehicle rather than being fixed such as in traffic/driveway monitoring systems is presented.
Abstract: Developing on-board automotive driver assistance systems aiming to alert drivers about driving environments, and possible collision with other vehicles has attracted a lot of attention lately. In these systems, robust and reliable vehicle detection is a critical step. This paper presents a review of recent vision-based on-road vehicle detection systems. Our focus is on systems where the camera is mounted on the vehicle rather than being fixed such as in traffic/driveway monitoring systems. First, we discuss the problem of on-road vehicle detection using optical sensors followed by a brief review of intelligent vehicle research worldwide. Then, we discuss active and passive sensors to set the stage for vision-based vehicle detection. Methods aiming to quickly hypothesize the location of vehicles in an image as well as to verify the hypothesized locations are reviewed next. Integrating detection with tracking is also reviewed to illustrate the benefits of exploiting temporal continuity for vehicle detection. Finally, we present a critical overview of the methods discussed, we assess their potential for future deployment, and we present directions for future research.

1,181 citations


Proceedings ArticleDOI
17 Jun 2006
TL;DR: This paper proposes a novel on-line AdaBoost feature selection method and demonstrates the multifariousness of the method on such diverse tasks as learning complex background models, visual tracking and object detection.
Abstract: Boosting has become very popular in computer vision, showing impressive performance in detection and recognition tasks. Mainly off-line training methods have been used, which implies that all training data has to be a priori given; training and usage of the classifier are separate steps. Training the classifier on-line and incrementally as new data becomes available has several advantages and opens new areas of application for boosting in computer vision. In this paper we propose a novel on-line AdaBoost feature selection method. In conjunction with efficient feature extraction methods the method is real time capable. We demonstrate the multifariousness of the method on such diverse tasks as learning complex background models, visual tracking and object detection. All approaches benefit significantly by the on-line training.

1,159 citations


Proceedings ArticleDOI
23 Oct 2006
TL;DR: The proposed spatiotemporal video attention framework has been applied on over 20 testing video sequences, and attended regions are detected to highlight interesting objects and motions present in the sequences with very high user satisfaction rate.
Abstract: Human vision system actively seeks interesting regions in images to reduce the search effort in tasks, such as object detection and recognition. Similarly, prominent actions in video sequences are more likely to attract our first sight than their surrounding neighbors. In this paper, we propose a spatiotemporal video attention detection technique for detecting the attended regions that correspond to both interesting objects and actions in video sequences. Both spatial and temporal saliency maps are constructed and further fused in a dynamic fashion to produce the overall spatiotemporal attention model. In the temporal attention model, motion contrast is computed based on the planar motions (homography) between images, which is estimated by applying RANSAC on point correspondences in the scene. To compensate the non-uniformity of spatial distribution of interest-points, spanning areas of motion segments are incorporated in the motion contrast computation. In the spatial attention model, a fast method for computing pixel-level saliency maps has been developed using color histograms of images. A hierarchical spatial attention representation is established to reveal the interesting points in images as well as the interesting regions. Finally, a dynamic fusion technique is applied to combine both the temporal and spatial saliency maps, where temporal attention is dominant over the spatial model when large motion contrast exists, and vice versa. The proposed spatiotemporal attention framework has been applied on over 20 testing video sequences, and attended regions are detected to highlight interesting objects and motions present in the sequences with very high user satisfaction rate.

983 citations


Journal ArticleDOI
17 Jun 2006
TL;DR: This paper provides a framework for placing local object detection in the context of the overall 3D scene by modeling the interdependence of objects, surface orientations, and camera viewpoint by allowing probabilistic object hypotheses to refine geometry and vice-versa.
Abstract: Image understanding requires not only individually estimating elements of the visual world but also capturing the interplay among them. In this paper, we provide a framework for placing local object detection in the context of the overall 3D scene by modeling the interdependence of objects, surface orientations, and camera viewpoint. Most object detection methods consider all scales and locations in the image as equally likely. We show that with probabilistic estimates of 3D geometry, both in terms of surfaces and world coordinates, we can put objects into perspective and model the scale and location variance in the image. Our approach reflects the cyclical nature of the problem by allowing probabilistic object hypotheses to refine geometry and vice-versa. Our framework allows painless substitution of almost any object detector and is easily extended to include other aspects of image understanding. Our results confirm the benefits of our integrated approach.

929 citations


Journal ArticleDOI
TL;DR: A keypoint-based approach is developed that is effective in this context by formulating wide-baseline matching of keypoints extracted from the input images to those found in the model images as a classification problem, which shifts much of the computational burden to a training phase, without sacrificing recognition performance.
Abstract: In many 3D object-detection and pose-estimation problems, runtime performance is of critical importance. However, there usually is time to train the system, which we would show to be very useful. Assuming that several registered images of the target object are available, we developed a keypoint-based approach that is effective in this context by formulating wide-baseline matching of keypoints extracted from the input images to those found in the model images as a classification problem. This shifts much of the computational burden to a training phase, without sacrificing recognition performance. As a result, the resulting algorithm is robust, accurate, and fast-enough for frame-rate performance. This reduction in runtime computational complexity is our first contribution. Our second contribution is to show that, in this context, a simple and fast keypoint detector suffices to support detection and tracking even under large perspective and scale variations. While earlier methods require a detector that can be expected to produce very repeatable results, in general, which usually is very time-consuming, we simply find the most repeatable object keypoints for the specific target object during the training phase. We have incorporated these ideas into a real-time system that detects planar, nonplanar, and deformable objects. It then estimates the pose of the rigid ones and the deformations of the others

843 citations


Proceedings ArticleDOI
17 Jun 2006
TL;DR: The covariance tracking method does not make any assumption on the measurement noise and the motion of the tracked objects, and provides the global optimal solution, and it is shown that it is capable of accurately detecting the nonrigid, moving objects in non-stationary camera sequences.
Abstract: We propose a simple and elegant algorithm to track nonrigid objects using a covariance based object description and a Lie algebra based update mechanism. We represent an object window as the covariance matrix of features, therefore we manage to capture the spatial and statistical properties as well as their correlation within the same representation. The covariance matrix enables efficient fusion of different types of features and modalities, and its dimensionality is small. We incorporated a model update algorithm using the Lie group structure of the positive definite matrices. The update mechanism effectively adapts to the undergoing object deformations and appearance changes. The covariance tracking method does not make any assumption on the measurement noise and the motion of the tracked objects, and provides the global optimal solution. We show that it is capable of accurately detecting the nonrigid, moving objects in non-stationary camera sequences while achieving a promising detection rate of 97.4 percent.

627 citations


Journal ArticleDOI
TL;DR: Experimental results show the robustness of the tracking algorithm, the efficiency of the algorithm for learning motion patterns, and the encouraging performance of algorithms for anomaly detection and behavior prediction.
Abstract: Analysis of motion patterns is an effective approach for anomaly detection and behavior prediction. Current approaches for the analysis of motion patterns depend on known scenes, where objects move in predefined ways. It is highly desirable to automatically construct object motion patterns which reflect the knowledge of the scene. In this paper, we present a system for automatically learning motion patterns for anomaly detection and behavior prediction based on a proposed algorithm for robustly tracking multiple objects. In the tracking algorithm, foreground pixels are clustered using a fast accurate fuzzy k-means algorithm. Growing and prediction of the cluster centroids of foreground pixels ensure that each cluster centroid is associated with a moving object in the scene. In the algorithm for learning motion patterns, trajectories are clustered hierarchically using spatial and temporal information and then each motion pattern is represented with a chain of Gaussian distributions. Based on the learned statistical motion patterns, statistical methods are used to detect anomalies and predict behaviors. Our system is tested using image sequences acquired, respectively, from a crowded real traffic scene and a model traffic scene. Experimental results show the robustness of the tracking algorithm, the efficiency of the algorithm for learning motion patterns, and the encouraging performance of algorithms for anomaly detection and behavior prediction

575 citations


Journal ArticleDOI
TL;DR: This paper sets out a tracking framework, which is applied to the recovery of three-dimensional hand motion from an image sequence, and handles the issues of initialization, tracking, and recovery in a unified way.
Abstract: This paper sets out a tracking framework, which is applied to the recovery of three-dimensional hand motion from an image sequence. The method handles the issues of initialization, tracking, and recovery in a unified way. In a single input image with no prior information of the hand pose, the algorithm is equivalent to a hierarchical detection scheme, where unlikely pose candidates are rapidly discarded. In image sequences, a dynamic model is used to guide the search and approximate the optimal filtering equations. A dynamic model is given by transition probabilities between regions in parameter space and is learned from training data obtained by capturing articulated motion. The algorithm is evaluated on a number of image sequences, which include hand motion with self-occlusion in front of a cluttered background

Journal ArticleDOI
TL;DR: In this paper, nonlinear pose estimation is formulated by means of a virtual visual servoing approach and has been validated on several complex image sequences including outdoor environments.
Abstract: Tracking is a very important research subject in a real-time augmented reality context. The main requirements for trackers are high accuracy and little latency at a reasonable cost. In order to address these issues, a real-time, robust, and efficient 3D model-based tracking algorithm is proposed for a "video see through" monocular vision system. The tracking of objects in the scene amounts to calculating the pose between the camera and the objects. Virtual objects can then be projected into the scene using the pose. In this paper, nonlinear pose estimation is formulated by means of a virtual visual servoing approach. In this context, the derivation of point-to-curves interaction matrices are given for different 3D geometrical primitives including straight lines, circles, cylinders, and spheres. A local moving edges tracker is used in order to provide real-time tracking of points normal to the object contours. Robustness is obtained by integrating an M-estimator into the visual control law via an iteratively reweighted least squares implementation. This approach is then extended to address the 3D model-free augmented reality problem. The method presented in this paper has been validated on several complex image sequences including outdoor environments. Results show the method to be robust to occlusion, changes in illumination, and mistracking.

Proceedings ArticleDOI
17 Jun 2006
TL;DR: This paper proposes a novel supervised learning algorithm for edge and object boundary detection which it refers to as Boosted Edge Learning or BEL for short and shows applications for edge detection in a number of specific image domains as well as on natural images.
Abstract: Edge detection is one of the most studied problems in computer vision, yet it remains a very challenging task. It is difficult since often the decision for an edge cannot be made purely based on low level cues such as gradient, instead we need to engage all levels of information, low, middle, and high, in order to decide where to put edges. In this paper we propose a novel supervised learning algorithm for edge and object boundary detection which we refer to as Boosted Edge Learning or BEL for short. A decision of an edge point is made independently at each location in the image; a very large aperture is used providing significant context for each decision. In the learning stage, the algorithm selects and combines a large number of features across different scales in order to learn a discriminative model using an extended version of the Probabilistic Boosting Tree classification algorithm. The learning based framework is highly adaptive and there are no parameters to tune. We show applications for edge detection in a number of specific image domains as well as on natural images. We test on various datasets including the Berkeley dataset and the results obtained are very good.

Proceedings ArticleDOI
17 Jun 2006
TL;DR: Testing on 750 artificial and natural scenes shows that the model’s predictions are consistent with a large body of available literature on human psychophysics of visual search, suggesting that it may provide good approximation of how humans combine bottom-up and top-down cues.
Abstract: Integration of goal-driven, top-down attention and image-driven, bottom-up attention is crucial for visual search. Yet, previous research has mostly focused on models that are purely top-down or bottom-up. Here, we propose a new model that combines both. The bottom-up component computes the visual salience of scene locations in different feature maps extracted at multiple spatial scales. The topdown component uses accumulated statistical knowledge of the visual features of the desired search target and background clutter, to optimally tune the bottom-up maps such that target detection speed is maximized. Testing on 750 artificial and natural scenes shows that the model’s predictions are consistent with a large body of available literature on human psychophysics of visual search. These results suggest that our model may provide good approximation of how humans combine bottom-up and top-down cues such as to optimize target detection speed.

Journal ArticleDOI
TL;DR: A generic model for unsupervised extraction of viewer's attention objects from color images by integrating computational visual attention mechanisms with attention object growing techniques and describes the MRF by a Gibbs random field with an energy function.
Abstract: This paper proposes a generic model for unsupervised extraction of viewer's attention objects from color images. Without the full semantic understanding of image content, the model formulates the attention objects as a Markov random field (MRF) by integrating computational visual attention mechanisms with attention object growing techniques. Furthermore, we describe the MRF by a Gibbs random field with an energy function. The minimization of the energy function provides a practical way to obtain attention objects. Experimental results on 880 real images and user subjective evaluations by 16 subjects demonstrate the effectiveness of the proposed approach.

Journal Article
TL;DR: A Bayesian framework for parsing images into their constituent visual patterns that optimizes the posterior probability and outputs a scene representation as a “parsing graph”, in a spirit similar to parsing sentences in speech and natural language is presented.
Abstract: In this chapter we present a Bayesian framework for parsing images into their constituent visual patterns. The parsing algorithm optimizes the posterior probability and outputs a scene representation as a parsing graph, in a spirit similar to parsing sentences in speech and natural language. The algorithm constructs the parsing graph and re-configures it dynamically using a set of moves, which are mostly reversible Markov chain jumps. This computational framework integrates two popular inference approaches - generative (top-down) methods and discriminative (bottom-up) methods. The former formulates the posterior probability in terms of generative models for images defined by likelihood functions and priors. The latter computes discriminative probabilities based on a sequence (cascade) of bottom-up tests/filters. In our Markov chain algorithm design, the posterior probability, defined by the generative models, is the invariant (target) probability for the Markov chain, and the discriminative probabilities are used to construct proposal probabilities to drive the Markov chain. Intuitively, the bottom-up discriminative probabilities activate top-down generative models. In this chapter, we focus on two types of visual patterns - generic visual patterns, such as texture and shading, and object patterns including human faces and text. These types of patterns compete and cooperate to explain the image and so image parsing unifies image segmentation, object detection, and recognition (if we use generic visual patterns only then image parsing will correspond to image segmentation [48].). We illustrate our algorithm on natural images of complex city scenes and show examples where image segmentation can be improved by allowing object specific knowledge to disambiguate low-level segmentation cues, and conversely where object detection can be improved by using generic visual patterns to explain away shadows and occlusions.

Book ChapterDOI
07 May 2006
TL;DR: The BFM detector is able to represent and detect object classes principally defined by their shape, rather than their appearance, and to achieve this with less supervision (such as the number of training images).
Abstract: The objective of this work is the detection of object classes, such as airplanes or horses. Instead of using a model based on salient image fragments, we show that object class detection is also possible using only the object's boundary. To this end, we develop a novel learning technique to extract class-discriminative boundary fragments. In addition to their shape, these “codebook” entries also determine the object's centroid (in the manner of Leibe et al. [19]). Boosting is used to select discriminative combinations of boundary fragments (weak detectors) to form a strong “Boundary-Fragment-Model” (BFM) detector. The generative aspect of the model is used to determine an approximate segmentation. We demonstrate the following results: (i) the BFM detector is able to represent and detect object classes principally defined by their shape, rather than their appearance; and (ii) in comparison with other published results on several object classes (airplanes, cars-rear, cows) the BFM detector is able to exceed previous performances, and to achieve this with less supervision (such as the number of training images).

Proceedings ArticleDOI
17 Jun 2006
TL;DR: It is shown that architectures such as convolutional networks are good at learning invariant features, but not always optimal for classification, while Support Vector Machines are good for producing decision surfaces from wellbehaved feature vectors, but cannot learn complicated invariances.
Abstract: The detection and recognition of generic object categories with invariance to viewpoint, illumination, and clutter requires the combination of a feature extractor and a classifier. We show that architectures such as convolutional networks are good at learning invariant features, but not always optimal for classification, while Support Vector Machines are good at producing decision surfaces from wellbehaved feature vectors, but cannot learn complicated invariances. We present a hybrid system where a convolutional network is trained to detect and recognize generic objects, and a Gaussian-kernel SVM is trained from the features learned by the convolutional network. Results are given on a large generic object recognition task with six categories (human figures, four-legged animals, airplanes, trucks, cars, and "none of the above"), with multiple instances of each object category under various poses, illuminations, and backgrounds. On the test set, which contains different object instances than the training set, an SVM alone yields a 43.3% error rate, a convolutional net alone yields 7.2% and an SVM on top of features produced by the convolutional net yields 5.9%.

Journal ArticleDOI
TL;DR: The performance of a detection algorithm is illustrated intuitively by performance graphs which present object level precision and recall depending on constraints on detection quality, and a representative single performance value is computed from the graphs.
Abstract: Evaluation of object detection algorithms is a non-trivial task: a detection result is usually evaluated by comparing the bounding box of the detected object with the bounding box of the ground truth object. The commonly used precision and recall measures are computed from the overlap area of these two rectangles. However, these measures have several drawbacks: they don't give intuitive information about the proportion of the correctly detected objects and the number of false alarms, and they cannot be accumulated across multiple images without creating ambiguity in their interpretation. Furthermore, quantitative and qualitative evaluation is often mixed resulting in ambiguous measures. In this paper we propose a new approach which tackles these problems. The performance of a detection algorithm is illustrated intuitively by performance graphs which present object level precision and recall depending on constraints on detection quality. In order to compare different detection algorithms, a representative single performance value is computed from the graphs. The influence of the test database on the detection performance is illustrated by performance/generality graphs. The evaluation method can be applied to different types of object detection algorithms. It has been tested on different text detection algorithms, among which are the participants of the ICDAR 2003 text detection competition.

Dissertation
17 Jul 2006
TL;DR: This thesis introduces grids of locally normalised Histograms of Oriented Gradients (HOG) as descriptors for object detection in static images and proposes descriptors based on oriented histograms of differential optical flow to detect moving humans in videos.
Abstract: This thesis targets the detection of humans and other object classes in images and videos. Our focus is on developing robust feature extraction algorithms that encode image regions as highdimensional feature vectors that support high accuracy object/non-object decisions. To test our feature sets we adopt a relatively simple learning framework that uses linear Support Vector Machines to classify each possible image region as an object or as a non-object. The approach is data-driven and purely bottom-up using low-level appearance and motion vectors to detect objects. As a test case we focus on person detection as people are one of the most challenging object classes with many applications, for example in film and video analysis, pedestrian detection for smart cars and video surveillance. Nevertheless we do not make any strong class specific assumptions and the resulting object detection framework also gives state-of-the-art performance for many other classes including cars, motorbikes, cows and sheep. This thesis makes four main contributions. Firstly, we introduce grids of locally normalised Histograms of Oriented Gradients (HOG) as descriptors for object detection in static images. The HOG descriptors are computed over dense and overlapping grids of spatial blocks, with image gradient orientation features extracted at fixed resolution and gathered into a highdimensional feature vector. They are designed to be robust to small changes in image contour locations and directions, and significant changes in image illumination and colour, while remaining highly discriminative for overall visual form. We show that unsmoothed gradients, fine orientation voting, moderately coarse spatial binning, strong normalisation and overlapping blocks are all needed for good performance. Secondly, to detect moving humans in videos, we propose descriptors based on oriented histograms of differential optical flow. These are similar to static HOG descriptors, but instead of image gradients, they are based on local differentials of dense optical flow. They encode the noisy optical flow estimates into robust feature vectors in a manner that is robust to the overall camera motion. Several variants are proposed, some capturing motion boundaries while others encode the relative motions of adjacent image regions. Thirdly, we propose a general method based on kernel density estimation for fusing multiple overlapping detections, that takes into account the number of detections, their confidence scores and the scales of the detections. Lastly, we present work in progress on a parts based approach to person detection that first detects local body parts like heads, torso, and legs and then fuses them to create a global overall person detector.

Proceedings ArticleDOI
17 Jun 2006
TL;DR: This paper addresses the problem of detecting and segmenting partially occluded objects of a known category by defining a part labelling which densely covers the object and imposing asymmetric local spatial constraints on these labels to ensure the consistent layout of parts whilst allowing for object deformation.
Abstract: This paper addresses the problem of detecting and segmenting partially occluded objects of a known category. We first define a part labelling which densely covers the object. Our Layout Consistent Random Field (LayoutCRF) model then imposes asymmetric local spatial constraints on these labels to ensure the consistent layout of parts whilst allowing for object deformation. Arbitrary occlusions of the object are handled by avoiding the assumption that the whole object is visible. The resulting system is both efficient to train and to apply to novel images, due to a novel annealed layout-consistent expansion move algorithm paired with a randomised decision tree classifier. We apply our technique to images of cars and faces and demonstrate state-of-the-art detection and segmentation performance even in the presence of partial occlusion.

Book ChapterDOI
07 May 2006
TL;DR: An extensive experimental evaluation on detecting five diverse object classes over hundreds of images demonstrates that the proposed method works in very cluttered images, allows for scale changes and considerable intra-class shape variation, is robust to interrupted contours, and is computationally efficient.
Abstract: We propose a method for object detection in cluttered real images, given a single hand-drawn example as model. The image edges are partitioned into contour segments and organized in an image representation which encodes their interconnections: the Contour Segment Network. The object detection problem is formulated as finding paths through the network resembling the model outlines, and a computationally efficient detection technique is presented. An extensive experimental evaluation on detecting five diverse object classes over hundreds of images demonstrates that our method works in very cluttered images, allows for scale changes and considerable intra-class shape variation, is robust to interrupted contours, and is computationally efficient.

Book
01 May 2006
TL;DR: In this paper, a biologically motivated computational attention system VOCUS (Visual Object detection with a Computational Attention System) is proposed to detect regions of interest in images, which are defined by strong contrasts (e.g., color or intensity contrasts) and by the uniqueness of a feature.
Abstract: Visual attention is a mechanism in human perception which selects relevant regions from a scene and provides these regions for higher-level processing as object recognition. This enables humans to act effectively in their environment despite the complexity of perceivable sensor data. Computational vision systems face the same problem as humans: there is a large amount of information to be processed and to achieve this efficiently, maybe even in real-time for robotic applications, the order in which a scene is investigated must be determined in an intelligent way. A promising approach is to use computational attention systems that simulate human visual attention. This monograph introduces the biologically motivated computational attention system VOCUS (Visual Object detection with a Computational attention System) that detects regions of interest in images. It operates in two modes, in an exploration mode in which no task is provided, and in a search mode with a specified target. In exploration mode, regions of interest are defined by strong contrasts (e.g., color or intensity contrasts) and by the uniqueness of a feature. For example, a black sheep is salient in a flock of white sheep. In search mode, the system uses previously learned information about a target object to bias the saliency computations with respect to the target. In various experiments, it is shown that the target is on average found with less than three fixations, that usually less than five training images suffice to learn the target information, and that the system is mostly robust with regard to viewpoint changes and illumination variances. Furthermore, we demonstrate how VOCUS profits from additional sensor data: we apply the system to depth and reflectance data from a 3D laser scanner and show the advantages that the laser modes provide. By fusing the data of both modes, we demonstrate how the system is able to consider distinct object properties and how the flexibility of the system increases by considering different data. Finally, the regions of interest provided by VOCUS serve as input to a classifier that recognizes the object in the detected region. We show how and in which cases the classification is sped up and how the detection quality is improved by the attentional front-end. This approach is especially useful if many object classes have to be considered, a frequently occurring situation in robotics. VOCUS provides a powerful approach to improve existing vision systems by concentrating computational resources to regions that are more likely to contain relevant information. The more the complexity and power of vision systems increase in the future, the more they will profit from an attentional front-end like VOCUS.

Proceedings ArticleDOI
17 Jun 2006
TL;DR: This work presents dynamic video synopsis, where most of the activity in the video is condensed by simultaneously showing several actions, even when they originally occurred at different times.
Abstract: The power of video over still images is the ability to represent dynamic activities. But video browsing and retrieval are inconvenient due to inherent spatio-temporal redundancies, where some time intervals may have no activity, or have activities that occur in a small image region. Video synopsis aims to provide a compact video representation, while preserving the essential activities of the original video. We present dynamic video synopsis, where most of the activity in the video is condensed by simultaneously showing several actions, even when they originally occurred at different times. For example, we can create a "stroboscopic movie", where multiple dynamic instances of a moving object are played simultaneously. This is an extension of the still stroboscopic picture. Previous approaches for video abstraction addressed mostly the temporal redundancy by selecting representative key-frames or time intervals. In dynamic video synopsis the activity is shifted into a significantly shorter period, in which the activity is much denser. Video examples can be found online in http://www.vision.huji.ac.il/synopsis

Book
01 Jan 2006
TL;DR: This monograph introduces the biologically motivated computational attention system VOCUS (Visual Object detection with a Computational attention System) that detects regions of interest in images that provides a powerful approach to improve existing vision systems by concentrating computational resources to regions that are more likely to contain relevant information.
Abstract: Visual attention is a mechanism in human perception which selects relevant regions from a scene and provides these regions for higher-level processing as object recognition. This enables humans to act effectively in their environment despite the complexity of perceivable sensor data. Computational vision systems face the same problem as humans: there is a large amount of information to be processed and to achieve this efficiently, maybe even in real-time for robotic applications, the order in which a scene is investigated must be determined in an intelligent way. A promising approach is to use computational attention systems that simulate human visual attention. This monograph introduces the biologically motivated computational attention system VOCUS (Visual Object detection with a Computational attention System) that detects regions of interest in images. It operates in two modes, in an exploration mode in which no task is provided, and in a search mode with a specified target. In exploration mode, regions of interest are defined by strong contrasts (e.g., color or intensity contrasts) and by the uniqueness of a feature. For example, a black sheep is salient in a flock of white sheep. In search mode, the system uses previously learned information about a target object to bias the saliency computations with respect to the target. In various experiments, it is shown that the target is on average found with less than three fixations, that usually less than five training images suffice to learn the target information, and that the system is mostly robust with regard to viewpoint changes and illumination variances. Furthermore, we demonstrate how VOCUS profits from additional sensor data: we apply the system to depth and reflectance data from a 3D laser scanner and show the advantages that the laser modes provide. By fusing the data of both modes, we demonstrate how the system is able to consider distinct object properties and how the flexibility of the system increases by considering different data. Finally, the regions of interest provided by VOCUS serve as input to a classifier that recognizes the object in the detected region. We show how and in which cases the classification is sped up and how the detection quality is improved by the attentional front-end. This approach is especially useful if many object classes have to be considered, a frequently occurring situation in robotics. VOCUS provides a powerful approach to improve existing vision systems by concentrating computational resources to regions that are more likely to contain relevant information. The more the complexity and power of vision systems increase in the future, the more they will profit from an attentional front-end like VOCUS.

Proceedings ArticleDOI
17 Jun 2006
TL;DR: The Implicit Shape Model for object class detection is combined with the multi-view specific object recognition system of Ferrari et al. to detect object instances from arbitrary viewpoints.
Abstract: We present a novel system for generic object class detection. In contrast to most existing systems which focus on a single viewpoint or aspect, our approach can detect object instances from arbitrary viewpoints. This is achieved by combining the Implicit Shape Model for object class detection proposed by Leibe and Schiele with the multi-view specific object recognition system of Ferrari et al. After learning single-view codebooks, these are interconnected by so-called activation links, obtained through multi-view region tracks across different training views of individual object instances. During recognition, these integrated codebooks work together to determine the location and pose of the object. Experimental results demonstrate the viability of the approach and compare it to a bank of independent single-view detectors

Proceedings ArticleDOI
17 Jun 2006
TL;DR: The performance of the proposed multi-object class detection approach is competitive to state of the art approaches dedicated to a single object class recognition problem.
Abstract: In this paper we propose an approach capable of simultaneous recognition and localization of multiple object classes using a generative model. A novel hierarchical representation allows to represent individual images as well as various objects classes in a single, scale and rotation invariant model. The recognition method is based on a codebook representation where appearance clusters built from edge based features are shared among several object classes. A probabilistic model allows for reliable detection of various objects in the same image. The approach is highly efficient due to fast clustering and matching methods capable of dealing with millions of high dimensional features. The system shows excellent performance on several object categories over a wide range of scales, in-plane rotations, background clutter, and partial occlusions. The performance of the proposed multi-object class detection approach is competitive to state of the art approaches dedicated to a single object class recognition problem.

Journal ArticleDOI
TL;DR: Novel methods to evaluate the performance of object detection algorithms in video sequences are proposed and segmentation algorithms recently proposed are evaluated in order to assess how well they can detect moving regions in an outdoor scene in fixed-camera situations.
Abstract: In this paper, we propose novel methods to evaluate the performance of object detection algorithms in video sequences. This procedure allows us to highlight characteristics (e.g., region splitting or merging) which are specific of the method being used. The proposed framework compares the output of the algorithm with the ground truth and measures the differences according to objective metrics. In this way it is possible to perform a fair comparison among different methods, evaluating their strengths and weaknesses and allowing the user to perform a reliable choice of the best method for a specific application. We apply this methodology to segmentation algorithms recently proposed and describe their performance. These methods were evaluated in order to assess how well they can detect moving regions in an outdoor scene in fixed-camera situations

Proceedings ArticleDOI
17 Jun 2006
TL;DR: This work presents an approach based on representing humans as an assembly of four body parts and detection of the body parts in single frames which makes the method insensitive to camera motions.
Abstract: Tracking of humans in videos is important for many applications. A major source of difficulty in performing this task is due to inter-human or scene occlusion. We present an approach based on representing humans as an assembly of four body parts and detection of the body parts in single frames which makes the method insensitive to camera motions. The responses of the body part detectors and a combined human detector provide the "observations" used for tracking. Trajectory initialization and termination are both fully automatic and rely on the confidences computed from the detection responses. An object is tracked by data association if its corresponding detection response can be found; otherwise it is tracked by a meanshift style tracker. Our method can track humans with both inter-object and scene occlusions. The system is evaluated on three sets of videos and compared with previous method.

Proceedings ArticleDOI
01 Oct 2006
TL;DR: A novel approach to removing ghosting artifacts from high dynamic range images, without the need for explicit object detection and motion estimation, using a non-parametric model for the static part of the scene.
Abstract: High dynamic range images may be created by capturing multiple images of a scene with varying exposures. Images created in this manner are prone to ghosting artifacts, which appear if there is movement in the scene at the time of capture. This paper describes a novel approach to removing ghosting artifacts from high dynamic range images, without the need for explicit object detection and motion estimation. Weights are computed iteratively and then applied to pixels to determine their contribution to the final image. We use a non-parametric model for the static part of the scene, and a pixel's membership in this model determines its weight. In contrast to previous approaches, our technique does not rely on explicit object detection, tracking, or pixelwise motion estimates. Ghost-free images of different scenes demonstrate the effectiveness of our technique.