scispace - formally typeset
Search or ask a question

Showing papers on "Object detection published in 2012"


Proceedings ArticleDOI
16 Jun 2012
TL;DR: The autonomous driving platform is used to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection, revealing that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world.
Abstract: Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti

11,283 citations


Journal ArticleDOI
TL;DR: An extensive evaluation of the state of the art in a unified framework of monocular pedestrian detection using sixteen pretrained state-of-the-art detectors across six data sets and proposes a refined per-frame evaluation methodology.
Abstract: Pedestrian detection is a key problem in computer vision, with several applications that have the potential to positively impact quality of life. In recent years, the number of approaches to detecting pedestrians in monocular images has grown steadily. However, multiple data sets and widely varying evaluation protocols are used, making direct comparisons difficult. To address these shortcomings, we perform an extensive evaluation of the state of the art in a unified framework. We make three primary contributions: 1) We put together a large, well-annotated, and realistic monocular pedestrian detection data set and study the statistics of the size, position, and occlusion patterns of pedestrians in urban scenes, 2) we propose a refined per-frame evaluation methodology that allows us to carry out probing and informative comparisons, including measuring performance in relation to scale and occlusion, and 3) we evaluate the performance of sixteen pretrained state-of-the-art detectors across six data sets. Our study allows us to assess the state of the art and provides a framework for gauging future efforts. Our experiments show that despite significant progress, performance still has much room for improvement. In particular, detection is disappointing at low resolutions and for partially occluded pedestrians.

3,170 citations


Journal ArticleDOI
TL;DR: A novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning, and detection, and develops a novel learning method (P-N learning) which estimates the errors by a pair of “experts”: P-expert estimates missed detections, and N-ex Expert estimates false alarms.
Abstract: This paper investigates long-term tracking of unknown objects in a video stream. The object is defined by its location and extent in a single frame. In every frame that follows, the task is to determine the object's location and extent or indicate that the object is not present. We propose a novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning, and detection. The tracker follows the object from frame to frame. The detector localizes all appearances that have been observed so far and corrects the tracker if necessary. The learning estimates the detector's errors and updates it to avoid these errors in the future. We study how to identify the detector's errors and learn from them. We develop a novel learning method (P-N learning) which estimates the errors by a pair of “experts”: (1) P-expert estimates missed detections, and (2) N-expert estimates false alarms. The learning process is modeled as a discrete dynamical system and the conditions under which the learning guarantees improvement are found. We describe our real-time implementation of the TLD framework and the P-N learning. We carry out an extensive quantitative evaluation which shows a significant improvement over state-of-the-art approaches.

3,137 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: It is shown that tree-structured models are surprisingly effective at capturing global elastic deformation, while being easy to optimize unlike dense graph structures, in real-world, cluttered images.
Abstract: We present a unified model for face detection, pose estimation, and landmark estimation in real-world, cluttered images. Our model is based on a mixtures of trees with a shared pool of parts; we model every facial landmark as a part and use global mixtures to capture topological changes due to viewpoint. We show that tree-structured models are surprisingly effective at capturing global elastic deformation, while being easy to optimize unlike dense graph structures. We present extensive results on standard face benchmarks, as well as a new “in the wild” annotated dataset, that suggests our system advances the state-of-the-art, sometimes considerably, for all three tasks. Though our model is modestly trained with hundreds of faces, it compares favorably to commercial systems trained with billions of examples (such as Google Picasa and face.com).

2,340 citations


Book ChapterDOI
07 Oct 2012
TL;DR: Using the well-established theory of Circulant matrices, this work provides a link to Fourier analysis that opens up the possibility of extremely fast learning and detection with the Fast Fourier Transform, which can be done in the dual space of kernel machines as fast as with linear classifiers.
Abstract: Recent years have seen greater interest in the use of discriminative classifiers in tracking systems, owing to their success in object detection. They are trained online with samples collected during tracking. Unfortunately, the potentially large number of samples becomes a computational burden, which directly conflicts with real-time requirements. On the other hand, limiting the samples may sacrifice performance. Interestingly, we observed that, as we add more and more samples, the problem acquires circulant structure. Using the well-established theory of Circulant matrices, we provide a link to Fourier analysis that opens up the possibility of extremely fast learning and detection with the Fast Fourier Transform. This can be done in the dual space of kernel machines as fast as with linear classifiers. We derive closed-form solutions for training and detection with several types of kernels, including the popular Gaussian and polynomial kernels. The resulting tracker achieves performance competitive with the state-of-the-art, can be implemented with only a few lines of code and runs at hundreds of frames-per-second. MATLAB code is provided in the paper (see Algorithm 1).

2,197 citations


Journal ArticleDOI
TL;DR: In this paper, a generic objectness measure is proposed to quantify how likely an image window is to contain an object of any class, such as cows and telephones, from amorphous background elements such as grass and road.
Abstract: We present a generic objectness measure, quantifying how likely it is for an image window to contain an object of any class. We explicitly train it to distinguish objects with a well-defined boundary in space, such as cows and telephones, from amorphous background elements, such as grass and road. The measure combines in a Bayesian framework several image cues measuring characteristics of objects, such as appearing different from their surroundings and having a closed boundary. These include an innovative cue to measure the closed boundary characteristic. In experiments on the challenging PASCAL VOC 07 dataset, we show this new cue to outperform a state-of-the-art saliency measure, and the combined objectness measure to perform better than any cue alone. We also compare to interest point operators, a HOG detector, and three recent works aiming at automatic object segmentation. Finally, we present two applications of objectness. In the first, we sample a small numberof windows according to their objectness probability and give an algorithm to employ them as location priors for modern class-specific object detectors. As we show experimentally, this greatly reduces the number of windows evaluated by the expensive class-specific model. In the second application, we use objectness as a complementary score in addition to the class-specific model, which leads to fewer false positives. As shown in several recent papers, objectness can act as a valuable focus of attention mechanism in many other applications operating on image windows, including weakly supervised learning of object categories, unsupervised pixelwise segmentation, and object tracking in video. Computing objectness is very efficient and takes only about 4 sec. per image.

1,223 citations


Book ChapterDOI
05 Nov 2012
TL;DR: A framework for automatic modeling, detection, and tracking of 3D objects with a Kinect and shows how to build the templates automatically from 3D models, and how to estimate the 6 degrees-of-freedom pose accurately and in real-time.
Abstract: We propose a framework for automatic modeling, detection, and tracking of 3D objects with a Kinect. The detection part is mainly based on the recent template-based LINEMOD approach [1] for object detection. We show how to build the templates automatically from 3D models, and how to estimate the 6 degrees-of-freedom pose accurately and in real-time. The pose estimation and the color information allow us to check the detection hypotheses and improves the correct detection rate by 13% with respect to the original LINEMOD. These many improvements make our framework suitable for object manipulation in Robotics applications. Moreover we propose a new dataset made of 15 registered, 1100+ frame video sequences of 15 various objects for the evaluation of future competing methods.

1,114 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: A unique change detection benchmark dataset consisting of nearly 90,000 frames in 31 video sequences representing 6 categories selected to cover a wide range of challenges in 2 modalities (color and thermal IR).
Abstract: Change detection is one of the most commonly encountered low-level tasks in computer vision and video processing. A plethora of algorithms have been developed to date, yet no widely accepted, realistic, large-scale video dataset exists for benchmarking different methods. Presented here is a unique change detection benchmark dataset consisting of nearly 90,000 frames in 31 video sequences representing 6 categories selected to cover a wide range of challenges in 2 modalities (color and thermal IR). A distinguishing characteristic of this dataset is that each frame is meticulously annotated for ground-truth foreground, background, and shadow area boundaries — an effort that goes much beyond a simple binary label denoting the presence of change. This enables objective and precise quantitative comparison and ranking of change detection algorithms. This paper presents and discusses various aspects of the new dataset, quantitative performance metrics used, and comparative results for over a dozen previous and new change detection algorithms. The dataset, evaluation tools, and algorithm rankings are available to the public on a website1 and will be updated with feedback from academia and industry in the future.

800 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: A unified model to incorporate traditional low-level features with higher-level guidance to detect salient objects and can be considered as a prototype framework not only for general salient object detection, but also for potential task-dependent saliency applications.
Abstract: Salient object detection is not a pure low-level, bottom-up process. Higher-level knowledge is important even for task-independent image saliency. We propose a unified model to incorporate traditional low-level features with higher-level guidance to detect salient objects. In our model, an image is represented as a low-rank matrix plus sparse noises in a certain feature space, where the non-salient regions (or background) can be explained by the low-rank matrix, and the salient regions are indicated by the sparse noises. To ensure the validity of this model, a linear transform for the feature space is introduced and needs to be learned. Given an image, its low-level saliency is then extracted by identifying those sparse noises when recovering the low-rank matrix. Furthermore, higher-level knowledge is fused to compose a prior map, and is treated as a prior term in the objective function to improve the performance. Extensive experiments show that our model can comfortably achieves comparable performance to the existing methods even without the help from high-level knowledge. The integration of top-down priors further improves the performance and achieves the state-of-the-art. Moreover, the proposed model can be considered as a prototype framework not only for general salient object detection, but also for potential task-dependent saliency applications.

719 citations


Journal ArticleDOI
TL;DR: A survey of the traffic sign detection literature, detailing detection systems for traffic sign recognition (TSR) for driver assistance and discussing future directions of TSR research, including the integration of context and localization.
Abstract: In this paper, we provide a survey of the traffic sign detection literature, detailing detection systems for traffic sign recognition (TSR) for driver assistance. We separately describe the contributions of recent works to the various stages inherent in traffic sign detection: segmentation, feature extraction, and final sign detection. While TSR is a well-established research area, we highlight open research issues in the literature, including a dearth of use of publicly available image databases and the over-representation of European traffic signs. Furthermore, we discuss future directions of TSR research, including the integration of context and localization. We also introduce a new public database containing U.S. traffic signs.

620 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: A new pedestrian detector that improves both in speed and quality over state-of-the-art by efficiently handling different scales and transferring computation from test time to training time, detection speed is improved.
Abstract: We present a new pedestrian detector that improves both in speed and quality over state-of-the-art. By efficiently handling different scales and transferring computation from test time to training time, detection speed is improved. When processing monocular images, our system provides high quality detections at 50 fps. We also propose a new method for exploiting geometric context extracted from stereo images. On a single CPU+GPU desktop machine, we reach 135 fps, when processing street scenes, from rectified input to detections output.

Journal ArticleDOI
TL;DR: A method for real-time 3D object instance detection that does not require a time-consuming training stage, and can handle untextured objects, and is much faster and more robust with respect to background clutter than current state-of-the-art methods is presented.
Abstract: We present a method for real-time 3D object instance detection that does not require a time-consuming training stage, and can handle untextured objects. At its core, our approach is a novel image representation for template matching designed to be robust to small image transformations. This robustness is based on spread image gradient orientations and allows us to test only a small subset of all possible pixel locations when parsing the image, and to represent a 3D object with a limited set of templates. In addition, we demonstrate that if a dense depth sensor is available we can extend our approach for an even better performance also taking 3D surface normal orientations into account. We show how to take advantage of the architecture of modern computers to build an efficient but very discriminant representation of the input images that can be used to consider thousands of templates in real time. We demonstrate in many experiments on real data that our method is much faster and more robust with respect to background clutter than current state-of-the-art methods.

Proceedings ArticleDOI
14 May 2012
TL;DR: This paper uses a RGBD sensor as the input sensor, and compute a set of features based on human pose and motion, as well as based on image and point-cloud information, based on a hierarchical maximum entropy Markov model (MEMM).
Abstract: Being able to detect and recognize human activities is essential for several applications, including personal assistive robotics In this paper, we perform detection and recognition of unstructured human activity in unstructured environments We use a RGBD sensor (Microsoft Kinect) as the input sensor, and compute a set of features based on human pose and motion, as well as based on image and point-cloud information Our algorithm is based on a hierarchical maximum entropy Markov model (MEMM), which considers a person's activity as composed of a set of sub-activities We infer the two-layered graph structure using a dynamic programming approach We test our algorithm on detecting and recognizing twelve different activities performed by four people in different environments, such as a kitchen, a living room, an office, etc, and achieve good performance even when the person was not seen before in the training set1

Proceedings ArticleDOI
16 Jun 2012
TL;DR: It is shown that training from a combination of weakly annotated videos and fully annotated still images using domain adaptation improves the performance of a detector trained from still images alone.
Abstract: Object detectors are typically trained on a large set of still images annotated by bounding-boxes. This paper introduces an approach for learning object detectors from real-world web videos known only to contain objects of a target class. We propose a fully automatic pipeline that localizes objects in a set of videos of the class and learns a detector for it. The approach extracts candidate spatio-temporal tubes based on motion segmentation and then selects one tube per video jointly over all videos. To compare to the state of the art, we test our detector on still images, i.e., Pascal VOC 2007. We observe that frames extracted from web videos can differ significantly in terms of quality to still images taken by a good camera. Thus, we formulate the learning from videos as a domain adaptation task. We show that training from a combination of weakly annotated videos and fully annotated still images using domain adaptation improves the performance of a detector trained from still images alone.

Proceedings ArticleDOI
16 Jun 2012
TL;DR: An approach to holistic scene understanding that reasons jointly about regions, location, class and spatial extent of objects, presence of a class in the image, as well as the scene type that outperforms the state-of-the-art on the MSRC-21 benchmark, while being much faster.
Abstract: In this paper we propose an approach to holistic scene understanding that reasons jointly about regions, location, class and spatial extent of objects, presence of a class in the image, as well as the scene type. Learning and inference in our model are efficient as we reason at the segment level, and introduce auxiliary variables that allow us to decompose the inherent high-order potentials into pairwise potentials between a few variables with small number of states (at most the number of classes). Inference is done via a convergent message-passing algorithm, which, unlike graph-cuts inference, has no submodularity restrictions and does not require potential specific moves. We believe this is very important, as it allows us to encode our ideas and prior knowledge about the problem without the need to change the inference engine every time we introduce a new potential. Our approach outperforms the state-of-the-art on the MSRC-21 benchmark, while being much faster. Importantly, our holistic model is able to improve performance in all tasks.

Proceedings ArticleDOI
16 Jun 2012
TL;DR: A conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks is utilized.
Abstract: In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video, and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model, and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast, exact inference using dynamic programming, which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset [16] and the 2011 TRECVID Multimedia Event Detection task [18]. We also illustrate and visualize the semantic understanding capabilities of our model.

Book
09 Oct 2012
TL;DR: This book is an essential guide to the implementation of image processing and computer vision techniques, with tutorial introductions and sample code in Matlab, and contains extensive new material on Haar wavelets, Viola-Jones, bilateral filtering, SURF, PCA-SIFT, moving object detection and tracking.
Abstract: This book is an essential guide to the implementation of image processing and computer vision techniques, with tutorial introductions and sample code in Matlab. Algorithms are presented and fully explained to enable complete understanding of the methods and techniques demonstrated. As one reviewer noted, "The main strength of the proposed book is the exemplar code of the algorithms." Fully updated with the latest developments in feature extraction, including expanded tutorials and new techniques, this new edition contains extensive new material on Haar wavelets, Viola-Jones, bilateral filtering, SURF, PCA-SIFT, moving object detection and tracking, development of symmetry operators, LBP texture analysis, Adaboost, and a new appendix on color models. Coverage of distance measures, feature detectors, wavelets, level sets and texture tutorials has been extended. * Named a 2012 Notable Computer Book for Computing Methodologies by Computing Reviews* Essential reading for engineers and students working in this cutting-edge field* Ideal module text and background reference for courses in image processing and computer vision* The only currently available text to concentrate on feature extraction with working implementation and worked through derivation

Journal ArticleDOI
TL;DR: The results achieved by different methods are compared and analysed to identify promising strategies for automatic urban object extraction from current airborne sensor data, but also common problems of state-of-the-art methods.
Abstract: . For more than two decades, many efforts have been made to develop methods for extracting urban objects from data acquired by airborne sensors. In order to make the results of such algorithms more comparable, benchmarking data sets are of paramount importance. Such a data set, consisting of airborne image and laserscanner data, has been made available to the scientific community. Researchers were encouraged to submit results of urban object detection and 3D building reconstruction, which were evaluated based on reference data. This paper presents the outcomes of the evaluation for building detection, tree detection, and 3D building reconstruction. The results achieved by different methods are compared and analysed to identify promising strategies for automatic urban object extraction from current airborne sensor data, but also common problems of state-of-the-art methods.

Proceedings ArticleDOI
16 Jun 2012
TL;DR: This paper proposes a robust part-based tracking-by-detection framework that learns part- based person-specific SVM classifiers which capture the articulations of the human bodies in dynamically changing appearance and background.
Abstract: Single camera-based multiple-person tracking is often hindered by difficulties such as occlusion and changes in appearance. In this paper, we address such problems by proposing a robust part-based tracking-by-detection framework. Human detection using part models has become quite popular, yet its extension in tracking has not been fully explored. Our approach learns part-based person-specific SVM classifiers which capture the articulations of the human bodies in dynamically changing appearance and background. With the part-based model, our approach is able to handle partial occlusions in both the detection and the tracking stages. In the detection stage, we select the subset of parts which maximizes the probability of detection, which significantly improves the detection performance in crowded scenes. In the tracking stage, we dynamically handle occlusions by distributing the score of the learned person classifier among its corresponding parts, which allows us to detect and predict partial occlusions, and prevent the performance of the classifiers from being degraded. Extensive experiments using the proposed method on several challenging sequences demonstrate state-of-the-art performance in multiple-people tracking.

Journal ArticleDOI
TL;DR: A conditional random field that starts from generic knowledge and then progressively adapts to the new class is proposed that allows training any state-of-the-art object detector in a weakly supervised fashion, although it would normally require object location annotations.
Abstract: Learning a new object class from cluttered training images is very challenging when the location of object instances is unknown, i.e. in a weakly supervised setting. Many previous works require objects covering a large portion of the images. We present a novel approach that can cope with extensive clutter as well as large scale and appearance variations between object instances. To make this possible we exploit generic knowledge learned beforehand from images of other classes for which location annotation is available. Generic knowledge facilitates learning any new class from weakly supervised images, because it reduces the uncertainty in the location of its object instances. We propose a conditional random field that starts from generic knowledge and then progressively adapts to the new class. Our approach simultaneously localizes object instances while learning an appearance model specific for the class. We demonstrate this on several datasets, including the very challenging Pascal VOC 2007. Furthermore, our method allows training any state-of-the-art object detector in a weakly supervised fashion, although it would normally require object location annotations.

Journal ArticleDOI
TL;DR: The proposed system is accurate at high vehicle speeds, operates under a range of weather conditions, runs at an average speed of 20 frames per second, and recognizes all classes of ideogram-based (nontext) traffic symbols from an online road sign database.
Abstract: This paper proposes a novel system for the automatic detection and recognition of traffic signs. The proposed system detects candidate regions as maximally stable extremal regions (MSERs), which offers robustness to variations in lighting conditions. Recognition is based on a cascade of support vector machine (SVM) classifiers that were trained using histogram of oriented gradient (HOG) features. The training data are generated from synthetic template images that are freely available from an online database; thus, real footage road signs are not required as training data. The proposed system is accurate at high vehicle speeds, operates under a range of weather conditions, runs at an average speed of 20 frames per second, and recognizes all classes of ideogram-based (nontext) traffic symbols from an online road sign database. Comprehensive comparative results to illustrate the performance of the system are presented.

Book ChapterDOI
07 Oct 2012
TL;DR: This work revisits a much older technique, viz.
Abstract: Object detection has over the past few years converged on using linear SVMs over HOG features. Training linear SVMs however is quite expensive, and can become intractable as the number of categories increase. In this work we revisit a much older technique, viz. Linear Discriminant Analysis, and show that LDA models can be trained almost trivially, and with little or no loss in performance. The covariance matrices we estimate capture properties of natural images. Whitening HOG features with these covariances thus removes naturally occuring correlations between the HOG features. We show that these whitened features (which we call WHO) are considerably better than the original HOG features for computing similarities, and prove their usefulness in clustering. Finally, we use our findings to produce an object detection system that is competitive on PASCAL VOC 2007 while being considerably easier to train and test.

Journal ArticleDOI
TL;DR: A survey and a comparative evaluation of recent techniques for moving cast shadow detection indicate that all shadow detection approaches make different contributions and all have individual strength and weaknesses.

Proceedings ArticleDOI
16 Jun 2012
TL;DR: A probabilistic pedestrian detection framework using a deformable part-based model to obtain the scores of part detectors and the visibilities of parts are modeled as hidden variables and a discriminative deep model is used for learning the visibility relationship among overlapping parts at multiple layers.
Abstract: Part-based models have demonstrated their merit in object detection. However, there is a key issue to be solved on how to integrate the inaccurate scores of part detectors when there are occlusions or large deformations. To handle the imperfectness of part detectors, this paper presents a probabilistic pedestrian detection framework. In this framework, a deformable part-based model is used to obtain the scores of part detectors and the visibilities of parts are modeled as hidden variables. Unlike previous occlusion handling approaches that assume independence among visibility probabilities of parts or manually define rules for the visibility relationship, a discriminative deep model is used in this paper for learning the visibility relationship among overlapping parts at multiple layers. Experimental results on three public datasets (Caltech, ETH and Daimler) and a new CUHK occlusion dataset1 specially designed for the evaluation of occlusion handling approaches show the effectiveness of the proposed approach.

Proceedings ArticleDOI
16 Jun 2012
TL;DR: This paper proposes the use of color attributes as an explicit color representation for object detection and shows that this method improves over state-of-the-art techniques despite its simplicity.
Abstract: State-of-the-art object detectors typically use shape information as a low level feature representation to capture the local structure of an object. This paper shows that early fusion of shape and color, as is popular in image classification, leads to a significant drop in performance for object detection. Moreover, such approaches also yields suboptimal results for object categories with varying importance of color and shape. In this paper we propose the use of color attributes as an explicit color representation for object detection. Color attributes are compact, computationally efficient, and when combined with traditional shape features provide state-of-the-art results for object detection. Our method is tested on the PASCAL VOC 2007 and 2009 datasets and results clearly show that our method improves over state-of-the-art techniques despite its simplicity. We also introduce a new dataset consisting of cartoon character images in which color plays a pivotal role. On this dataset, our approach yields a significant gain of 14% in mean AP over conventional state-of-the-art methods.

Proceedings Article
01 Dec 2012
TL;DR: The key observation is that drawing a bounding box is significantly more difficult and time consuming than giving answers to multiple choice questions, so quality control through additional verification tasks is more cost effective than consensus based algorithms.
Abstract: A large number of images with ground truth object bounding boxes are critical for learning object detectors, which is a fundamental task in compute vision. In this paper, we study strategies to crowd-source bounding box annotations. The core challenge of building such a system is to effectively control the data quality with minimal cost. Our key observation is that drawing a bounding box is significantly more difficult and time consuming than giving answers to multiple choice questions. Thus quality control through additional verification tasks is more cost effective than consensus based algorithms. In particular, we present a system that consists of three simple sub-tasks — a drawing task, a quality verification task and a coverage verification task. Experimental results demonstrate that our system is scalable, accurate, and cost-effective.

Book ChapterDOI
07 Oct 2012
TL;DR: This work proposes to exploit correlations in detector responses at nearby locations and scales by tightly coupling detector evaluation of nearby windows by introducing two opposing mechanisms: detector excitation of promising neighbors and inhibition of inferior neighbors.
Abstract: Cascades help make sliding window object detection fast, nevertheless, computational demands remain prohibitive for numerous applications. Currently, evaluation of adjacent windows proceeds independently; this is suboptimal as detector responses at nearby locations and scales are correlated. We propose to exploit these correlations by tightly coupling detector evaluation of nearby windows. We introduce two opposing mechanisms: detector excitation of promising neighbors and inhibition of inferior neighbors. By enabling neighboring detectors to communicate, crosstalk cascades achieve major gains (4-30× speedup) over cascades evaluated independently at each image location. Combined with recent advances in fast multi-scale feature computation, for which we provide an optimized implementation, our approach runs at 35-65 fps on 640×480 images while attaining state-of-the-art accuracy.

Journal ArticleDOI
TL;DR: This work proposes and proposes and evaluates techniques for searching a video dataset for people in a specific pose, and develops three new pose descriptors and compares their classification and retrieval performance to two baselines built on state-of-the-art object detection models.
Abstract: We present a technique for estimating the spatial layout of humans in still images--the position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side). We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films. We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their classification and retrieval performance to two baselines built on state-of-the-art object detection models.

Journal ArticleDOI
TL;DR: An extensive experimental evaluation on the sports action data set from [1], the PASCAL Action 2010 data set [2], and a new human-object interaction data set are presented.
Abstract: We introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: We first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human-object interaction, i.e., the spatial relation between the human and the object. We present an extensive experimental evaluation on the sports action data set from [1], the PASCAL Action 2010 data set [2], and a new human-object interaction data set.

Proceedings ArticleDOI
16 Jun 2012
TL;DR: The introduction of spatial coherence into the background update procedure leads to the so-called SC-SOBS algorithm, that provides further robustness against false detections.
Abstract: The Self-Organizing Background Subtraction (SOBS) algorithm implements an approach to moving object detection based on the neural background model automatically generated by a self-organizing method, without prior knowledge about the involved patterns. Such adaptive model can handle scenes containing moving backgrounds, gradual illumination variations and camouflage, can include into the background model shadows cast by moving objects, and achieves robust detection for different types of videos taken with stationary cameras. Moreover, the introduction of spatial coherence into the background update procedure leads to the so-called SC-SOBS algorithm, that provides further robustness against false detections. The paper includes extensive experimental results achieved by the SOBS and the SC-SOBS algorithms on the dataset made available for the Change Detection Challenge at the IEEE CVPR2012.