scispace - formally typeset
Search or ask a question

Showing papers on "Object detection published in 2014"


Book ChapterDOI
06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Abstract: We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

30,462 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.
Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

21,729 citations


Proceedings ArticleDOI
14 May 2014
TL;DR: It is shown that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost, and it is identified that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance.
Abstract: The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in challenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. In particular, we show that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost. Source code and models to reproduce the experiments in the paper is made publicly available.

3,533 citations


Posted Content
TL;DR: This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF).
Abstract: Deep Convolutional Neural Networks (DCNNs) have recently shown state of the art performance in high level vision tasks, such as image classification and object detection. This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification (also called "semantic image segmentation"). We show that responses at the final layer of DCNNs are not sufficiently localized for accurate object segmentation. This is due to the very invariance properties that make DCNNs good for high level tasks. We overcome this poor localization property of deep networks by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF). Qualitatively, our "DeepLab" system is able to localize segment boundaries at a level of accuracy which is beyond previous methods. Quantitatively, our method sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. We show how these results can be obtained efficiently: Careful network re-purposing and a novel application of the 'hole' algorithm from the wavelet community allow dense computation of neural net responses at 8 frames per second on a modern GPU.

3,389 citations


Book ChapterDOI
06 Sep 2014
TL;DR: A novel method for generating object bounding box proposals using edges is proposed, showing results that are significantly more accurate than the current state-of-the-art while being faster to compute.
Abstract: The use of object proposals is an effective recent approach for increasing the computational efficiency of object detection. We propose a novel method for generating object bounding box proposals using edges. Edges provide a sparse yet informative representation of an image. Our main observation is that the number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object. We propose a simple box objectness score that measures the number of edges that exist in the box minus those that are members of contours that overlap the box’s boundary. Using efficient data structures, millions of candidate boxes can be evaluated in a fraction of a second, returning a ranked set of a few thousand top-scoring proposals. Using standard metrics, we show results that are significantly more accurate than the current state-of-the-art while being faster to compute. In particular, given just 1000 proposals we achieve over 96% object recall at overlap threshold of 0.5 and over 75% recall at the more challenging overlap of 0.7. Our approach runs in 0.25 seconds and we additionally demonstrate a near real-time variant with only minor loss in accuracy.

2,892 citations


Book ChapterDOI
TL;DR: SPP-Net as mentioned in this paper proposes a spatial pyramid pooling strategy, which can generate a fixed-length representation regardless of image size/scale, and achieves state-of-the-art performance in object detection.
Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224x224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102x faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

2,304 citations


Journal ArticleDOI
TL;DR: For a broad family of features, this work finds that features computed at octave-spaced scale intervals are sufficient to approximate features on a finely-sampled pyramid, and this approximation yields considerable speedups with negligible loss in detection accuracy.
Abstract: Multi-resolution image features may be approximated via extrapolation from nearby scales, rather than being computed explicitly. This fundamental insight allows us to design object detection algorithms that are as accurate, and considerably faster, than the state-of-the-art. The computational bottleneck of many modern detectors is the computation of features at every scale of a finely-sampled image pyramid. Our key insight is that one may compute finely sampled feature pyramids at a fraction of the cost, without sacrificing performance: for a broad family of features we find that features computed at octave-spaced scale intervals are sufficient to approximate features on a finely-sampled pyramid. Extrapolation is inexpensive as compared to direct feature computation. As a result, our approximation yields considerable speedups with negligible loss in detection accuracy. We modify three diverse visual recognition systems to use fast feature pyramids and show results on both pedestrian detection (measured on the Caltech, INRIA, TUD-Brussels and ETH data sets) and general object detection (measured on the PASCAL VOC). The approach is general and is widely applicable to vision algorithms requiring fine-grained multi-scale analysis. Our approximation is valid for images with broad spectra (most natural images) and fails for images with narrow band-pass spectra (e.g., periodic textures).

2,000 citations


Book ChapterDOI
06 Sep 2014
TL;DR: In this paper, a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity is proposed.
Abstract: In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.

1,414 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: A novel deformable part-based model is proposed, which exploits both local context around each candidate detection as well as global context at the level of the scene, which significantly helps in detecting objects at all scales.
Abstract: In this paper we study the role of context in existing state-of-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 detection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it contains 520 additional classes for semantic segmentation and object detection. Our analysis shows that nearest neighbor based approaches perform poorly on semantic segmentation of contextual classes, showing the variability of PASCAL imagery. Furthermore, improvements of exist ing contextual models for detection is rather modest. In order to push forward the performance in this difficult scenario, we propose a novel deformable part-based model, which exploits both local context around each candidate detection as well as global context at the level of the scene. We show that this contextual reasoning significantly helps in detecting objects at all scales.

1,327 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work proposes a robust background measure, called boundary connectivity, which characterizes the spatial layout of image regions with respect to image boundaries and is much more robust and presents unique benefits that are absent in previous saliency measures.
Abstract: Recent progresses in salient object detection have exploited the boundary prior, or background information, to assist other saliency cues such as contrast, achieving state-of-the-art results. However, their usage of boundary prior is very simple, fragile, and the integration with other cues is mostly heuristic. In this work, we present new methods to address these issues. First, we propose a robust background measure, called boundary connectivity. It characterizes the spatial layout of image regions with respect to image boundaries and is much more robust. It has an intuitive geometrical interpretation and presents unique benefits that are absent in previous saliency measures. Second, we propose a principled optimization framework to integrate multiple low level cues, including our background measure, to obtain clean and uniform saliency maps. Our formulation is intuitive, efficient and achieves state-of-the-art results on several benchmark datasets.

1,321 citations


Book ChapterDOI
06 Sep 2014
TL;DR: This work builds on recent work that uses convolutional neural networks to classify category-independent region proposals (R-CNN), introducing a novel architecture tailored for SDS, and uses category-specific, top-down figure-ground predictions to refine the bottom-up proposals.
Abstract: We aim to detect all instances of a category in an image and, for each instance, mark the pixels that belong to it. We call this task Simultaneous Detection and Segmentation (SDS). Unlike classical bounding box detection, SDS requires a segmentation and not just a box. Unlike classical semantic segmentation, we require individual object instances. We build on recent work that uses convolutional neural networks to classify category-independent region proposals (R-CNN [16]), introducing a novel architecture tailored for SDS. We then use category-specific, top-down figure-ground predictions to refine our bottom-up proposals. We show a 7 point boost (16% relative) over our baselines on SDS, a 5 point boost (10% relative) over state-of-the-art on semantic segmentation, and state-of-the-art performance in object detection. Finally, we provide diagnostic tools that unpack performance and provide directions for future work.

Posted Content
TL;DR: This paper is able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN, and finds in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods.
Abstract: Deep convolutional neural networks (CNN) has become the most promising method for object recognition, repeatedly demonstrating record breaking results for image classification and object detection in recent years. However, a very deep CNN generally involves many layers with millions of parameters, making the storage of the network model to be extremely large. This prohibits the usage of deep CNNs on resource limited hardware, especially cell phones or other embedded devices. In this paper, we tackle this model storage issue by investigating information theoretical vector quantization methods for compressing the parameters of CNNs. In particular, we have found in terms of compressing the most storage demanding dense connected layers, vector quantization methods have a clear gain over existing matrix factorization methods. Simply applying k-means clustering to the weights or conducting product quantization can lead to a very good balance between model size and recognition accuracy. For the 1000-category classification task in the ImageNet challenge, we are able to achieve 16-24 times compression of the network with only 1% loss of classification accuracy using the state-of-the-art CNN.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work proposes a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest.
Abstract: Deep convolutional neural networks have recently achieved state-of-the-art performance on a number of image recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC-2012). The winning model on the localization sub-task was a network that predicts a single bounding box and a confidence score for each object category in the image. Such a model captures the whole-image context around the objects but cannot handle multiple instances of the same object in the image without naively replicating the number of outputs for each instance. In this work, we propose a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest. The model naturally handles a variable number of instances for each class and allows for cross-class generalization at the highest levels of the network. We are able to obtain competitive recognition performance on VOC2007 and ILSVRC2012, while using only the top few predicted locations in each image and a small number of neural network evaluations.

Posted Content
TL;DR: A new geocentric embedding is proposed for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity to facilitate the use of perception in fields like robotics.
Abstract: In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.

Book ChapterDOI
06 Sep 2014
TL;DR: In this article, the authors propose a model for fine-grained categorization by leveraging deep convolutional features computed on bottom-up region proposals, which learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a finegrained category from a pose normalized representation.
Abstract: Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: It is observed that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, with a suitable resizing of their corresponding image windows in to a small fixed size, so as to train a generic objectness measure.
Abstract: Training a generic objectness measure to produce a small set of candidate object windows, has been shown to speed up the classical sliding window object detection paradigm. We observe that generic objects with well-defined closed boundary can be discriminated by looking at the norm of gradients, with a suitable resizing of their corresponding image windows in to a small fixed size. Based on this observation and computational reasons, we propose to resize the window to 8 × 8 and use the norm of the gradients as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g. ADD, BITWISE SHIFT, etc.). Experiments on the challenging PASCAL VOC 2007 dataset show that our method efficiently (300fps on a single laptop CPU) generates a small set of category-independent, high quality object windows, yielding 96.2% object detection rate (DR) with 1, 000 proposals. Increasing the numbers of proposals and color spaces for computing BING features, our performance can be further improved to 99.5% DR.

Proceedings ArticleDOI
24 Mar 2014
TL;DR: PASCAL3D+ dataset is contributed, which is a novel and challenging dataset for 3D object detection and pose estimation, and on average there are more than 3,000 object instances per category.
Abstract: 3D object detection and pose estimation methods have become popular in recent years since they can handle ambiguities in 2D images and also provide a richer description for objects compared to 2D object detectors. However, most of the datasets for 3D recognition are limited to a small amount of images per category or are captured in controlled environments. In this paper, we contribute PASCAL3D+ dataset, which is a novel and challenging dataset for 3D object detection and pose estimation. PASCAL3D+ augments 12 rigid categories of the PASCAL VOC 2012 [4] with 3D annotations. Furthermore, more images are added for each category from ImageNet [3]. PASCAL3D+ images exhibit much more variability compared to the existing 3D datasets, and on average there are more than 3,000 object instances per category. We believe this dataset will provide a rich testbed to study 3D detection and pose estimation and will help to significantly push forward research in this area. We provide the results of variations of DPM [6] on our new dataset for object detection and viewpoint estimation in different scenarios, which can be used as baselines for the community. Our benchmark is available online at http://cvgl.stanford.edu/projects/pascal3d

Journal ArticleDOI
TL;DR: The detection and localization of anomalous behaviors in crowded scenes is considered, and a joint detector of temporal and spatial anomalies is proposed, based on a video representation that accounts for both appearance and dynamics, using a set of mixture of dynamic textures models.
Abstract: The detection and localization of anomalous behaviors in crowded scenes is considered, and a joint detector of temporal and spatial anomalies is proposed. The proposed detector is based on a video representation that accounts for both appearance and dynamics, using a set of mixture of dynamic textures models. These models are used to implement 1) a center-surround discriminant saliency detector that produces spatial saliency scores, and 2) a model of normal behavior that is learned from training data and produces temporal saliency scores. Spatial and temporal anomaly maps are then defined at multiple spatial scales, by considering the scores of these operators at progressively larger regions of support. The multiscale scores act as potentials of a conditional random field that guarantees global consistency of the anomaly judgments. A data set of densely crowded pedestrian walkways is introduced and used to evaluate the proposed anomaly detector. Experiments on this and other data sets show that the latter achieves state-of-the-art anomaly detection results.

Book ChapterDOI
06 Sep 2014
TL;DR: This work addresses the problem of estimating the 6D Pose of specific objects from a single RGB-D image by presenting a learned, intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labelling.
Abstract: This work addresses the problem of estimating the 6D Pose of specific objects from a single RGB-D image. We present a flexible approach that can deal with generic objects, both textured and texture-less. The key new concept is a learned, intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labelling. We are able to show that for a common dataset with texture-less objects, where template-based techniques are suitable and state of the art, our approach is slightly superior in terms of accuracy. We also demonstrate the benefits of our approach, compared to template-based techniques, in terms of robustness with respect to varying lighting conditions. Towards this end, we contribute a new ground truth dataset with 10k images of 20 objects captured each under three different lighting conditions. We demonstrate that our approach scales well with the number of objects and has capabilities to run fast.

Journal ArticleDOI
TL;DR: An effective small target detection algorithm inspired by the contrast mechanism of human vision system and derived kernel model is presented, which can improve the SNR of the image significantly.
Abstract: Robust small target detection of low signal-to-noise ratio (SNR) is very important in infrared search and track applications for self-defense or attacks. Consequently, an effective small target detection algorithm inspired by the contrast mechanism of human vision system and derived kernel model is presented in this paper. At the first stage, the local contrast map of the input image is obtained using the proposed local contrast measure which measures the dissimilarity between the current location and its neighborhoods. In this way, target signal enhancement and background clutter suppression are achieved simultaneously. At the second stage, an adaptive threshold is adopted to segment the target. The experiments on two sequences have validated the detection capability of the proposed target detection method. Experimental evaluation results show that our method is simple and effective with respect to detection accuracy. In particular, the proposed method can improve the SNR of the image significantly.

Journal ArticleDOI
TL;DR: An accurate and robust method for detecting texts in natural scene images using a fast and effective pruning algorithm to extract Maximally Stable Extremal Regions (MSERs) as character candidates using the strategy of minimizing regularized variations is proposed.
Abstract: Text detection in natural scene images is an important prerequisite for many content-based image analysis tasks. In this paper, we propose an accurate and robust method for detecting texts in natural scene images. A fast and effective pruning algorithm is designed to extract Maximally Stable Extremal Regions (MSERs) as character candidates using the strategy of minimizing regularized variations. Character candidates are grouped into text candidates by the single-link clustering algorithm, where distance weights and clustering threshold are learned automatically by a novel self-training distance metric learning algorithm. The posterior probabilities of text candidates corresponding to non-text are estimated with a character classifier; text candidates with high non-text probabilities are eliminated and texts are identified with a text classifier. The proposed system is evaluated on the ICDAR 2011 Robust Reading Competition database; the f-measure is over 76%, much better than the state-of-the-art performance of 71%. Experiments on multilingual, street view, multi-orientation and even born-digital databases also demonstrate the effectiveness of the proposed method.

Posted Content
TL;DR: DenseNet is presented, an open source system that computes dense, multiscale features from the convolutional layers of a CNN based object classifier.
Abstract: Convolutional Neural Networks (CNNs) can provide accurate object classification. They can be extended to perform object detection by iterating over dense or selected proposed object regions. However, the runtime of such detectors scales as the total number and/or area of regions to examine per image, and training such detectors may be prohibitively slow. However, for some CNN classifier topologies, it is possible to share significant work among overlapping regions to be classified. This paper presents DenseNet, an open source system that computes dense, multiscale features from the convolutional layers of a CNN based object classifier. Future work will involve training efficient object detectors with DenseNet feature descriptors.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This paper shows that the most commonly-used measures for evaluating both non-binary maps and binary maps do not always provide a reliable evaluation, and proposes a new measure that amends these flaws.
Abstract: The output of many algorithms in computer-vision is either non-binary maps or binary maps (e.g., salient object detection and object segmentation). Several measures have been suggested to evaluate the accuracy of these foreground maps. In this paper, we show that the most commonly-used measures for evaluating both non-binary maps and binary maps do not always provide a reliable evaluation. This includes the Area-Under-the-Curve measure, the Average-Precision measure, the F-measure, and the evaluation measure of the PASCAL VOC segmentation challenge. We start by identifying three causes of inaccurate evaluation. We then propose a new measure that amends these flaws. An appealing property of our measure is being an intuitive generalization of the F-measure. Finally we propose four meta-measures to compare the adequacy of evaluation measures. We show via experiments that our novel measure is preferable.

Book ChapterDOI
06 Sep 2014
TL;DR: It is shown that a properly trained vanilla DPM reaches top performance, improving over commercial and research systems, and a detector based on rigid templates - similar in structure to the Viola&Jones detector - can reach similar top performance on this task.
Abstract: Face detection is a mature problem in computer vision. While diverse high performing face detectors have been proposed in the past, we present two surprising new top performance results. First, we show that a properly trained vanilla DPM reaches top performance, improving over commercial and research systems. Second, we show that a detector based on rigid templates - similar in structure to the Viola&Jones detector - can reach similar top performance on this task. Importantly, we discuss issues with existing evaluation benchmark and propose an improved procedure.

Journal ArticleDOI
TL;DR: Comparative experimental results indicate that the proposed HDNN significantly outperforms the traditional DNN on vehicle detection, by dividing the maps of the last convolutional layer and the max-pooling layer of DNN into multiple blocks of variable receptive field sizes or max- pooling field sizes to enable the HDNN to extract variable-scale features.
Abstract: Detecting small objects such as vehicles in satellite images is a difficult problem. Many features (such as histogram of oriented gradient, local binary pattern, scale-invariant featuretransform, etc.) have been used to improve the performance of object detection, but mostly in simple environments such as those on roads. Kembhavi et al. proposed that no satisfactory accuracy has been achieved in complex environments such as the City of San Francisco. Deep convolutional neural networks (DNNs) can learn rich features from the training data automatically and has achieved state-of-the-art performance in many image classification databases. Though the DNN has shown robustness to distortion, it only extracts features of the same scale, and hence is insufficient to tolerate large-scale variance of object. In this letter, we present a hybrid DNN (HDNN), by dividing the maps of the last convolutional layer and the max-pooling layer of DNN into multiple blocks of variable receptive field sizes or max-pooling field sizes, to enable the HDNN to extract variable-scale features. Comparative experimental results indicate that our proposed HDNN significantly outperforms the traditional DNN on vehicle detection.

Journal ArticleDOI
TL;DR: Comprehensive evaluations on two remote sensing image databases and comparisons with some state-of-the-art approaches demonstrate the effectiveness and superiority of the developed framework.
Abstract: The rapid development of remote sensing technology has facilitated us the acquisition of remote sensing images with higher and higher spatial resolution, but how to automatically understand the image contents is still a big challenge. In this paper, we develop a practical and rotation-invariant framework for multi-class geospatial object detection and geographic image classification based on collection of part detectors (COPD). The COPD is composed of a set of representative and discriminative part detectors, where each part detector is a linear support vector machine (SVM) classifier used for the detection of objects or recurring spatial patterns within a certain range of orientation. Specifically, when performing multi-class geospatial object detection, we learn a set of seed-based part detectors where each part detector corresponds to a particular viewpoint of an object class, so the collection of them provides a solution for rotation-invariant detection of multi-class objects. When performing geographic image classification, we utilize a large number of pre-trained part detectors to discovery distinctive visual parts from images and use them as attributes to represent the images. Comprehensive evaluations on two remote sensing image databases and comparisons with some state-of-the-art approaches demonstrate the effectiveness and superiority of the developed framework.

Journal ArticleDOI
TL;DR: This paper presents a comprehensive survey of existing local surface feature based 3D object recognition methods and enlists a number of popular and contemporary databases together with their relevant attributes.
Abstract: 3D object recognition in cluttered scenes is a rapidly growing research area. Based on the used types of features, 3D object recognition methods can broadly be divided into two categories-global or local feature based methods. Intensive research has been done on local surface feature based methods as they are more robust to occlusion and clutter which are frequently present in a real-world scene. This paper presents a comprehensive survey of existing local surface feature based 3D object recognition methods. These methods generally comprise three phases: 3D keypoint detection, local surface feature description, and surface matching. This paper covers an extensive literature survey of each phase of the process. It also enlists a number of popular and contemporary databases together with their relevant attributes.

Posted Content
TL;DR: The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.

Journal ArticleDOI
TL;DR: A novel probabilistic generative model for multi-object traffic scene understanding from movable platforms which reasons jointly about the 3D scene layout as well as the location and orientation of objects in the scene is presented.
Abstract: In this paper, we present a novel probabilistic generative model for multi-object traffic scene understanding from movable platforms which reasons jointly about the 3D scene layout as well as the location and orientation of objects in the scene. In particular, the scene topology, geometry, and traffic activities are inferred from short video sequences. Inspired by the impressive driving capabilities of humans, our model does not rely on GPS, lidar, or map knowledge. Instead, it takes advantage of a diverse set of visual cues in the form of vehicle tracklets, vanishing points, semantic scene labels, scene flow, and occupancy grids. For each of these cues, we propose likelihood functions that are integrated into a probabilistic generative model. We learn all model parameters from training data using contrastive divergence. Experiments conducted on videos of 113 representative intersections show that our approach successfully infers the correct layout in a variety of very challenging scenarios. To evaluate the importance of each feature cue, experiments using different feature combinations are conducted. Furthermore, we show how by employing context derived from the proposed method we are able to improve over the state-of-the-art in terms of object detection and object orientation estimation in challenging and cluttered urban environments.

Book ChapterDOI
06 Sep 2014
TL;DR: This paper proposes to use depth maps for object detection and design a 3D detector to overcome the major difficulties for recognition, namely the variations of texture, illumination, shape, viewpoint, clutter, occlusion, self-occlusion and sensor noises.
Abstract: The depth information of RGB-D sensors has greatly simplified some common challenges in computer vision and enabled breakthroughs for several tasks. In this paper, we propose to use depth maps for object detection and design a 3D detector to overcome the major difficulties for recognition, namely the variations of texture, illumination, shape, viewpoint, clutter, occlusion, self-occlusion and sensor noises. We take a collection of 3D CAD models and render each CAD model from hundreds of viewpoints to obtain synthetic depth maps. For each depth rendering, we extract features from the 3D point cloud and train an Exemplar-SVM classifier. During testing and hard-negative mining, we slide a 3D detection window in 3D space. Experiment results show that our 3D detector significantly outperforms the state-of-the-art algorithms for both RGB and RGB-D images, and achieves about ×1.7 improvement on average precision compared to DPM and R-CNN. All source code and data are available online.