scispace - formally typeset
Search or ask a question

Showing papers on "Object detection published in 2010"


Journal ArticleDOI
TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

15,935 citations


Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: A new type of correlation filter is presented, a Minimum Output Sum of Squared Error (MOSSE) filter, which produces stable correlation filters when initialized using a single frame, which enables the tracker to pause and resume where it left off when the object reappears.
Abstract: Although not commonly used, correlation filters can track complex objects through rotations, occlusions and other distractions at over 20 times the rate of current state-of-the-art techniques. The oldest and simplest correlation filters use simple templates and generally fail when applied to tracking. More modern approaches such as ASEF and UMACE perform better, but their training needs are poorly suited to tracking. Visual tracking requires robust filters to be trained from a single frame and dynamically adapted as the appearance of the target object changes. This paper presents a new type of correlation filter, a Minimum Output Sum of Squared Error (MOSSE) filter, which produces stable correlation filters when initialized using a single frame. A tracker based upon MOSSE filters is robust to variations in lighting, scale, pose, and nonrigid deformations while operating at 669 frames per second. Occlusion is detected based upon the peak-to-sidelobe ratio, which enables the tracker to pause and resume where it left off when the object reappears.

2,948 citations


Journal ArticleDOI
TL;DR: This review shows that, despite their apparent simplicity, the development of a general eye detection technique involves addressing many challenges, requires further theoretical developments, and is consequently of interest to many other domains problems in computer vision and beyond.
Abstract: Despite active research and significant progress in the last 30 years, eye detection and tracking remains challenging due to the individuality of eyes, occlusion, variability in scale, location, and light conditions. Data on eye location and details of eye movements have numerous applications and are essential in face detection, biometric identification, and particular human-computer interaction tasks. This paper reviews current progress and state of the art in video-based eye detection and tracking in order to identify promising techniques as well as issues to be further addressed. We present a detailed review of recent eye models and techniques for eye detection and tracking. We also survey methods for gaze estimation and compare them based on their geometric properties and reported accuracies. This review shows that, despite their apparent simplicity, the development of a general eye detection technique involves addressing many challenges, requires further theoretical developments, and is consequently of interest to many other domains problems in computer vision and beyond.

1,514 citations


Journal ArticleDOI
TL;DR: An EM-based algorithm to compute dense depth and occlusion maps from wide-baseline image pairs using a local image descriptor, DAISY, which is very efficient to compute densely and robust against many photometric and geometric transformations.
Abstract: In this paper, we introduce a local image descriptor, DAISY, which is very efficient to compute densely. We also present an EM-based algorithm to compute dense depth and occlusion maps from wide-baseline image pairs using this descriptor. This yields much better results in wide-baseline situations than the pixel and correlation-based algorithms that are commonly used in narrow-baseline stereo. Also, using a descriptor makes our algorithm robust against many photometric and geometric transformations. Our descriptor is inspired from earlier ones such as SIFT and GLOH but can be computed much faster for our purposes. Unlike SURF, which can also be computed efficiently at every pixel, it does not introduce artifacts that degrade the matching performance when used densely. It is important to note that our approach is the first algorithm that attempts to estimate dense depth maps from wide-baseline image pairs, and we show that it is a good one at that with many experiments for depth estimation accuracy, occlusion detection, and comparing it against other descriptors on laser-scanned ground truth scenes. We also tested our approach on a variety of indoor and outdoor scenes with different photometric and geometric transformations and our experiments support our claim to being robust against these.

1,484 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: A new type of saliency is proposed – context-aware saliency – which aims at detecting the image regions that represent the scene and a detection algorithm is presented which is based on four principles observed in the psychological literature.
Abstract: We propose a new type of saliency – context-aware saliency – which aims at detecting the image regions that represent the scene. This definition differs from previous definitions whose goal is to either identify fixation points or detect the dominant object. In accordance with our saliency definition, we present a detection algorithm which is based on four principles observed in the psychological literature. The benefits of the proposed approach are evaluated in two applications where the context of the dominant objects is just as essential as the objects themselves. In image retargeting we demonstrate that using our saliency prevents distortions in the important regions. In summarization we show that our saliency helps to produce compact, appealing, and informative summaries.

1,117 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: In analogy to probably approximately correct (PAC) learning, the notion of probably approximately admissible (PAA) thresholds is introduced, providing theoretical guarantees on the performance of the cascade method and can be computed from a small sample of positive examples.
Abstract: We describe a general method for building cascade classifiers from part-based deformable models such as pictorial structures. We focus primarily on the case of star-structured models and show how a simple algorithm based on partial hypothesis pruning can speed up object detection by more than one order of magnitude without sacrificing detection accuracy. In our algorithm, partial hypotheses are pruned with a sequence of thresholds. In analogy to probably approximately correct (PAC) learning, we introduce the notion of probably approximately admissible (PAA) thresholds. Such thresholds provide theoretical guarantees on the performance of the cascade method and can be computed from a small sample of positive examples. Finally, we outline a cascade detection algorithm for a general class of models defined by a grammar formalism. This class includes not only tree-structured pictorial structures but also richer models that can represent each part recursively as a mixture of other parts.

975 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: A generic objectness measure, quantifying how likely it is for an image window to contain an object of any class, is presented, combining in a Bayesian framework several image cues measuring characteristics of objects, such as appearing different from their surroundings and having a closed boundary.
Abstract: We present a generic objectness measure, quantifying how likely it is for an image window to contain an object of any class. We explicitly train it to distinguish objects with a well-defined boundary in space, such as cows and telephones, from amorphous background elements, such as grass and road. The measure combines in a Bayesian framework several image cues measuring characteristics of objects, such as appearing different from their surroundings and having a closed boundary. This includes an innovative cue measuring the closed boundary characteristic. In experiments on the challenging PASCAL VOC 07 dataset, we show this new cue to outperform a state-of-the-art saliency measure [17], and the combined measure to perform better than any cue alone. Finally, we show how to sample windows from an image according to their objectness distribution and give an algorithm to employ them as location priors for modern class-specific object detectors. In experiments on PASCAL VOC 07 we show this greatly reduces the number of windows evaluated by class-specific object detectors.

969 citations


Journal ArticleDOI
TL;DR: Extensive tests of videos, natural images, and psychological patterns show that the proposed PQFT model is more effective in saliency detection and can predict eye fixations better than other state-of-the-art models in previous literature.
Abstract: Salient areas in natural scenes are generally regarded as areas which the human eye will typically focus on, and finding these areas is the key step in object detection. In computer vision, many models have been proposed to simulate the behavior of eyes such as SaliencyToolBox (STB), Neuromorphic Vision Toolkit (NVT), and others, but they demand high computational cost and computing useful results mostly relies on their choice of parameters. Although some region-based approaches were proposed to reduce the computational complexity of feature maps, these approaches still were not able to work in real time. Recently, a simple and fast approach called spectral residual (SR) was proposed, which uses the SR of the amplitude spectrum to calculate the image's saliency map. However, in our previous work, we pointed out that it is the phase spectrum, not the amplitude spectrum, of an image's Fourier transform that is key to calculating the location of salient areas, and proposed the phase spectrum of Fourier transform (PFT) model. In this paper, we present a quaternion representation of an image which is composed of intensity, color, and motion features. Based on the principle of PFT, a novel multiresolution spatiotemporal saliency detection model called phase spectrum of quaternion Fourier transform (PQFT) is proposed in this paper to calculate the spatiotemporal saliency map of an image by its quaternion representation. Distinct from other models, the added motion dimension allows the phase spectrum to represent spatiotemporal saliency in order to perform attention selection not only for images but also for videos. In addition, the PQFT model can compute the saliency map of an image under various resolutions from coarse to fine. Therefore, the hierarchical selectivity (HS) framework based on the PQFT model is introduced here to construct the tree structure representation of an image. With the help of HS, a model called multiresolution wavelet domain foveation (MWDF) is proposed in this paper to improve coding efficiency in image and video compression. Extensive tests of videos, natural images, and psychological patterns show that the proposed PQFT model is more effective in saliency detection and can predict eye fixations better than other state-of-the-art models in previous literature. Moreover, our model requires low computational cost and, therefore, can work in real time. Additional experiments on image and video compression show that the HS-MWDF model can achieve higher compression rate than the traditional model.

944 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: A novel method is proposed that creates a global model description based on oriented point pair features and matches that model locally using a fast voting scheme, which allows using much sparser object and scene point clouds, resulting in very fast performance.
Abstract: This paper addresses the problem of recognizing free-form 3D objects in point clouds. Compared to traditional approaches based on point descriptors, which depend on local information around points, we propose a novel method that creates a global model description based on oriented point pair features and matches that model locally using a fast voting scheme. The global model description consists of all model point pair features and represents a mapping from the point pair feature space to the model, where similar features on the model are grouped together. Such representation allows using much sparser object and scene point clouds, resulting in very fast performance. Recognition is done locally using an efficient voting scheme on a reduced two-dimensional search space. We demonstrate the efficiency of our approach and show its high recognition performance in the case of noise, clutter and partial occlusions. Compared to state of the art approaches we achieve better recognition rates, and demonstrate that with a slight or even no sacrifice of the recognition performance our method is much faster then the current state of the art approaches.

808 citations


Journal ArticleDOI
TL;DR: This paper shows that formulating the problem in a naive Bayesian classification framework makes such preprocessing unnecessary and produces an algorithm that is simple, efficient, and robust, and it scales well as the number of classes grows.
Abstract: While feature point recognition is a key component of modern approaches to object detection, existing approaches require computationally expensive patch preprocessing to handle perspective distortion. In this paper, we show that formulating the problem in a naive Bayesian classification framework makes such preprocessing unnecessary and produces an algorithm that is simple, efficient, and robust. Furthermore, it scales well as the number of classes grows. To recognize the patches surrounding keypoints, our classifier uses hundreds of simple binary features and models class posterior probabilities. We make the problem computationally tractable by assuming independence between arbitrary sets of features. Even though this is not strictly true, we demonstrate that our classifier nevertheless performs remarkably well on image data sets containing very significant perspective changes.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: A new random field model is proposed to encode the mutual context of objects and human poses in human-object interaction activities and it is shown that this mutual context model significantly outperforms state-of-the-art in detecting very difficult objects andhuman poses.
Abstract: Detecting objects in cluttered scenes and estimating articulated human body parts are two challenging problems in computer vision. The difficulty is particularly pronounced in activities involving human-object interactions (e.g. playing tennis), where the relevant object tends to be small or only partially visible, and the human body parts are often self-occluded. We observe, however, that objects and human poses can serve as mutual context to each other – recognizing one facilitates the recognition of the other. In this paper we propose a new random field model to encode the mutual context of objects and human poses in human-object interaction activities. We then cast the model learning task as a structure learning problem, of which the structural connectivity between the object, the overall human pose, and different body parts are estimated through a structure search approach, and the parameters of the model are estimated by a new max-margin algorithm. On a sports data set of six classes of human-object interactions [12], we show that our mutual context model significantly outperforms state-of-the-art in detecting very difficult objects and human poses.

Book ChapterDOI
05 Sep 2010
TL;DR: This paper generalizes PatchMatch in three ways: to find k nearest neighbors, as opposed to just one, to search across scales and rotations, in addition to just translations, and to match using arbitrary descriptors and distances, not just sum-of-squared-differences on patch colors.
Abstract: PatchMatch is a fast algorithm for computing dense approximate nearest neighbor correspondences between patches of two image regions [1]. This paper generalizes PatchMatch in three ways: (1) to find k nearest neighbors, as opposed to just one, (2) to search across scales and rotations, in addition to just translations, and (3) to match using arbitrary descriptors and distances, not just sum-of-squared-differences on patch colors. In addition, we offer new search and parallelization strategies that further accelerate the method, and we show performance improvements over standard kd-tree techniques across a variety of inputs. In contrast to many previous matching algorithms, which for efficiency reasons have restricted matching to sparse interest points, or spatially proximate matches, our algorithm can efficiently find global, dense matches, even while matching across all scales and rotations. This is especially useful for computer vision applications, where our algorithm can be used as an efficient general-purpose component. We explore a variety of vision applications: denoising, finding forgeries by detecting cloned regions, symmetry detection, and object detection.

Proceedings ArticleDOI
12 Jun 2010
TL;DR: In this paper, a three-stage process is proposed to recover 3D human pose from monocular image sequences in real-world scenarios, such as crowded street scenes, based on tracking-by-detection.
Abstract: Automatic recovery of 3D human pose from monocular image sequences is a challenging and important research topic with numerous applications. Although current methods are able to recover 3D pose for a single person in controlled environments, they are severely challenged by real-world scenarios, such as crowded street scenes. To address this problem, we propose a three-stage process building on a number of recent advances. The first stage obtains an initial estimate of the 2D articulation and viewpoint of the person from single frames. The second stage allows early data association across frames based on tracking-by-detection. These two stages successfully accumulate the available 2D image evidence into robust estimates of 2D limb positions over short image sequences (= tracklets). The third and final stage uses those tracklet-based estimates as robust image observations to reliably recover 3D pose. We demonstrate state-of-the-art performance on the HumanEva II benchmark, and also show the applicability of our approach to articulated 3D tracking in realistic street conditions.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: Results show that the proposed novel method for crowd flow modeling and anomaly detection achieves higher accuracy in anomaly detection and can effectively localize anomalies.
Abstract: A novel method for crowd flow modeling and anomaly detection is proposed for both coherent and incoherent scenes. The novelty is revealed in three aspects. First, it is a unique utilization of particle trajectories for modeling crowded scenes, in which we propose new and efficient representative trajectories for modeling arbitrarily complicated crowd flows. Second, chaotic dynamics are introduced into the crowd context to characterize complicated crowd motions by regulating a set of chaotic invariant features, which are reliably computed and used for detecting anomalies. Third, a probabilistic framework for anomaly detection and localization is formulated. The overall work-flow begins with particle advection based on optical flow. Then particle trajectories are clustered to obtain representative trajectories for a crowd flow. Next, the chaotic dynamics of all representative trajectories are extracted and quantified using chaotic invariants known as maximal Lyapunov exponent and correlation dimension. Probabilistic model is learned from these chaotic feature set, and finally, a maximum likelihood estimation criterion is adopted to identify a query video of a scene as normal or abnormal. Furthermore, an effective anomaly localization algorithm is designed to locate the position and size of an anomaly. Experiments are conducted on known crowd data set, and results show that our method achieves higher accuracy in anomaly detection and can effectively localize anomalies.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This work shows that motion features derived from optic flow yield substantial improvements on image sequences, if implemented correctly — even in the case of low-quality video and consequently degraded flow fields, and introduces a new feature, self-similarity on color channels, which consistently improves detection performance across different datasets.
Abstract: Despite impressive progress in people detection the performance on challenging datasets like Caltech Pedestrians or TUD-Brussels is still unsatisfactory. In this work we show that motion features derived from optic flow yield substantial improvements on image sequences, if implemented correctly — even in the case of low-quality video and consequently degraded flow fields. Furthermore, we introduce a new feature, self-similarity on color channels, which consistently improves detection performance both for static images and for video sequences, across different datasets. In combination with HOG, these two features outperform the state-of-the-art by up to 20%. Finally, we report two insights concerning detector evaluations, which apply to classifier-based object detection in general. First, we show that a commonly under-estimated detail of training, the number of bootstrapping rounds, has a drastic influence on the relative (and absolute) performance of different feature/classifier combinations. Second, we discuss important intricacies of detector evaluation and show that current benchmarking protocols lack crucial details, which can distort evaluations.

Book ChapterDOI
05 Sep 2010
TL;DR: A new algorithm for detecting people using poselets is developed which uses only 2D annotations which are much easier for naive human annotators and is the current best performer on the task of people detection and segmentation.
Abstract: Bourdev and Malik (ICCV 09) introduced a new notion of parts, poselets, constructed to be tightly clustered both in the configuration space of keypoints, as well as in the appearance space of image patches. In this paper we develop a new algorithm for detecting people using poselets. Unlike that work which used 3D annotations of keypoints, we use only 2D annotations which are much easier for naive human annotators. The main algorithmic contribution is in how we use the pattern of poselet activations. Individual poselet activations are noisy, but considering the spatial context of each can provide vital disambiguating information, just as object detection can be improved by considering the detection scores of nearby objects in the scene. This can be done by training a two-layer feed-forward network with weights set using a max margin technique. The refined poselet activations are then clustered into mutually consistent hypotheses where consistency is based on empirically determined spatial keypoint distributions. Finally, bounding boxes are predicted for each person hypothesis and shape masks are aligned to edges in the image to provide a segmentation. To the best of our knowledge, the resulting system is the current best performer on the task of people detection and segmentation with an average precision of 47.8% and 40.5% respectively on PASCAL VOC 2009.

Proceedings ArticleDOI
03 Dec 2010
TL;DR: This paper introduces a method for salient region detection that retains the advantages of such saliency maps while overcoming their shortcomings, and compares it to six state-of-the-art salient region Detection methods using publicly available ground truth.
Abstract: Detection of visually salient image regions is useful for applications like object segmentation, adaptive compression, and object recognition. Recently, full-resolution salient maps that retain well-defined boundaries have attracted attention. In these maps, boundaries are preserved by retaining substantially more frequency content from the original image than older techniques. However, if the salient regions comprise more than half the pixels of the image, or if the background is complex, the background gets highlighted instead of the salient object. In this paper, we introduce a method for salient region detection that retains the advantages of such saliency maps while overcoming their shortcomings. Our method exploits features of color and luminance, is simple to implement and is computationally efficient. We compare our algorithm to six state-of-the-art salient region detection methods using publicly available ground truth. Our method outperforms the six algorithms by achieving both higher precision and better recall. We also show application of our saliency maps in an automatic salient object segmentation scheme using graph-cuts.

Journal ArticleDOI
TL;DR: The proposed method automatically merges the regions that are initially segmented by mean shift segmentation, and then effectively extracts the object contour by labeling all the non-marker regions as either background or object.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper introduces a new dataset with images that contain many instances of different object categories and proposes an efficient model that captures the contextual information among more than a hundred ofobject categories and shows that the context model can be applied to scene understanding tasks that local detectors alone cannot solve.
Abstract: There has been a growing interest in exploiting contextual information in addition to local features to detect and localize multiple object categories in an image. Context models can efficiently rule out some unlikely combinations or locations of objects and guide detectors to produce a semantically coherent interpretation of a scene. However, the performance benefit from using context models has been limited because most of these methods were tested on datasets with only a few object categories, in which most images contain only one or two object categories. In this paper, we introduce a new dataset with images that contain many instances of different object categories and propose an efficient model that captures the contextual information among more than a hundred of object categories. We show that our context model can be applied to scene understanding tasks that local detectors alone cannot solve.

Book ChapterDOI
05 Sep 2010
TL;DR: A probabilistic framework for reasoning about regions, objects, and their attributes such as object class, location, and spatial extent is presented, which combines results from sliding window detectors, and low-level pixel-based unary and pairwise relations.
Abstract: Computer vision algorithms for individual tasks such as object recognition, detection and segmentation have shown impressive results in the recent past. The next challenge is to integrate all these algorithms and address the problem of scene understanding. This paper is a step towards this goal. We present a probabilistic framework for reasoning about regions, objects, and their attributes such as object class, location, and spatial extent. Our model is a Conditional Random Field defined on pixels, segments and objects. We define a global energy function for the model, which combines results from sliding window detectors, and low-level pixel-based unary and pairwise relations. One of our primary contributions is to show that this energy function can be solved efficiently. Experimental results show that our model achieves significant improvement over the baseline methods on CamVid and PASCAL VOC datasets.

Journal ArticleDOI
TL;DR: A multi-object filter suitable for image observations with low signal-to-noise ratio (SNR) is developed and a particle implementation of the multi- object filter is proposed and demonstrated via simulations.
Abstract: The problem of jointly detecting multiple objects and estimating their states from image observations is formulated in a Bayesian framework by modeling the collection of states as a random finite set. Analytic characterizations of the posterior distribution of this random finite set are derived for various prior distributions under the assumption that the regions of the observation influenced by individual objects do not overlap. These results provide tractable means to jointly estimate the number of states and their values from image observations. As an application, we develop a multi-object filter suitable for image observations with low signal-to-noise ratio (SNR). A particle implementation of the multi-object filter is proposed and demonstrated via simulations.

Journal ArticleDOI
TL;DR: Frequency analysis allows for greater accuracy in the removal of dynamic weather and in the performance of feature extraction than previous pixel-based or patch-based methods and is effective for videos with both scene and camera motions.
Abstract: Dynamic weather such as rain and snow causes complex spatio-temporal intensity fluctuations in videos. Such fluctuations can adversely impact vision systems that rely on small image features for tracking, object detection and recognition. While these effects appear to be chaotic in space and time, we show that dynamic weather has a predictable global effect in frequency space. For this, we first develop a model of the shape and appearance of a single rain or snow streak in image space. Detecting individual streaks is difficult even with an accurate appearance model, so we combine the streak model with the statistical characteristics of rain and snow to create a model of the overall effect of dynamic weather in frequency space. Our model is then fit to a video and is used to detect rain or snow streaks first in frequency space, and the detection result is then transferred to image space. Once detected, the amount of rain or snow can be reduced or increased. We demonstrate that our frequency analysis allows for greater accuracy in the removal of dynamic weather and in the performance of feature extraction than previous pixel-based or patch-based methods. We also show that unlike previous techniques, our approach is effective for videos with both scene and camera motions.

Journal ArticleDOI
TL;DR: An object class detection approach which fully integrates the complementary strengths offered by shape matchers and can localize object boundaries accurately and does not need segmented examples for training (only bounding-boxes).
Abstract: We present an object class detection approach which fully integrates the complementary strengths offered by shape matchers. Like an object detector, it can learn class models directly from images, and can localize novel instances in the presence of intra-class variations, clutter, and scale changes. Like a shape matcher, it finds the boundaries of objects, rather than just their bounding-boxes. This is achieved by a novel technique for learning a shape model of an object class given images of example instances. Furthermore, we also integrate Hough-style voting with a non-rigid point matching algorithm to localize the model in cluttered images. As demonstrated by an extensive evaluation, our method can localize object boundaries accurately and does not need segmented examples for training (only bounding-boxes).

Journal ArticleDOI
TL;DR: This paper presents a novel object recognition algorithm that performs automatic dataset collecting and incremental model learning simultaneously, and adapts a non-parametric latent topic model and proposes an incremental learning framework.
Abstract: The explosion of the Internet provides us with a tremendous resource of images shared online. It also confronts vision researchers the problem of finding effective methods to navigate the vast amount of visual information. Semantic image understanding plays a vital role towards solving this problem. One important task in image understanding is object recognition, in particular, generic object categorization. Critical to this problem are the issues of learning and dataset. Abundant data helps to train a robust recognition system, while a good object classifier can help to collect a large amount of images. This paper presents a novel object recognition algorithm that performs automatic dataset collecting and incremental model learning simultaneously. The goal of this work is to use the tremendous resources of the web to learn robust object category models for detecting and searching for objects in real-world cluttered scenes. Humans contiguously update the knowledge of objects when new examples are observed. Our framework emulates this human learning process by iteratively accumulating model knowledge and image examples. We adapt a non-parametric latent topic model and propose an incremental learning framework. Our algorithm is capable of automatically collecting much larger object category datasets for 22 randomly selected classes from the Caltech 101 dataset. Furthermore, our system offers not only more images in each object category but also a robust object category model and meaningful image annotation. Our experiments show that OPTIMOL is capable of collecting image datasets that are superior to the well known manually collected object datasets Caltech 101 and LabelMe.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper describes an incremental concave-convex procedure (iCCCP) which allows us to learn both two and three layer models efficiently and demonstrates the advantages of three layer hierarchies - outperforming Felzenszwalb et al.'s two layer models on all 20 classes.
Abstract: We present a latent hierarchical structural learning method for object detection. An object is represented by a mixture of hierarchical tree models where the nodes represent object parts. The nodes can move spatially to allow both local and global shape deformations. The models can be trained discriminatively using latent structural SVM learning, where the latent variables are the node positions and the mixture component. But current learning methods are slow, due to the large number of parameters and latent variables, and have been restricted to hierarchies with two layers. In this paper we describe an incremental concave-convex procedure (iCCCP) which allows us to learn both two and three layer models efficiently. We show that iCCCP leads to a simple training algorithm which avoids complex multi-stage layer-wise training, careful part selection, and achieves good performance without requiring elaborate initialization. We perform object detection using our learnt models and obtain performance comparable with state-of-the-art methods when evaluated on challenging public PASCAL datasets. We demonstrate the advantages of three layer hierarchies – outperforming Felzenszwalb et al.'s two layer models on all 20 classes.

Book ChapterDOI
05 Sep 2010
TL;DR: This work describes a multiresolution model that acts as a deformable part-based model when scoring large instances and a rigid template with scoring small instances and examines the interplay of resolution and context, and demonstrates that context is most helpful for detecting low-resolution instances when local models are limited in discriminative power.
Abstract: Most current approaches to recognition aim to be scale-invariant. However, the cues available for recognizing a 300 pixel tall object are qualitatively different from those for recognizing a 3 pixel tall object. We argue that for sensors with finite resolution, one should instead use scale-variant, or multiresolution representations that adapt in complexity to the size of a putative detection window. We describe a multiresolution model that acts as a deformable part-based model when scoring large instances and a rigid template with scoring small instances. We also examine the interplay of resolution and context, and demonstrate that context is most helpful for detecting low-resolution instances when local models are limited in discriminative power. We demonstrate impressive results on the Caltech Pedestrian benchmark, which contains object instances at a wide range of scales. Whereas recent state-of-the-art methods demonstrate missed detection rates of 86%-37% at 1 false-positive-per-image, our multiresolution model reduces the rate to 29%.

Journal ArticleDOI
TL;DR: Seven unsupervised and two supervised detection methods based on the so-called h -dome transform from mathematical morphology or the multiscale variance-stabilizing transform perform comparably, and have the advantage that they do not require a cumbersome learning stage.
Abstract: Quantitative analysis of biological image data generally involves the detection of many subresolution spots. Especially in live cell imaging, for which fluorescence microscopy is often used, the signal-to-noise ratio (SNR) can be extremely low, making automated spot detection a very challenging task. In the past, many methods have been proposed to perform this task, but a thorough quantitative evaluation and comparison of these methods is lacking in the literature. In this paper, we evaluate the performance of the most frequently used detection methods for this purpose. These include seven unsupervised and two supervised methods. We perform experiments on synthetic images of three different types, for which the ground truth was available, as well as on real image data sets acquired for two different biological studies, for which we obtained expert manual annotations to compare with. The results from both types of experiments suggest that for very low SNRs ( ? 2), the supervised (machine learning) methods perform best overall. Of the unsupervised methods, the detectors based on the so-called h -dome transform from mathematical morphology or the multiscale variance-stabilizing transform perform comparably, and have the advantage that they do not require a cumbersome learning stage. At high SNRs ( > 5), the difference in performance of all considered detectors becomes negligible.

Journal ArticleDOI
TL;DR: A shape-based, hierarchical part-template matching approach to simultaneous human detection and segmentation combining local part-based and global shape-template-based schemes is proposed.
Abstract: We propose a shape-based, hierarchical part-template matching approach to simultaneous human detection and segmentation combining local part-based and global shape-template-based schemes. The approach relies on the key idea of matching a part-template tree to images hierarchically to detect humans and estimate their poses. For learning a generic human detector, a pose-adaptive feature computation scheme is developed based on a tree matching approach. Instead of traditional concatenation-style image location-based feature encoding, we extract features adaptively in the context of human poses and train a kernel-SVM classifier to separate human/nonhuman patterns. Specifically, the features are collected in the local context of poses by tracing around the estimated shape boundaries. We also introduce an approach to multiple occluded human detection and segmentation based on an iterative occlusion compensation scheme. The output of our learned generic human detector can be used as an initial set of human hypotheses for the iterative optimization. We evaluate our approaches on three public pedestrian data sets (INRIA, MIT-CBCL, and USC-B) and two crowded sequences from Caviar Benchmark and Munich Airport data sets.

Journal ArticleDOI
TL;DR: A comprehensive study on the representation choices of BoW, including vocabulary size, weighting scheme, stop word removal, feature selection, spatial information, and visual bi-gram, and a soft-weighting method to assess the significance of a visual word to an image is conducted.
Abstract: Based on the local keypoints extracted as salient image patches, an image can be described as a ?bag-of-visual-words (BoW)? and this representation has appeared promising for object and scene classification. The performance of BoW features in semantic concept detection for large-scale multimedia databases is subject to various representation choices. In this paper, we conduct a comprehensive study on the representation choices of BoW, including vocabulary size, weighting scheme, stop word removal, feature selection, spatial information, and visual bi-gram. We offer practical insights in how to optimize the performance of BoW by choosing appropriate representation choices. For the weighting scheme, we elaborate a soft-weighting method to assess the significance of a visual word to an image. We experimentally show that the soft-weighting outperforms other popular weighting schemes such as TF-IDF with a large margin. Our extensive experiments on TRECVID data sets also indicate that BoW feature alone, with appropriate representation choices, already produces highly competitive concept detection performance. Based on our empirical findings, we further apply our method to detect a large set of 374 semantic concepts. The detectors, as well as the features and detection scores on several recent benchmark data sets, are released to the multimedia community.