scispace - formally typeset
Search or ask a question

Showing papers by "Cristian Sminchisescu published in 2012"


Journal ArticleDOI
TL;DR: A novel framework to generate and rank plausible hypotheses for the spatial extent of objects in images using bottom-up computational processes and mid-level selection cues and it is shown that the algorithm can be used, successfully, in a segmentation-based visual object category recognition pipeline.
Abstract: We present a novel framework to generate and rank plausible hypotheses for the spatial extent of objects in images using bottom-up computational processes and mid-level selection cues. The object hypotheses are represented as figure-ground segmentations, and are extracted automatically, without prior knowledge of the properties of individual object classes, by solving a sequence of Constrained Parametric Min-Cut problems (CPMC) on a regular image grid. In a subsequent step, we learn to rank the corresponding segments by training a continuous model to predict how likely they are to exhibit real-world regularities (expressed as putative overlap with ground truth) based on their mid-level region properties, then diversify the estimated overlap score using maximum marginal relevance measures. We show that this algorithm significantly outperforms the state of the art for low-level segmentation in the VOC 2009 and 2010 data sets. In our companion papers [1], [2], we show that the algorithm can be used, successfully, in a segmentation-based visual object category recognition pipeline. This architecture ranked first in the VOC2009 and VOC2010 image segmentation and labeling challenges.

671 citations


Book ChapterDOI
07 Oct 2012
TL;DR: This paper introduces multiplicative second-order analogues of average and max-pooling that together with appropriate non-linearities lead to state-of-the-art performance on free-form region recognition, without any type of feature coding.
Abstract: Feature extraction, coding and pooling, are important components on many contemporary object recognition paradigms. In this paper we explore novel pooling techniques that encode the second-order statistics of local descriptors inside a region. To achieve this effect, we introduce multiplicative second-order analogues of average and max-pooling that together with appropriate non-linearities lead to state-of-the-art performance on free-form region recognition, without any type of feature coding. Instead of coding, we found that enriching local descriptors with additional image information leads to large performance gains, especially in conjunction with the proposed pooling methodology. We show that second-order pooling over free-form regions produces results superior to those of the winning systems in the Pascal VOC 2011 semantic segmentation challenge, with models that are 20,000 times faster.

547 citations


Book ChapterDOI
07 Oct 2012
TL;DR: This work complements existing state-of-the art large-scale dynamic computer vision datasets like Hollywood-2 and UCF Sports with human eye movements collected under the ecological constraints of the visual action recognition task, and introduces novel dynamic consistency and alignment models, which underline the remarkable stability of patterns of visual search among subjects.
Abstract: Systems based on bag-of-words models operating on image features collected at maxima of sparse interest point operators have been extremely successful for both computer-based visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in "saccade and fixate" regimes, the knowledge, methodology, and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large-scale dynamic computer vision datasets like Hollywood-2[1] and UCF Sports[2] with human eye movements collected under the ecological constraints of the visual action recognition task. To our knowledge these are the first massive human eye tracking datasets of significant size to be collected for video (497,107 frames, each viewed by 16 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing. Second, we introduce novel dynamic consistency and alignment models, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the massive amounts of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the most advanced computer vision practice, can lead to state of the art results.

172 citations


Journal ArticleDOI
TL;DR: This work presents an approach to visual object-class segmentation and recognition based on a pipeline that combines multiple figure-ground hypotheses with large object spatial support, generated by bottom-up computational processes that do not exploit knowledge of specific categories, and sequential categorization based on continuous estimates of the spatial overlap between the image segment hypotheses and each putative class.
Abstract: We present an approach to visual object-class segmentation and recognition based on a pipeline that combines multiple figure-ground hypotheses with large object spatial support, generated by bottom-up computational processes that do not exploit knowledge of specific categories, and sequential categorization based on continuous estimates of the spatial overlap between the image segment hypotheses and each putative class. We differ from existing approaches not only in our seemingly unreasonable assumption that good object-level segments can be obtained in a feed-forward fashion, but also in formulating recognition as a regression problem. Instead of focusing on a one-vs.-all winning margin that may not preserve the ordering of segment qualities inside the non-maximum (non-winning) set, our learning method produces a globally consistent ranking with close ties to segment quality, hence to the extent entire object or part hypotheses are likely to spatially overlap the ground truth. We demonstrate results beyond the current state of the art for image classification, object detection and semantic segmentation, in a number of challenging datasets including Caltech-101, ETHZ-Shape as well as PASCAL VOC 2009 and 2010.

125 citations


Book ChapterDOI
07 Oct 2012
TL;DR: A unified formulation for boundary detection, with closed-form solution, which is applicable to the localization of different types of boundaries, such as intensity edges and occlusion boundaries from video and RGB-D cameras is proposed.
Abstract: Boundary detection is essential for a variety of computer vision tasks such as segmentation and recognition. We propose a unified formulation for boundary detection, with closed-form solution, which is applicable to the localization of different types of boundaries, such as intensity edges and occlusion boundaries from video and RGB-D cameras. Our algorithm simultaneously combines low- and mid-level image representations, in a single eigenvalue problem, and we solve over an infinite set of putative boundary orientations. Moreover, our method achieves state of the art results at a significantly lower computational cost than current methods. We also propose a novel method for soft-segmentation that can be used in conjunction with our boundary detection algorithm and improve its accuracy at a negligible extra computational cost.

89 citations


Book ChapterDOI
07 Oct 2012
TL;DR: It is shown that a scalable optimization process in the Fourier domain can be used to identify the different frequency bands that are useful for prediction on training data and recover efficient and scalable linear reformulations for both single and multiple kernel learning.
Abstract: Approximations based on random Fourier embeddings have recently emerged as an efficient and formally consistent methodology to design large-scale kernel machines [23]. By expressing the kernel as a Fourier expansion, features are generated based on a finite set of random basis projections, sampled from the Fourier transform of the kernel, with inner products that are Monte Carlo approximations of the original non-linear model. Based on the observation that different kernel-induced Fourier sampling distributions correspond to different kernel parameters, we show that a scalable optimization process in the Fourier domain can be used to identify the different frequency bands that are useful for prediction on training data. This approach allows us to design a family of linear prediction models where we can learn the hyper-parameters of the kernel together with the weights of the feature vectors jointly. Under this methodology, we recover efficient and scalable linear reformulations for both single and multiple kernel learning. Experiments show that our linear models produce fast and accurate predictors for complex datasets such as the Visual Object Challenge 2011 and ImageNet ILSVRC 2011.

57 citations


Proceedings Article
21 Mar 2012
TL;DR: This paper presents a novel method for the hypergraph clustering problem, in which second or higher order anities between sets of data points are considered, and it is shown that this method could be successfully applied both to higher-order assignment problems and to image segmentation.
Abstract: Data clustering is an essential problem in data mining, machine learning and computer vision. In this paper we present a novel method for the hypergraph clustering problem, in which second or higher order anities between sets of data points are considered. Our algorithm has important theoretical properties, such as convergence and satisfaction of first order necessary optimality conditions. It is based on an ecient iterative procedure, which by updating the cluster membership of all points in parallel, is able to achieve state of the art results in very few steps. We outperform current hypergraph clustering methods especially in terms of computational speed, but also in terms of accuracy. Moreover, we show that our method could be successfully applied both to higher-order assignment problems and to image segmentation.

47 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: An asymptotically convergent analytic series of the χ2 measure, based on analogies with Chebyshev polynomials, is proposed to reduce the dimensionality of the approximation and achieve better performance at the expense of only an additional constant factor to the time complexity.
Abstract: The random Fourier embedding methodology can be used to approximate the performance of non-linear kernel classifiers in linear time on the number of training examples. However, there still exists a non-trivial performance gap between the approximation and the nonlinear models, especially for the exponential χ2 kernel, one of the most powerful models for histograms. Based on analogies with Chebyshev polynomials, we propose an asymptotically convergent analytic series of the χ2 measure. The new series removes the need to use periodic approximations to the χ2 function, as typical in previous methods, and improves the classification accuracy when used in the random Fourier approximation of the exponential χ2 kernel. Besides, out-of-core principal component analysis (PCA) methods are introduced to reduce the dimensionality of the approximation and achieve better performance at the expense of only an additional constant factor to the time complexity. Moreover, when PCA is performed jointly on the training and unlabeled testing data, further performance improvements can be obtained. The proposed approaches are tested on the PASCAL VOC 2010 segmentation and the ImageNet ILSVR-C 2010 datasets, and shown to give statistically significant improvements over alternative approximation methods.

33 citations


Journal ArticleDOI
TL;DR: The concept of spatiotemporal closure is introduced, and the resulting approach automatically recovers coherent components in images and videos, corresponding to objects, object parts, and objects with surrounding context, providing a good set of multiscale hypotheses for high-level scene analysis.
Abstract: Detecting independent objects in images and videos is an important perceptual grouping problem. One common perceptual grouping cue that can facilitate this objective is the cue of contour closure, reflecting the spatial coherence of objects in the world and their projections as closed boundaries separating figure from background. Detecting contour closure in images consists of finding a cycle of disconnected contour fragments that separates an object from its background. Searching the entire space of possible groupings is intractable, and previous approaches have adopted powerful perceptual grouping heuristics, such as proximity and co-curvilinearity, to constrain the search. We introduce a new formulation of the problem, by transforming the problem of finding cycles of contour fragments to finding subsets of superpixels whose collective boundary has strong edge support (few gaps) in the image. Our cost function, a ratio of a boundary gap measure to area, promotes spatially coherent sets of superpixels. Moreover, its properties support a global optimization procedure based on parametric maxflow. Extending closure detection to videos, we introduce the concept of spatiotemporal closure. Analogous to image closure, we formulate our spatiotemporal closure cost over a graph of spatiotemporal superpixels. Our cost function is a ratio of motion and appearance discontinuity measures on the boundary of the selection to an internal homogeneity measure of the selected spatiotemporal volume. The resulting approach automatically recovers coherent components in images and videos, corresponding to objects, object parts, and objects with surrounding context, providing a good set of multiscale hypotheses for high-level scene analysis. We evaluate both our image and video closure frameworks by comparing them to other closure detection approaches, and find that they yield improved performance.

29 citations


01 Jan 2012
TL;DR: This paper introduces multiplicative second-order analogues of average and max- pooling that together with appropriate non-linearities lead to state-of- the-art performance on free-form region recognition, without any type of feature coding.
Abstract: Feature extraction, coding and pooling, are important com- ponents on many contemporary object recognition paradigms. In this paper we explore novel pooling techniques that encode the second-order statistics of local descriptors inside a region. To achieve this effect, we introduce multiplicative second-order analogues of average and max- pooling that together with appropriate non-linearities lead to state-of- the-art performance on free-form region recognition, without any type of feature coding. Instead of coding, we found that enriching local descrip- tors with additional image information leads to large performance gains, especially in conjunction with the proposed pooling methodology. We show that second-order pooling over free-form regions produces results superior to those of the winning systems in the Pascal VOC 2011 seman- tic segmentation challenge, with models that are 20,000 times faster.

23 citations


Posted Content
TL;DR: This work analyzes the limitations of existing methods and cast the active set selection into a zero-norm constrained optimization problem that is solved using greedy methods, and obtains not only efficient approximate solutions for GBCD, but also demonstrates that the method is globally convergent.
Abstract: We propose a variable decomposition algorithm -greedy block coordinate descent (GBCD)- in order to make dense Gaussian process regression practical for large scale problems. GBCD breaks a large scale optimization into a series of small sub-problems. The challenge in variable decomposition algorithms is the identification of a subproblem (the active set of variables) that yields the largest improvement. We analyze the limitations of existing methods and cast the active set selection into a zero-norm constrained optimization problem that we solve using greedy methods. By directly estimating the decrease in the objective function, we obtain not only efficient approximate solutions for GBCD, but we are also able to demonstrate that the method is globally convergent. Empirical comparisons against competing dense methods like Conjugate Gradient or SMO show that GBCD is an order of magnitude faster. Comparisons against sparse GP methods show that GBCD is both accurate and capable of handling datasets of 100,000 samples or more.

Posted Content
18 Jun 2012
TL;DR: Experiments show the new approximation derived with elementary methods and adapts to the input distribution for optimal convergence rate leads to improved performance in image classification and semantic segmentation tasks using a random Fourier feature approximation of the $\exp-\chi^2$ kernel.
Abstract: We propose a new analytical approximation to the $\chi^2$ kernel that converges geometrically. The analytical approximation is derived with elementary methods and adapts to the input distribution for optimal convergence rate. Experiments show the new approximation leads to improved performance in image classification and semantic segmentation tasks using a random Fourier feature approximation of the $\exp-\chi^2$ kernel. Besides, out-of-core principal component analysis (PCA) methods are introduced to reduce the dimensionality of the approximation and achieve better performance at the expense of only an additional constant factor to the time complexity. Moreover, when PCA is performed jointly on the training and unlabeled testing data, further performance improvements can be obtained. Experiments conducted on the PASCAL VOC 2010 segmentation and the ImageNet ILSVRC 2010 datasets show statistically significant improvements over alternative approximation methods.

Posted Content
TL;DR: Experiments show the new approximation derived with elementary methods and adapts to the input distribution for optimal convergence rate leads to improved performance in image classification and semantic segmentation tasks using a random Fourier feature approximation of the $\exp-\chi^2$ kernel.
Abstract: We propose a new analytical approximation to the $\chi^2$ kernel that converges geometrically. The analytical approximation is derived with elementary methods and adapts to the input distribution for optimal convergence rate. Experiments show the new approximation leads to improved performance in image classification and semantic segmentation tasks using a random Fourier feature approximation of the $\exp-\chi^2$ kernel. Besides, out-of-core principal component analysis (PCA) methods are introduced to reduce the dimensionality of the approximation and achieve better performance at the expense of only an additional constant factor to the time complexity. Moreover, when PCA is performed jointly on the training and unlabeled testing data, further performance improvements can be obtained. Experiments conducted on the PASCAL VOC 2010 segmentation and the ImageNet ILSVRC 2010 datasets show statistically significant improvements over alternative approximation methods.

Book ChapterDOI
27 Jun 2012
TL;DR: Two approaches to perceptual grouping based on grouping superpixels are reviewed, which use symmetry to group superp pixels into symmetric parts, and then group the parts to form structured objects.
Abstract: Perceptual grouping plays a critical role in both human and computer vision. However, with the object categorization community's preoccupation with object detection, interest in perceptual grouping has waned. The reason for this is clear: the object-independent, mid-level shape priors that form the basis of perceptual grouping are subsumed by the object-dependent, high-level shape priors defined by a target object. As the recognition community moves from object detection back to object recognition, a linear search through a large database of target models is intractable, and perceptual grouping will be essential for sublinear scaling. We review two approaches to perceptual grouping based on grouping superpixels. In the first, we use symmetry to group superpixels into symmetric parts, and then group the parts to form structured objects. In the second, we use contour closure to group superpixels, yielding a figure-ground segmentation.

Posted Content
TL;DR: A soft-segmentation procedure is introduced that provides region input layers to the authors' boundary detection algorithm for a significant improvement in accuracy, at negligible computational cost; and an efficient method for contour grouping and reasoning is presented, which when applied as a final post-processing stage, further increases the boundary detection performance.
Abstract: Boundary detection is essential for a variety of computer vision tasks such as segmentation and recognition. In this paper we propose a unified formulation and a novel algorithm that are applicable to the detection of different types of boundaries, such as intensity edges, occlusion boundaries or object category specific boundaries. Our formulation leads to a simple method with state-of-the-art performance and significantly lower computational cost than existing methods. We evaluate our algorithm on different types of boundaries, from low-level boundaries extracted in natural images, to occlusion boundaries obtained using motion cues and RGB-D cameras, to boundaries from soft-segmentation. We also propose a novel method for figure/ground soft-segmentation that can be used in conjunction with our boundary detection method and improve its accuracy at almost no extra computational cost.

Posted Content
TL;DR: The linear Fourier approximation methodology for both single and multiple gradient-based kernel learning is developed and it is shown that it produces fast and accurate predictors on a complex dataset such as the Visual Object Challenge 2011 (VOC2011).
Abstract: Approximations based on random Fourier features have recently emerged as an efficient and formally consistent methodology to design large-scale kernel machines. By expressing the kernel as a Fourier expansion, features are generated based on a finite set of random basis projections, sampled from the Fourier transform of the kernel, with inner products that are Monte Carlo approximations of the original kernel. Based on the observation that different kernel-induced Fourier sampling distributions correspond to different kernel parameters, we show that an optimization process in the Fourier domain can be used to identify the different frequency bands that are useful for prediction on training data. Moreover, the application of group Lasso to random feature vectors corresponding to a linear combination of multiple kernels, leads to efficient and scalable reformulations of the standard multiple kernel learning model \cite{Varma09}. In this paper we develop the linear Fourier approximation methodology for both single and multiple gradient-based kernel learning and show that it produces fast and accurate predictors on a complex dataset such as the Visual Object Challenge 2011 (VOC2011).