scispace - formally typeset
Search or ask a question

Showing papers by "Rob Fergus published in 2011"


Proceedings ArticleDOI
06 Nov 2011
TL;DR: A hierarchical model that learns image decompositions via alternating layers of convolutional sparse coding and max pooling, relying on a novel inference scheme that ensures each layer reconstructs the input, rather than just the output of the layer directly beneath, as is common with existing hierarchical approaches.
Abstract: We present a hierarchical model that learns image decompositions via alternating layers of convolutional sparse coding and max pooling. When trained on natural images, the layers of our model capture image information in a variety of forms: low-level edges, mid-level edge junctions, high-level object parts and complete objects. To build our model we rely on a novel inference scheme that ensures each layer reconstructs the input, rather than just the output of the layer directly beneath, as is common with existing hierarchical approaches. This makes it possible to learn multiple layers of representation and we show models with 4 layers, trained on images from the Caltech-101 and 256 datasets. When combined with a standard classifier, features extracted from these models outperform SIFT, as well as representations from other feature learning methods.

1,257 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: A new type of image regularization which gives lowest cost for the true sharp image is introduced, which allows a very simple cost formulation to be used for the blind deconvolution model, obviating the need for additional methods.
Abstract: Blind image deconvolution is an ill-posed problem that requires regularization to solve. However, many common forms of image prior used in this setting have a major drawback in that the minimum of the resulting cost function does not correspond to the true sharp solution. Accordingly, a range of additional methods are needed to yield good results (Bayesian methods, adaptive cost functions, alpha-matte extraction and edge localization). In this paper we introduce a new type of image regularization which gives lowest cost for the true sharp image. This allows a very simple cost formulation to be used for the blind deconvolution model, obviating the need for additional methods. Due to its simplicity the algorithm is fast and very robust. We demonstrate our method on real images with both spatially invariant and spatially varying blur.

1,054 citations


Proceedings ArticleDOI
01 Nov 2011
TL;DR: This paper uses a CRF-based model to evaluate a range of different representations for depth information and proposes a novel prior on 3D location, revealing that the combination of depth and intensity images gives dramatic performance gains over intensity images alone.
Abstract: In this paper we explore how a structured light depth sensor, in the form of the Microsoft Kinect, can assist with indoor scene segmentation. We use a CRF-based model to evaluate a range of different representations for depth information and propose a novel prior on 3D location. We introduce a new and challenging indoor scene dataset, complete with accurate depth maps and dense label coverage. Evaluating our model on this dataset reveals that the combination of depth and intensity images gives dramatic performance gains over intensity images alone. Our results clearly demonstrate the utility of structured light sensors for scene understanding.

526 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: This paper proposes crowd-sourcing similar images by soliciting human imitations by exploiting temporal coherence in video to generate additional pairwise graded similarities between the user-contributed imitations.
Abstract: Supervised methods for learning an embedding aim to map high-dimensional images to a space in which perceptually similar observations have high measurable similarity. Most approaches rely on binary similarity, typically defined by class membership where labels are expensive to obtain and/or difficult to define. In this paper we propose crowd-sourcing similar images by soliciting human imitations. We exploit temporal coherence in video to generate additional pairwise graded similarities between the user-contributed imitations. We introduce two methods for learning nonlinear, invariant mappings that exploit graded similarities. We learn a model that is highly effective at matching people in similar pose. It exhibits remarkable invariance to identity, clothing, background, lighting, shift and scale.

50 citations


Proceedings Article
12 Dec 2011
TL;DR: A type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence, sharing the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state, and the ability to model nonlinear structure and dynamics.
Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data.

39 citations