scispace - formally typeset
Search or ask a question

Showing papers by "Rob Fergus published in 2015"


Proceedings Article•DOI•
Du Tran1, Du Tran2, Lubomir Bourdev1, Rob Fergus1, Lorenzo Torresani2, Manohar Paluri1 •
07 Dec 2015
TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

7,091 citations


Proceedings Article•DOI•
07 Dec 2015
TL;DR: This paper addresses three different computer vision tasks using a single basic architecture: depth prediction, surface normal estimation, and semantic labeling using a multiscale convolutional network that is able to adapt easily to each task using only small modifications.
Abstract: In this paper we address three different computer vision tasks using a single basic architecture: depth prediction, surface normal estimation, and semantic labeling. We use a multiscale convolutional network that is able to adapt easily to each task using only small modifications, regressing from the input image to the output map directly. Our method progressively refines predictions using a sequence of scales, and captures many image details without any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks.

2,046 citations


Proceedings Article•
07 Dec 2015
TL;DR: A generative parametric model capable of producing high quality samples of natural images using a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion.
Abstract: In this paper we introduce a generative parametric model capable of producing high quality samples of natural images. Our approach uses a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion. At each level of the pyramid, a separate generative convnet model is trained using the Generative Adversarial Nets (GAN) approach [11]. Samples drawn from our model are of significantly higher quality than alternate approaches. In a quantitative assessment by human evaluators, our CIFAR10 samples were mistaken for real images around 40% of the time, compared to 10% for samples drawn from a GAN baseline model. We also show samples from models trained on the higher resolution images of the LSUN scene dataset.

1,898 citations


Proceedings Article•
07 Dec 2015
TL;DR: This paper proposed an end-to-end memory network with a recurrent attention model over a possibly large external memory, which can be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol.
Abstract: We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network [23] but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. It can also be seen as an extension of RNNsearch [2] to the case where multiple computational steps (hops) are performed per output symbol. The flexibility of the model allows us to apply it to tasks as diverse as (synthetic) question answering [22] and to language modeling. For the former our approach is competitive with Memory Networks, but with less supervision. For the latter, on the Penn TreeBank and Text8 datasets our approach demonstrates comparable performance to RNNs and LSTMs. In both cases we show that the key concept of multiple computational hops yields improved results.

1,804 citations


Posted Content•
TL;DR: A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.
Abstract: We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network (Weston et al., 2015) but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol. The flexibility of the model allows us to apply it to tasks as diverse as (synthetic) question answering and to language modeling. For the former our approach is competitive with Memory Networks, but with less supervision. For the latter, on the Penn TreeBank and Text8 datasets our approach demonstrates comparable performance to RNNs and LSTMs. In both cases we show that the key concept of multiple computational hops yields improved results.

1,250 citations


Posted Content•
TL;DR: In this article, a Laplacian pyramid of GANs is used to generate images in a coarse-to-fine fashion, where a separate GAN model is trained at each level of the pyramid.
Abstract: In this paper we introduce a generative parametric model capable of producing high quality samples of natural images. Our approach uses a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion. At each level of the pyramid, a separate generative convnet model is trained using the Generative Adversarial Nets (GAN) approach (Goodfellow et al.). Samples drawn from our model are of significantly higher quality than alternate approaches. In a quantitative assessment by human evaluators, our CIFAR10 samples were mistaken for real images around 40% of the time, compared to 10% for samples drawn from a GAN baseline model. We also show samples from models trained on the higher resolution images of the LSUN scene dataset.

854 citations


Posted Content•
TL;DR: A very simple bag-of-words baseline for visual question answering that concatenates the word features from the question and CNN features fromThe image to predict the answer.
Abstract: We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .

316 citations


Proceedings Article•DOI•
07 Jun 2015
TL;DR: In this paper, a pose-invariant PErson Recognition (PIPER) method is proposed, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer.
Abstract: We explore the task of recognizing peoples' identities in photo albums in an unconstrained setting. To facilitate this, we introduce the new People In Photo Albums (PIPA) dataset, consisting of over 60000 instances of ∼2000 individuals collected from public Flickr photo albums. With only about half of the person images containing a frontal face, the recognition task is very challenging due to the large variations in pose, clothing, camera viewpoint, image resolution and illumination. We propose the Pose Invariant PErson Recognition (PIPER) method, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer. Experiments on three different settings confirm that in our unconstrained setup PIPER significantly improves on the performance of DeepFace, which is one of the best face recognizers as measured on the LFW dataset.

132 citations


Posted Content•
TL;DR: The Pose Invariant PErson Recognition (PIPER) method is proposed, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer.
Abstract: We explore the task of recognizing peoples' identities in photo albums in an unconstrained setting. To facilitate this, we introduce the new People In Photo Albums (PIPA) dataset, consisting of over 60000 instances of 2000 individuals collected from public Flickr photo albums. With only about half of the person images containing a frontal face, the recognition task is very challenging due to the large variations in pose, clothing, camera viewpoint, image resolution and illumination. We propose the Pose Invariant PErson Recognition (PIPER) method, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer. Experiments on three different settings confirm that in our unconstrained setup PIPER significantly improves on the performance of DeepFace, which is one of the best face recognizers as measured on the LFW dataset.

117 citations


Posted Content•
31 Mar 2015
TL;DR: This paper introduces a variant of Memory Networks that needs significantly less supervision to perform question and answering tasks and applies it to the synthetic bAbI tasks, showing that the approach is competitive with the supervised approach, particularly when trained on a sufficiently large amount of data.
Abstract: In this paper we introduce a variant of Memory Networks (Weston et al., 2015b) that needs significantly less supervision to perform question and answering tasks. The original model requires that the sentences supporting the answer be explicitly indicated during training. In contrast, our approach only requires the answer to the question during training. We apply the model to the synthetic bAbI tasks, showing that our approach is competitive with the supervised approach, particularly when trained on a sufficiently large amount of data. Furthermore, it decisively beats other weakly supervised approaches based on LSTMs. The approach is quite general and can potentially be applied to many other tasks that require capturing long-term dependencies.

111 citations


Proceedings Article•DOI•
Kevin Tang1, Manohar Paluri2, Li Fei-Fei1, Rob Fergus2, Lubomir Bourdev2 •
07 Dec 2015
TL;DR: This work tackles the problem of performing image classification with location context, and explores different ways of encoding and extracting features from the GPS coordinates, and shows how to naturally incorporate these features into a Convolutional Neural Network, the current state-of-the-art for most image classification and recognition problems.
Abstract: With the widespread availability of cellphones and cameras that have GPS capabilities, it is common for images being uploaded to the Internet today to have GPS coordinates associated with them. In addition to research that tries to predict GPS coordinates from visual features, this also opens up the door to problems that are conditioned on the availability of GPS coordinates. In this work, we tackle the problem of performing image classification with location context, in which we are given the GPS coordinates for images in both the train and test phases. We explore different ways of encoding and extracting features from the GPS coordinates, and show how to naturally incorporate these features into a Convolutional Neural Network (CNN), the current state-of-the-art for most image classification and recognition problems. We also show how it is possible to simultaneously learn the optimal pooling radii for a subset of our features within the CNN framework. To evaluate our model and to help promote research in this area, we identify a set of location-sensitive concepts and annotate a subset of the Yahoo Flickr Creative Commons 100M dataset that has GPS coordinates with these concepts, which we make publicly available. By leveraging location context, we are able to achieve almost a 7% gain in mean average precision.

Proceedings Article•DOI•
Emily Denton1, Jason Weston2, Manohar Paluri2, Lubomir Bourdev2, Rob Fergus2 •
10 Aug 2015
TL;DR: It is shown how user metadata (age, gender, etc.) combined with image features derived from a convolutional neural network can be used to perform hashtag prediction and it is demonstrated that modeling the user can significantly improve the tag prediction quality over current state-of-the-art methods.
Abstract: Understanding the content of user's image posts is a particularly interesting problem in social networks and web settings. Current machine learning techniques focus mostly on curated training sets of image-label pairs, and perform image classification given the pixels within the image. In this work we instead leverage the wealth of information available from users: firstly, we employ user hashtags to capture the description of image content; and secondly, we make use of valuable contextual information about the user. We show how user metadata (age, gender, etc.) combined with image features derived from a convolutional neural network can be used to perform hashtag prediction. We explore two ways of combining these heterogeneous features into a learning framework: (i) simple concatenation; and (ii) a 3-way multiplicative gating, where the image model is conditioned on the user metadata. We apply these models to a large dataset of de-identified Facebook posts and demonstrate that modeling the user can significantly improve the tag prediction quality over current state-of-the-art methods.

Proceedings Article•DOI•
Li Wan1, David Eigen1, Rob Fergus1•
07 Jun 2015
TL;DR: In this article, a new model that combines the advantages of deformable parts models and convolutional networks is proposed, which considers all bounding boxes within an image, rather than isolated object instances.
Abstract: Deformable Parts Models and Convolutional Networks each have achieved notable performance in object detection. Yet these two approaches find their strengths in complementary areas: DPMs are well-versed in object composition, modeling fine-grained spatial relationships between parts; likewise, ConvNets are adept at producing powerful image features, having been discriminatively trained directly on the pixels. In this paper, we propose a new model that combines these two approaches, obtaining the advantages of each. We train this model using a new structured loss function that considers all bounding boxes within an image, rather than isolated object instances. This enables the non-maximal suppression (NMS) operation, previously treated as a separate post-processing stage, to be integrated into the model. This allows for discriminative training of our combined Convnet + DPM + NMS model in end-to-end fashion. We evaluate our system on PASCAL VOC 2007 and 2011 datasets, achieving competitive results on both benchmarks.

Posted Content•
TL;DR: MazeBase is introduced, an environment for simple 2D games, designed as a sandbox for machine learning approaches to reasoning and planning, and models trained on the MazeBase version can be directly applied to StarCraft, where they consistently beat the in-game AI.
Abstract: This paper introduces MazeBase: an environment for simple 2D games, designed as a sandbox for machine learning approaches to reasoning and planning. Within it, we create 10 simple games embodying a range of algorithmic tasks (e.g. if-then statements or set negation). A variety of neural models (fully connected, convolutional network, memory network) are deployed via reinforcement learning on these games, with and without a procedurally generated curriculum. Despite the tasks' simplicity, the performance of the models is far from optimal, suggesting directions for future development. We also demonstrate the versatility of MazeBase by using it to emulate small combat scenarios from StarCraft. Models trained on the MazeBase version can be directly applied to StarCraft, where they consistently beat the in-game AI.

Proceedings Article•DOI•
Yunchao Gong1, Marcin Pawlowski1, Fei Yang1, Louis Brandy1, Lubomir Boundev1, Rob Fergus1 •
07 Jun 2015
TL;DR: This paper presents a fast binary k-means algorithm that works directly on the similarity-preserving hashes of images and clusters them into binary centers on which to build hash indexes to speedup computation.
Abstract: This paper addresses the problem of clustering a very large number of photos (i.e. hundreds of millions a day) in a stream into millions of clusters. This is particularly important as the popularity of photo sharing websites, such as Facebook, Google, and Instagram. Given large number of photos available online, how to efficiently organize them is an open problem. To address this problem, we propose to cluster the binary hash codes of a large number of photos into binary cluster centers. We present a fast binary k-means algorithm that works directly on the similarity-preserving hashes of images and clusters them into binary centers on which we can build hash indexes to speedup computation. The proposed method is capable of clustering millions of photos on a single machine in a few minutes. We show that this approach is usually several magnitude faster than standard k-means and produces comparable clustering accuracy. In addition, we propose an online clustering method based on binary k-means that is capable of clustering large photo stream on a single machine, and show applications to spam detection and trending photo discovery.

Posted Content•
TL;DR: This work presents an approach for learning simple algorithms such as copying, multi-digit addition and single digit multiplication directly from examples, and shows that the bottleneck is in the capabilities of the controller rather than in the search incurred by the learning.
Abstract: We present an approach for learning simple algorithms such as copying, multi-digit addition and single digit multiplication directly from examples. Our framework consists of a set of interfaces, accessed by a controller. Typical interfaces are 1-D tapes or 2-D grids that hold the input and output data. For the controller, we explore a range of neural network-based models which vary in their ability to abstract the underlying algorithm from training instances and generalize to test examples with many thousands of digits. The controller is trained using $Q$-learning with several enhancements and we show that the bottleneck is in the capabilities of the controller rather than in the search incurred by $Q$-learning.

Journal Article•
TL;DR: It is found that deep convolutional networks trained for classification have substantial predictive power, unlike simpler features computed from the same massive dataset, showing how typicality might emerge as a byproduct of a complex model trained to maximize classification performance.

Patent•
29 Dec 2015
TL;DR: In this article, a convolutional neural network including a set of two-dimensional and three-dimensional convolution layers is used to output one or more video frames at a first resolution.
Abstract: Systems, methods, and non-transitory computer-readable media can obtain a set of video frames at a first resolution Process the set of video frames using a convolutional neural network to output one or more signals, the convolutional neural network including (i) a set of two-dimensional convolutional layers and (ii) a set of three-dimensional convolutional layers, wherein the processing causes the set of video frames to be reduced to a second resolution Process the one or more signals using a set of three-dimensional de-convolutional layers of the convolutional neural network Obtain one or more outputs corresponding to the set of video frames from the convolutional neural network

Posted Content•
TL;DR: This paper presents a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, and shows that the same exact architecture can be used to achieve competitive results on three widely different voxels-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring.
Abstract: Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore, while deep networks are commonly recognized as the best models to use in these domains, there is a widespread perception that in order to yield successful results they often require time-consuming architecture search, manual tweaking of parameters and computationally intensive pre-processing or post-processing methods. In this paper we challenge these views by presenting a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, i.e., to output a variable at every voxel of the video. Most importantly, we show that the same exact architecture can be used to achieve competitive results on three widely different voxel-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring. The three networks learned on these problems are trained from raw video without any form of preprocessing and their outputs do not require post-processing to achieve outstanding performance. Thus, they offer an efficient alternative to traditional and much more computationally expensive methods in these video domains.

Journal Article•DOI•
TL;DR: This proof of concept series of cases suggests that the use of eye tracking to detect CN palsy while the patient watches television or its equivalent represents a new capacity for this technology.
Abstract: OBJECT Automated eye movement tracking may provide clues to nervous system function at many levels. Spatial calibration of the eye tracking device requires the subject to have relatively intact ocular motility that implies function of cranial nerves (CNs) III (oculomotor), IV (trochlear), and VI (abducent) and their associated nuclei, along with the multiple regions of the brain imparting cognition and volition. The authors have developed a technique for eye tracking that uses temporal rather than spatial calibration, enabling detection of impaired ability to move the pupil relative to normal (neurologically healthy) control volunteers. This work was performed to demonstrate that this technique may detect CN palsies related to brain compression and to provide insight into how the technique may be of value for evaluating neuropathological conditions associated with CN palsy, such as hydrocephalus or acute mass effect. METHODS The authors recorded subjects' eye movements by using an Eyelink 1000 eye tracker...

Patent•
Yunchao Gong1, Liu Liu1, Lubomir Bourdev1, Ming Yang1, Rob Fergus1 •
29 Dec 2015
TL;DR: In this paper, the authors used a CNN to apply a media processing technique to the media content item to produce information about the content item, based on which it can be determined whether to transmit at least a portion of the content to one or more remote servers for additional media processing.
Abstract: Systems, methods, and non-transitory computer-readable media can receive a compressed convolutional neural network (CNN). A media content item to be processed can be acquired. The compressed CNN to can be utilized to apply a media processing technique to the media content item to produce information about the media content item. It can be determined, based on at least some of the information about the media content item, whether to transmit at least a portion of the media content item to one or more remote servers for additional media processing.

Posted Content•
Kevin Tang1, Manohar Paluri1, Li Fei-Fei2, Rob Fergus2, Lubomir Bourdev2 •
TL;DR: In this article, the authors tackle the problem of performing image classification with location context, in which they are given the GPS coordinates for images in both the train and test phases, and they explore different ways of encoding and extracting features from GPS coordinates, and show how to naturally incorporate these features into a CNN, the current state-of-theart for most image classification and recognition problems.
Abstract: With the widespread availability of cellphones and cameras that have GPS capabilities, it is common for images being uploaded to the Internet today to have GPS coordinates associated with them. In addition to research that tries to predict GPS coordinates from visual features, this also opens up the door to problems that are conditioned on the availability of GPS coordinates. In this work, we tackle the problem of performing image classification with location context, in which we are given the GPS coordinates for images in both the train and test phases. We explore different ways of encoding and extracting features from the GPS coordinates, and show how to naturally incorporate these features into a Convolutional Neural Network (CNN), the current state-of-the-art for most image classification and recognition problems. We also show how it is possible to simultaneously learn the optimal pooling radii for a subset of our features within the CNN framework. To evaluate our model and to help promote research in this area, we identify a set of location-sensitive concepts and annotate a subset of the Yahoo Flickr Creative Commons 100M dataset that has GPS coordinates with these concepts, which we make publicly available. By leveraging location context, we are able to achieve almost a 7% gain in mean average precision.

Patent•
Yunchao Gong1, Marcin Pawlowski1, Fei Yang1, Lubomir Bourdev1, Louis Brandy1, Rob Fergus1 •
28 Dec 2015
TL;DR: In this paper, a set of content items to be clustered can be generated by clustering respective binary hash codes for each content item in the first batch, wherein content items included in a cluster are visually similar to one another.
Abstract: Systems, methods, and non-transitory computer-readable media can obtain a first batch of content items to be clustered. A set of clusters can be generated by clustering respective binary hash codes for each content item in the first batch, wherein content items included in a cluster are visually similar to one another. A next batch of content items to be clustered can be obtained. One or more respective binary hash codes for the content items in the next batch can be assigned to a cluster in the set of clusters.