Showing papers by "Rob Fergus published in 2015"

PDF

Open Access

Proceedings Article•DOI•

Learning Spatiotemporal Features with 3D Convolutional Networks

[...]

Du Tran¹, Du Tran², Lubomir Bourdev¹, Rob Fergus¹, Lorenzo Torresani², Manohar Paluri¹ - Show less +2 more•Institutions (2)

Facebook¹, Dartmouth College²

07 Dec 2015

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

...read moreread less

Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

...read moreread less

7,091 citations

Proceedings Article•DOI•

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture

[...]

David Eigen¹, Rob Fergus²•Institutions (2)

New York University¹, Facebook²

07 Dec 2015

TL;DR: This paper addresses three different computer vision tasks using a single basic architecture: depth prediction, surface normal estimation, and semantic labeling using a multiscale convolutional network that is able to adapt easily to each task using only small modifications.

...read moreread less

Abstract: In this paper we address three different computer vision tasks using a single basic architecture: depth prediction, surface normal estimation, and semantic labeling. We use a multiscale convolutional network that is able to adapt easily to each task using only small modifications, regressing from the input image to the output map directly. Our method progressively refines predictions using a sequence of scales, and captures many image details without any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks.

...read moreread less

2,046 citations

Proceedings Article•

Deep generative image models using a Laplacian pyramid of adversarial networks

[...]

Emily Denton¹, Soumith Chintala², Arthur Szlam², Rob Fergus²•Institutions (2)

New York University¹, Facebook²

07 Dec 2015

TL;DR: A generative parametric model capable of producing high quality samples of natural images using a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion.

...read moreread less

Abstract: In this paper we introduce a generative parametric model capable of producing high quality samples of natural images. Our approach uses a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion. At each level of the pyramid, a separate generative convnet model is trained using the Generative Adversarial Nets (GAN) approach [11]. Samples drawn from our model are of significantly higher quality than alternate approaches. In a quantitative assessment by human evaluators, our CIFAR10 samples were mistaken for real images around 40% of the time, compared to 10% for samples drawn from a GAN baseline model. We also show samples from models trained on the higher resolution images of the LSUN scene dataset.

...read moreread less

1,898 citations

Proceedings Article•

End-to-end memory networks

[...]

Sainbayar Sukhbaatar¹, Arthur Szlam², Jason Weston², Rob Fergus²•Institutions (2)

New York University¹, Facebook²

07 Dec 2015

TL;DR: This paper proposed an end-to-end memory network with a recurrent attention model over a possibly large external memory, which can be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol.

...read moreread less

Abstract: We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network [23] but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. It can also be seen as an extension of RNNsearch [2] to the case where multiple computational steps (hops) are performed per output symbol. The flexibility of the model allows us to apply it to tasks as diverse as (synthetic) question answering [22] and to language modeling. For the former our approach is competitive with Memory Networks, but with less supervision. For the latter, on the Penn TreeBank and Text8 datasets our approach demonstrates comparable performance to RNNs and LSTMs. In both cases we show that the key concept of multiple computational hops yields improved results.

...read moreread less

1,804 citations

Posted Content•

End-To-End Memory Networks

[...]

Sainbayar Sukhbaatar¹, Arthur Szlam², Jason Weston², Rob Fergus²•Institutions (2)

New York University¹, Facebook²

31 Mar 2015-arXiv: Neural and Evolutionary Computing

TL;DR: A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.

...read moreread less

Abstract: We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network (Weston et al., 2015) but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. It can also be seen as an extension of RNNsearch to the case where multiple computational steps (hops) are performed per output symbol. The flexibility of the model allows us to apply it to tasks as diverse as (synthetic) question answering and to language modeling. For the former our approach is competitive with Memory Networks, but with less supervision. For the latter, on the Penn TreeBank and Text8 datasets our approach demonstrates comparable performance to RNNs and LSTMs. In both cases we show that the key concept of multiple computational hops yields improved results.

...read moreread less

1,250 citations

Posted Content•

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

[...]

Emily Denton¹, Soumith Chintala², Arthur Szlam², Rob Fergus²•Institutions (2)

New York University¹, Facebook²

18 Jun 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a Laplacian pyramid of GANs is used to generate images in a coarse-to-fine fashion, where a separate GAN model is trained at each level of the pyramid.

...read moreread less

Abstract: In this paper we introduce a generative parametric model capable of producing high quality samples of natural images. Our approach uses a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion. At each level of the pyramid, a separate generative convnet model is trained using the Generative Adversarial Nets (GAN) approach (Goodfellow et al.). Samples drawn from our model are of significantly higher quality than alternate approaches. In a quantitative assessment by human evaluators, our CIFAR10 samples were mistaken for real images around 40% of the time, compared to 10% for samples drawn from a GAN baseline model. We also show samples from models trained on the higher resolution images of the LSUN scene dataset.

...read moreread less

854 citations

Posted Content•

Simple Baseline for Visual Question Answering

[...]

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, Rob Fergus - Show less +1 more

07 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: A very simple bag-of-words baseline for visual question answering that concatenates the word features from the question and CNN features fromThe image to predict the answer.

...read moreread less

Abstract: We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the strength and weakness of the trained model, we also provide an interactive web demo and open-source code. .

...read moreread less

316 citations

Proceedings Article•DOI•

Beyond frontal faces: Improving Person Recognition using multiple cues

[...]

Ning Zhang¹, Manohar Paluri², Yaniv Taigman², Rob Fergus², Lubomir Bourdev² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Facebook²

07 Jun 2015

TL;DR: In this paper, a pose-invariant PErson Recognition (PIPER) method is proposed, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer.

...read moreread less

Abstract: We explore the task of recognizing peoples' identities in photo albums in an unconstrained setting. To facilitate this, we introduce the new People In Photo Albums (PIPA) dataset, consisting of over 60000 instances of ∼2000 individuals collected from public Flickr photo albums. With only about half of the person images containing a frontal face, the recognition task is very challenging due to the large variations in pose, clothing, camera viewpoint, image resolution and illumination. We propose the Pose Invariant PErson Recognition (PIPER) method, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer. Experiments on three different settings confirm that in our unconstrained setup PIPER significantly improves on the performance of DeepFace, which is one of the best face recognizers as measured on the LFW dataset.

...read moreread less

132 citations

Posted Content•

Beyond Frontal Faces: Improving Person Recognition Using Multiple Cues

[...]

Ning Zhang¹, Manohar Paluri², Yaniv Taigman², Rob Fergus², Lubomir Bourdev² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Facebook²

23 Jan 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Pose Invariant PErson Recognition (PIPER) method is proposed, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer.

...read moreread less

Abstract: We explore the task of recognizing peoples' identities in photo albums in an unconstrained setting. To facilitate this, we introduce the new People In Photo Albums (PIPA) dataset, consisting of over 60000 instances of 2000 individuals collected from public Flickr photo albums. With only about half of the person images containing a frontal face, the recognition task is very challenging due to the large variations in pose, clothing, camera viewpoint, image resolution and illumination. We propose the Pose Invariant PErson Recognition (PIPER) method, which accumulates the cues of poselet-level person recognizers trained by deep convolutional networks to discount for the pose variations, combined with a face recognizer and a global recognizer. Experiments on three different settings confirm that in our unconstrained setup PIPER significantly improves on the performance of DeepFace, which is one of the best face recognizers as measured on the LFW dataset.

...read moreread less

117 citations

Posted Content•

Weakly Supervised Memory Networks.

[...]

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus

31 Mar 2015

TL;DR: This paper introduces a variant of Memory Networks that needs significantly less supervision to perform question and answering tasks and applies it to the synthetic bAbI tasks, showing that the approach is competitive with the supervised approach, particularly when trained on a sufficiently large amount of data.

...read moreread less

Abstract: In this paper we introduce a variant of Memory Networks (Weston et al., 2015b) that needs significantly less supervision to perform question and answering tasks. The original model requires that the sentences supporting the answer be explicitly indicated during training. In contrast, our approach only requires the answer to the question during training. We apply the model to the synthetic bAbI tasks, showing that our approach is competitive with the supervised approach, particularly when trained on a sufficiently large amount of data. Furthermore, it decisively beats other weakly supervised approaches based on LSTMs. The approach is quite general and can potentially be applied to many other tasks that require capturing long-term dependencies.

...read moreread less

111 citations

Proceedings Article•DOI•

Improving Image Classification with Location Context

[...]

Kevin Tang¹, Manohar Paluri², Li Fei-Fei¹, Rob Fergus², Lubomir Bourdev² - Show less +1 more•Institutions (2)

Stanford University¹, Facebook²

07 Dec 2015

TL;DR: This work tackles the problem of performing image classification with location context, and explores different ways of encoding and extracting features from the GPS coordinates, and shows how to naturally incorporate these features into a Convolutional Neural Network, the current state-of-the-art for most image classification and recognition problems.

...read moreread less

Abstract: With the widespread availability of cellphones and cameras that have GPS capabilities, it is common for images being uploaded to the Internet today to have GPS coordinates associated with them. In addition to research that tries to predict GPS coordinates from visual features, this also opens up the door to problems that are conditioned on the availability of GPS coordinates. In this work, we tackle the problem of performing image classification with location context, in which we are given the GPS coordinates for images in both the train and test phases. We explore different ways of encoding and extracting features from the GPS coordinates, and show how to naturally incorporate these features into a Convolutional Neural Network (CNN), the current state-of-the-art for most image classification and recognition problems. We also show how it is possible to simultaneously learn the optimal pooling radii for a subset of our features within the CNN framework. To evaluate our model and to help promote research in this area, we identify a set of location-sensitive concepts and annotate a subset of the Yahoo Flickr Creative Commons 100M dataset that has GPS coordinates with these concepts, which we make publicly available. By leveraging location context, we are able to achieve almost a 7% gain in mean average precision.

...read moreread less

Proceedings Article•DOI•

User Conditional Hashtag Prediction for Images

[...]

Emily Denton¹, Jason Weston², Manohar Paluri², Lubomir Bourdev², Rob Fergus² - Show less +1 more•Institutions (2)

New York University¹, Facebook²

10 Aug 2015

TL;DR: It is shown how user metadata (age, gender, etc.) combined with image features derived from a convolutional neural network can be used to perform hashtag prediction and it is demonstrated that modeling the user can significantly improve the tag prediction quality over current state-of-the-art methods.

...read moreread less

Abstract: Understanding the content of user's image posts is a particularly interesting problem in social networks and web settings. Current machine learning techniques focus mostly on curated training sets of image-label pairs, and perform image classification given the pixels within the image. In this work we instead leverage the wealth of information available from users: firstly, we employ user hashtags to capture the description of image content; and secondly, we make use of valuable contextual information about the user. We show how user metadata (age, gender, etc.) combined with image features derived from a convolutional neural network can be used to perform hashtag prediction. We explore two ways of combining these heterogeneous features into a learning framework: (i) simple concatenation; and (ii) a 3-way multiplicative gating, where the image model is conditioned on the user metadata. We apply these models to a large dataset of de-identified Facebook posts and demonstrate that modeling the user can significantly improve the tag prediction quality over current state-of-the-art methods.

...read moreread less

Proceedings Article•DOI•

End-to-end integration of a Convolutional Network, Deformable Parts Model and non-maximum suppression

[...]

Li Wan¹, David Eigen¹, Rob Fergus¹•Institutions (1)

New York University¹

07 Jun 2015

TL;DR: In this article, a new model that combines the advantages of deformable parts models and convolutional networks is proposed, which considers all bounding boxes within an image, rather than isolated object instances.

...read moreread less

Abstract: Deformable Parts Models and Convolutional Networks each have achieved notable performance in object detection. Yet these two approaches find their strengths in complementary areas: DPMs are well-versed in object composition, modeling fine-grained spatial relationships between parts; likewise, ConvNets are adept at producing powerful image features, having been discriminatively trained directly on the pixels. In this paper, we propose a new model that combines these two approaches, obtaining the advantages of each. We train this model using a new structured loss function that considers all bounding boxes within an image, rather than isolated object instances. This enables the non-maximal suppression (NMS) operation, previously treated as a separate post-processing stage, to be integrated into the model. This allows for discriminative training of our combined Convnet + DPM + NMS model in end-to-end fashion. We evaluate our system on PASCAL VOC 2007 and 2011 datasets, achieving competitive results on both benchmarks.

...read moreread less

Posted Content•

MazeBase: A Sandbox for Learning from Games

[...]

Sainbayar Sukhbaatar, Arthur Szlam, Gabriel Synnaeve, Soumith Chintala, Rob Fergus - Show less +1 more

23 Nov 2015-arXiv: Learning

TL;DR: MazeBase is introduced, an environment for simple 2D games, designed as a sandbox for machine learning approaches to reasoning and planning, and models trained on the MazeBase version can be directly applied to StarCraft, where they consistently beat the in-game AI.

...read moreread less

Abstract: This paper introduces MazeBase: an environment for simple 2D games, designed as a sandbox for machine learning approaches to reasoning and planning. Within it, we create 10 simple games embodying a range of algorithmic tasks (e.g. if-then statements or set negation). A variety of neural models (fully connected, convolutional network, memory network) are deployed via reinforcement learning on these games, with and without a procedurally generated curriculum. Despite the tasks' simplicity, the performance of the models is far from optimal, suggesting directions for future development. We also demonstrate the versatility of MazeBase by using it to emulate small combat scenarios from StarCraft. Models trained on the MazeBase version can be directly applied to StarCraft, where they consistently beat the in-game AI.

...read moreread less

Proceedings Article•DOI•

Web scale photo hash clustering on a single machine

[...]

Yunchao Gong¹, Marcin Pawlowski¹, Fei Yang¹, Louis Brandy¹, Lubomir Boundev¹, Rob Fergus¹ - Show less +2 more•Institutions (1)

Facebook¹

07 Jun 2015

TL;DR: This paper presents a fast binary k-means algorithm that works directly on the similarity-preserving hashes of images and clusters them into binary centers on which to build hash indexes to speedup computation.

...read moreread less

Abstract: This paper addresses the problem of clustering a very large number of photos (i.e. hundreds of millions a day) in a stream into millions of clusters. This is particularly important as the popularity of photo sharing websites, such as Facebook, Google, and Instagram. Given large number of photos available online, how to efficiently organize them is an open problem. To address this problem, we propose to cluster the binary hash codes of a large number of photos into binary cluster centers. We present a fast binary k-means algorithm that works directly on the similarity-preserving hashes of images and clusters them into binary centers on which we can build hash indexes to speedup computation. The proposed method is capable of clustering millions of photos on a single machine in a few minutes. We show that this approach is usually several magnitude faster than standard k-means and produces comparable clustering accuracy. In addition, we propose an online clustering method based on binary k-means that is capable of clustering large photo stream on a single machine, and show applications to spam detection and trending photo discovery.

...read moreread less

Posted Content•

Learning Simple Algorithms from Examples

[...]

Wojciech Zaremba¹, Tomas Mikolov², Armand Joulin², Rob Fergus²•Institutions (2)

New York University¹, Facebook²

23 Nov 2015-arXiv: Artificial Intelligence

TL;DR: This work presents an approach for learning simple algorithms such as copying, multi-digit addition and single digit multiplication directly from examples, and shows that the bottleneck is in the capabilities of the controller rather than in the search incurred by the learning.

...read moreread less

Abstract: We present an approach for learning simple algorithms such as copying, multi-digit addition and single digit multiplication directly from examples. Our framework consists of a set of interfaces, accessed by a controller. Typical interfaces are 1-D tapes or 2-D grids that hold the input and output data. For the controller, we explore a range of neural network-based models which vary in their ability to abstract the underlying algorithm from training instances and generalize to test examples with many thousands of digits. The controller is trained using $Q$-learning with several enhancements and we show that the bottleneck is in the capabilities of the controller rather than in the search incurred by $Q$-learning.

...read moreread less

Journal Article•

Deep neural networks predict category typicality ratings for images

[...]

Brenden M. Lake, Wojciech Zaremba¹, Rob Fergus, Todd M. Gureckis•Institutions (1)

New York University¹

01 Jan 2015-Cognitive Science

TL;DR: It is found that deep convolutional networks trained for classification have substantial predictive power, unlike simpler features computed from the same massive dataset, showing how typicality might emerge as a byproduct of a complex model trained to maximize classification performance.

...read moreread less

Patent•

Systems and methods for processing content using convolutional neural networks

[...]

Balamanohar Paluri¹, Du Le Hong Tran¹, Lubomir Bourdev¹, Rob Fergus¹•Institutions (1)

Facebook¹

29 Dec 2015

TL;DR: In this article, a convolutional neural network including a set of two-dimensional and three-dimensional convolution layers is used to output one or more video frames at a first resolution.

...read moreread less

Abstract: Systems, methods, and non-transitory computer-readable media can obtain a set of video frames at a first resolution Process the set of video frames using a convolutional neural network to output one or more signals, the convolutional neural network including (i) a set of two-dimensional convolutional layers and (ii) a set of three-dimensional convolutional layers, wherein the processing causes the set of video frames to be reduced to a second resolution Process the one or more signals using a set of three-dimensional de-convolutional layers of the convolutional neural network Obtain one or more outputs corresponding to the set of video frames from the convolutional neural network

...read moreread less

Posted Content•

Deep End2End Voxel2Voxel Prediction

[...]

Du Tran¹, Du Tran², Lubomir Bourdev³, Rob Fergus¹, Lorenzo Torresani², Manohar Paluri³ - Show less +2 more•Institutions (3)

Facebook¹, Dartmouth College², University of California, Berkeley³

20 Nov 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, and shows that the same exact architecture can be used to achieve competitive results on three widely different voxels-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring.

...read moreread less

Abstract: Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore, while deep networks are commonly recognized as the best models to use in these domains, there is a widespread perception that in order to yield successful results they often require time-consuming architecture search, manual tweaking of parameters and computationally intensive pre-processing or post-processing methods. In this paper we challenge these views by presenting a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, i.e., to output a variable at every voxel of the video. Most importantly, we show that the same exact architecture can be used to achieve competitive results on three widely different voxel-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring. The three networks learned on these problems are trained from raw video without any form of preprocessing and their outputs do not require post-processing to achieve outstanding performance. Thus, they offer an efficient alternative to traditional and much more computationally expensive methods in these video domains.

...read moreread less

Journal Article•DOI•

Detection of third and sixth cranial nerve palsies with a novel method for eye tracking while watching a short film clip

[...]

Uzma Samadani, Sameer Farooq, Robert Ritlop, Floyd A. Warren, Marleen Reyes, Elizabeth Lamm¹, Anastasia Alex, Elena Nehrbass, Radek Kolecki, Michael Jureller, Julia R Schneider, Agnes Chen, Chen Shi, Neil Mendhiratta, Jason H. Huang², Meng Qian, Roy C Kwak, Artem Mikheev¹, Henry Rusinek¹, Ajax E. George¹, Rob Fergus¹, Douglas Kondziolka, Paul P. Huang, R. Theodore Smith - Show less +20 more•Institutions (2)

New York University¹, Scott & White Hospital²

01 Mar 2015-Journal of Neurosurgery

TL;DR: This proof of concept series of cases suggests that the use of eye tracking to detect CN palsy while the patient watches television or its equivalent represents a new capacity for this technology.

...read moreread less

Abstract: OBJECT Automated eye movement tracking may provide clues to nervous system function at many levels. Spatial calibration of the eye tracking device requires the subject to have relatively intact ocular motility that implies function of cranial nerves (CNs) III (oculomotor), IV (trochlear), and VI (abducent) and their associated nuclei, along with the multiple regions of the brain imparting cognition and volition. The authors have developed a technique for eye tracking that uses temporal rather than spatial calibration, enabling detection of impaired ability to move the pupil relative to normal (neurologically healthy) control volunteers. This work was performed to demonstrate that this technique may detect CN palsies related to brain compression and to provide insight into how the technique may be of value for evaluating neuropathological conditions associated with CN palsy, such as hydrocephalus or acute mass effect. METHODS The authors recorded subjects' eye movements by using an Eyelink 1000 eye tracker...

...read moreread less

Patent•

Systems and methods for utilizing compressed convolutional neural networks to perform media content processing

[...]

Yunchao Gong¹, Liu Liu¹, Lubomir Bourdev¹, Ming Yang¹, Rob Fergus¹ - Show less +1 more•Institutions (1)

Facebook¹

29 Dec 2015

TL;DR: In this paper, the authors used a CNN to apply a media processing technique to the media content item to produce information about the content item, based on which it can be determined whether to transmit at least a portion of the content to one or more remote servers for additional media processing.

...read moreread less

Abstract: Systems, methods, and non-transitory computer-readable media can receive a compressed convolutional neural network (CNN). A media content item to be processed can be acquired. The compressed CNN to can be utilized to apply a media processing technique to the media content item to produce information about the media content item. It can be determined, based on at least some of the information about the media content item, whether to transmit at least a portion of the media content item to one or more remote servers for additional media processing.

...read moreread less

Posted Content•

Improving Image Classification with Location Context

[...]

Kevin Tang¹, Manohar Paluri¹, Li Fei-Fei², Rob Fergus², Lubomir Bourdev² - Show less +1 more•Institutions (2)

Stanford University¹, Facebook²

14 May 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors tackle the problem of performing image classification with location context, in which they are given the GPS coordinates for images in both the train and test phases, and they explore different ways of encoding and extracting features from GPS coordinates, and show how to naturally incorporate these features into a CNN, the current state-of-theart for most image classification and recognition problems.

...read moreread less

Patent•

Systems and methods for online clustering of content items

[...]

Yunchao Gong¹, Marcin Pawlowski¹, Fei Yang¹, Lubomir Bourdev¹, Louis Brandy¹, Rob Fergus¹ - Show less +2 more•Institutions (1)

Facebook¹

28 Dec 2015

TL;DR: In this paper, a set of content items to be clustered can be generated by clustering respective binary hash codes for each content item in the first batch, wherein content items included in a cluster are visually similar to one another.

...read moreread less

Abstract: Systems, methods, and non-transitory computer-readable media can obtain a first batch of content items to be clustered. A set of clusters can be generated by clustering respective binary hash codes for each content item in the first batch, wherein content items included in a cluster are visually similar to one another. A next batch of content items to be clustered can be obtained. One or more respective binary hash codes for the content items in the next batch can be assigned to a cluster in the set of clusters.

...read moreread less