Showing papers by "Aditya Khosla published in 2015"

PDF

Open Access

Journal Article•DOI•

ImageNet Large Scale Visual Recognition Challenge

[...]

Olga Russakovsky¹, Jia Deng², Hao Su¹, Jonathan Krause¹, Sanjeev Satheesh¹, Sean Ma¹, Zhiheng Huang¹, Andrej Karpathy¹, Aditya Khosla³, Michael S. Bernstein¹, Alexander C. Berg⁴, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, University of Michigan², Massachusetts Institute of Technology³, University of North Carolina at Chapel Hill⁴

01 Dec 2015-International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

...read moreread less

30,811 citations

Journal Article•

ImageNet Large Scale Visual Recognition Challenge

[...]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Michael S. Bernstein, Li Fei-Fei, Alexander C. Berg, Aditya Khosla - Show less +8 more

01 Apr 2015-Springer US

6,730 citations

Posted Content•

Learning Deep Features for Discriminative Localization

[...]

Bolei Zhou¹, Aditya Khosla¹, Agata Lapedriza¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

14 Dec 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors revisited the global average pooling layer and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels.

...read moreread less

Abstract: In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them

...read moreread less

5,065 citations

Proceedings Article•DOI•

3D ShapeNets: A deep representation for volumetric shapes

[...]

Zhirong Wu¹, Shuran Song¹, Aditya Khosla², Fisher Yu¹, Linguang Zhang¹, Xiaoou Tang³, Jianxiong Xiao¹ - Show less +3 more•Institutions (3)

Princeton University¹, Massachusetts Institute of Technology², The Chinese University of Hong Kong³

07 Jun 2015

TL;DR: This work proposes to represent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network, and shows that this 3D deep representation enables significant performance improvement over the-state-of-the-arts in a variety of tasks.

...read moreread less

Abstract: 3D shape is a crucial but heavily underutilized cue in today's computer vision systems, mostly due to the lack of a good generic shape representation. With the recent availability of inexpensive 2.5D depth sensors (e.g. Microsoft Kinect), it is becoming increasingly important to have a powerful 3D shape representation in the loop. Apart from category recognition, recovering full 3D shapes from view-based 2.5D depth maps is also a critical part of visual understanding. To this end, we propose to represent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network. Our model, 3D ShapeNets, learns the distribution of complex 3D shapes across different object categories and arbitrary poses from raw CAD data, and discovers hierarchical compositional part representation automatically. It naturally supports joint object recognition and shape completion from 2.5D depth maps, and it enables active object recognition through view planning. To train our 3D deep learning model, we construct ModelNet - a large-scale 3D CAD model dataset. Extensive experiments show that our 3D deep representation enables significant performance improvement over the-state-of-the-arts in a variety of tasks.

...read moreread less

4,266 citations

Proceedings Article•

Object Detectors Emerge in Deep Scene CNNs

[...]

Bolei Zhou¹, Aditya Khosla¹, Agata Lapedriza¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

01 May 2015

TL;DR: This work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.

...read moreread less

Abstract: With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning to recognize scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.

...read moreread less

649 citations

Proceedings Article•DOI•

Understanding and Predicting Image Memorability at a Large Scale

[...]

Aditya Khosla¹, Akhil S. Raju¹, Antonio Torralba¹, Aude Oliva¹•Institutions (1)

Massachusetts Institute of Technology¹

07 Dec 2015

TL;DR: LaMem is built, the largest annotated image memorability dataset to date, using Convolutional Neural Networks, to demonstrate that one can now robustly estimate the memorability of images from many different classes, positioning memorability and deep memorability features as prime candidates to estimate the utility of information for cognitive systems.

...read moreread less

Abstract: Progress in estimating visual memorability has been limited by the small scale and lack of variety of benchmark data. Here, we introduce a novel experimental procedure to objectively measure human memory, allowing us to build LaMem, the largest annotated image memorability dataset to date (containing 60,000 images from diverse sources). Using Convolutional Neural Networks (CNNs), we show that fine-tuned deep features outperform all other features by a large margin, reaching a rank correlation of 0.64, near human consistency (0.68). Analysis of the responses of the high-level CNN layers shows which objects and regions are positively, and negatively, correlated with memorability, allowing us to create memorability maps for each image and provide a concrete method to perform image memorability manipulation. This work demonstrates that one can now robustly estimate the memorability of images from many different classes, positioning memorability and deep memorability features as prime candidates to estimate the utility of information for cognitive systems. Our model and data are available at: http://memorability.csail.mit.edu.

...read moreread less

285 citations

Proceedings Article•

Where are they looking

[...]

Adrià Recasens¹, Aditya Khosla¹, Carl Vondrick¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

07 Dec 2015

TL;DR: A deep neural network-based approach for gaze-following and a new benchmark dataset, GazeFollow, for thorough evaluation are proposed and it is shown that this approach produces reliable results, even when viewing only the back of the head.

...read moreread less

Abstract: Humans have the remarkable ability to follow the gaze of other people to identify what they are looking at. Following eye gaze, or gaze-following, is an important ability that allows us to understand what other people are thinking, the actions they are performing, and even predict what they might do next. Despite the importance of this topic, this problem has only been studied in limited scenarios within the computer vision community. In this paper, we propose a deep neural network-based approach for gaze-following and a new benchmark dataset, GazeFollow, for thorough evaluation. Given an image and the location of a head, our approach follows the gaze of the person and identifies the object being looked at. Our deep network is able to discover how to extract head pose and gaze orientation, and to select objects in the scene that are in the predicted line of sight and likely to be looked at (such as televisions, balls and food). The quantitative evaluation shows that our approach produces reliable results, even when viewing only the back of the head. While our method outperforms several baseline approaches, we are still far from reaching human performance on this task. Overall, we believe that gaze-following is a challenging and important problem that deserves more attention from the community.

...read moreread less

165 citations

Proceedings Article•DOI•

What Makes an Object Memorable

[...]

Rachit Dubey¹, Joshua C. Peterson², Aditya Khosla³, Ming-Hsuan Yang², Bernard Ghanem¹ - Show less +1 more•Institutions (3)

King Abdullah University of Science and Technology¹, University of California, Merced², Massachusetts Institute of Technology³

07 Dec 2015

TL;DR: This paper augments both the images and object segmentations from the PASCAL-S dataset with ground truth memorability scores and shed light on the various factors and properties that make an object memorable (or forgettable) to humans.

...read moreread less

Abstract: Recent studies on image memorability have shed light on what distinguishes the memorability of different images and the intrinsic and extrinsic properties that make those images memorable. However, a clear understanding of the memorability of specific objects inside an image remains elusive. In this paper, we provide the first attempt to answer the question: what exactly is remembered about an image? We augment both the images and object segmentations from the PASCAL-S dataset with ground truth memorability scores and shed light on the various factors and properties that make an object memorable (or forgettable) to humans. We analyze various visual factors that may influence object memorability (e.g. color, visual saliency, and object categories). We also study the correlation between object and image memorability and find that image memorability is greatly affected by the memorability of its most memorable object. Lastly, we explore the effectiveness of deep learning and other computational approaches in predicting object memorability in images. Our efforts offer a deeper understanding of memorability in general thereby opening up avenues for a wide variety of applications.

...read moreread less

103 citations

Posted Content•DOI•

Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks

[...]

Radoslaw Martin Cichy¹, Aditya Khosla¹, Dimitrios Pantazis¹, Aude Oliva¹•Institutions (1)

Massachusetts Institute of Technology¹

23 Nov 2015-bioRxiv

TL;DR: Together these data provide a first description of an electrophysiological signal for layout processing in humans, and a novel quantitative model of how spatial layout representations may emerge in the human brain.

...read moreread less

Abstract: Human scene recognition is a rapid multistep process evolving over time from single scene image to spatial layout processing. We used multivariate pattern analyses on magnetoencephalography (MEG) data to unravel the time course of this cortical process. Following an early signal for lower-level visual analysis of single scenes at ~100ms, we found a marker of real-world scene size, i.e. spatial layout processing, at ~250ms indexing neural representations robust to changes in unrelated scene properties and viewing conditions. For a quantitative explanation that captures the complexity of scene recognition, we compared MEG data to a deep neural network model trained on scene classification. Representations of scene size emerged intrinsically in the model, and resolved emerging neural scene size representation. Together our data provide a first description of an electrophysiological signal for layout processing in humans, and a novel quantitative model of how spatial layout representations may emerge in the human brain.

...read moreread less

25 citations

Journal Article•DOI•

Guest Editorial: Scene Understanding

[...]

Derek Hoiem¹, James Hays², Jianxiong Xiao³, Aditya Khosla•Institutions (3)

University of Illinois at Urbana–Champaign¹, Brown University², Princeton University³

01 Apr 2015-International Journal of Computer Vision

TL;DR: In this issue, several papers offer improvements to image segmentation and labeling through use of region classifiers, detectors, and object and scene context.

...read moreread less

Abstract: Scene understanding is the ability to visually analyze a scene to answer questions such as: What is happening? Why is it happening? What will happen next? What should I do? For example, in the context of driving safety, the vision system would need to recognize nearby people and vehicles, anticipate their motions, infer traffic patterns, and detect road conditions. So far, research has focused on providing complete (e.g., every pixel labeled) or holistic (reasoning about several different scene elements) interpretations, often taking into account scene geometry or 3D spatial relationships. Accordingly, in this issue, several papers offer improvements to image segmentation and labeling through use of region classifiers, detectors, and object and scene context: “Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation” (doi:10.1007/s11263-014-0777-6) by Gupta et al. addresses problems of interpreting indoor scenes from a paired RGB and depth image. The method infers whether observed contours are due to depth, normal, or albedo

...read moreread less

17 citations

Posted Content•

Visualizing Object Detection Features

[...]

Carl Vondrick¹, Aditya Khosla¹, Hamed Pirsiavash², Tomasz Malisiewicz, Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, University of Maryland, Baltimore County²

19 Feb 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: Algorithms to visualize feature spaces used by object detectors are introduced to allow for a more intuitive understanding of recognition systems and suggest that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors without improving the features.

...read moreread less

Abstract: We introduce algorithms to visualize feature spaces used by object detectors. Our method works by inverting a visual feature back to multiple natural images. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector's failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and supports that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of recognition systems.

...read moreread less

Journal Article•DOI•

Mapping human visual representations in space and time by neural networks

[...]

Radoslaw Martin Cichy¹, Aditya Khosla¹, Dimitrios Pantazis¹, Antonio Torralba¹, Aude Oliva¹ - Show less +1 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Sep 2015-Journal of Vision

TL;DR: CNNs are a promising formal model of human visual object recognition Combined with fMRI and MEG, they provide an integrated spatiotemporal and algorithmically explicit view of the first few hundred milliseconds of object recognition.

...read moreread less

Abstract: The neural machinery underlying visual object recognition comprises a hierarchy of cortical regions in the ventral visual stream. The spatiotemporal dynamics of information flow in this hierarchy of regions is largely unknown. Here we tested the hypothesis that there is a correspondence between the spatiotemporal neural processes in the human brain and the layer hierarchy of a deep convolutional neural network (CNN). We presented 118 images of real-world objects to human participants (N=15) while we measured their brain activity with functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG). We trained an 8 layer (5 convolutional layers, 3 fully connected layers) CNN to predict 683 object categories with 900K training images from the ImageNet dataset. We obtained layer-specific CNN responses to the same 118 images. To compare brain-imaging data with the CNN in a common framework, we used representational similarity analysis. The key idea is that if two conditions evoke similar patterns in brain imaging data, they should also evoke similar patterns in the computer model. We thus determined 'where' (fMRI) and 'when' (MEG) the CNNs predicted brain activity. We found a correspondence in hierarchy between cortical regions, processing time, and CNN layers. Low CNN layers predicted MEG activity early and high layers relatively later; low CNN layers predicted fMRI activity in early visual regions, and high layers in late visual regions. Surprisingly, the correspondence between CNN layer hierarchy and cortical regions held for the ventral and dorsal visual stream. Results were dependent on amount of training and type of training material. Our results show that CNNs are a promising formal model of human visual object recognition. Combined with fMRI and MEG, they provide an integrated spatiotemporal and algorithmically explicit view of the first few hundred milliseconds of object recognition. Meeting abstract presented at VSS 2015.

...read moreread less

Journal Article•

Guest Editorial: Scene Understanding

[...]

Derek Hoiem¹, James Hays², Jianxiong Xiao³, Aditya Khosla•Institutions (3)

University of Illinois at Urbana–Champaign¹, Brown University², Princeton University³

01 Mar 2015-Springer US

TL;DR: Gupta et al. as discussed by the authors used region classifiers, detectors, and object and scene context to improve indoor scene understanding with RGB-D images from a paired RGB and depth image.

...read moreread less