scispace - formally typeset
Search or ask a question

Showing papers by "Aditya Khosla published in 2015"


Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations



Posted Content
TL;DR: In this article, the authors revisited the global average pooling layer and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels.
Abstract: In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them

5,065 citations


Proceedings ArticleDOI
07 Jun 2015
TL;DR: This work proposes to represent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network, and shows that this 3D deep representation enables significant performance improvement over the-state-of-the-arts in a variety of tasks.
Abstract: 3D shape is a crucial but heavily underutilized cue in today's computer vision systems, mostly due to the lack of a good generic shape representation. With the recent availability of inexpensive 2.5D depth sensors (e.g. Microsoft Kinect), it is becoming increasingly important to have a powerful 3D shape representation in the loop. Apart from category recognition, recovering full 3D shapes from view-based 2.5D depth maps is also a critical part of visual understanding. To this end, we propose to represent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid, using a Convolutional Deep Belief Network. Our model, 3D ShapeNets, learns the distribution of complex 3D shapes across different object categories and arbitrary poses from raw CAD data, and discovers hierarchical compositional part representation automatically. It naturally supports joint object recognition and shape completion from 2.5D depth maps, and it enables active object recognition through view planning. To train our 3D deep learning model, we construct ModelNet - a large-scale 3D CAD model dataset. Extensive experiments show that our 3D deep representation enables significant performance improvement over the-state-of-the-arts in a variety of tasks.

4,266 citations


Proceedings Article
01 May 2015
TL;DR: This work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.
Abstract: With the success of new computational architectures for visual processing, such as convolutional neural networks (CNN) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. One important factor for continued progress is to understand the representations that are learned by the inner layers of these deep architectures. Here we show that object detectors emerge from training CNNs to perform scene classification. As scenes are composed of objects, the CNN for scene classification automatically discovers meaningful objects detectors, representative of the learned scene categories. With object detectors emerging as a result of learning to recognize scenes, our work demonstrates that the same network can perform both scene recognition and object localization in a single forward-pass, without ever having been explicitly taught the notion of objects.

649 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: LaMem is built, the largest annotated image memorability dataset to date, using Convolutional Neural Networks, to demonstrate that one can now robustly estimate the memorability of images from many different classes, positioning memorability and deep memorability features as prime candidates to estimate the utility of information for cognitive systems.
Abstract: Progress in estimating visual memorability has been limited by the small scale and lack of variety of benchmark data. Here, we introduce a novel experimental procedure to objectively measure human memory, allowing us to build LaMem, the largest annotated image memorability dataset to date (containing 60,000 images from diverse sources). Using Convolutional Neural Networks (CNNs), we show that fine-tuned deep features outperform all other features by a large margin, reaching a rank correlation of 0.64, near human consistency (0.68). Analysis of the responses of the high-level CNN layers shows which objects and regions are positively, and negatively, correlated with memorability, allowing us to create memorability maps for each image and provide a concrete method to perform image memorability manipulation. This work demonstrates that one can now robustly estimate the memorability of images from many different classes, positioning memorability and deep memorability features as prime candidates to estimate the utility of information for cognitive systems. Our model and data are available at: http://memorability.csail.mit.edu.

285 citations


Proceedings Article
07 Dec 2015
TL;DR: A deep neural network-based approach for gaze-following and a new benchmark dataset, GazeFollow, for thorough evaluation are proposed and it is shown that this approach produces reliable results, even when viewing only the back of the head.
Abstract: Humans have the remarkable ability to follow the gaze of other people to identify what they are looking at. Following eye gaze, or gaze-following, is an important ability that allows us to understand what other people are thinking, the actions they are performing, and even predict what they might do next. Despite the importance of this topic, this problem has only been studied in limited scenarios within the computer vision community. In this paper, we propose a deep neural network-based approach for gaze-following and a new benchmark dataset, GazeFollow, for thorough evaluation. Given an image and the location of a head, our approach follows the gaze of the person and identifies the object being looked at. Our deep network is able to discover how to extract head pose and gaze orientation, and to select objects in the scene that are in the predicted line of sight and likely to be looked at (such as televisions, balls and food). The quantitative evaluation shows that our approach produces reliable results, even when viewing only the back of the head. While our method outperforms several baseline approaches, we are still far from reaching human performance on this task. Overall, we believe that gaze-following is a challenging and important problem that deserves more attention from the community.

165 citations


Proceedings ArticleDOI
07 Dec 2015
TL;DR: This paper augments both the images and object segmentations from the PASCAL-S dataset with ground truth memorability scores and shed light on the various factors and properties that make an object memorable (or forgettable) to humans.
Abstract: Recent studies on image memorability have shed light on what distinguishes the memorability of different images and the intrinsic and extrinsic properties that make those images memorable. However, a clear understanding of the memorability of specific objects inside an image remains elusive. In this paper, we provide the first attempt to answer the question: what exactly is remembered about an image? We augment both the images and object segmentations from the PASCAL-S dataset with ground truth memorability scores and shed light on the various factors and properties that make an object memorable (or forgettable) to humans. We analyze various visual factors that may influence object memorability (e.g. color, visual saliency, and object categories). We also study the correlation between object and image memorability and find that image memorability is greatly affected by the memorability of its most memorable object. Lastly, we explore the effectiveness of deep learning and other computational approaches in predicting object memorability in images. Our efforts offer a deeper understanding of memorability in general thereby opening up avenues for a wide variety of applications.

103 citations


Posted ContentDOI
23 Nov 2015-bioRxiv
TL;DR: Together these data provide a first description of an electrophysiological signal for layout processing in humans, and a novel quantitative model of how spatial layout representations may emerge in the human brain.
Abstract: Human scene recognition is a rapid multistep process evolving over time from single scene image to spatial layout processing. We used multivariate pattern analyses on magnetoencephalography (MEG) data to unravel the time course of this cortical process. Following an early signal for lower-level visual analysis of single scenes at ~100ms, we found a marker of real-world scene size, i.e. spatial layout processing, at ~250ms indexing neural representations robust to changes in unrelated scene properties and viewing conditions. For a quantitative explanation that captures the complexity of scene recognition, we compared MEG data to a deep neural network model trained on scene classification. Representations of scene size emerged intrinsically in the model, and resolved emerging neural scene size representation. Together our data provide a first description of an electrophysiological signal for layout processing in humans, and a novel quantitative model of how spatial layout representations may emerge in the human brain.

25 citations


Journal ArticleDOI
TL;DR: In this issue, several papers offer improvements to image segmentation and labeling through use of region classifiers, detectors, and object and scene context.
Abstract: Scene understanding is the ability to visually analyze a scene to answer questions such as: What is happening? Why is it happening? What will happen next? What should I do? For example, in the context of driving safety, the vision system would need to recognize nearby people and vehicles, anticipate their motions, infer traffic patterns, and detect road conditions. So far, research has focused on providing complete (e.g., every pixel labeled) or holistic (reasoning about several different scene elements) interpretations, often taking into account scene geometry or 3D spatial relationships. Accordingly, in this issue, several papers offer improvements to image segmentation and labeling through use of region classifiers, detectors, and object and scene context: “Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation” (doi:10.1007/s11263-014-0777-6) by Gupta et al. addresses problems of interpreting indoor scenes from a paired RGB and depth image. The method infers whether observed contours are due to depth, normal, or albedo

17 citations


Posted Content
TL;DR: Algorithms to visualize feature spaces used by object detectors are introduced to allow for a more intuitive understanding of recognition systems and suggest that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors without improving the features.
Abstract: We introduce algorithms to visualize feature spaces used by object detectors. Our method works by inverting a visual feature back to multiple natural images. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector's failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and supports that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of recognition systems.

Journal ArticleDOI
TL;DR: CNNs are a promising formal model of human visual object recognition Combined with fMRI and MEG, they provide an integrated spatiotemporal and algorithmically explicit view of the first few hundred milliseconds of object recognition.
Abstract: The neural machinery underlying visual object recognition comprises a hierarchy of cortical regions in the ventral visual stream. The spatiotemporal dynamics of information flow in this hierarchy of regions is largely unknown. Here we tested the hypothesis that there is a correspondence between the spatiotemporal neural processes in the human brain and the layer hierarchy of a deep convolutional neural network (CNN). We presented 118 images of real-world objects to human participants (N=15) while we measured their brain activity with functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG). We trained an 8 layer (5 convolutional layers, 3 fully connected layers) CNN to predict 683 object categories with 900K training images from the ImageNet dataset. We obtained layer-specific CNN responses to the same 118 images. To compare brain-imaging data with the CNN in a common framework, we used representational similarity analysis. The key idea is that if two conditions evoke similar patterns in brain imaging data, they should also evoke similar patterns in the computer model. We thus determined 'where' (fMRI) and 'when' (MEG) the CNNs predicted brain activity. We found a correspondence in hierarchy between cortical regions, processing time, and CNN layers. Low CNN layers predicted MEG activity early and high layers relatively later; low CNN layers predicted fMRI activity in early visual regions, and high layers in late visual regions. Surprisingly, the correspondence between CNN layer hierarchy and cortical regions held for the ventral and dorsal visual stream. Results were dependent on amount of training and type of training material. Our results show that CNNs are a promising formal model of human visual object recognition. Combined with fMRI and MEG, they provide an integrated spatiotemporal and algorithmically explicit view of the first few hundred milliseconds of object recognition. Meeting abstract presented at VSS 2015.

Journal Article
TL;DR: Gupta et al. as discussed by the authors used region classifiers, detectors, and object and scene context to improve indoor scene understanding with RGB-D images from a paired RGB and depth image.
Abstract: Scene understanding is the ability to visually analyze a scene to answer questions such as: What is happening? Why is it happening? What will happen next? What should I do? For example, in the context of driving safety, the vision system would need to recognize nearby people and vehicles, anticipate their motions, infer traffic patterns, and detect road conditions. So far, research has focused on providing complete (e.g., every pixel labeled) or holistic (reasoning about several different scene elements) interpretations, often taking into account scene geometry or 3D spatial relationships. Accordingly, in this issue, several papers offer improvements to image segmentation and labeling through use of region classifiers, detectors, and object and scene context: “Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation” (doi:10.1007/s11263-014-0777-6) by Gupta et al. addresses problems of interpreting indoor scenes from a paired RGB and depth image. The method infers whether observed contours are due to depth, normal, or albedo