scispace - formally typeset
Search or ask a question
Author

Antonio Torralba

Other affiliations: Vassar College, Nvidia, Carleton College
Bio: Antonio Torralba is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Computer science & Object detection. The author has an hindex of 119, co-authored 388 publications receiving 84607 citations. Previous affiliations of Antonio Torralba include Vassar College & Nvidia.


Papers
More filters
Proceedings Article
01 Jul 2020
TL;DR: In this paper, the authors consider the task of causal discovery from videos in an end-to-end fashion without supervision on the ground-truth graph structure and propose a model consisting of a perception module that extracts a semantically meaningful and temporally consistent keypoint representation from images, an inference module for determining the graph distribution induced by the detected keypoints, and a dynamics module that can predict the future by conditioning on the inferred graph.
Abstract: Causal discovery is at the core of human cognition. It enables us to reason about the environment and make counterfactual predictions about unseen scenarios that can vastly differ from our previous experiences. We consider the task of causal discovery from videos in an end-to-end fashion without supervision on the ground-truth graph structure. In particular, our goal is to discover the structural dependencies among environmental and object variables: inferring the type and strength of interactions that have a causal effect on the behavior of the dynamical system. Our model consists of (a) a perception module that extracts a semantically meaningful and temporally consistent keypoint representation from images, (b) an inference module for determining the graph distribution induced by the detected keypoints, and (c) a dynamics module that can predict the future by conditioning on the inferred graph. We assume access to different configurations and environmental conditions, i.e., data from unknown interventions on the underlying system; thus, we can hope to discover the correct underlying causal graph without explicit interventions. We evaluate our method in a planar multi-body interaction environment and scenarios involving fabrics of different shapes like shirts and pants. Experiments demonstrate that our model can correctly identify the interactions from a short sequence of images and make long-term future predictions. The causal structure assumed by the model also allows it to make counterfactual predictions and extrapolate to systems of unseen interaction graphs or graphs of various sizes.

61 citations

Posted Content
TL;DR: This paper proposes an approach for measuring the training value of an example, and uses it for ranking and greedily sorting examples, and shows that the performance of current state-of-the-art detectors and classifiers can be improved when training on a subset, rather than the whole training set.
Abstract: When learning a new concept, not all training examples may prove equally useful for training: some may have higher or lower training value than others The goal of this paper is to bring to the attention of the vision community the following considerations: (1) some examples are better than others for training detectors or classifiers, and (2) in the presence of better examples, some examples may negatively impact performance and removing them may be beneficial In this paper, we propose an approach for measuring the training value of an example, and use it for ranking and greedily sorting examples We test our methods on different vision tasks, models, datasets and classifiers Our experiments show that the performance of current state-of-the-art detectors and classifiers can be improved when training on a subset, rather than the whole training set

60 citations

Journal ArticleDOI
TL;DR: In this paper, the authors demonstrate that the presence of natural occluders in the hidden scene can obviate the need for collecting time-resolved measurements, and develop an accompanying analysis for such systems and their generalizations.
Abstract: Active non-line-of-sight imaging systems are of growing interest for diverse applications. The most commonly proposed approaches to date rely on exploiting time-resolved measurements, i.e., measuring the time it takes for short-duration light pulses to transit the scene. This typically requires expensive, specialized, ultrafast lasers, and detectors that must be carefully calibrated. We develop an alternative approach that exploits the valuable role that natural occluders in a scene play in enabling accurate and practical image formation in such settings without such hardware complexity. In particular, we demonstrate that the presence of occluders in the hidden scene can obviate the need for collecting time-resolved measurements, and develop an accompanying analysis for such systems and their generalizations. Ultimately, the results suggest the potential to develop increasingly sophisticated future systems that are able to identify and exploit diverse structural features of the environment to reconstruct scenes hidden from view.

59 citations

Proceedings ArticleDOI
01 Jun 2019
TL;DR: In this article, a new conditional adversarial model was proposed to synthesize plausible tactile signals from visual inputs as well as imagine how humans interact with objects given tactile data as input.
Abstract: Humans perceive the world using multi-modal sensory inputs such as vision, audition, and touch. In this work, we investigate the cross-modal connection between vision and touch. The main challenge in this cross-domain modeling task lies in the significant scale discrepancy between the two: while our eyes perceive an entire visual scene at once, humans can only feel a small region of an object at any given moment. To connect vision and touch, we introduce new tasks of synthesizing plausible tactile signals from visual inputs as well as imagining how we interact with objects given tactile data as input. To accomplish our goals, we first equip robots with both visual and tactile sensors and collect a large-scale dataset of corresponding vision and tactile image sequences. To close the scale gap, we present a new conditional adversarial model that incorporates the scale and location information of the touch. Human perceptual studies demonstrate that our model can produce realistic visual images from tactile data and vice versa. Finally, we present both qualitative and quantitative experimental results regarding different system designs, as well as visualizing the learned representations of our model.

58 citations

Proceedings ArticleDOI
03 Apr 2017
TL;DR: This work uses data for 1.9 million images from Instagram from the US to look at systematic differences in how a machine would objectively label an image compared to how a human subjectively does, and shows that this difference, which it calls the "perception gap", relates to a number of health outcomes observed at the county level.
Abstract: Food is an integral part of our life and what and how much we eat crucially affects our health. Our food choices largely depend on how we perceive certain characteristics of food, such as whether it is healthy, delicious or if it qualifies as a salad. But these perceptions differ from person to person and one person's "single lettuce leaf" might be another person's "side salad". Studying how food is perceived in relation to what it actually is typically involves a laboratory setup. Here we propose to use recent advances in image recognition to tackle this problem. Concretely, we use data for 1.9 million images from Instagram from the US to look at systematic differences in how a machine would objectively label an image compared to how a human subjectively does. We show that this difference, which we call the "perception gap", relates to a number of health outcomes observed at the county level. To the best of our knowledge, this is the first time that image recognition is being used to study the "misalignment" of how people describe food images vs. what they actually depict.

56 citations


Cited by
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, respectively, which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully connected layers we employed a recently developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

33,301 citations