Showing papers by "Antonio Torralba published in 2009"

PDF

Open Access

Proceedings Article•DOI•

[...]

Tilke Judd¹, Krista A. Ehinger¹, Frédo Durand¹, Antonio Torralba¹•Institutions (1)

01 Sep 2009

TL;DR: This paper collects eye tracking data of 15 viewers on 1003 images and uses this database as training and testing examples to learn a model of saliency based on low, middle and high-level image features.

...read moreread less

Abstract: For many applications in graphics, design, and human computer interaction, it is essential to understand where humans look in a scene Where eye tracking devices are not a viable option, models of saliency can be used to predict fixation locations Most saliency approaches are based on bottom-up computation that does not consider top-down image semantics and often does not match actual eye movements To address this problem, we collected eye tracking data of 15 viewers on 1003 images and use this database as training and testing examples to learn a model of saliency based on low, middle and high-level image features This large database of eye tracking data is publicly available with this paper

...read moreread less

2,093 citations

Proceedings Article•DOI•

Recognizing indoor scenes

[...]

Ariadna Quattoni¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

20 Jun 2009

TL;DR: A prototype based model that can successfully combine local and global discriminative information is proposed that can significantly outperform a state of the art classifier for the indoor scene recognition task.

...read moreread less

Abstract: Indoor scene recognition is a challenging open problem in high level vision. Most scene recognition models that work well for outdoor scenes perform poorly in the indoor domain. The main difficulty is that while some indoor scenes (e.g. corridors) can be well characterized by global spatial properties, others (e.g, bookstores) are better characterized by the objects they contain. More generally, to address the indoor scenes recognition problem we need a model that can exploit local and global discriminative information. In this paper we propose a prototype based model that can successfully combine both sources of information. To test our approach we created a dataset of 67 indoor scenes categories (the largest available) covering a wide range of domains. The results show that our approach can significantly outperform a state of the art classifier for the task.

...read moreread less

1,517 citations

Proceedings Article•DOI•

Nonparametric scene parsing: Label transfer via dense scene alignment

[...]

Ce Liu¹, Jenny Yuen¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

20 Jun 2009

TL;DR: Compared to existing object recognition approaches that require training for each object category, the proposed nonparametric scene parsing system is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.

...read moreread less

Abstract: In this paper we propose a novel nonparametric approach for object recognition and scene parsing using dense scene alignment. Given an input image, we retrieve its best matches from a large database with annotated images using our modified, coarse-to-fine SIFT flow algorithm that aligns the structures within two images. Based on the dense scene correspondence obtained from the SIFT flow, our system warps the existing annotations, and integrates multiple cues in a Markov random field framework to segment and recognize the query image. Promising experimental results have been achieved by our nonparametric scene parsing system on a challenging database. Compared to existing object recognition approaches that require training for each object category, our system is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.

...read moreread less

396 citations

Journal Article•

Modelling search for people in 900 scenes: A combined source model of eye guidance

[...]

Krista A. Ehinger, Barbara Hidalgo-Sotelo, Antonio Torralba, Aude Oliva

01 Aug 2009-PubMed Central

TL;DR: This work puts forth a benchmark for computational models of search in real world scenes by recording observers’ eye movements as they performed a search task (person detection) in 912 outdoor scenes and finding that observers were highly consistent in the regions fixated during search.

...read moreread less

Abstract: How predictable are human eye movements during search in real world scenes? We recorded 14 observers' eye movements as they performed a search task (person detection) in 912 outdoor scenes. Observers were highly consistent in the regions fixated during search, even when the target was absent from the scene. These eye movements were used to evaluate computational models of search guidance from three sources: saliency, target features, and scene context. Each of these models independently outperformed a cross-image control in predicting human fixations. Models that combined sources of guidance ultimately predicted 94% of human agreement, with the scene context component providing the most explanatory power. None of the models, however, could reach the precision and fidelity of an attentional map defined by human fixations. This work puts forth a benchmark for computational models of search in real world scenes. Further improvements in modeling should capture mechanisms underlying the selectivity of observer's fixations during search.

...read moreread less

288 citations

Proceedings Article•

Semi-Supervised Learning in Gigantic Image Collections

[...]

Rob Fergus¹, Yair Weiss², Antonio Torralba³•Institutions (3)

Courant Institute of Mathematical Sciences¹, Hebrew University of Jerusalem², Massachusetts Institute of Technology³

07 Dec 2009

TL;DR: This paper uses the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images.

...read moreread less

Abstract: With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. "Clean labels" can be manually obtained on a small fraction, "noisy labels" may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet.

...read moreread less

279 citations

Journal Article•DOI•

Modelling search for people in 900 scenes: A combined source model of eye guidance

[...]

Krista A. Ehinger¹, Barbara Hidalgo-Sotelo¹, Antonio Torralba¹, Aude Oliva¹•Institutions (1)

Massachusetts Institute of Technology¹

05 Aug 2009-Visual Cognition

TL;DR: In this article, the authors evaluated computational models of search guidance from three sources: saliency, target features, and scene context, and found that the scene context component provided the most explanatory power.

...read moreread less

264 citations

Proceedings Article•DOI•

LabelMe video: Building a video database with human annotations

[...]

Jenny Yuen¹, Bryan Russell², Ce Liu¹, Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, École Normale Supérieure²

01 Sep 2009

TL;DR: An online and openly accessible video annotation system that allows anyone with a browser and internet access to efficiently annotate object category, shape, motion, and activity information in real-world videos is designed.

...read moreread less

Abstract: Currently, video analysis algorithms suffer from lack of information regarding the objects present, their interactions, as well as from missing comprehensive annotated video databases for benchmarking. We designed an online and openly accessible video annotation system that allows anyone with a browser and internet access to efficiently annotate object category, shape, motion, and activity information in real-world videos. The annotations are also complemented with knowledge from static image databases to infer occlusion and depth information. Using this system, we have built a scalable video database composed of diverse video samples and paired with human-guided annotations. We complement this paper demonstrating potential uses of this database by studying motion statistics as well as cause-effect motion relationships between objects.

...read moreread less

225 citations

Journal Article•DOI•

How many pixels make an image

[...]

Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2009-Visual Neuroscience

TL;DR: It is shown that very small thumbnail images at the spatial resolution of 32 × 32 color pixels provide enough information to identify the semantic category of real-world scenes and permit observers to report four to five of the objects that the scene contains, despite the fact that some of these objects are unrecognizable in isolation.

...read moreread less

Abstract: The human visual system is remarkably tolerant to degradation in image resolution: human performance in scene categorization remains high no matter whether low-resolution images or multimegapixel images are used. This observation raises the question of how many pixels are required to form a meaningful representation of an image and identify the objects it contains. In this article, we show that very small thumbnail images at the spatial resolution of 32 × 32 color pixels provide enough information to identify the semantic category of real-world scenes. Most strikingly, this low resolution permits observers to report, with 80% accuracy, four to five of the objects that the scene contains, despite the fact that some of these objects are unrecognizable in isolation. The robustness of the information available at very low resolution for describing semantic content of natural images could be an important asset to explain the speed and efficiently at which the human brain comprehends the gist of visual scenes.

...read moreread less

131 citations

Proceedings Article•

Unsupervised Detection of Regions of Interest Using Iterative Link Analysis

[...]

Gunhee Kim¹, Antonio Torralba²•Institutions (2)

Carnegie Mellon University¹, Massachusetts Institute of Technology²

07 Dec 2009

TL;DR: A fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels that is better than one of state-of-the-art techniques and comparable to supervised methods is proposed.

...read moreread less

Abstract: This paper proposes a fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels. The proposed approach discovers highly probable regions of object instances by iteratively repeating the following two functions: (1) choose the exemplar set (i.e. a small number of highly ranked reference ROIs) across the dataset and (2) refine the ROIs of each image with respect to the exemplar set. These two subproblems are formulated as ranking in two different similarity networks of ROI hypotheses by link analysis. The experiments with the PASCAL 06 dataset show that our unsupervised localization performance is better than one of state-of-the-art techniques and comparable to supervised methods. Also, we test the scalability of our approach with five objects in Flickr dataset consisting of more than 200K images.

...read moreread less

115 citations

Proceedings Article•DOI•

Building a database of 3D scenes from user annotations

[...]

Bryan Russell¹, Antonio Torralba²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, Massachusetts Institute of Technology²

20 Jun 2009

TL;DR: A model is described that integrates cues extracted from the object labels to infer the implicit geometric information and it is shown how it can find better scene matches for an unlabeled image by expanding the database through viewpoint interpolation to unseen views.

...read moreread less

Abstract: In this paper, we wish to build a high quality database of images depicting scenes, along with their real-world three-dimensional (3D) coordinates. Such a database is useful for a variety of applications, including training systems for object detection and validation of 3D output. We build such a database from images that have been annotated with only the identity of objects and their spatial extent in images. Important for this task is the recovery of geometric information that is implicit in the object labels, such as qualitative relationships between objects (attachment, support, occlusion) and quantitative ones (inferring camera parameters). We describe a model that integrates cues extracted from the object labels to infer the implicit geometric information. We show that we are able to obtain high quality 3D information by evaluating the proposed approach on a database obtained with a laser range scanner. Finally, given the database of 3D scenes, we show how it can find better scene matches for an unlabeled image by expanding the database through viewpoint interpolation to unseen views.

...read moreread less

92 citations

Shape from Sheen

[...]

Edward H. Adelson, Antonio Torralba, Roland W. Fleming

22 Oct 2009

Journal Article•DOI•

Guest Editors' Introduction to the Special Section on Probabilistic Graphical Models

[...]

Qiang Ji, Jiebo Luo, Dimitris N. Metaxas, Antonio Torralba, Thomas S. Huang, Erik B. Sudderth - Show less +2 more

01 Oct 2009-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The ten papers in this special section as mentioned in this paper focus on applications of probabilistic graphical models in all areas of computer vision, including image classification, classification, and image segmentation.

...read moreread less

Abstract: The ten papers in this special section focus on applications of probabilistic graphical models in all areas of computer vision.

...read moreread less

Proceedings Article•

Nonparametric Bayesian Texture Learning and Synthesis

[...]

Long Zhu¹, Yuanahao Chen², Bill Freeman¹, Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, University of California, Los Angeles²

07 Dec 2009

TL;DR: The preliminary results suggest that HDP-2DHMM is generally useful for further applications in low-level vision problems and results in a compact representation of textures which allows fast texture synthesis with comparable rendering quality over the state-of-the-art patch-based rendering methods.

...read moreread less

Abstract: We present a nonparametric Bayesian method for texture learning and synthesis. A texture image is represented by a 2D Hidden Markov Model (2DHMM) where the hidden states correspond to the cluster labeling of textons and the transition matrix encodes their spatial layout (the compatibility between adjacent textons). The 2DHMM is coupled with the Hierarchical Dirichlet process (HDP) which allows the number of textons and the complexity of transition matrix grow as the input texture becomes irregular. The HDP makes use of Dirichlet process prior which favors regular textures by penalizing the model complexity. This framework (HDP-2DHMM) learns the texton vocabulary and their spatial layout jointly and automatically. The HDP-2DHMM results in a compact representation of textures which allows fast texture synthesis with comparable rendering quality over the state-of-the-art patch-based rendering methods. We also show that the HDP-2DHMM can be applied to perform image segmentation and synthesis. The preliminary results suggest that HDP-2DHMM is generally useful for further applications in low-level vision problems.

...read moreread less