scispace - formally typeset
Search or ask a question

Showing papers by "Antonio Torralba published in 2009"


Proceedings ArticleDOI
01 Sep 2009
TL;DR: This paper collects eye tracking data of 15 viewers on 1003 images and uses this database as training and testing examples to learn a model of saliency based on low, middle and high-level image features.
Abstract: For many applications in graphics, design, and human computer interaction, it is essential to understand where humans look in a scene Where eye tracking devices are not a viable option, models of saliency can be used to predict fixation locations Most saliency approaches are based on bottom-up computation that does not consider top-down image semantics and often does not match actual eye movements To address this problem, we collected eye tracking data of 15 viewers on 1003 images and use this database as training and testing examples to learn a model of saliency based on low, middle and high-level image features This large database of eye tracking data is publicly available with this paper

2,093 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: A prototype based model that can successfully combine local and global discriminative information is proposed that can significantly outperform a state of the art classifier for the indoor scene recognition task.
Abstract: Indoor scene recognition is a challenging open problem in high level vision. Most scene recognition models that work well for outdoor scenes perform poorly in the indoor domain. The main difficulty is that while some indoor scenes (e.g. corridors) can be well characterized by global spatial properties, others (e.g, bookstores) are better characterized by the objects they contain. More generally, to address the indoor scenes recognition problem we need a model that can exploit local and global discriminative information. In this paper we propose a prototype based model that can successfully combine both sources of information. To test our approach we created a dataset of 67 indoor scenes categories (the largest available) covering a wide range of domains. The results show that our approach can significantly outperform a state of the art classifier for the task.

1,517 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: Compared to existing object recognition approaches that require training for each object category, the proposed nonparametric scene parsing system is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.
Abstract: In this paper we propose a novel nonparametric approach for object recognition and scene parsing using dense scene alignment. Given an input image, we retrieve its best matches from a large database with annotated images using our modified, coarse-to-fine SIFT flow algorithm that aligns the structures within two images. Based on the dense scene correspondence obtained from the SIFT flow, our system warps the existing annotations, and integrates multiple cues in a Markov random field framework to segment and recognize the query image. Promising experimental results have been achieved by our nonparametric scene parsing system on a challenging database. Compared to existing object recognition approaches that require training for each object category, our system is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.

396 citations


Journal Article
TL;DR: This work puts forth a benchmark for computational models of search in real world scenes by recording observers’ eye movements as they performed a search task (person detection) in 912 outdoor scenes and finding that observers were highly consistent in the regions fixated during search.
Abstract: How predictable are human eye movements during search in real world scenes? We recorded 14 observers' eye movements as they performed a search task (person detection) in 912 outdoor scenes. Observers were highly consistent in the regions fixated during search, even when the target was absent from the scene. These eye movements were used to evaluate computational models of search guidance from three sources: saliency, target features, and scene context. Each of these models independently outperformed a cross-image control in predicting human fixations. Models that combined sources of guidance ultimately predicted 94% of human agreement, with the scene context component providing the most explanatory power. None of the models, however, could reach the precision and fidelity of an attentional map defined by human fixations. This work puts forth a benchmark for computational models of search in real world scenes. Further improvements in modeling should capture mechanisms underlying the selectivity of observer's fixations during search.

288 citations


Proceedings Article
07 Dec 2009
TL;DR: This paper uses the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images.
Abstract: With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. "Clean labels" can be manually obtained on a small fraction, "noisy labels" may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet.

279 citations


Journal ArticleDOI
TL;DR: In this article, the authors evaluated computational models of search guidance from three sources: saliency, target features, and scene context, and found that the scene context component provided the most explanatory power.
Abstract: How predictable are human eye movements during search in real world scenes? We recorded 14 observers' eye movements as they performed a search task (person detection) in 912 outdoor scenes. Observers were highly consistent in the regions fixated during search, even when the target was absent from the scene. These eye movements were used to evaluate computational models of search guidance from three sources: saliency, target features, and scene context. Each of these models independently outperformed a cross-image control in predicting human fixations. Models that combined sources of guidance ultimately predicted 94% of human agreement, with the scene context component providing the most explanatory power. None of the models, however, could reach the precision and fidelity of an attentional map defined by human fixations. This work puts forth a benchmark for computational models of search in real world scenes. Further improvements in modeling should capture mechanisms underlying the selectivity of observer's fixations during search.

264 citations


Proceedings ArticleDOI
01 Sep 2009
TL;DR: An online and openly accessible video annotation system that allows anyone with a browser and internet access to efficiently annotate object category, shape, motion, and activity information in real-world videos is designed.
Abstract: Currently, video analysis algorithms suffer from lack of information regarding the objects present, their interactions, as well as from missing comprehensive annotated video databases for benchmarking. We designed an online and openly accessible video annotation system that allows anyone with a browser and internet access to efficiently annotate object category, shape, motion, and activity information in real-world videos. The annotations are also complemented with knowledge from static image databases to infer occlusion and depth information. Using this system, we have built a scalable video database composed of diverse video samples and paired with human-guided annotations. We complement this paper demonstrating potential uses of this database by studying motion statistics as well as cause-effect motion relationships between objects.

225 citations


Journal ArticleDOI
TL;DR: It is shown that very small thumbnail images at the spatial resolution of 32 × 32 color pixels provide enough information to identify the semantic category of real-world scenes and permit observers to report four to five of the objects that the scene contains, despite the fact that some of these objects are unrecognizable in isolation.
Abstract: The human visual system is remarkably tolerant to degradation in image resolution: human performance in scene categorization remains high no matter whether low-resolution images or multimegapixel images are used. This observation raises the question of how many pixels are required to form a meaningful representation of an image and identify the objects it contains. In this article, we show that very small thumbnail images at the spatial resolution of 32 × 32 color pixels provide enough information to identify the semantic category of real-world scenes. Most strikingly, this low resolution permits observers to report, with 80% accuracy, four to five of the objects that the scene contains, despite the fact that some of these objects are unrecognizable in isolation. The robustness of the information available at very low resolution for describing semantic content of natural images could be an important asset to explain the speed and efficiently at which the human brain comprehends the gist of visual scenes.

131 citations


Proceedings Article
07 Dec 2009
TL;DR: A fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels that is better than one of state-of-the-art techniques and comparable to supervised methods is proposed.
Abstract: This paper proposes a fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels. The proposed approach discovers highly probable regions of object instances by iteratively repeating the following two functions: (1) choose the exemplar set (i.e. a small number of highly ranked reference ROIs) across the dataset and (2) refine the ROIs of each image with respect to the exemplar set. These two subproblems are formulated as ranking in two different similarity networks of ROI hypotheses by link analysis. The experiments with the PASCAL 06 dataset show that our unsupervised localization performance is better than one of state-of-the-art techniques and comparable to supervised methods. Also, we test the scalability of our approach with five objects in Flickr dataset consisting of more than 200K images.

115 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: A model is described that integrates cues extracted from the object labels to infer the implicit geometric information and it is shown how it can find better scene matches for an unlabeled image by expanding the database through viewpoint interpolation to unseen views.
Abstract: In this paper, we wish to build a high quality database of images depicting scenes, along with their real-world three-dimensional (3D) coordinates. Such a database is useful for a variety of applications, including training systems for object detection and validation of 3D output. We build such a database from images that have been annotated with only the identity of objects and their spatial extent in images. Important for this task is the recovery of geometric information that is implicit in the object labels, such as qualitative relationships between objects (attachment, support, occlusion) and quantitative ones (inferring camera parameters). We describe a model that integrates cues extracted from the object labels to infer the implicit geometric information. We show that we are able to obtain high quality 3D information by evaluating the proposed approach on a database obtained with a laser range scanner. Finally, given the database of 3D scenes, we show how it can find better scene matches for an unlabeled image by expanding the database through viewpoint interpolation to unseen views.

92 citations



Journal ArticleDOI
TL;DR: The ten papers in this special section as mentioned in this paper focus on applications of probabilistic graphical models in all areas of computer vision, including image classification, classification, and image segmentation.
Abstract: The ten papers in this special section focus on applications of probabilistic graphical models in all areas of computer vision.

Proceedings Article
07 Dec 2009
TL;DR: The preliminary results suggest that HDP-2DHMM is generally useful for further applications in low-level vision problems and results in a compact representation of textures which allows fast texture synthesis with comparable rendering quality over the state-of-the-art patch-based rendering methods.
Abstract: We present a nonparametric Bayesian method for texture learning and synthesis. A texture image is represented by a 2D Hidden Markov Model (2DHMM) where the hidden states correspond to the cluster labeling of textons and the transition matrix encodes their spatial layout (the compatibility between adjacent textons). The 2DHMM is coupled with the Hierarchical Dirichlet process (HDP) which allows the number of textons and the complexity of transition matrix grow as the input texture becomes irregular. The HDP makes use of Dirichlet process prior which favors regular textures by penalizing the model complexity. This framework (HDP-2DHMM) learns the texton vocabulary and their spatial layout jointly and automatically. The HDP-2DHMM results in a compact representation of textures which allows fast texture synthesis with comparable rendering quality over the state-of-the-art patch-based rendering methods. We also show that the HDP-2DHMM can be applied to perform image segmentation and synthesis. The preliminary results suggest that HDP-2DHMM is generally useful for further applications in low-level vision problems.