Showing papers by "Antonio Torralba published in 2011"

PDF

Open Access

Proceedings Article•DOI•

[...]

Antonio Torralba¹, Alexei A. Efros²•Institutions (2)

Massachusetts Institute of Technology¹, Carnegie Mellon University²

20 Jun 2011

TL;DR: A comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value is presented.

...read moreread less

Abstract: Datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. At the same time, datasets have often been blamed for narrowing the focus of object recognition research, reducing it to a single benchmark performance number. Indeed, some datasets, that started out as data capture efforts aimed at representing the visual world, have become closed worlds unto themselves (e.g. the Corel world, the Caltech-101 world, the PASCAL VOC world). With the focus on beating the latest benchmark numbers on the latest dataset, have we perhaps lost sight of the original purpose? The goal of this paper is to take stock of the current state of recognition datasets. We present a comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value. The experimental results, some rather surprising, suggest directions that can improve dataset collection as well as algorithm evaluation protocols. But more broadly, the hope is to stimulate discussion in the community regarding this very important, but largely neglected issue.

...read moreread less

2,428 citations

Journal Article•DOI•

SIFT Flow: Dense Correspondence across Scenes and Its Applications

[...]

Ce Liu¹, Jenny Yuen², Antonio Torralba²•Institutions (2)

Microsoft¹, Massachusetts Institute of Technology²

01 May 2011-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: SIFT flow is proposed, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence.

...read moreread less

Abstract: While image alignment has been studied in different areas of computer vision for decades, aligning images depicting different scenes remains a challenging problem. Analogous to optical flow, where an image is aligned to its temporally adjacent frame, we propose SIFT flow, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes. The SIFT flow algorithm consists of matching densely sampled, pixelwise SIFT features between two images while preserving spatial discontinuities. The SIFT features allow robust matching across different scene/object appearances, whereas the discontinuity-preserving spatial model allows matching of objects located at different parts of the scene. Experiments show that the proposed approach robustly aligns complex scene pairs containing significant spatial differences. Based on SIFT flow, we propose an alignment-based large database framework for image analysis and synthesis, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence. This framework is demonstrated through concrete applications such as motion field prediction from a single image, motion synthesis via object transfer, satellite image registration, and face recognition.

...read moreread less

1,726 citations

Proceedings Article•DOI•

A large-scale benchmark dataset for event recognition in surveillance video

[...]

Sangmin Oh¹, Anthony Hoogs², A. G. Amitha Perera², Naresh P. Cuntoor², Chia-Chih Chen³, Jong Taek Lee³, Saurajit Mukherjee, Jake K. Aggarwal³, Hyungtae Lee⁴, Larry S. Davis⁴, Eran Swears², Xioyang Wang, Qiang Ji⁵, Kishore K. Reddy⁶, Mubarak Shah⁶, Carl Vondrick⁷, Hamed Pirsiavash⁷, Deva Ramanan⁷, Jenny Yuen⁸, Antonio Torralba⁸, Bi Song⁹, Anesco Fong, Amit K. Roy-Chowdhury⁹, Mita Desai¹⁰ - Show less +20 more•Institutions (10)

Georgia Institute of Technology¹, Kitware², University of Texas at Austin³, University of Maryland, College Park⁴, Rensselaer Polytechnic Institute⁵, University of Central Florida⁶, University of California, Irvine⁷, Massachusetts Institute of Technology⁸, University of California, Riverside⁹, DARPA¹⁰

20 Jun 2011

TL;DR: A new large-scale video dataset designed to assess the performance of diverseVisual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage is introduced.

...read moreread less

Abstract: We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage. Previous datasets for action recognition are unrealistic for real-world surveillance because they consist of short clips showing one action by one individual [15, 8]. Datasets have been developed for movies [11] and sports [12], but, these actions and scene conditions do not apply effectively to surveillance videos. Our dataset consists of many outdoor scenes with actions occurring naturally by non-actors in continuously captured videos of the real world. The dataset includes large numbers of instances for 23 event types distributed throughout 29 hours of video. This data is accompanied by detailed annotations which include both moving object tracks and event examples, which will provide solid basis for large-scale evaluation. Additionally, we propose different types of evaluation modes for visual recognition tasks and evaluation metrics along with our preliminary experimental results. We believe that this dataset will stimulate diverse aspects of computer vision research and help us to advance the CVER tasks in the years ahead.

...read moreread less

664 citations

Journal Article•DOI•

Nonparametric Scene Parsing via Label Transfer

[...]

Ce Liu¹, Jenny Yuen², Antonio Torralba²•Institutions (2)

Microsoft¹, Massachusetts Institute of Technology²

01 Dec 2011-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper proposes a novel, nonparametric approach for object recognition and scene parsing using a new technology the authors name label transfer, which is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.

...read moreread less

Abstract: While there has been a lot of recent work on object recognition and image understanding, the focus has been on carefully establishing mathematical models for images, scenes, and objects. In this paper, we propose a novel, nonparametric approach for object recognition and scene parsing using a new technology we name label transfer. For an input image, our system first retrieves its nearest neighbors from a large database containing fully annotated images. Then, the system establishes dense correspondences between the input image and each of the nearest neighbors using the dense SIFT flow algorithm [28], which aligns two images based on local image structures. Finally, based on the dense scene correspondences obtained from SIFT flow, our system warps the existing annotations and integrates multiple cues in a Markov random field framework to segment and recognize the query image. Promising experimental results have been achieved by our nonparametric scene parsing system on challenging databases. Compared to existing object recognition approaches that require training classifiers or appearance models for each object category, our system is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.

...read moreread less

431 citations

Proceedings Article•DOI•

Learning to share visual appearance for multiclass object detection

[...]

Ruslan Salakhutdinov¹, Antonio Torralba¹, Josh Tenenbaum¹•Institutions (1)

Massachusetts Institute of Technology¹

20 Jun 2011

TL;DR: A hierarchical classification model that allows rare objects to borrow statistical strength from related objects that have many training examples and learns both a hierarchy for sharing visual appearance across 200 object categories and hierarchical parameters is presented.

...read moreread less

Abstract: We present a hierarchical classification model that allows rare objects to borrow statistical strength from related objects that have many training examples. Unlike many of the existing object detection and recognition systems that treat different classes as unrelated entities, our model learns both a hierarchy for sharing visual appearance across 200 object categories and hierarchical parameters. Our experimental results on the challenging object localization and detection task demonstrate that the proposed model substantially improves the accuracy of the standard single object detectors that ignore hierarchical structure altogether.

...read moreread less

385 citations

Proceedings Article•DOI•

What makes an image memorable

[...]

Phillip Isola¹, Jianxiong Xiao¹, Antonio Torralba¹, Aude Oliva¹•Institutions (1)

Massachusetts Institute of Technology¹

20 Jun 2011

TL;DR: It is shown that memorability is a stable property of an image that is shared across different viewers, and a database for which each picture will be remembered after a single view is introduced.

...read moreread less

Abstract: When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. Despite of this overflow of visual information, humans are extremely good at remembering thousands of pictures along with some of their visual details. But not all images are equal in memory. Some stitch to our minds, and other are forgotten. In this paper we focus on the problem of predicting how memorable an image will be. We show that memorability is a stable property of an image that is shared across different viewers. We introduce a database for which we have measured the probability that each picture will be remembered after a single view. We analyze image features and labels that contribute to making an image memorable, and we train a predictor based on global image descriptors. We find that predicting image memorability is a task that can be addressed with current computer vision techniques. Whereas making memorable images is a challenging task in visualization and photography, this work is a first attempt to quantify this useful quality of images.

...read moreread less

358 citations

Report•DOI•

Understanding the Intrinsic Memorability of Images

[...]

Phillip Isola¹, Devi Parikh, Antonio Torralba¹, Aude Oliva¹•Institutions (1)

Massachusetts Institute of Technology¹

12 Dec 2011

TL;DR: In this article, the authors used the publicly available memorability dataset of Isola et al. and augmented object and scene annotations with interpretable spatial, content, and aesthetic image properties to determine a compact set of attributes that characterizes the memorability of any individual image.

...read moreread less

Abstract: Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember. While it may seem like image memorability is purely subjective, recent work shows that it is not an inexplicable phenomenon: variation in memorability of images is consistent across subjects, suggesting that some images are intrinsically more memorable than others, independent of a subjects' contexts and biases. In this paper, we used the publicly available memorability dataset of Isola et al. [13], and augmented the object and scene annotations with interpretable spatial, content, and aesthetic image properties. We used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image. We find that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not. Contrary to popular belief, unusual or aesthetically pleasing scenes do not tend to be highly memorable. This work represents one of the first attempts at understanding intrinsic image memorability, and opens a new domain of investigation at the interface between human cognition and computer vision.

...read moreread less

167 citations

Journal Article•DOI•

What are the shapes of response time distributions in visual search

[...]

Evan M. Palmer¹, Todd S. Horowitz¹, Antonio Torralba², Jeremy M. Wolfe¹•Institutions (2)

Brigham and Women's Hospital¹, Massachusetts Institute of Technology²

01 Feb 2011-Journal of Experimental Psychology: Human Perception and Performance

TL;DR: This article collected about 500 trials per cell per observer for both target-present and target-absent displays in each of three classic search tasks: feature search, with the target defined by color, conjunction search, and spatial configuration search for a 2 among distractor 5s.

...read moreread less

Abstract: Many visual search experiments measure response time (RT) as their primary dependent variable. Analyses typically focus on mean (or median) RT. However, given enough data, the RT distribution can be a rich source of information. For this paper, we collected about 500 trials per cell per observer for both target-present and target-absent displays in each of three classic search tasks: feature search, with the target defined by color; conjunction search, with the target defined by both color and orientation; and spatial configuration search for a 2 among distractor 5s. This large data set allows us to characterize the RT distributions in detail. We present the raw RT distributions and fit several psychologically motivated functions (ex-Gaussian, ex-Wald, Gamma, and Weibull) to the data. We analyze and interpret parameter trends from these four functions within the context of theories of visual search.

...read moreread less

156 citations

Proceedings Article•

Transfer Learning by Borrowing Examples for Multiclass Object Detection

[...]

Joseph J. Lim¹, Ruslan Salakhutdinov², Antonio Torralba¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

12 Dec 2011

TL;DR: This work proposes a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes, and demonstrates that the new object detector improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset.

...read moreread less

Abstract: Despite the recent trend of increasingly large datasets for object detection, there still exist many classes with few training examples. To overcome this lack of training data for certain classes, we propose a novel way of augmenting the training data for each class by borrowing and transforming examples from other classes. Our model learns which training instances from other classes to borrow and how to transform the borrowed examples so that they become more similar to instances from the target class. Our experimental results demonstrate that our new object detector, with borrowed and transformed examples, improves upon the current state-of-the-art detector on the challenging SUN09 object detection dataset.

...read moreread less

143 citations

Journal Article•DOI•

What makes an image memorable

[...]

Phillip Isola¹, Jianxiong Xiao¹, Antonio Torralba¹, Aude Oliva¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Sep 2011-Journal of Vision

TL;DR: In this article, the authors focus on the problem of predicting how memorable an image will be, and find that memorability is a stable property of an image that is shared across different viewers, and introduce a database for which they have measured the probability that each picture will be remembered after a single view.

...read moreread less

130 citations

Proceedings Article•DOI•

AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video

[...]

Sangmin Oh¹, Anthony Hoogs², A. G. Amitha Perera², Naresh P. Cuntoor², Chia-Chih Chen³, Jong Taek Lee³, Saurajit Mukherjee, Jake K. Aggarwal³, Hyungtae Lee⁴, Larry S. Davis⁴, Eran Swears², Xiaoyang Wang⁵, Qiang Ji⁶, Kishore K. Reddy⁷, Mubarak Shah⁷, Carl Vondrick⁸, Hamed Pirsiavash⁸, Deva Ramanan⁸, Jenny Yuen⁹, Antonio Torralba⁹, Bi Song¹⁰, Anesco Fong, Amit K. Roy-Chowdhury¹⁰, Mita Desai¹¹ - Show less +20 more•Institutions (11)

Georgia Institute of Technology¹, Kitware², University of Texas at Austin³, University of Maryland, College Park⁴, Tsinghua University⁵, Rensselaer Polytechnic Institute⁶, University of Central Florida⁷, University of California, Irvine⁸, Massachusetts Institute of Technology⁹, University of California, Riverside¹⁰, DARPA¹¹

30 Aug 2011

TL;DR: A concept for automatic construction site monitoring by taking into account 4D information (3D over time), that is acquired from highly-overlapping digital aerial images, which largely supports automated methods toward full scene understanding.

...read moreread less

Abstract: Summary form only given. We present a concept for automatic construction site monitoring by taking into account 4D information (3D over time), that is acquired from highly-overlapping digital aerial images. On the one hand today's maturity of flying micro aerial vehicles (MAVs) enables a low-cost and an efficient image acquisition of high-quality data that maps construction sites entirely from many varying viewpoints. On the other hand, due to low-noise sensors and high redundancy in the image data, recent developments in 3D reconstruction workflows have benefited the automatic computation of accurate and dense 3D scene information. Having both an inexpensive high-quality image acquisition and an efficient 3D analysis workflow enables monitoring, documentation and visualization of observed sites over time with short intervals. Relating acquired 4D site observations, composed of color, texture, geometry over time, largely supports automated methods toward full scene understanding, the acquisition of both the change and the construction site's progress.

...read moreread less

One-shot learning with a hierarchical nonparametric Bayesian model

[...]

Ruslan Salakhutdinov¹, Josh Tenenbaum², Antonio Torralba²•Institutions (2)

University of Toronto¹, Massachusetts Institute of Technology²

02 Jul 2011

TL;DR: In this article, a hierarchical Bayesian model is proposed to transfer knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances, which can discover how to group categories into meaningful super-categories that express different priors for new classes.

...read moreread less

Abstract: We develop a hierarchical Bayesian model that learns categories from single training examples. The model transfers acquired knowledge from previously learned categories to a novel category, in the form of a prior over category means and variances. The model discovers how to group categories into meaningful super-categories that express different priors for new classes. Given a single example of a novel category, we can efficiently infer which super-category the novel category belongs to, and thereby estimate not only the new category's mean but also an appropriate similarity metric based on parameters inherited from the super-category. On MNIST and MSR Cambridge image datasets the model learns useful representations of novel categories based on just a single training example, and performs significantly better than simpler hierarchical Bayesian approaches. It can also discover new categories in a completely unsupervised fashion, given just one or a few examples.

...read moreread less

Proceedings Article•DOI•

Evaluation of image features using a photorealistic virtual world

[...]

Biliana K. Kaneva¹, Antonio Torralba¹, William T. Freeman¹•Institutions (1)

Massachusetts Institute of Technology¹

06 Nov 2011

TL;DR: This work proposes using a photorealistic virtual world to gain complete and repeatable control of the environment in order to evaluate image features, and calibrates the virtual world evaluations by comparing against feature rankings made from photographic data of the same subject matter.

...read moreread less

Abstract: Image features are widely used in computer vision applications. They need to be robust to scene changes and image transformations. Designing and comparing feature descriptors requires the ability to evaluate their performance with respect to those transformations. We want to know how robust the descriptors are to changes in the lighting, scene, or viewing conditions. For this, we need ground truth data of different scenes viewed under different camera or lighting conditions in a controlled way. Such data is very difficult to gather in a real-world setting. We propose using a photorealistic virtual world to gain complete and repeatable control of the environment in order to evaluate image features. We calibrate our virtual world evaluations by comparing against feature rankings made from photographic data of the same subject matter (the Statue of Liberty). We find very similar feature rankings between the two datasets. We then use our virtual world to study the effects on descriptor performance of controlled changes in viewpoint and illumination. We also study the effect of augmenting the descriptors with depth information to improve performance.

...read moreread less

Journal Article•

Estimating scene typicality from human ratings and image features

[...]

Krista A. Ehinger, Jiangxiong Xiao, Antonio Torralba, Aude Oliva

01 Jan 2011-Cognitive Science

TL;DR: The goal of the current study is to determine the prototypical exemplars that best represent each visual scene category; and to evaluate the performances of state-of-the-art global features algorithms at classifying different types of exemplars.

...read moreread less

Proceedings Article•

Learning to Learn with Compound HD Models

[...]

Antonio Torralba¹, Joshua B. Tenenbaum¹, Ruslan Salakhutdinov²•Institutions (2)

Massachusetts Institute of Technology¹, University of Toronto²

12 Dec 2011

TL;DR: Efficient learning and inference algorithms for the HDP-DBM model are presented and it is shown that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.

...read moreread less

Abstract: We introduce HD (or "Hierarchical-Deep") models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets.

...read moreread less

Dissertation•

Understanding and predicting where people look in images

[...]

Frédo Durand¹, Antonio Torralba¹, Tilke Judd¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2011

TL;DR: A new way to model where people look from ground truth eye tracking data is described using techniques of machine learning that outperforms all existing models, and a benchmark data set is provided to quantitatively compare existing and future models.

...read moreread less

Abstract: For many applications in graphics, design, and human computer interaction, it is essential to understand where humans look in a scene. This is a challenging task given that no one fully understands how the human visual system works. This thesis explores the way people look at different types of images and provides methods of predicting where they look in new scenes. We describe a new way to model where people look from ground truth eye tracking data using techniques of machine learning that outperforms all existing models, and provide a benchmark data set to quantitatively compare existing and future models. In addition we explore how image resolution affects where people look. Our experiments, models, and large eye tracking data sets should help future researchers better understand and predict where people look in order to create more powerful computational vision systems. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

...read moreread less

Proceedings Article•DOI•

Simultaneous detection and segmentation for generic objects

[...]

Albert Torrent¹, Xavier Lladó¹, Jordi Freixenet¹, Antonio Torralba²•Institutions (2)

University of Girona¹, Massachusetts Institute of Technology²

29 Dec 2011

TL;DR: This paper proposes a general framework to simultaneously perform object detection and segmentation on objects of different nature based on a boosting procedure which automatically decides - according to the object properties - whether is better to give more weight to the detection or segmentation process to improve both results.

...read moreread less

Abstract: Numerous approaches to object detection and segmentation have been proposed so far. However, these methods are prone to fail in some general situations due to the proper object nature. For instance, classical approaches of object detection and segmentation obtain good results for some specific object classes (i.e. detection of pedestrians or segmentation of cars). However, these methods have troubles when detecting or segmenting object classes with different distinctive characteristics (i.e. cars and horses versus sky and road). In this paper, we propose a general framework to simultaneously perform object detection and segmentation on objects of different nature. Our approach is based on a boosting procedure which automatically decides - according to the object properties - whether is better to give more weight to the detection or segmentation process to improve both results. We validate our approach using different object classes from La-belMe, TUD and Weizmann databases, obtaining competitive detection and segmentation results.

...read moreread less

Journal Article•

What are the shapes of response time distributions in visual search

[...]

Evan M. Palmer¹, Todd S. Horowitz¹, Antonio Torralba², Jeremy M. Wolfe¹•Institutions (2)

Brigham and Women's Hospital¹, Massachusetts Institute of Technology²

01 Feb 2011-PubMed Central

TL;DR: This paper presents the raw RT distributions and fit several psychologically motivated functions (ex-Gaussian, ex-Wald, Gamma, and Weibull) to the data, and analyzes and interpret parameter trends from these four functions within the context of theories of visual search.

...read moreread less

Journal Article•DOI•

How little do we need for 3-D shape perception?

[...]

Chetan Nandakumar¹, Antonio Torralba², Jitendra Malik¹•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

01 Jan 2011-Perception

TL;DR: This study identifies the ‘medium-blur’ condition, images approximately 32 pixels on a side, to be the limit for accurate 3-D shape perception and finds that degradation affects the perceived slant of point-estimates making images look flatter as degradation increases.

...read moreread less

Abstract: How little do we need to perceive 3-D shape in monocular natural images? The shape-from-texture and shape-from-shading perspectives would motivate that 3-D perception vanishes once low-level cues are disrupted. Is this the case in human vision? Or can top – down influences salvage the percept? In this study we probe this question by employing a gauge-figure paradigm similar to that used by Koenderink et al (1992, Perception & Psychophysics52 487 – 496). Subjects were presented degraded natural images and instructed to make local assessments of slant and tilt at various locations thereby quantifying their internal 3-D percept. Analysis of subjects' responses reveals recognition to be a significant influence thereby allowing subjects to perceive 3-D shape at high levels of degradation. Specifically, we identify the ‘medium-blur’ condition, images approximately 32 pixels on a side, to be the limit for accurate 3-D shape perception. In addition, we find that degradation affects the perceived slant of point-es...

...read moreread less

Trees and beyond: exploiting and improving tree-structured graphical models

[...]

Alan S. Willsky¹, Antonio Torralba¹, Myung Jin Choi¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2011

TL;DR: This thesis exploits the advantages of tree- structured graphical models and considers modifications to overcome their limitations, and presents methods for learning such models given data at the finest scale by formulating a convex optimization problem.

...read moreread less

Abstract: Probabilistic models commonly assume that variables are independent of each other conditioned on a subset of other variables. Graphical models provide a powerful framework for encoding such conditional independence structure of a large collection of random variables. A special class of graphical models with significant theoretical and practical importance is the class of tree-structured graphical models. Tree models have several advantages: they can be easily learned given data, their structures are often intuitive, and inference in tree models is highly efficient. However, tree models make strong conditional independence assumptions, which limit their modeling power substantially. This thesis exploits the advantages of tree- structured graphical models and considers modifications to overcome their limitations. To improve the modeling accuracy of tree models, we consider latent trees in which variables at some nodes represent the original (observed) variables of interest while others represent the latent variables added during the learning procedure. The appeal of such models is clear: the additional latent variables significantly increase the modeling power, and inference on trees is scalable with or without latent variables. We propose two computationally efficient and statistically consistent algorithms for learning latent trees, and compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree models. We exploit the advantages of tree models in the application of modeling contextual information of an image. Object co-occurrences and spatial relationships can be important cues in recognizing and localizing object instances. We develop tree-based context models and demonstrate that its simplicity enables us to integrate many sources of contextual information efficiently. In addition to object recognition, we are interested in using context models to detect objects that are out of their normal context. This task requires precise and careful modeling of object relationships, so we use a latent tree for object co-occurrences. Many of the latent variables can be interpreted as scene categories, capturing higher-order dependencies among object categories. Tree-structured graphical models have been widely used in multi-resolution (MR) modeling. In the last part of the thesis, we move beyond trees, and propose a new modeling framework that allows additional dependency structure at each scale of an MR tree model. We mainly focus on MR models with jointly Gaussian variables, and assume that variables at each scale have sparse covariance structure (as opposed to fully-uncorrelated structure in MR trees) conditioned on variables at other scales. We develop efficient inference algorithms that are partly based on inference on the embedded MR tree and partly based on local filtering at each scale. In addition, we present methods for learning such models given data at the finest scale by formulating a convex optimization problem. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

...read moreread less

A Framework for Encoding Object-level Image Priors

[...]

Jenny Yuen, C. Lawrence Zitnick, Ce Liu, Antonio Torralba

01 Aug 2011

TL;DR: It is demonstrated that this new object-level image prior not only scales well to include arbitrary high-order object relationships, but also seamlessly integrates multiple-sources of image information such as scene categorization, scene parsing and object detection.

...read moreread less

Abstract: Although context is a key component to the success of building an object recognition system, it is difficult to scale and integrate existing formulations of contextual rules to take into account multiplesources of information. In this paper, we propose a generic, object-level image prior to represent rich, complicated contextual relationships. A maximum entropy distribution is learned to model the possible layouts of objects and scenes by placing constraints on the prior distribution. We demonstrate that this new object-level image prior not only scales well to include arbitrary high-order object relationships, but also seamlessly integrates multiple-sources of image information such as scene categorization, scene parsing and object detection. The result is a more comprehensive understanding of the image.

...read moreread less