scispace - formally typeset
Search or ask a question

Showing papers by "Li Fei-Fei published in 2007"


Journal ArticleDOI
TL;DR: The incremental algorithm is compared experimentally to an earlier batch Bayesian algorithm, as well as to one based on maximum-likelihood, which have comparable classification performance on small training sets, but incremental learning is significantly faster, making real-time learning feasible.

2,597 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper uses a number of sport games such as snow boarding, rock climbing or badminton to demonstrate event classification and proposes a first attempt to classify events in static images by integrating scene and object categorizations.
Abstract: We propose a first attempt to classify events in static images by integrating scene and object categorizations. We define an event in a static image as a human activity taking place in a specific environment. In this paper, we use a number of sport games such as snow boarding, rock climbing or badminton to demonstrate event classification. Our goal is to classify the event in the image as well as to provide a number of semantic labels to the objects and scene environment within the image. For example, given a rowing scene, our algorithm recognizes the event as rowing by classifying the environment as a lake and recognizing the critical objects in the image as athletes, rowing boat, water, etc. We achieve this integrative and holistic recognition through a generative graphical model. We have assembled a highly challenging database of 8 widely varied sport events. We show that our system is capable of classifying these event classes at 73.4% accuracy. While each component of the model contributes to the final recognition, using scene or objects alone cannot achieve this performance.

858 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: A hierarchical model that can be characterized as a constellation of bags-of-features and that is able to combine both spatial and spatial-temporal features is proposed and shown to improve the classification performance over bag of feature models.
Abstract: We present a novel model for human action categorization. A video sequence is represented as a collection of spatial and spatial-temporal features by extracting static and dynamic interest points. We propose a hierarchical model that can be characterized as a constellation of bags-of-features and that is able to combine both spatial and spatial-temporal features. Given a novel video sequence, the model is able to categorize human actions in a frame-by-frame basis. We test the model on a publicly available human action dataset [2] and show that our new method performs well on the classification task. We also conducted control experiments to show that the use of the proposed mixture of hierarchical models improves the classification performance over bag of feature models. An additional experiment shows that using both dynamic and static features provides a richer representation of human actions when compared to the use of a single feature type, as demonstrated by our evaluation in the classification task.

486 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This work proposes a novel and robust model to represent and learn generic 3D object categories, and proposes a framework in which learning is done via minimal supervision compared to previous works.
Abstract: We propose a novel and robust model to represent and learn generic 3D object categories. We aim to solve the problem of true 3D object categorization for handling arbitrary rotations and scale changes. Our approach is to capture a compact model of an object category by linking together diagnostic parts of the objects from different viewing points. We emphasize on the fact that our "parts" are large and discriminative regions of the objects that are composed of many local invariant features. Instead of recovering a full 3D geometry, we connect these parts through their mutual homographic transformation. The resulting model is a compact summarization of both the appearance and geometry information of the object class. We propose a framework in which learning is done via minimal supervision compared to previous works. Our results on categorization show superior performances to state-of-the-art algorithms such as (Thomas et al., 2006). Furthermore, we have compiled a new 3D object dataset that consists of 10 different object categories. We have tested our algorithm on this dataset and have obtained highly promising results.

476 citations


Journal ArticleDOI
TL;DR: It is shown that within a single glance, much object- and scene-level information is perceived by human subjects and subjects tend to have a propensity toward perceiving natural scenes as being outdoor rather than indoor.
Abstract: What do we see when we glance at a natural scene and how does it change as the glance becomes longer? We asked naive subjects to report in a free-form format what they saw when looking at briefly presented real-life photographs. Our subjects received no specific information as to the content of each stimulus. Thus, our paradigm differs from previous studies where subjects were cued before a picture was presented and/or were probed with multiple-choice questions. In the first stage, 90 novel grayscale photographs were foveally shown to a group of 22 native-English-speaking subjects. The presentation time was chosen at random from a set of seven possible times (from 27 to 500 ms). A perceptual mask followed each photograph immediately. After each presentation, subjects reported what they had just seen as completely and truthfully as possible. In the second stage, another group of naive individuals was instructed to score each of the descriptions produced by the subjects in the first stage. Individual scores were assigned to more than a hundred different attributes. We show that within a single glance, much object- and scene-level information is perceived by human subjects. The richness of our perception, though, seems asymmetrical. Subjects tend to have a propensity toward perceiving natural scenes as being outdoor rather than indoor. The reporting of sensory- or feature-level information of a scene (such as shading and shape) consistently precedes the reporting of the semantic-level information. But once subjects recognize more semantic-level components of a scene, there is little evidence suggesting any bias toward either scene-level or object-level recognition.

432 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: Spatial-LTM represents an image containing objects in a hierarchical way by over-segmented image regions of homogeneous appearances and the salient image patches within the regions, enforcing the spatial coherency of the model.
Abstract: We present a novel generative model for simultaneously recognizing and segmenting object and scene classes. Our model is inspired by the traditional bag of words representation of texts and images as well as a number of related generative models, including probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA). A major drawback of the pLSA and LDA models is the assumption that each patch in the image is independently generated given its corresponding latent topic. While such representation provides an efficient computational method, it lacks the power to describe the visually coherent images and scenes. Instead, we propose a spatially coherent latent topic model (spatial-LTM). Spatial-LTM represents an image containing objects in a hierarchical way by over-segmented image regions of homogeneous appearances and the salient image patches within the regions. Only one single latent topic is assigned to the image patches within each region, enforcing the spatial coherency of the model. This idea gives rise to the following merits of spatial-LTM: (1) spatial-LTM provides a unified representation for spatially coherent bag of words topic models; (2) spatial-LTM can simultaneously segment and classify objects, even in the case of occlusion and multiple instances; and (3) spatial-LTM can be trained either unsupervised or supervised, as well as when partial object labels are provided. We verify the success of our model in a number of segmentation and classification experiments.

392 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: This work adapts a non-parametric graphical model and proposes an incremental learning framework that mimics the human learning process of iteratively accumulating model knowledge and image examples and is capable of collecting image datasets that are superior to Caltech 101 and LabelMe.
Abstract: A well-built dataset is a necessary starting point for advanced computer vision research. It plays a crucial role in evaluation and provides a continuous challenge to state-of-the-art algorithms. Dataset collection is, however, a tedious and time-consuming task. This paper presents a novel automatic dataset collecting and model learning approach that uses object recognition techniques in an incremental method. The goal of this work is to use the tremendous resources of the web to learn robust object category models in order to detect and search for objects in real-world cluttered scenes. It mimics the human learning process of iteratively accumulating model knowledge and image examples. We adapt a non-parametric graphical model and propose an incremental learning framework. Our algorithm is capable of automatically collecting much larger object category datasets for 22 randomly selected classes from the Caltech 101 dataset. Furthermore, we offer not only more images in each object category dataset, but also a robust object model and meaningful image annotation. Our experiments show that OPTIMOL is capable of collecting image datasets that are superior to Caltech 101 and LabelMe.

243 citations


Journal ArticleDOI
TL;DR: It is concluded that deploying top-down attention to a different attribute incurs a significant cost in reaction time, but that biasing to aDifferent feature value within the same stimulus attribute is effortless.
Abstract: In many everyday situations, we bias our perception from the top down, based on a task or an agenda. Frequently, this entails shifting attention to a specific attribute of a particular object or scene. To explore the cost of shifting top-down attention to a different stimulus attribute, we adopt the task-set switching paradigm, in which switch trials are contrasted with repeat trials in mixed-task blocks and with single-task blocks. Using two tasks that relate to the content of a natural scene in a gray-level photograph and two tasks that relate to the color of the frame around the image, we were able to distinguish switch costs with and without shifts of attention. We found a significant cost in reaction time of 23–31 ms for switches that require shifting attention to other stimulus attributes, but no significant switch cost for switching the task set within an attribute. We conclude that deploying top-down attention to a different attribute incurs a significant cost in reaction time, but that biasing to a different feature value within the same stimulus attribute is effortless.

19 citations


Proceedings Article
22 Jul 2007
TL;DR: OPTIMOL (a framework for Online Picture collection via Incremental MOdel Learning) is a novel, automatic dataset collecting and model learning system for object categorization that mimics the human learning process in such a way that the more confident data you incorporate in the training data, the more reliahle models can be learnt.
Abstract: OPTIMOL (a framework for Online Picture collection via Incremental MOdel Learning) is a novel, automatic dataset collecting and model learning system for object categorization. Our algorithm mimics the human learning process in such a way that, starting from a few training examples, the more confident data you incorporate in the training data, the more reliahle models can be learnt. Our system uses the Internet as the (nearly) unlimited resource for images. The learning and image collection processes are done via an iterative and incremental scheme. The goal of this work is to use this tremendous web resource to learn robust object category models in order to detect and search for objects in real-world scenes.

1 citations