scispace - formally typeset
Search or ask a question

Showing papers by "Andrew Zisserman published in 2007"


Proceedings ArticleDOI
17 Jun 2007
TL;DR: To improve query performance, this work adds an efficient spatial verification stage to re-rank the results returned from the bag-of-words model and shows that this consistently improves search quality, though by less of a margin when the visual vocabulary is large.
Abstract: In this paper, we present a large-scale object retrieval system. The user supplies a query object by selecting a region of a query image, and the system returns a ranked list of images that contain the same object, retrieved from a large corpus. We demonstrate the scalability and performance of our system on a dataset of over 1 million images crawled from the photo-sharing site, Flickr [3], using Oxford landmarks as queries. Building an image-feature vocabulary is a major time and performance bottleneck, due to the size of our dataset. To address this problem we compare different scalable methods for building a vocabulary and introduce a novel quantization method based on randomized trees which we show outperforms the current state-of-the-art on an extensive ground-truth. Our experiments show that the quantization has a major effect on retrieval quality. To further improve query performance, we add an efficient spatial verification stage to re-rank the results returned from our bag-of-words model and show that this consistently improves search quality, though by less of a margin when the visual vocabulary is large. We view this work as a promising step towards much larger, "web-scale " image corpora.

3,242 citations


Proceedings ArticleDOI
09 Jul 2007
TL;DR: This work introduces a descriptor that represents local image shape and its spatial layout, together with a spatial pyramid kernel that is designed so that the shape correspondence between two images can be measured by the distance between their descriptors using the kernel.
Abstract: The objective of this paper is classifying images by the object categories they contain, for example motorbikes or dolphins. There are three areas of novelty. First, we introduce a descriptor that represents local image shape and its spatial layout, together with a spatial pyramid kernel. These are designed so that the shape correspondence between two images can be measured by the distance between their descriptors using the kernel. Second, we generalize the spatial pyramid kernel, and learn its level weighting parameters (on a validation set). This significantly improves classification performance. Third, we show that shape and appearance kernels may be combined (again by learning parameters on a validation set).Results are reported for classification on Caltech-101 and retrieval on the TRECVID 2006 data sets. For Caltech-101 it is shown that the class specific optimization that we introduce exceeds the state of the art performance by more than 10%.

1,496 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: It is shown that selecting the ROI adds about 5% to the performance and, together with the other improvements, the result is about a 10% improvement over the state of the art for Caltech-256.
Abstract: We explore the problem of classifying images by the object categories they contain in the case of a large number of object categories. To this end we combine three ingredients: (i) shape and appearance representations that support spatial pyramid matching over a region of interest. This generalizes the representation of Lazebnik et al., (2006) from an image to a region of interest (ROI), and from appearance (visual words) alone to appearance and local shape (edge distributions); (ii) automatic selection of the regions of interest in training. This provides a method of inhibiting background clutter and adding invariance to the object instance 's position; and (iii) the use of random forests (and random ferns) as a multi-way classifier. The advantage of such classifiers (over multi-way SVM for example) is the ease of training and testing. Results are reported for classification of the Caltech-101 and Caltech-256 data sets. We compare the performance of the random forest/ferns classifier with a benchmark multi-way SVM classifier. It is shown that selecting the ROI adds about 5% to the performance and, together with the other improvements, the result is about a 10% improvement over the state of the art for Caltech-256.

1,401 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper brings query expansion into the visual domain via two novel contributions: strong spatial constraints between the query image and each result allow us to accurately verify each return, suppressing the false positives which typically ruin text-based query expansion.
Abstract: Given a query image of an object, our objective is to retrieve all instances of that object in a large (1M+) image database. We adopt the bag-of-visual-words architecture which has proven successful in achieving high precision at low recall. Unfortunately, feature detection and quantization are noisy processes and this can result in variation in the particular visual words that appear in different images of the same object, leading to missed results. In the text retrieval literature a standard method for improving performance is query expansion. A number of the highly ranked documents from the original query are reissued as a new query. In this way, additional relevant terms can be added to the query. This is a form of blind rele- vance feedback and it can fail if 'outlier' (false positive) documents are included in the reissued query. In this paper we bring query expansion into the visual domain via two novel contributions. Firstly, strong spatial constraints between the query image and each result allow us to accurately verify each return, suppressing the false positives which typically ruin text-based query expansion. Secondly, the verified images can be used to learn a latent feature model to enable the controlled construction of expanded queries. We illustrate these ideas on the 5000 annotated image Oxford building database together with more than 1M Flickr images. We show that the precision is substantially boosted, achieving total recall in many cases.

966 citations


Proceedings Article
03 Dec 2007
TL;DR: It is shown that attributes can be learnt starting from a text query to Google image search, and can then be used to recognize the attribute and determine its spatial extent in novel real-world images.
Abstract: We present a probabilistic generative model of visual attributes, together with an efficient learning algorithm. Attributes are visual qualities of objects, such as 'red', 'striped', or 'spotted'. The model sees attributes as patterns of image segments, repeatedly sharing some characteristic properties. These can be any combination of appearance, shape, or the layout of segments within the pattern. Moreover, attributes with general appearance are taken into account, such as the pattern of alternation of any two colors which is characteristic for stripes. To enable learning from unsegmented training images, the model is learnt discriminatively, by optimizing a likelihood ratio. As demonstrated in the experimental evaluation, our model can learn in a weakly supervised setting and encompasses a broad range of attributes. We show that attributes can be learnt starting from a text query to Google image search, and can then be used to recognize the attribute and determine its spatial extent in novel real-world images.

470 citations


Proceedings ArticleDOI
17 Jun 2007
TL;DR: An exemplar model that can learn and generate a region of interest around class instances in a training set, given only a set of images containing the visual class, which enables the detection of multiple instances of the object class in test images.
Abstract: We introduce an exemplar model that can learn and generate a region of interest around class instances in a training set, given only a set of images containing the visual class. The model is scale and translation invariant. In the training phase, image regions that optimize an objective function are automatically located in the training images, without requiring any user annotation such as bounding boxes. The objective function measures visual similarity between training image pairs, using the spatial distribution of both appearance patches and edges. The optimization is initialized using discriminative features. The model enables the detection (localization) of multiple instances of the object class in test images, and can be used as a precursor to training other visual models that require bounding box annotation. The detection performance of the model is assessed on the PASCAL Visual Object Classes Challenge 2006 test set. For a number of object classes the performance far exceeds the current state of the art of fully supervised methods.

311 citations


Proceedings ArticleDOI
09 Jul 2007
TL;DR: Two novel schemes for near duplicate image and video-shot detection based on global hierarchical colour histograms, using Locality Sensitive Hashing for fast retrieval and local feature descriptors, are proposed and compared.
Abstract: This paper proposes and compares two novel schemes for near duplicate image and video-shot detection. The first approach is based on global hierarchical colour histograms, using Locality Sensitive Hashing for fast retrieval. The second approach uses local feature descriptors (SIFT) and for retrieval exploits techniques used in the information retrieval community to compute approximate set intersections between documents using a min-Hash algorithm.The requirements for near-duplicate images vary according to the application, and we address two types of near duplicate definition: (i) being perceptually identical (e.g. up to noise, discretization effects, small photometric distortions etc); and (ii) being images of the same 3D scene (so allowing for viewpoint changes and partial occlusion). We define two shots to be near-duplicates if they share a large percentage of near-duplicate frames.We focus primarily on scalability to very large image and video databases, where fast query processing is necessary. Both methods are designed so that only a small amount of data need be stored for each image. In the case of near-duplicate shot detection it is shown that a weak approximation to histogram matching, consuming substantially less storage, is sufficient for good results. We demonstrate our methods on the TRECVID 2006 data set which contains approximately 165 hours of video (about 17.8M frames with 146K key frames), and also on feature films and pop videos.

237 citations


Journal ArticleDOI
TL;DR: The flexible nature of the model is demonstrated by results over six diverse object categories including geometrically constrained categories (e.g. faces, cars) and flexible objects (such as animals).
Abstract: We investigate a method for learning object categories in a weakly supervised manner. Given a set of images known to contain the target category from a similar viewpoint, learning is translation and scale-invariant; does not require alignment or correspondence between the training images, and is robust to clutter and occlusion. Category models are probabilistic constellations of parts, and their parameters are estimated by maximizing the likelihood of the training data. The appearance of the parts, as well as their mutual position, relative scale and probability of detection are explicitly described in the model. Recognition takes place in two stages. First, a feature-finder identifies promising locations for the model"s parts. Second, the category model is used to compare the likelihood that the observed features are generated by the category model, or are generated by background clutter. The flexible nature of the model is demonstrated by results over six diverse object categories including geometrically constrained categories (e.g. faces, cars) and flexible objects (such as animals).

234 citations


Proceedings ArticleDOI
26 Dec 2007
TL;DR: A multi-modal approach employing both text, meta data and visual features is used to gather many, high-quality images from the Web to automatically generate a large number of images for a specified object class.
Abstract: The objective of this work is to automatically generate a large number of images for a specified object class (for example, penguin). A multi-modal approach employing both text, meta data and visual features is used to gather many, high-quality images from the Web. Candidate images are obtained by a text based Web search querying on the object identifier (the word penguin). The Web pages and the images they contain are downloaded. The task is then to remove irrelevant images and re-rank the remainder. First, the images are re-ranked using a Bayes posterior estimator trained on the text surrounding the image and meta data features (such as the image alternative tag, image title tag, and image filename). No visual information is used at this stage. Second, the top-ranked images are used as (noisy) training data and a SVM visual classifier is learnt to improve the ranking further. The principal novelty is in combining text/meta-data and visual features in order to achieve a completely automatic ranking of the images. Examples are given for a selection of animals (e.g. camels, sharks, penguins), vehicles (cars, airplanes, bikes) and other classes (guitar, wristwatch), totalling 18 classes. The results are assessed by precision/recall curves on ground truth annotated data and by comparison to previous approaches including those of Berg et al. (on an additional six classes) and Fergus et al.

201 citations


Journal ArticleDOI
TL;DR: This work develops a completely automatic system that works in two stages; it first builds a model of appearance of each person in a video and then it tracks by detecting those models in each frame ("tracking by model-building and detection").
Abstract: An open vision problem is to automatically track the articulations of people from a video sequence. This problem is difficult because one needs to determine both the number of people in each frame and estimate their configurations. But, finding people and localizing their limbs is hard because people can move fast and unpredictably, can appear in a variety of poses and clothes, and are often surrounded by limb-like clutter. We develop a completely automatic system that works in two stages; it first builds a model of appearance of each person in a video and then it tracks by detecting those models in each frame ("tracking by model-building and detection"). We develop two algorithms that build models; one bottom-up approach groups together candidate body parts found throughout a sequence. We also describe a top-down approach that automatically builds people-models by detecting convenient key poses within a sequence. We finally show that building a discriminative model of appearance is quite helpful since it exploits structure in a background (without background-subtraction). We demonstrate the resulting tracker on hundreds of thousands of frames of unscripted indoor and outdoor activity, a feature-length film ("Run Lola Run"), and legacy sports footage (from the 2002 World Series and 1998 Winter Olympics). Experiments suggest that our system 1) can count distinct individuals, 2) can identify and track them, 3) can recover when it loses track, for example, if individuals are occluded or briefly leave the view, 4) can identify body configuration accurately, and 5) is not dependent on particular models of human motion

191 citations


Proceedings ArticleDOI
01 Jan 2007
TL;DR: An algorithm for automatically segmenting flowers in colour photographs using a MRF cost function optimized using graph cuts that is tolerant to viewpoint changes and petal deformations, and applicable across many different flower classes.
Abstract: We describe an algorithm for automatically segmenting flowers in colour photographs. This is a challenging problem because of the sheer variety of flower classes, the intra-class variability, the variation within a particular flower, and the variability of imaging conditions ‐ lighting, pose, foreshortening etc. The method couples two models ‐ a colour model for foreground and background, and a generic shape model for the petal structure. This shape model is tolerant to viewpoint changes and petal deformations, and applicable across many different flower classes. The segmentations are produced using a MRF cost function optimized using graph cuts. The algorithm is tested on 13 flower classes and more than 750 examples. Performance is assessed against ground truth segmentations.

Proceedings ArticleDOI
26 Dec 2007
TL;DR: This work extends the LMNN framework to incorporate knowledge about invariance of the data and calls the resulting formulation invariant LMNN (lLMNN) classifier, which compares with the state of the art classifiers and demonstrates improvements.
Abstract: The k-nearest neighbour (kNN) rule is a simple and effective method for multi-way classification that is much used in Computer Vision. However, its performance depends heavily on the distance metric being employed. The recently proposed large margin nearest neighbour (LMNN) classifier [21] learns a distance metric for kNN classification and thereby improves its accuracy. Learning involves optimizing a convex problem using semidefinite programming (SDP). We extend the LMNN framework to incorporate knowledge about invariance of the data. The main contributions of our work are three fold: (i) Invariances to multivariate polynomial transformations are incorporated without explicitly adding more training data during learning - these can approximate common transformations such as rotations and affinities; (ii) the incorporation of different regularizes on the parameters being learnt; and (Hi) for all these variations, we show that the distance metric can still be obtained by solving a convex SDP problem. We call the resulting formulation invariant LMNN (lLMNN) classifier. We test our approach to learn a metric for matching (i) feature vectors from the standard Iris dataset; and (ii) faces obtained from TV video (an episode of 'Buffy the Vampire Slayer'). We compare our method with the state of the art classifiers and demonstrate improvements.


Proceedings ArticleDOI
01 Jan 2007
TL;DR: This paper presents a system for person identification that uses concise statistical models of facial features in a real-time realisation of the cast identification system of Everingham et al.
Abstract: This paper presents a system for person identification that uses concise statistical models of facial features in a real-time realisation of the cast identification system of Everingham et al. [7]. Our system integrates the cascaded face detector of Viola and Jones with a kernel-based regressor for face tracking, which is trained on-line when new people are detected in the video stream. A pictorial model is used to compute the locations of facial features, which form a descriptor of the person’s face. When sufficient samples are collected, identification is performed using a random-ferns classifier by marginalising over the facial features. This confers robustness to localisation errors and occlusions, while enabling a real-time search of the database. These four different processes communicate within a real-time framework capable of tracking and identifying up to 5 people in real-time on a standard dual-core 1.86GHz machine.

Journal ArticleDOI
TL;DR: A scheme to learn the parameters of the image prior as part of the super-resolution algorithm is introduced, and examples on a number of real sequences including multiple stills, digital video, and DVDs of movies are shown.
Abstract: In multiple-image super-resolution, a high-resolution image is estimated from a number of lower-resolution images. This usually involves computing the parameters of a generative imaging model (such as geometric and photometric registration, and blur) and obtaining a MAP estimate by minimizing a cost function including an appropriate prior. Two alternative approaches are examined. First, both registrations and the super-resolution image are found simultaneously using a joint MAP optimization. Second, we perform Bayesian integration over the unknown image registration parameters, deriving a cost function whose only variables of interest are the pixel values of the super-resolution image. We also introduce a scheme to learn the parameters of the image prior as part of the super-resolution algorithm. We show examples on a number of real sequences including multiple stills, digital video, and DVDs of movies.

01 Jan 2007
TL;DR: The Oxford team participated in the high-level feature extraction and interactive search tasks and observed that the main observation this year is that the system can boost retrieval performance by using tailored approaches for specific concepts.
Abstract: The Oxford team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on sparse visual features. One used a standard bag-of-words representation, while the other additionally used a lower-dimensional “topic”-based representation generated by Latent Dirichlet Allocation (LDA). For both methods, we trained 2 -based SVM classifiers for all high-level features using publicly available annotations [3]. In addition, for certain features, we took a more targeted approach. Features based on human actions, such as “Walking/Running” and “People Marching”, were answered by using a robust pedestrian detector on every frame, coupled with an action classifier targeted to each feature to give highprecision results. For “Face” and “Person”, we used a realtime face detector and pedestrian detector, and for “Car” and “Truck”, we used a classifier which localized the vehicle in each image, trained on an external set of images of side and front views. We submitted 6 different runs. OXVGG_1(0.073 mAP) was our best run, which used a fusion of our LDA and bag-of-words results for most features, but favored our specific methods for features where these were available. OXVGG_2(0.062 mAP) and OXVGG_3(0.060 mAP) were variations on this first run, using different parameter settings. OXVGG_4(0.060 mAP) used LDA for all features and OXVGG_5(0.059 mAP) used bag-of-words for all features. OXVGG_6(0.066 mAP) was a variation of our first run. We came first in “Mountain” and were in the top five for “Studio”, “Car”, “Truck” and “Explosion/Fire”. Our main observation this year is that we can boost retrieval performance by using tailored approaches for specific concepts. For the interactive search task, we coupled the results generated during the high-level task with methods to facilitate efficient and productive interactive search. Our system allowed for several “expansion” methods based on different image representations. The main differences between this year’s system and last year’s was the availability of many more expansion methods and a “temporal zoom” facility which proved invaluable to answering the many action queries in this year’s task. We submitted just one run, I_C_2_VGG_I_1_1, which came second overall with an mAP of 0.328, and came first in 5 queries.

01 Jan 2007
TL;DR: The Oxford team participated in the high-level feature extraction and interactive search tasks and developed two generic methods which were run for all topics and used more specialized methods for particular topics.
Abstract: The Oxford team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on sparse visual features. One used a standard bag-of-words representation, while the other additionally used a lower-dimensional “topic”-based representation generated by Latent Dirichlet Allocation (LDA). For both methods, we trained χ-based SVM classifiers for all high-level features using publicly available annotations [3]. In addition, for certain features, we took a more targeted approach. Features based on human actions, such as “Walking/Running” and “People Marching”, were answered by using a robust pedestrian detector on every frame, coupled with an action classifier targeted to each feature to give highprecision results. For “Face” and “Person”, we used a realtime face detector and pedestrian detector, and for “Car” and “Truck”, we used a classifier which localized the vehicle in each image, trained on an external set of images of side and front views. We submitted 6 different runs. OXVGG_1(0.073 mAP) was our best run, which used a fusion of our LDA and bag-of-words results for most features, but favored our specific methods for features where these were available. OXVGG_2(0.062 mAP) and OXVGG_3(0.060 mAP) were variations on this first run, using different parameter settings. OXVGG_4(0.060 mAP) used LDA for all features and OXVGG_5(0.059 mAP) used bag-of-words for all features. OXVGG_6(0.066 mAP) was a variation of our first run. We came first in “Mountain” and were in the top five for “Studio”, “Car”, “Truck” and “Explosion/Fire”. Our main observation this year is that we can boost retrieval performance by using tailored approaches for specific concepts. For the interactive search task, we coupled the results generated during the high-level task with methods to facilitate efficient and productive interactive search. Our system allowed for several “expansion” methods based on different image representations. The main differences between this year’s system and last year’s was the availability of many more expansion methods and a “temporal zoom” facility which proved invaluable to answering the many action queries in this year’s task. We submitted just one run, I_C_2_VGG_I_1_1, which came second overall with an mAP of 0.328, and came first in 5 queries. 1 High-level Feature Extraction For the high-level feature task, we used two generic methods which were run for all topics and used more specialized methods for particular topics. These results were then fused to create the final submission. 1.1 Generic Approaches For the following approaches, we used a reduced subset of MPEG i-frames from each shot, found by clustering i-frames within a shot. Our approach here was to train an SVM for the concept in question, then score all frames in the test set using their distance from the discriminating hyper-plane. We then subsequently ranked the test shots by the maximum score over the reduced i-frames. We have developed two different methods for this task, each differing only in their representations. The first uses a standard bag-of-words representation and the second concatenates this bag-of-words representation with a topic-based LDA representation. 1.1.1 Bag of visual word representation The first method uses a bag of (visual) words [29] representation for the frames, where positional relationships between features are ignored. This representation has proved successful for classifying images according to whether they contain visual categories (such as cars, horses, etc) by training an SVM [10]. Here we use the kernel formulation proposed by [33]. Figure 1: An example of Hessian-Laplace regions used in the bag of words method. Left: original image; right: sparse detected regions overlaid as ellipses. Features and bag of words representation. We used Hessian Laplace(HL) [21] interest points coupled with a SIFT [20] descriptor. This combination of detection and description generates features which are approximately invariant to an affine transformation of the image, see figure 1. These features are computed for all reduced i-frames. The “visual vocabulary” is then constructed by running unsupervised K-means clustering over both the training and test data. The K-means cluster centres define the visual words. We used a vocabulary size of K = 10, 000 visual words. The SIFT features in each reduced i-frame are then assigned to the nearest cluster centre, to give the visual word representation, and the number of occurrences of each visual word is recorded in a histogram. This histogram of visual words is the bag of visual words model for that frame. Topic-based representation We use the Latent Dirichlet Allocation [5, 16] model to obtain a low dimensional representation of the bag-of-visual-words feature vectors. Similar low dimensional representations have been found useful in the context of unsupervised [26, 28] and supervised [6, 25] object and scene category recognition, and image retrieval [17, 27]. We pool together both TRECVid training and test data in the form of 10,000 dimensional bag-ofvisual words vectors and learn 20, 50, 100, 500 and 1,000 topic models. The models are fitted using the Gibbs sampler described in [16]. These representations are concatanated into a single feature vector, each one independantly normalized, such that the bag-of-words and the individual topic representations are each given equal weight. This approach was found to work best using a validation set taken from the training data. SVM classification. To predict whether a keyframe from the test set belongs to a concept, an SVM classifier is trained for each concept. Specifically, a kernel SVM with χ kernel K(p, q) = e−αχ 2(p,q)