scispace - formally typeset
Search or ask a question

Showing papers by "Andrew Zisserman published in 2005"


Journal ArticleDOI
TL;DR: A snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions to establish a reference test set of images and performance software so that future detectors can be evaluated in the same framework.
Abstract: The paper gives a snapshot of the state of the art in affine covariant region detectors, and compares their performance on a set of test images under varying imaging conditions. Six types of detectors are included: detectors based on affine normalization around Harris (Mikolajczyk and Schmid, 2002; Schaffalitzky and Zisserman, 2002) and Hessian points (Mikolajczyk and Schmid, 2002), a detector of `maximally stable extremal regions', proposed by Matas et al. (2002); an edge-based region detector (Tuytelaars and Van Gool, 1999) and a detector based on intensity extrema (Tuytelaars and Van Gool, 2000), and a detector of `salient regions', proposed by Kadir, Zisserman and Brady (2004). The performance is measured against changes in viewpoint, scale, illumination, defocus and image compression. The objective of this paper is also to establish a reference test set of images and performance software, so that future detectors can be evaluated in the same framework.

3,359 citations


Journal ArticleDOI
TL;DR: A method of reliably measuring relative orientation co-occurrence statistics in a rotationally invariant manner is presented, and whether incorporating such information can enhance the classifier’s performance is discussed.
Abstract: We investigate texture classification from single images obtained under unknown viewpoint and illumination. A statistical approach is developed where textures are modelled by the joint probability distribution of filter responses. This distribution is represented by the frequency histogram of filter response cluster centres (textons). Recognition proceeds from single, uncalibrated images and the novelty here is that rotationally invariant filters are used and the filter response space is low dimensional.

1,145 citations


Proceedings ArticleDOI
17 Oct 2005
TL;DR: This work treats object categories as topics, so that an image containing instances of several categories is modeled as a mixture of topics, and develops a model developed in the statistical text literature: probabilistic latent semantic analysis (pLSA).
Abstract: We seek to discover the object categories depicted in a set of unlabelled images. We achieve this using a model developed in the statistical text literature: probabilistic latent semantic analysis (pLSA). In text analysis, this is used to discover topics in a corpus using the bag-of-words document representation. Here we treat object categories as topics, so that an image containing instances of several categories is modeled as a mixture of topics. The model is applied to images by using a visual analogue of a word, formed by vector quantizing SIFT-like region descriptors. The topic discovery approach successfully translates to the visual domain: for a small set of objects, we show that both the object categories and their approximate spatial layout are found without supervision. Performance of this unsupervised method is compared to the supervised approach of Fergus et al. (2003) on a set of unseen images containing only one object per image. We also extend the bag-of-words vocabulary to include 'doublets' which encode spatially local co-occurring regions. It is demonstrated that this extended vocabulary gives a cleaner image segmentation. Finally, the classification and segmentation methods are applied to a set of images containing multiple objects per image. These results demonstrate that we can successfully build object class models from an unsupervised analysis of images.

1,129 citations


Proceedings ArticleDOI
17 Oct 2005
TL;DR: A new model, TSI-pLSA, is developed, which extends pLSA (as applied to visual words) to include spatial information in a translation and scale invariant manner, and can handle the high intra-class variability and large proportion of unrelated images returned by search engines.
Abstract: Current approaches to object category recognition require datasets of training images to be manually prepared, with varying degrees of supervision. We present an approach that can learn an object category from just its name, by utilizing the raw output of image search engines available on the Internet. We develop a new model, TSI-pLSA, which extends pLSA (as applied to visual words) to include spatial information in a translation and scale invariant manner. Our approach can handle the high intra-class variability and large proportion of unrelated images returned by search engines. We evaluate tire models on standard test sets, showing performance competitive with existing methods trained on hand prepared datasets

807 citations


25 Feb 2005
TL;DR: Given a set of images containing multiple object categories, this work seeks to discover those categories and their image locations without supervision using generative models from the statistical text literature: probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA).
Abstract: Given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. We achieve this using generative models from the statistical text literature: probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA). In text analysis these are used to discover topics in a corpus using the bag-of-words document representation. Here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. The models are applied to images by using a visual analogue of a word, formed by vector quantizing SIFT like region descriptors. We investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. The object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. We also demonstrate classification of unseen images and images containing multiple objects. Performance of the proposed unsupervised method is compared to the semi-supervised approach of [7].1 1This work was sponsored in part by the EU Project CogViSys, the University of Oxford, Shell Oil, and the National Geospatial-Intelligence Agency.

524 citations


Proceedings ArticleDOI

[...]

20 Jun 2005
TL;DR: A principled Bayesian method for detecting and segmenting instances of a particular object category within an image, providing a coherent methodology for combining top down and bottom up cues and developing an efficient method, OBJ CUT, to obtain segmentations using this model.
Abstract: In this paper, we present a principled Bayesian method for detecting and segmenting instances of a particular object category within an image, providing a coherent methodology for combining top down and bottom up cues. The work draws together two powerful formulations: pictorial structures (PS) and Markov random fields (MRFs) both of which have efficient algorithms for their solution. The resulting combination, which we call the object category specific MRF, suggests a solution to the problem that has long dogged MRFs namely that they provide a poor prior for specific shapes. In contrast, our model provides a prior that is global across the image plane using the PS. We develop an efficient method, OBJ CUT, to obtain segmentations using this model. Novel aspects of this method include an efficient algorithm for sampling the PS model, and the observation that the expected log likelihood of the model can be increased by a single graph cut. Results are presented on two object categories, cows and horses. We compare our methods to the state of the art in object category specific image segmentation and demonstrate significant improvements.

386 citations


Book ChapterDOI
11 Apr 2005
TL;DR: The PASCAL Visual Object Classes Challenge (PASCALVOC) as mentioned in this paper was held from February to March 2005 to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects).
Abstract: The PASCAL Visual Object Classes Challenge ran from February to March 2005. The goal of the challenge was to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). Four object classes were selected: motorbikes, bicycles, cars and people. Twelve teams entered the challenge. In this chapter we provide details of the datasets, algorithms used by the teams, evaluation criteria, and results achieved.

381 citations


Proceedings ArticleDOI
20 Jun 2005
TL;DR: A person detector that quite accurately detects and localizes limbs of people in lateral walking poses is built, and an algorithm for finding and kinematically tracking multiple people in long sequences is developed.
Abstract: We develop an algorithm for finding and kinematically tracking multiple people in long sequences. Our basic assumption is that people tend to take on certain canonical poses, even when performing unusual activities like throwing a baseball or figure skating. We build a person detector that quite accurately detects and localizes limbs of people in lateral walking poses. We use the estimated limbs from a detection to build a discriminative appearance model; we assume the features that discriminate a figure in one frame will discriminate the figure in other frames. We then use the models as limb detectors in a pictorial structure framework, detecting figures in unrestricted poses in both previous and successive frames. We have run our tracker on hundreds of thousands of frames, and present and apply a methodology for evaluating tracking on such a large scale. We test our tracker on real sequences including a feature-length film, an hour of footage from a public park, and various sports sequences. We find that we can quite accurately automatically find and track multiple people interacting with each other while performing fast and unusual motions.

364 citations


Proceedings ArticleDOI
20 Jun 2005
TL;DR: A "parts and structure" model for object category recognition that can be learnt efficiently and in a semi-supervised manner is presented, learnt from example images containing category instances, without requiring segmentation from background clutter.
Abstract: We present a "parts and structure" model for object category recognition that can be learnt efficiently and in a semi-supervised manner: the model is learnt from example images containing category instances, without requiring segmentation from background clutter. The model is a sparse representation of the object, and consists of a star topology configuration of parts modeling the output of a variety of feature detectors. The optimal choice of feature types (whose repertoire includes interest points, curves and regions) is made automatically. In recognition, the model may be applied efficiently in an exhaustive manner, bypassing the need for feature detectors, to give the globally optimal match within a query image. The approach is demonstrated on a wide variety of categories, and delivers both successful classification and localization of the object within the image.

333 citations


Book ChapterDOI
20 Jul 2005
TL;DR: Progress is described in harnessing multiple exemplars of each person in a form that can easily be associated automatically using straightforward visual tracking in order to retrieve humans automatically in videos, given a query face in a shot.
Abstract: Matching people based on their imaged face is hard because of the well known problems of illumination, pose, size and expression variation. Indeed these variations can exceed those due to identity. Fortunately, videos of people have the happy benefit of containing multiple exemplars of each person in a form that can easily be associated automatically using straightforward visual tracking. We describe progress in harnessing these multiple exemplars in order to retrieve humans automatically in videos, given a query face in a shot. There are three areas of interest: (i) the matching of sets of exemplars provided by “tubes” of the spatial-temporal volume; (ii) the description of the face using a spatial orientation field; and, (iii) the structuring of the problem so that retrieval is immediate at run time. The result is a person retrieval system, able to retrieve a ranked list of shots containing a particular person in the manner of Google. The method has been implemented and tested on two feature length movies.

243 citations


Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is demonstrated that high recall rates can be achieved whilst maintaining good precision (over 93%) and a recognition method based on a cascade of processing steps that normalize for the effects of the changing imaging environment is developed.
Abstract: The objective of this work is to recognize all the frontal faces of a character in the closed world of a movie or situation comedy, given a small number of query faces. This is challenging because faces in a feature-length film are relatively uncontrolled with a wide variability of scale, pose, illumination, and expressions, and also may be partially occluded. We develop a recognition method based on a cascade of processing steps that normalize for the effects of the changing imaging environment. In particular there are three areas of novelty: (i) we suppress the background surrounding the face, enabling the maximum area of the face to be retained for recognition rather than a subset; (ii) we include a pose refinement step to optimize the registration between the test image and face exemplar; and (iii) we use robust distance to a sub-space to allow for partial occlusion and expression change. The method is applied and evaluated on several feature length films. It is demonstrated that high recall rates (over 92%) can be achieved whilst maintaining good precision (over 93%).

Proceedings ArticleDOI
17 Oct 2005
TL;DR: An unsupervised approach for learning a generative layered representation of a scene from a video for motion segmentation using efficient loopy belief propagation and /spl alpha//spl beta/-swap and / spl alpha/-expansion algorithms for refining the initial estimate.
Abstract: We present an unsupervised approach for learning a generative layered representation of a scene from a video for motion segmentation. The learnt model is a composition of layers, which consist of one or more segments. Included in the model are the effects of image projection, lighting, and motion blur. The two main contributions of our method are: (i) a novel algorithm for obtaining the initial estimate of the model using efficient loopy belief propagation; and (ii) using /spl alpha//spl beta/-swap and /spl alpha/-expansion algorithms, which guarantee a strong local minima, for refining the initial estimate. Results are presented on several classes of objects with different types of camera motion. We compare our method with the state of the art and demonstrate significant improvements.

Proceedings ArticleDOI
17 Oct 2005
TL;DR: Two areas of innovation are described: the first is to capture the 3-D appearance of the entire head, rather than just the face region, so that visual features such as the hairline can be exploited, and the second is to combine discriminative and 'generative' approaches for detection and recognition.
Abstract: The objective of this work is automatic detection and identification of individuals in unconstrained consumer video, given a minimal number of labelled faces as training data. Whilst much work has been done on (mainly frontal) face detection and recognition, current methods are not sufficiently robust to deal with the wide variations in pose and appearance found in such video. These include variations in scale, illumination, expression, partial occlusion, motion blur, etc. We describe two areas of innovation: the first is to capture the 3-D appearance of the entire head, rather than just the face region, so that visual features such as the hairline can be exploited. The second is to combine discriminative and 'generative' approaches for detection and recognition. Images rendered using the head model are used to train a discriminative tree-structured classifier giving efficient detection and pose estimates over a very wide pose range with three degrees of freedom. Subsequent verification of the identity is obtained using the head model in a 'generative' framework. We demonstrate excellent performance in detecting and identifying three characters and their poses in a TV situation comedy

Proceedings ArticleDOI
01 Jan 2005
TL;DR: It is shown how a 3D model of a complex curved object can be easily extracted from a single 2D image and found the smoothest 3D surface which projects exactly to this silhouette can be expressed as a quadratic optimization, a result which has not previously appeared in the large literature on the shape-from-silhouette problem.
Abstract: We show how a 3D model of a complex curved object can be easily extracted from a single 2D image. A userdefined silhouette is the key input; and we show that finding the smoothest 3D surface which projects exactly to this silhouette can be expressed as a quadratic optimization, a result which has not previously appeared in the large literature on the shape-from-silhouette problem. For simple models, this process can immediately yield a usable 3D model; but for more complex geometries the user will wish to further shape the surface. We show that a variety of editing operations—which can be defined either in the image or in 3D—can also be expressed as linear constraints on the 3D shape parameters. We extend the system to fit higher genus surfaces. Our method has several advantages over the system of Zhanget al. [ZDPSS01] and over systems such asSKETCH and Teddy.

Proceedings ArticleDOI
20 Jun 2005
TL;DR: A system for automatic people tracking and activity recognition that builds a model of limb appearance from sparse stylized detections and reprocesses the video, using the learned appearance models to find people in unrestricted configuration.
Abstract: We present a system for automatic people tracking and activity recognition Our basic approach to people-tracking is to build an appearance model for the person in the video The video illustrates our method of using a stylized-pose detector Our system builds a model of limb appearance from those sparse stylized detections Our algorithm then reprocesses the video, using the learned appearance models to find people in unrestricted configuration We can use our tracker to recover 3D configurations and activity labels We assume we have a motion capture library where the 3D poses have been labeled offline with activity descriptions

Journal ArticleDOI
24 Oct 2005
TL;DR: The identity of a target face can be determined by first proposing faces with similar pose, and then classifying the target face as one of the proposed faces or not, and the texture maps of the model can be automatically updated as new poses and expressions are detected.
Abstract: Progress in the automatic detection and identification of humans in video, given a minimal number of labelled faces as training data, is described. This is an extremely challenging problem owing to the many sources of variation in a person's imaged appearance: pose variation, scale, facial expression, illumination, partial occlusion, motion blur, etc. The method developed in this work combines approaches from computer vision, for detection and pose estimation, with those from machine learning for classification. A ‘generative’ model of a person's head is defined consisting of a coarse 3-D model and multiple texture maps. This allows faces to be rendered with a variety of facial expressions and at poses differing from those of the training data. It is shown that the identity of a target face can then be determined by first proposing faces with similar pose, and then classifying the target face as one of the proposed faces or not. Furthermore, the texture maps of the model can be automatically updated as new poses and expressions are detected. Results of detecting three characters in a TV situation comedy are demonstrated.

01 Jan 2005
TL;DR: In this paper, the authors explore the use of computer graphics and computer vision techniques in the history of art, focusing on analyzing the geometry of perspective paintings to learn about the perspectival skills of artists and explore the evolution of linear perspective in history.
Abstract: This paper explores the use of computer graphics and computer vision techniques in the history of art. The focus is on analysing the geometry of perspective paintings to learn about the perspectival skills of artists and explore the evolution of linear perspective in history. Algorithms for a systematic analysis of the two- and three-dimensional geometry of paintings are drawn from the work on “single-view reconstruction” and applied to interpreting works of art from the Italian Renaissance and later periods. Since a perspectival painting is not a photograph of an actual subject but an artificial construction subject to imaginative manipulation and inadvertent inaccuracies, the internal consistency of its geometry must be assessed before carrying out any geometric analysis. Some simple techniques to analyse the consistency and perspectival accuracy of the geometry of a painting are discussed. Moreover, this work presents new algorithms for generating new views of a painted scene or portions of it, analysing shapes and proportions of objects, filling in occluded areas, performing a complete threedimensional reconstruction of a painting and a rigorous analysis of possible reconstruction ambiguities. The validity of the techniques described here is demonstrated on a number of historical paintings and frescoes. Whenever possible, the computer-generated results are compared to those obtained by art historians through careful manual analysis. This research represents a further attempt to build a constructive dialogue between two very different disciplines: computer science and history of art. Despite their fundamental differences, science and art can learn and be enriched by each other’s procedures. A longer and more detailed version of this paper may be found in [5].