scispace - formally typeset
Search or ask a question

Showing papers by "Andrew Zisserman published in 2011"


Proceedings ArticleDOI
01 Jan 2011
TL;DR: A rigorous evaluation of novel encodings for bag of visual words models by identifying both those aspects of each method which are particularly important to achieve good performance, and those aspects which are less critical, which allows a consistent comparative analysis of these encoding methods.
Abstract: A large number of novel encodings for bag of visual words models have been proposed in the past two years to improve on the standard histogram of quantized local features. Examples include locality-constrained linear encoding [23], improved Fisher encoding [17], super vector encoding [27], and kernel codebook encoding [20]. While several authors have reported very good results on the challenging PASCAL VOC classification data by means of these new techniques, differences in the feature computation and learning algorithms, missing details in the description of the methods, and different tuning of the various components, make it impossible to compare directly these methods and hard to reproduce the results reported. This paper addresses these shortcomings by carrying out a rigorous evaluation of these new techniques by: (1) fixing the other elements of the pipeline (features, learning, tuning); (2) disclosing all the implementation details, and (3) identifying both those aspects of each method which are particularly important to achieve good performance, and those aspects which are less critical. This allows a consistent comparative analysis of these encoding methods. Several conclusions drawn from our analysis cannot be inferred from the original publications.

980 citations


Proceedings ArticleDOI
06 Nov 2011
TL;DR: This work proposes three transfer learning formulations where a template learnt previously for other categories is used to regularize the training of a new category, which result in convex optimization problems.
Abstract: Our objective is transfer training of a discriminatively trained object category detector, in order to reduce the number of training images required. To this end we propose three transfer learning formulations where a template learnt previously for other categories is used to regularize the training of a new category. All the formulations result in convex optimization problems. Experiments (on PASCAL VOC) demonstrate significant performance gains by transfer learning from one class to another (e.g. motorbike to bicycle), including one-shot learning, specialization from class to a subordinate class (e.g. from quadruped to horse) and transfer using multiple components. In the case of multiple training samples it is shown that a detection performance approaching that of the state of the art can be achieved with substantially fewer training samples.

402 citations


Journal ArticleDOI
TL;DR: A multi-modal approach employing both text, meta data and visual features is used to gather many, high-quality images from the Web to automatically generate a large number of images for a specified object class.
Abstract: The objective of this work is to automatically generate a large number of images for a specified object class. A multimodal approach employing both text, metadata, and visual features is used to gather many high-quality images from the Web. Candidate images are obtained by a text-based Web search querying on the object identifier (e.g., the word penguin). The Webpages and the images they contain are downloaded. The task is then to remove irrelevant images and rerank the remainder. First, the images are reranked based on the text surrounding the image and metadata features. A number of methods are compared for this reranking. Second, the top-ranked images are used as (noisy) training data and an SVM visual classifier is learned to improve the ranking further. We investigate the sensitivity of the cross-validation procedure to this noisy training data. The principal novelty of the overall method is in combining text/metadata and visual features in order to achieve a completely automatic ranking of the images. Examples are given for a selection of animals, vehicles, and other classes, totaling 18 classes. The results are assessed by precision/recall curves on ground-truth annotated data and by comparison to previous approaches, including those of Berg and Forsyth [5] and Fergus et al. [12].

369 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: The hand detector exceeds the state of the art on two public datasets, including the PASCAL VOC 2010 human layout challenge and is introduced with a fully annotated hand dataset for training and testing.
Abstract: We describe a two-stage method for detecting hands and their orientation in unconstrained images. The first stage uses three complementary detectors to propose hand bounding boxes. Each bounding box is then scored by the three detectors independently, and a second stage classifier learnt to compute a final confidence score for the proposals using these features. We make the following contributions: (i) we add context-based and skin-based proposals to a sliding window shape based detector to increase recall; (ii) we develop a new method of non-maximum suppression based on super-pixels; and (iii) we introduce a fully annotated hand dataset for training and testing. We show that the hand detector exceeds the state of the art on two public datasets, including the PASCAL VOC 2010 human layout challenge.

280 citations


Journal ArticleDOI
01 Nov 2011
TL;DR: A new deblurring algorithm is proposed that locates error-prone bright pixels in the latent sharp image, and by decoupling them from the remainder of the latent image, greatly reduces ringing.
Abstract: We address the problem of deblurring images degraded by camera shake blur and saturated or over-exposed pixels. Saturated pixels are a problem for existing non-blind deblurring algorithms because they violate the assumption that the image formation process is linear, and often cause significant artifacts in deblurred outputs. We propose a forward model that includes sensor saturation, and use it to derive a deblurring algorithm properly treating saturated pixels. By using this forward model and reasoning about the causes of artifacts in the deblurred results, we obtain significantly better results than existing deblurring algorithms. Further we propose an efficient approximation of the forward model leading to a significant speed-up.

186 citations


Proceedings ArticleDOI
06 Nov 2011
TL;DR: A new scalable, alternation-based algorithm for co-segmentation, BiCoS, is introduced, which is simpler than many of its predecessors, and yet has superior performance on standard benchmark image datasets.
Abstract: The objective of this paper is the unsupervised segmentation of image training sets into foreground and background in order to improve image classification performance. To this end we introduce a new scalable, alternation-based algorithm for co-segmentation, BiCoS, which is simpler than many of its predecessors, and yet has superior performance on standard benchmark image datasets.

183 citations


Proceedings Article
12 Dec 2011
TL;DR: This paper shows that if, instead of a flat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding benefits of global optimality and efficiency).
Abstract: Graph cut optimization is one of the standard workhorses of image segmentation since for binary random field representations of the image, it gives globally optimal results and there are efficient polynomial time implementations. Often, the random field is applied over a flat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a flat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding benefits of global optimality and efficiency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a flat partitioning) and demonstrate that the optimization in the pylon model is able to flexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches.

170 citations


Proceedings ArticleDOI
06 Nov 2011
TL;DR: This approach proposes to use the template-based model to detect a distinctive part for the class, followed by detecting the rest of the object via segmentation on image specific information learnt from that part, and achieves accuracy comparable to the state-of-the-art on the PASCAL VOC competition, which includes other models such as bag- of-words.
Abstract: Template-based object detectors such as the deformable parts model of Felzenszwalb et al [11] achieve state-of-the-art performance for a variety of object categories, but are still outperformed by simpler bag-of-words models for highly flexible objects such as cats and dogs In these cases we propose to use the template-based model to detect a distinctive part for the class, followed by detecting the rest of the object via segmentation on image specific information learnt from that part This approach is motivated by two ob- servations: (i) many object classes contain distinctive parts that can be detected very reliably by template-based detec- tors, whilst the entire object cannot; (ii) many classes (eg animals) have fairly homogeneous coloring and texture that can be used to segment the object once a sample is provided in an image We show quantitatively that our method substantially outperforms whole-body template-based detectors for these highly deformable object categories, and indeed achieves accuracy comparable to the state-of-the-art on the PASCAL VOC competition, which includes other models such as bag-of-words

148 citations


Proceedings ArticleDOI
06 Nov 2011
TL;DR: A scalable approach to 3D smooth object retrieval which searches for and localizes all the occurrences of a user outlined object in a dataset of images in real time is described.
Abstract: We describe a scalable approach to 3D smooth object retrieval which searches for and localizes all the occurrences of a user outlined object in a dataset of images in real time. The approach is illustrated on sculptures.

112 citations


Journal ArticleDOI
TL;DR: The Geometric Latent Dirichlet Allocation model for unsupervised discovery of particular objects in unordered image collections is introduced and how “hub images” (images representative of an object or landmark) can easily be extracted from the authors' matching graph representation is discussed.
Abstract: Given a large-scale collection of images our aim is to efficiently associate images which contain the same entity, for example a building or object, and to discover the significant entities. To achieve this, we introduce the Geometric Latent Dirichlet Allocation (gLDA) model for unsupervised discovery of particular objects in unordered image collections. This explicitly represents images as mixtures of particular objects or facades, and builds rich latent topic models which incorporate the identity and locations of visual words specific to the topic in a geometrically consistent way. Applying standard inference techniques to this model enables images likely to contain the same object to be probabilistically grouped and ranked. Additionally, to reduce the computational cost of applying the gLDA model to large datasets, we propose a scalable method that first computes a matching graph over all the images in a dataset. This matching graph connects images that contain the same object, and rough image groups can be mined from this graph using standard clustering techniques. The gLDA model can then be applied to generate a more nuanced representation of the data. We also discuss how "hub images" (images representative of an object or landmark) can easily be extracted from our matching graph representation. We evaluate our techniques on the publicly available Oxford buildings dataset (5K images) and show examples of automatically mined objects. The methods are evaluated quantitatively on this dataset using a ground truth labeling for a number of Oxford landmarks. To demonstrate the scalability of the matching graph method, we show qualitative results on two larger datasets of images taken of the Statue of Liberty (37K images) and Rome (1M+ images).

85 citations


Journal ArticleDOI
TL;DR: The goal of this work is to detect and track the articulated pose of a human in signing videos of more than one hour in length, and proposes a complete model which accounts for self-occlusion of the arms.
Abstract: The goal of this work is to detect and track the articulated pose of a human in signing videos of more than one hour in length. In particular we wish to accurately localise hands and arms, despite fast motion and a cluttered and changing background. We cast the problem as inference in a generative model of the image, and propose a complete model which accounts for self-occlusion of the arms. Under this model, limb detection is expensive due to the very large number of possible configurations each part can assume. We make the following contributions to reduce this cost: (i) efficient sampling from a pictorial structure proposal distribution to obtain reasonable configurations; (ii) identifying a large number of frames where configurations can be correctly inferred, and exploiting temporal tracking elsewhere. Results are reported for signing footage with challenging image conditions and for different signers. We show that the method is able to identify the true arm and hand locations with high reliability. The results exceed the state-of-the-art for the length and stability of continuous limb tracking.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: Standard cinematography practice is to first establish which characters are looking at each other using a medium or wide shot, and then edit subsequent close-up shots so that the eyelines match the point of view of the characters.
Abstract: If you read any book on film editing or listen to a director’s commentary on a DVD, then what emerges again and again is the importance of eyelines. Standard cinematography practice is to first establish which characters are looking at each other using a medium or wide shot, and then edit subsequent close-up shots so that the eyelines match the point of view of the characters. This is the basis of the well known 180o rule in editing.

Proceedings ArticleDOI
01 Nov 2011
TL;DR: It is shown that a large dataset of roughly 3400 humans can be automatically acquired very cheaply using the Kinect, and that the method can be completely automated - segmenting humans given only the images, without requiring a bounding box, and compared with a previous state of the art method.
Abstract: The Kinect provides an opportunity to collect large quantities of training data for visual learning algorithms relatively effortlessly. To this end we investigate learning to automatically segment humans from cluttered images (without depth information) given a bounding box. For this algorithm, obtaining a large dataset of images with segmented humans is crucial as it enables the possible variations in human appearances and backgrounds to be learnt. We show that a large dataset of roughly 3400 humans can be automatically acquired very cheaply using the Kinect. Segmenting humans is then cast as a learning problem with linear classifiers trained to predict segmentation masks from sparsely coded local HOG descriptors. These classifiers introduce top-down knowledge to obtain a crude segmentation of the human which is then refined using bottom up information from local color models in a Snap-Cut [2] like fashion. The method is quantitatively evaluated on images of humans in cluttered scenes, and a high performance obtained (88:5% overlap score). We also show that the method can be completely automated - segmenting humans given only the images, without requiring a bounding box, and compare with a previous state of the art method.

Proceedings ArticleDOI
06 Nov 2011
TL;DR: A generalization of structured output SVM regressors that can incorporate equivariance and invariance into a convex training procedure, enabling the incorporation of large families of transformations, while maintaining optimality and tractability is proposed.
Abstract: Equivariance and invariance are often desired properties of a computer vision system. However, currently available strategies generally rely on virtual sampling, leaving open the question of how many samples are necessary, on the use of invariant feature representations, which can mistakenly discard information relevant to the vision task, or on the use of latent variable models, which result in non-convex training and expensive inference at test time. We propose here a generalization of structured output SVM regressors that can incorporate equivariance and invariance into a convex training procedure, enabling the incorporation of large families of transformations, while maintaining optimality and tractability. Importantly, test time inference does not require the estimation of latent variables, resulting in highly efficient objective functions. This results in a natural formulation for treating equivariance and invariance that is easily implemented as an adaptation of off-the-shelf optimization software, obviating the need for ad hoc sampling strategies. Theoretical results relating to vicinal risk, and experiments on challenging aerial car and pedestrian detection tasks show the effectiveness of the proposed solution.

Book ChapterDOI
18 Sep 2011
TL;DR: This work enables the user to select a query Region Of Interest (ROI) and automatically detect the corresponding regions within all returned images, which allows the returned images to be ranked on the content of the ROI, rather than the entire image.
Abstract: The objective of this work is a scalable, real-time visual search engine for medical images. In contrast to existing systems that retrieve images that are globally similar to a query image, we enable the user to select a query Region Of Interest (ROI) and automatically detect the corresponding regions within all returned images. This allows the returned images to be ranked on the content of the ROI, rather than the entire image. Our contribution is two-fold: (i) immediate retrieval - the data is appropriately pre-processed so that the search engine returns results in real-time for any query image and ROI; (ii) structured output - returning ROIs with a choice of ranking functions. The retrieval performance is assessed on a number of annotated queries for images from the IRMA X-ray dataset and compared to a baseline.

01 Jan 2011
TL;DR: A learning algorithm is developed that counts the number of cells in a large field of view image automatically, and can be used to investigate colony growth in time lapse sequences, using a novel, small, and cost effective diffraction device.
Abstract: We have developed a learning algorithm that counts the number of cells in a large field of view image automatically, and can be used to investigate colony growth in time lapse sequences. The images are acquired using a novel, small, and cost effective diffraction device that can be placed in an incubator during acquisition. This device, termed a CyMap, contains a resonant cavity LED and CMOS camera with no additional optical components or lenses. The counting method is based on structured output learning, and involves segmentation and computation using a random forest. We show that the algorithm can accurately count thousands of cells in a time suitable for immediate analysis of time lapse sequences. Performance is measured using ground truth annotation from registered images acquired under a different modality.