scispace - formally typeset
Search or ask a question

Showing papers by "Andrew Zisserman published in 2012"


Proceedings ArticleDOI
16 Jun 2012
TL;DR: A new method to compare SIFT descriptors (RootSIFT) which yields superior performance without increasing processing or storage requirements, and a novel method for query expansion where a richer model for the query is learnt discriminatively in a form suited to immediate retrieval through efficient use of the inverted index.
Abstract: The objective of this work is object retrieval in large scale image datasets, where the object is specified by an image query and retrieval should be immediate at run time in the manner of Video Google [28]. We make the following three contributions: (i) a new method to compare SIFT descriptors (RootSIFT) which yields superior performance without increasing processing or storage requirements; (ii) a novel method for query expansion where a richer model for the query is learnt discriminatively in a form suited to immediate retrieval through efficient use of the inverted index; (iii) an improvement of the image augmentation method proposed by Turcot and Lowe [29], where only the augmenting features which are spatially consistent with the augmented image are kept. We evaluate these three methods over a number of standard benchmark datasets (Oxford Buildings 5k and 105k, and Paris 6k) and demonstrate substantial improvements in retrieval performance whilst maintaining immediate retrieval speeds. Combining these complementary methods achieves a new state-of-the-art performance on these datasets.

1,463 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: These models are very good: they beat all previously published results on the challenging ASIRRA test (cat vs dog discrimination) when applied to the task of discriminating the 37 different breeds of pets, and obtain an average accuracy of about 59%, a very encouraging result considering the difficulty of the problem.
Abstract: We investigate the fine grained object categorization problem of determining the breed of animal from an image. To this end we introduce a new annotated dataset of pets covering 37 different breeds of cats and dogs. The visual problem is very challenging as these animals, particularly cats, are very deformable and there can be quite subtle differences between the breeds. We make a number of contributions: first, we introduce a model to classify a pet breed automatically from an image. The model combines shape, captured by a deformable part model detecting the pet face, and appearance, captured by a bag-of-words model that describes the pet fur. Fitting the model involves automatically segmenting the animal in the image. Second, we compare two classification approaches: a hierarchical one, in which a pet is first assigned to the cat or dog family and then to a breed, and a flat one, in which the breed is obtained directly. We also investigate a number of animal and image orientated spatial layouts. These models are very good: they beat all previously published results on the challenging ASIRRA test (cat vs dog discrimination). When applied to the task of discriminating the 37 different breeds of pets, the models obtain an average accuracy of about 59%, a very encouraging result considering the difficulty of the problem.

1,076 citations


Journal ArticleDOI
TL;DR: This work introduces explicit feature maps for the additive class of kernels, such as the intersection, Hellinger's, and χ2 kernels, commonly used in computer vision, and enables their use in large scale problems.
Abstract: Large scale nonlinear support vector machines (SVMs) can be approximated by linear ones using a suitable feature map. The linear SVMs are in general much faster to learn and evaluate (test) than the original nonlinear SVMs. This work introduces explicit feature maps for the additive class of kernels, such as the intersection, Hellinger's, and χ2 kernels, commonly used in computer vision, and enables their use in large scale problems. In particular, we: 1) provide explicit feature maps for all additive homogeneous kernels along with closed form expression for all common kernels; 2) derive corresponding approximate finite-dimensional feature maps based on a spectral analysis; and 3) quantify the error of the approximation, showing that the error is independent of the data dimension and decays exponentially fast with the approximation order for selected kernels such as χ2. We demonstrate that the approximations have indistinguishable performance from the full kernels yet greatly reduce the train/test times of SVMs. We also compare with two other approximation methods: Nystrom's approximation of Perronnin et al. [1], which is data dependent, and the explicit map of Maji and Berg [2] for the intersection kernel, which, as in the case of our approximations, is data independent. The approximations are evaluated on a number of standard data sets, including Caltech-101 [3], Daimler-Chrysler pedestrians [4], and INRIA pedestrians [5].

804 citations


Journal ArticleDOI
TL;DR: A new parametrized geometric model of the blurring process in terms of the rotational motion of the camera during exposure is proposed, able to capture non-uniform blur in an image due to camera shake using a single global descriptor, and can be substituted into existing deblurring algorithms with only small modifications.
Abstract: Photographs taken in low-light conditions are often blurry as a result of camera shake, i.e. a motion of the camera while its shutter is open. Most existing deblurring methods model the observed blurry image as the convolution of a sharp image with a uniform blur kernel. However, we show that blur from camera shake is in general mostly due to the 3D rotation of the camera, resulting in a blur that can be significantly non-uniform across the image. We propose a new parametrized geometric model of the blurring process in terms of the rotational motion of the camera during exposure. This model is able to capture non-uniform blur in an image due to camera shake using a single global descriptor, and can be substituted into existing deblurring algorithms with only small modifications. To demonstrate its effectiveness, we apply this model to two deblurring problems; first, the case where a single blurry image is available, for which we examine both an approximate marginalization approach and a maximum a posteriori approach, and second, the case where a sharp but noisy image of the scene is available in addition to the blurry image. We show that our approach makes it possible to model and remove a wider class of blurs than previous approaches, including uniform blur as a special case, and demonstrate its effectiveness with experiments on synthetic and real images.

656 citations


Journal ArticleDOI
TL;DR: This work proposes and proposes and evaluates techniques for searching a video dataset for people in a specific pose, and develops three new pose descriptors and compares their classification and retrieval performance to two baselines built on state-of-the-art object detection models.
Abstract: We present a technique for estimating the spatial layout of humans in still images--the position of the head, torso and arms. The theme we explore is that once a person is localized using an upper body detector, the search for their body parts can be considerably simplified using weak constraints on position and appearance arising from that detection. Our approach is capable of estimating upper body pose in highly challenging uncontrolled images, without prior knowledge of background, clothing, lighting, or the location and scale of the person in the image. People are only required to be upright and seen from the front or the back (not side). We evaluate the stages of our approach experimentally using ground truth layout annotation on a variety of challenging material, such as images from the PASCAL VOC 2008 challenge and video frames from TV shows and feature films. We also propose and evaluate techniques for searching a video dataset for people in a specific pose. To this end, we develop three new pose descriptors and compare their classification and retrieval performance to two baselines built on state-of-the-art object detection models.

261 citations


Book ChapterDOI
01 Oct 2012
TL;DR: A machine learning-based cell detection method applicable to different modalities and state-of-the-art cell detection accuracy is achieved for H&E stained histology, fluorescence, and phase-contrast images.
Abstract: Cell detection in microscopy images is an important step in the automation of cell based-experiments. We propose a machine learning-based cell detection method applicable to different modalities. The method consists of three steps: first, a set of candidate cell-like regions is identified. Then, each candidate region is evaluated using a statistical model of the cell appearance. Finally, dynamic programming picks a set of non-overlapping regions that match the model. The cell model requires few images with simple dot annotation for training and can be learned within a structured SVM framework. In the reported experiments, state-of-the-art cell detection accuracy is achieved for H&E-stained histology, fluorescence, and phase-contrast images.

195 citations


Journal ArticleDOI
TL;DR: It is shown that inference can be carried out with polynomial complexity in the number of people, and an efficient algorithm is described that is evaluated on a new dataset comprising 300 video clips acquired from 23 different TV shows and on the benchmark UT--Interaction dataset.
Abstract: The objective of this work is recognition and spatiotemporal localization of two-person interactions in video. Our approach is person-centric. As a first stage we track all upper bodies and heads in a video using a tracking-by-detection approach that combines detections with KLT tracking and clique partitioning, together with occlusion detection, to yield robust person tracks. We develop local descriptors of activity based on the head orientation (estimated using a set of pose-specific classifiers) and the local spatiotemporal region around them, together with global descriptors that encode the relative positions of people as a function of interaction type. Learning and inference on the model uses a structured output SVM which combines the local and global descriptors in a principled manner. Inference using the model yields information about which pairs of people are interacting, their interaction class, and their head orientation (which is also treated as a variable, enabling mistakes in the classifier to be corrected using global context). We show that inference can be carried out with polynomial complexity in the number of people, and describe an efficient algorithm for this. The method is evaluated on a new dataset comprising 300 video clips acquired from 23 different TV shows and on the benchmark UT--Interaction dataset.

181 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: This work shows that issuing multiple queries significantly improves recall and enables the system to find quite challenging occu rrences of the queried object, and is evaluated quantitatively on the standard Oxford Buildings benchmark dataset where it achieves very high retrieval performance.
Abstract: The aim of large scale specific-object image retrieval systems is to instanta neously find images that contain the query object in the image database. Current s ystems, for example Google Goggles, concentrate on querying using a single view of an object, e.g. a photo a user takes with his mobile phone, in order to answer the question “what is this?”. Here we consider the somewhat converse problem of finding all images of an object given that the user knows what he is looking for; so the input modality is text, not a n image. This problem is useful in a number of settings, for example media production teams are interested in searching internal databases for images or video footage to accompany news reports and newspaper articles. Given a textual query (e.g. “coca cola bottle”), our approach is to firs t obtain multiple images of the queried object using textual Google image search. These images are then used to visually query the target database to discover images containing the object of interest. We compare a number of different methods for combining the multiple query images, including discriminative learning. We show that issuing multiple queries significantly improves recall and enables the system to find quite challenging occu rrences of the queried object. The system is evaluated quantitatively on the standard Oxford Buildings benchmark dataset where it achieves very high retrieval performance, and also qualitatively on the TrecVid 2011 known-item search dataset.

123 citations


Book ChapterDOI
07 Oct 2012
TL;DR: TriCoS is introduced, a new co-segmentation algorithm that looks at all training images jointly and automatically segments out the most class-discriminative foregrounds for each image to improve classification performance on weakly annotated datasets.
Abstract: The aim of this paper is to leverage foreground segmentation to improve classification performance on weakly annotated datasets – those with no additional annotation other than class labels. We introduce TriCoS, a new co-segmentation algorithm that looks at all training images jointly and automatically segments out the most class-discriminative foregrounds for each image. Ultimately, those foreground segmentations are used to train a classification system. TriCoS solves the co-segmentation problem by minimizing losses at three different levels: the category level for foreground/background consistency across images belonging to the same category, the image level for spatial continuity within each image, and the dataset level for discrimination between classes. In an extensive set of experiments, we evaluate the algorithm on three benchmark datasets: the UCSD-Caltech Birds-200-2010, the Stanford Dogs, and the Oxford Flowers 102. With the help of a modern image classifier, we show superior performance compared to previously published classification methods and other co-segmentation methods.

105 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: This work gives a method to compute sparse features for arbitrary kernels, re-deriving as a special case a popular map for the intersection kernel and extending it to arbitrary additive kernels, and shows that bundle optimisation methods can handle efficiently these sparse features in learning.
Abstract: Efficient learning with non-linear kernels is often based on extracting features from the data that “linearise” the kernel. While most constructions aim at obtaining low-dimensional and dense features, in this work we explore high-dimensional and sparse ones. We give a method to compute sparse features for arbitrary kernels, re-deriving as a special case a popular map for the intersection kernel and extending it to arbitrary additive kernels. We show that bundle optimisation methods can handle efficiently these sparse features in learning. As an application, we show that product quantisation can be interpreted as a sparse feature encoding, and use this to significantly accelerate learning with this technique. We demonstrate these ideas on image classification with Fisher kernels and object detection with deformable part models on the challenging PASCAL VOC data, obtaining five to ten-fold speed-ups as well as reducing memory use by an order of magnitude.

99 citations


Journal Article
TL;DR: TriCoS as discussed by the authors is a co-segmentation algorithm that looks at all training images jointly and automatically segments out the most class-discriminative foregrounds for each image.
Abstract: The aim of this paper is to leverage foreground segmentation to improve classification performance on weakly annotated datasets – those with no additional annotation other than class labels. We introduce TriCoS, a new co-segmentation algorithm that looks at all training images jointly and automatically segments out the most class-discriminative foregrounds for each image. Ultimately, those foreground segmentations are used to train a classification system. TriCoS solves the co-segmentation problem by minimizing losses at three different levels: the category level for foreground/background consistency across images belonging to the same category, the image level for spatial continuity within each image, and the dataset level for discrimination between classes. In an extensive set of experiments, we evaluate the algorithm on three benchmark datasets: the UCSD-Caltech Birds-200-2010, the Stanford Dogs, and the Oxford Flowers 102. With the help of a modern image classifier, we show superior performance compared to previously published classification methods and other co-segmentation methods.

Book ChapterDOI
07 Oct 2012
TL;DR: The objective of this work is to learn descriptors suitable for the sparse feature detectors used in viewpoint invariant matching, and it is shown that learning the pooling regions for the descriptor can be formulated as a convex optimisation problem selecting the regions using sparsity.
Abstract: The objective of this work is to learn descriptors suitable for the sparse feature detectors used in viewpoint invariant matching. We make a number of novel contributions towards this goal: first, it is shown that learning the pooling regions for the descriptor can be formulated as a convex optimisation problem selecting the regions using sparsity; second, it is shown that dimensionality reduction can also be formulated as a convex optimisation problem, using the nuclear norm to reduce dimensionality. Both of these problems use large margin discriminative learning methods. The third contribution is a new method of obtaining the positive and negative training data in a weakly supervised manner. And, finally, we employ a state-of-the-art stochastic optimizer that is efficient and well matched to the non-smooth cost functions proposed here. It is demonstrated that the new learning methods improve over the state of the art in descriptor learning for large scale matching, Brown et al. [2], and large scale object retrieval, Philbin et al. [10].

Book ChapterDOI
05 Nov 2012
TL;DR: This paper compares state of the art encoding methods and introduces a novel cascade retrieval architecture and shows that new visual concepts can be learnt on-the-fly, given a text description, and so images of that category can be retrieved from the dataset in realtime.
Abstract: This paper addresses the problem of object category retrieval in large unannotated image datasets. Our aim is to enable both fast learning of an object category model, and fast retrieval over the dataset. With these elements we show that new visual concepts can be learnt on-the-fly, given a text description, and so images of that category can then be retrieved from the dataset in realtime. To this end we compare state of the art encoding methods and introduce a novel cascade retrieval architecture, with a focus on achieving the best trade-off between three important performance measures for a realtime system of this kind, namely: (i) class accuracy, (ii) memory footprint, and (iii) speed. We show that an on-the-fly system is possible and compare its performance (using noisy training images) to that of using carefully curated images. For this evaluation we use the VOC 2007 dataset together with 100k images from ImageNet to act as distractors.

Book ChapterDOI
07 Oct 2012
TL;DR: This paper proposes evaluator algorithms that predict if a vision algorithm has succeeded, and illustrates this idea for the case of Human Pose Estimation with four recently developed HPE algorithms.
Abstract: Most current vision algorithms deliver their output 'as is', without indicating whether it is correct or not. In this paper we propose evaluator algorithms that predict if a vision algorithm has succeeded. We illustrate this idea for the case of Human Pose Estimation (HPE). We describe the stages required to learn and test an evaluator, including the use of an annotated ground truth dataset for training and testing the evaluator (and we provide a new dataset for the HPE case), and the development of auxiliary features that have not been used by the (HPE) algorithm, but can be learnt by the evaluator to predict if the output is correct or not. Then an evaluator is built for each of four recently developed HPE algorithms using their publicly available implementations: Eichner and Ferrari [5], Sapp et al. [16], Andriluka et al. [2] and Yang and Ramanan [22]. We demonstrate that in each case the evaluator is able to predict if the algorithm has correctly estimated the pose or not.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: Content based image retrieval (CBIR) is particularly interested in retrieving subwindows of images which are similar to the given query image, i.e. the goal is detection rather than image level classification.
Abstract: Content based image retrieval (CBIR), the problem of searching digital images in large databases according to their visual content, is a well established research area in computer vision. In this work we are particularly interested in retrieving subwindows of images which are similar to the given query image, i.e. the goal is detection rather than image level classification. The notion of similarity is defined as being the same object class but also having similar viewpoint (e.g. frontal, left-facing, rear etc.). A query image can be a part of an object (e.g. head of a side facing horse), a complete object (e.g. frontal car image), or a composition of objects (visual phrases, e.g. person riding a horse). For instance, given a query of a horse facing left, the aim is to retrieve any left facing horse (intra-class variation) which might be walking or running with different feet formations (exemplar deformation).

Proceedings ArticleDOI
23 May 2012
TL;DR: A method of visual search for finding people in large video datasets that can be specified at run time by a text query, and a discriminative classifier for that person is then learnt on-the-fly using images downloaded from Google Image search.
Abstract: We describe a method of visual search for finding people in large video datasets. The novelty is that the person of interest can be specified at run time by a text query, and a discriminative classifier for that person is then learnt on-the-fly using images downloaded from Google Image search. The performance of the method is evaluated on a ground truth dataset of episodes of Scrubs, and results are also shown for retrieval on the TRECVid 2011 IACC.1.B dataset of over 8k videos. The entire process from specifying the query to receiving the ranked results takes only a matter of seconds.

01 Sep 2012
TL;DR: The AXES project participated in the interactive instance search task (INS), the known-item search task, and the multimedia event detection task (MED) for TRECVid 2012 as mentioned in this paper.
Abstract: The AXES project participated in the interactive instance search task (INS), the known-item search task (KIS), and the multimedia event detection task (MED) for TRECVid 2012. As in our TRECVid 2011 system, we used nearly identical search systems and user interfaces for both INS and KIS. Our interactive INS and KIS systems focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our KIS experiments were media professionals from the BBC; our INS experiments were carried out by students and researchers at Dublin City University. We performed comparatively well in both experiments. Our best KIS run found 13 of the 25 topics, and our best INS runs outperformed all other submitted runs in terms of P@100. For MED, the system presented was based on a minimal number of low-level descriptors, which we chose to be as large as computationally feasible. These descriptors are aggregated to produce high-dimensional video-level signatures, which are used to train a set of linear classifiers. Our MED system achieved the second-best score of all submitted runs in the main track, and best score in the ad-hoc track, suggesting that a simple system based on state-of-the-art low-level descriptors can give relatively high performance. This paper describes in detail our KIS, INS, and MED systems and the results and findings of our experiments.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: A fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length and achieves superior joint localisation results to those obtained using the method of Buehler et al. (IJCV 2011).
Abstract: We present a fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length. Our framework replicates the state-of-the-art long term tracker by Buehler et al. (IJCV 2011), but does not require the manual annotation and, after automatic initialisation, performs tracking in real-time. We cast the problem as a generic frame-by-frame random forest regressor without a strong spatial model. Our contributions are (i) a co-segmentation algorithm that automatically separates the signer from any signed TV broadcast using a generative layered model; (ii) a method of predicting joint positions given only the segmentation and a colour model using a random forest regressor; and (iii) demonstrating that the random forest can be trained from an existing semi-automatic, but computationally expensive, tracker. The method is applied to signing footage with changing background, challenging imaging conditions, and for different signers. We achieve superior joint localisation results to those obtained using the method of Buehler et al.

Proceedings ArticleDOI
05 Jun 2012
TL;DR: A method for real time video retrieval where the task is to match the 2D human pose of a query using a random forest of K-D trees and it is shown that pose retrieval can proceed using a low dimensional representation.
Abstract: We describe a method for real time video retrieval where the task is to match the 2D human pose of a query. A user can form a query by (i) interactively controlling a stickman on a web based GUI, (ii) uploading an image of the desired pose, or (iii) using the Kinect and acting out the query himself. The method is scalable and is applied to a dataset of 18 films totaling more than three million frames. The real time performance is achieved by searching for approximate nearest neighbors to the query using a random forest of K-D trees. Apart from the query modalities, we introduce two other areas of novelty. First, we show that pose retrieval can proceed using a low dimensional representation. Second, we show that the precision of the results can be improved substantially by combining the outputs of independent human pose estimation algorithms. The performance of the system is assessed quantitatively over a range of pose queries.

Book ChapterDOI
07 Oct 2012
TL;DR: A new vanishing point estimation algorithm based on recently introduced techniques for the continuous-discrete optimisation of energies arising from model selection priors is proposed, obtaining state-of-the-art results.
Abstract: We introduce the self-similar sketch, a new method for the extraction of intermediate image features that combines three principles: detection of self-similarity structures, nonaccidental alignment, and instance-specific modelling. The method searches for self-similar image structures that form nonaccidental patterns, for example collinear arrangements. We demonstrate a simple implementation of this idea where self-similar structures are found by looking for SIFT descriptors that map to the same visual words in image-specific vocabularies. This results in a visual word map which is searched for elongated connected components. Finally, segments are fitted to these connected components, extracting linear image structures beyond the ones that can be captured by conventional edge detectors, as the latter implicitly assume a specific appearance for the edges (steps). The resulting collection of segments constitutes a "sketch" of the image. This is applied to the task of estimating vanishing points, horizon, and zenith in standard benchmark data, obtaining state-of-the-art results. We also propose a new vanishing point estimation algorithm based on recently introduced techniques for the continuous-discrete optimisation of energies arising from model selection priors.

Proceedings ArticleDOI
05 Jun 2012
TL;DR: Using two complementary visual retrieval methods improves both retrieval and precision performance and it is shown that Google image search can be used to query expand the name sub-set, and thereby correctly determine the full name of the sculpture.
Abstract: We describe a retrieval based method for automatically determining the title and sculptor of an imaged sculpture. This is a useful problem to solve, but also quite challenging given the variety in both form and material that sculptures can take, and the similarity in both appearance and names that can occur.Our approach is to first visually match the sculpture and then to name it by harnessing the meta-data provided by Flickr users. To this end we make the following three contributions: (i) we show that using two complementary visual retrieval methods (one based on visual words, the other on boundaries) improves both retrieval and precision performance; (ii) we show that a simple voting scheme on the tf-idf weighted meta-data can correctly hypothesize a subset of the sculpture name (provided that the meta-data has first been suitably cleaned up and normalized); and (iii) we show that Google image search can be used to query expand the name sub-set, and thereby correctly determine the full name of the sculpture.The method is demonstrated on over 500 sculptors covering more than 2000 sculptures. We also quantitatively evaluate the system and demonstrate correct identification of the sculpture on over 60% of the queries.

Journal Article
TL;DR: In this paper, an algorithm for structured output ranking is proposed that can be trained in a time linear in the number of samples under a mild assumption common to many computer vision problems: the loss function can be discretized into a small number of values.
Abstract: In computer vision efficient multi-class classification is becoming a key problem as the field develops and the number of object classes to be identified increases. Often objects might have some sort of structure such as a taxonomy in which the mis-classification score for object classes close by, using tree distance within the taxonomy, should be less than for those far apart. This is an example of multi-class classification in which the loss function has a special structure. Another example in vision is for the ubiquitous pictorial structure or parts based model. In this case we would like the mis-classification score to be proportional to the number of parts misclassified. It transpires both of these are examples of structured output ranking problems. However, so far no efficient large scale algorithm for this problem has been demonstrated. In this work we propose an algorithm for structured output ranking that can be trained in a time linear in the number of samples under a mild assumption common to many computer vision problems: that the loss function can be discretized into a small number of values. We show the feasibility of structured ranking on these two core computer vision problems and demonstrate a consistent and substantial improvement over competing techniques. Aside from this, we also achieve state-of-the art results for the PASCAL VOC human layout problem.

Proceedings ArticleDOI
02 May 2012
TL;DR: This work takes the novel approach of using temporal features evaluated over the whole of the mitotic phases rather than over single frames, thereby capturing the distinctive behaviour over the phases.
Abstract: With the widespread use of time-lapse data to understand cellular function, there is a need for tools which facilitate high-throughput analysis of data. We present a system for automated segmentation and mitotic phase labelling based on a wide margin discriminative Semi-Markov Model. This work takes the novel approach of using temporal features evaluated over the whole of the mitotic phases rather than over single frames, thereby capturing the distinctive behaviour over the phases. This approach extends and substantially improves on our previous approach of using dynamic time warping to align temporal feature signals to a reference.

Book ChapterDOI
07 Oct 2012
TL;DR: An algorithm for structured output ranking that can be trained in a time linear in the number of samples under a mild assumption common to many computer vision problems: that the loss function can be discretized into a small number of values is proposed.
Abstract: In computer vision efficient multi-class classification is becoming a key problem as the field develops and the number of object classes to be identified increases. Often objects might have some sort of structure such as a taxonomy in which the mis-classification score for object classes close by, using tree distance within the taxonomy, should be less than for those far apart. This is an example of multi-class classification in which the loss function has a special structure. Another example in vision is for the ubiquitous pictorial structure or parts based model. In this case we would like the mis-classification score to be proportional to the number of parts misclassified. It transpires both of these are examples of structured output ranking problems. However, so far no efficient large scale algorithm for this problem has been demonstrated. In this work we propose an algorithm for structured output ranking that can be trained in a time linear in the number of samples under a mild assumption common to many computer vision problems: that the loss function can be discretized into a small number of values. We show the feasibility of structured ranking on these two core computer vision problems and demonstrate a consistent and substantial improvement over competing techniques. Aside from this, we also achieve state-of-the art results for the PASCAL VOC human layout problem.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: A latent deformable template model with a locally affine deformation field is proposed, which allows for more general and more natural deformations of the template while not over-fitting the data; and a novel inference method is provided for this kind of problem.
Abstract: Methods for human detection and localization typically use histograms of gradients (HOG) and work well for aligned data with low variance. For methods based on HOG despite the fact the higher resolution templates capture more details, their use does not lead to a better performance, because even a small variance in the data could cause the discriminative edges to fall into different neighbouring cells. To overcome these problems, Felzenszwalb et al. proposed a star-graph part based deformable model with a fixed number of rigid parts, which could capture these variations in the data leading to state-ofthe-art results. Motivated by this work, we propose a latent deformable template model with a locally affine deformation field, which allows for more general and more natural deformations of the template while not over-fitting the data; and we also provide a novel inference method for this kind of problem. This deformation model gives us a way to measure the distances between training samples, and we show how this can be used to cluster the problem into several modes, corresponding to different types of objects, viewpoints or poses. Our method leads to a significant improvement over the state-of-the-art with small computational overhead.

Book ChapterDOI
01 Oct 2012
TL;DR: A scalable, real-time, visual search engine for 3-D medical images, where a user is able to select a query Region Of Interest (ROI) and automatically detect the corresponding regions within all returned images.
Abstract: The objective of this work is a scalable, real-time, visual search engine for 3-D medical images, where a user is able to select a query Region Of Interest (ROI) and automatically detect the corresponding regions within all returned images. We make three contributions: (i) we show that with appropriate off-line processing, images can be retrieved and ROIs registered in real time; (ii) we propose and evaluate a number of scalable exemplar-based image registration schemes; (iii) we propose a discriminative method for learning to rank the returned images based on the content of the ROI. The retrieval system is demonstrated on MRI data from the ADNI dataset, and it is shown that the learnt ranking function outperforms the baseline.

Journal ArticleDOI
TL;DR: Everything he did: research, experimentation, software, paper writing, talks, was of the highest standard and a testament to his intellectual stamina, and he was kind and demonstrated a gentle, dry wit that made time spent with Mark both stimulating and enjoyable.
Abstract: MARK EVERINGHAM was a brilliant colleague. You may have been aware of him at conferences where he asked penetrating questions that could crystallize a key aspect of a paper. In conversation with him you might have been stunned by a new connection he made between areas of research or to a crucial related work. These questions and observations were a reflection of his very broad knowledge and deep understanding of computer vision and machine learning. To us, they demonstrated his intellect and insight, but to him they were just a way of being helpful, a way of ensuring that the field made progress. Mark was incredibly generous with his time. Those of us that worked with him are aware of how much he contributed behind the scenes, without expecting any recognition. Nowhere is this more apparent than in the organization of the PASCAL Visual Object Classes (VOC) challenge to which he devoted colossal amounts of time and effort. There are also the more visible contributions to the community in area chair duties at both CVPR and ECCV, as a program cochair for BMVC, and as a member of the TPAMI editorial board. Everything he did: research, experimentation, software, paper writing, talks, was of the highest standard and a testament to his intellectual stamina. He was kind and demonstrated a gentle, dry wit that made time spent with Mark both stimulating and enjoyable. Mark was born in Bristol in 1973, winning a scholarship to Clifton College, and completing his A levels at Filton College in 1991. Directly after school he worked on a research project for the Bristol Eye Hospital, developing software for remote electrodiagnosis. He continued his involvement with the project after heading to the University of Manchester to study computer science, winning prizes for top achievement every year, and was duly awarded the BSc with 1st class honors and the Williams-Kilburn medal for exceptional achievement in 1995. Returning to Bristol, he completed work on the electrodiagnosis project, leading to his first publication, in the journal Electroencephalography and Clinical Neurophysiology in 1996. In 1997 he began his doctoral studies at the University of Bristol, supervised by Barry Thomas and Tom Troscianko, on mobile augmented reality aids for people with severe visual impairments. By presenting an enhanced image to the wearer’s visual system, users with low vision would be freed from the need for external assistance in many tasks. The approach Mark took was, as usual, based on a deep rethinking of the problem. Rather than attempting to enhance the image by emphasizing edges, he proposed to identify the semantic content of an image such that images could be enhanced in a content-driven manner, and to enhance regions rather than edges. This work led him to look at region segmentation algorithms, and there he discovered the difficulties of evaluating computer vision algorithms, leading to some of the most significant papers from his PhD. In particular, “Evaluating Image Segmentation Algorithms Using the Pareto Front,” presented at ECCV 2002, showed the importance of the choice of evaluation metric in a compelling way, and is notable for its inclusion of the “embarrassingly simple” baseline method of dividing the image into blocks, which sometimes appears on the Pareto front. In presentations of this work, Mark would draw out the humor in this fact, but also use it as a point of reference to illustrate the behavior of metrics, to give insight into the criteria, and ultimately to convince you that you had learned something. Graduating with his PhD in 2002, Mark moved to Andrew Zisserman’s group at Oxford University’s Department of Engineering Science, where he worked on three projects which explored the level of supervision required for visual classification and detection tasks. The first aimed to detect and identify actors in relatively low resolution video footage, such as TV material from the 1970s. It was demonstrated on the situation comedy Fawlty Towers. The method involved quite strong supervision where a 3D head and face model were built for each character (from images). These 3D models were then used to render images to train a discriminative tree-structured classifier which was then used as a sliding window detector. This person-specific IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 11, NOVEMBER 2012 2081