scispace - formally typeset
Search or ask a question

Showing papers by "Andrew Zisserman published in 2006"


Book ChapterDOI
07 May 2006
TL;DR: The classification performance under changes in the visual vocabulary and number of latent topics learnt is investigated, and a novel vocabulary using colour SIFT descriptors is developed using probabilistic Latent Semantic Analysis.
Abstract: Given a set of images of scenes containing multiple object categories (e.g. grass, roads, buildings) our objective is to discover these objects in each image in an unsupervised manner, and to use this object distribution to perform scene classification. We achieve this discovery using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature, here applied to a bag of visual words representation for each image. The scene classification on the object distribution is carried out by a k-nearest neighbour classifier. We investigate the classification performance under changes in the visual vocabulary and number of latent topics learnt, and develop a novel vocabulary using colour SIFT descriptors. Classification performance is compared to the supervised approaches of Vogel & Schiele [19] and Oliva & Torralba [11], and the semi-supervised approach of Fei Fei & Perona [3] using their own datasets and testing protocols. In all cases the combination of (unsupervised) pLSA followed by (supervised) nearest neighbour classification achieves superior results. We show applications of this method to image retrieval with relevance feedback and to scene classification in videos.

846 citations


Proceedings ArticleDOI
17 Jun 2006
TL;DR: It is demonstrated that by developing a visual vocabulary that explicitly represents the various aspects that distinguish one flower from another, it can overcome the ambiguities that exist between flower categories.
Abstract: We investigate to what extent ‘bag of visual words’ models can be used to distinguish categories which have significant visual similarity. To this end we develop and optimize a nearest neighbour classifier architecture, which is evaluated on a very challenging database of flower images. The flower categories are chosen to be indistinguishable on colour alone (for example), and have considerable variation in shape, scale, and viewpoint. We demonstrate that by developing a visual vocabulary that explicitly represents the various aspects (colour, shape, and texture) that distinguish one flower from another, we can overcome the ambiguities that exist between flower categories. The novelty lies in the vocabulary used for each aspect, and how these vocabularies are combined into a final classifier. The various stages of the classifier (vocabulary selection and combination) are each optimized on a validation set. Results are presented on a dataset of 1360 images consisting of 17 flower species. It is shown that excellent performance can be achieved, far surpassing standard baseline algorithms using (for example) colour cues alone.

834 citations


Proceedings ArticleDOI
17 Jun 2006
TL;DR: This work compute multiple segmentations of each image and then learns the object classes and chooses the correct segmentations, demonstrating that such an algorithm succeeds in automatically discovering many familiar objects in a variety of image datasets, including those from Caltech, MSRC and LabelMe.
Abstract: Given a large dataset of images, we seek to automatically determine the visually similar object and scene classes together with their image segmentation. To achieve this we combine two ideas: (i) that a set of segmented objects can be partitioned into visual object classes using topic discovery models from statistical text analysis; and (ii) that visual object classes can be used to assess the accuracy of a segmentation. To tie these ideas together we compute multiple segmentations of each image and then: (i) learn the object classes; and (ii) choose the correct segmentations. We demonstrate that such an algorithm succeeds in automatically discovering many familiar objects in a variety of image datasets, including those from Caltech, MSRC and LabelMe.

737 citations


Proceedings ArticleDOI
01 Jan 2006
TL;DR: It is demonstrated that high precision can be achieved by combining multiple sources of information, both visual and textual, by automatic generation of time stamped character annotation by aligning subtitles and transcripts.
Abstract: We investigate the problem of automatically labelling appearances of characters in TV or film material. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying when characters are speaking; (iii) using complementary cues of face matching and clothing matching to propose common annotations for face tracks. Results are presented on episodes of the TV series “Buffy the Vampire Slayer”.

683 citations


Book ChapterDOI
07 May 2006
TL;DR: The BFM detector is able to represent and detect object classes principally defined by their shape, rather than their appearance, and to achieve this with less supervision (such as the number of training images).
Abstract: The objective of this work is the detection of object classes, such as airplanes or horses. Instead of using a model based on salient image fragments, we show that object class detection is also possible using only the object's boundary. To this end, we develop a novel learning technique to extract class-discriminative boundary fragments. In addition to their shape, these “codebook” entries also determine the object's centroid (in the manner of Leibe et al. [19]). Boosting is used to select discriminative combinations of boundary fragments (weak detectors) to form a strong “Boundary-Fragment-Model” (BFM) detector. The generative aspect of the model is used to determine an approximate segmentation. We demonstrate the following results: (i) the BFM detector is able to represent and detect object classes principally defined by their shape, rather than their appearance; and (ii) in comparison with other published results on several object classes (airplanes, cars-rear, cows) the BFM detector is able to exceed previous performances, and to achieve this with less supervision (such as the number of training images).

376 citations



BookDOI
01 Jan 2006
TL;DR: This book is the outcome of two workshops that brought together about 40 prominent vision and machine learning researchers interested in the fundamental and applicative aspects of object recognition, as well as representatives of industry to promote the creation of an international object recognition community.
Abstract: This book is the outcome of two workshops that brought together about 40 prominent vision and machine learning researchers interested in the fundamental and applicative aspects of object recognition, as well as representatives of industry. The main goals of these two workshops were (1) to promote the creation of an international object recognition community, with common datasets and evaluation procedures, (2) to map the state of the art and identify the main open problems and opportunities for synergistic research, and (3) to articulate the industrial and societal needs and opportunities for object recognition research worldwide. These goals are reflected in a relatively small number of papers that illustrate the breadth of today's object recognition research and the arsenal of techniques at its disposal, and discuss current achievements and outstanding challenges. Most of the chapters are descriptions of technical approaches, intended to capture the current state of the art. Some of the chapters are of a tutorial nature. They cover fundamental building blocks for object recognition techniques.

260 citations


Book ChapterDOI
TL;DR: An approach to object retrieval which searches for and localizes all the occurrences of an object in a video, given a query image of the object, and returns a ranked list of shots in the manner of Google.
Abstract: We describe an approach to object retrieval which searches for and localizes all the occurrences of an object in a video, given a query image of the object. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject those that are unstable. Efficient retrieval is achieved by employing methods from statistical text retrieval, including inverted file systems, and text and document frequency weightings. This requires a visual analogy of a word which is provided here by vector quantizing the region descriptors. The final ranking also depends on the spatial layout of the regions. The result is that retrieval is immediate, returning a ranked list of shots in the manner of Google. We report results for object retrieval on the full length feature films 'Groundhog Day' and 'Casablanca'.

255 citations


Book ChapterDOI
TL;DR: Current datasets are lacking in several respects, and this paper discusses some of the lessons learned from existing efforts, as well as innovative ways to obtain very large and diverse annotated datasets.
Abstract: Appropriate datasets are required at all stages of object recognition research, including learning visual models of object and scene categories, detecting and localizing instances of these models in images, and evaluating the performance of recognition algorithms Current datasets are lacking in several respects, and this paper discusses some of the lessons learned from existing efforts, as well as innovative ways to obtain very large and diverse annotated datasets It also suggests a few criteria for gathering future datasets

250 citations


Proceedings ArticleDOI
17 Jun 2006
TL;DR: A visual alphabet representation which can be learnt incrementally, and explicitly shares boundary fragments and spatial configurations across object categories, and shows that category similarities can be predicted from the alphabet.
Abstract: We address the problem of multiclass object detection. Our aims are to enable models for new categories to benefit from the detectors built previously for other categories, and for the complexity of the multiclass system to grow sublinearly with the number of categories. To this end we introduce a visual alphabet representation which can be learnt incrementally, and explicitly shares boundary fragments (contours) and spatial configurations (relation to centroid) across object categories. We develop a learning algorithm with the following novel contributions: (i) AdaBoost is adapted to learn jointly, based on shape features; (ii) a new learning schedule enables incremental additions of new categories; and (iii) the algorithm learns to detect objects (instead of categorizing images). Furthermore, we show that category similarities can be predicted from the alphabet. We obtain excellent experimental results on a variety of complex categories over several visual aspects. We show that the sharing of shape features not only reduces the number of features required per category, but also often improves recognition performance, as compared to individual detectors which are trained on a per-class basis.

214 citations


Journal ArticleDOI
TL;DR: A method for automatically obtaining object representations suitable for retrieval from generic video shots that includes associating regions within a single shot to represent a deforming object and an affine factorization method that copes with motion degeneracy.
Abstract: We describe a method for automatically obtaining object representations suitable for retrieval from generic video shots. The object representation consists of an association of frame regions. These regions provide exemplars of the object's possible visual appearances. Two ideas are developed: (i) associating regions within a single shot to represent a deforming object; (ii) associating regions from the multiple visual aspects of a 3D object, thereby implicitly representing 3D structure. For the association we exploit temporal continuity (tracking) and wide baseline matching of affine covariant regions. In the implementation there are three areas of novelty: First, we describe a method to repair short gaps in tracks. Second, we show how to join tracks across occlusions (where many tracks terminate simultaneously). Third, we develop an affine factorization method that copes with motion degeneracy. We obtain tracks that last throughout the shot, without requiring a 3D reconstruction. The factorization method is used to associate tracks into object-level groups, with common motion. The outcome is that separate parts of an object that are not simultaneously visible (such as the front and back of a car, or the front and side of a face) are associated together. In turn this enables object-level matching and recognition throughout a video. We illustrate the method on the feature film "Groundhog Day." Examples are given for the retrieval of deforming objects (heads, walking people) and rigid objects (vehicles, locations).

01 Jan 2006
TL;DR: It is shown that, perhaps surprisingly, the simple Bayesian approach performs best on databases including challenging images, and performance is comparable to more complex state-of-the-art methods.
Abstract: We address the task of accurately localizing the eyes in face images extracted by a face detector, an important problem to be solved because of the negative effect of poor localization on face recognition accuracy. We investigate three approaches to the task: a regression approach aiming to directly minimize errors in the predicted eye positions, a simple Bayesian model of eye and non-eye appearance, and a discriminative eye detector trained using AdaBoost. By using identical training and test data for each method we are able to perform an unbiased comparison. We show that, perhaps surprisingly, the simple Bayesian approach performs best on databases including challenging images, and performance is comparable to more complex state-of-the-art methods.

Proceedings ArticleDOI
10 Apr 2006
TL;DR: In this paper, a regression approach aiming to directly minimize errors in the predicted eye positions, a simple Bayesian model of eye and non-eye appearance, and a discriminative eye detector trained using AdaBoost are investigated.
Abstract: We address the task of accurately localizing the eyes in face images extracted by a face detector, an important problem to be solved because of the negative effect of poor localization on face recognition accuracy. We investigate three approaches to the task: a regression approach aiming to directly minimize errors in the predicted eye positions, a simple Bayesian model of eye and non-eye appearance, and a discriminative eye detector trained using AdaBoost. By using identical training and test data for each method we are able to perform an unbiased comparison. We show that, perhaps surprisingly, the simple Bayesian approach performs best on databases including challenging images, and performance is comparable to more complex state-of-the-art methods.

Proceedings Article
04 Dec 2006
TL;DR: This paper develops a multi-frame image super-resolution approach from a Bayesian view-point by marginalizing over the unknown registration parameters relating the set of input low-resolution views, allowing for more realistic prior distributions and reduces the dimension of the integral considerably, removing the main computational bottleneck of the other algorithm.
Abstract: This paper develops a multi-frame image super-resolution approach from a Bayesian view-point by marginalizing over the unknown registration parameters relating the set of input low-resolution views. In Tipping and Bishop's Bayesian image super-resolution approach [16], the marginalization was over the super-resolution image, necessitating the use of an unfavorable image prior. By integrating over the registration parameters rather than the high-resolution image, our method allows for more realistic prior distributions, and also reduces the dimension of the integral considerably, removing the main computational bottleneck of the other algorithm. In addition to the motion model used by Tipping and Bishop, illumination components are introduced into the generative model, allowing us to handle changes in lighting as well as motion. We show results on real and synthetic datasets to illustrate the efficacy of this approach.

Journal Article
TL;DR: In this article, the authors used probabilistic Latent Semantic Analysis (pLSA) to discover objects in each image in an unsupervised manner, and to use this object distribution to perform scene classification.
Abstract: Given a set. of images of scenes containing multiple object categories (e.g. grass, roads, buildings) our objective is to discover these objects in each image in an unsupervised manner, and to use this object distribution to perform scene classification. We achieve this discovery using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature, here applied to a bag of visual words representation for each image. The scene classification on the object distribution is carried out by a k-nearest neighbour classifier. We investigate the classification performance under changes in the visual vocabulary and number of latent topics learnt, and develop a novel vocabulary using colour SIFT descriptors. Classification performance is compared to the supervised approaches of Vogel & Schiele [19] and Oliva & Torralba [11], and the semi-supervised approach of Fei Fei & Perona [3] using their own datasets and testing protocols. In all cases the combination of (unsupervised) pLSA followed by (supervised) nearest neighbour classification achieves superior results. We show applications of this method to image retrieval with relevance feedback and to scene classification in videos.

Book ChapterDOI
13 Dec 2006
TL;DR: In this paper, a bag of visual words histograms is used to represent an object class by a set of histograms, each one corresponding to a training exemplar, and classification is then achieved by k-nearest neighbor search over the exemplars.
Abstract: Histograms of visual words (or textons) have proved effective in tasks such as image classification and object class recognition. A common approach is to represent an object class by a set of histograms, each one corresponding to a training exemplar. Classification is then achieved by k-nearest neighbour search over the exemplars. In this paper we introduce two novelties on this approach: (i) we show that new compact single histogram models estimated optimally from the entire training set achieve an equal or superior classification accuracy. The benefit of the single histograms is that they are much more efficient both in terms of memory and computational resources; and (ii) we show that bag of visual words histograms can provide an accurate pixel-wise segmentation of an image into object class regions. In this manner the compact models of visual object classes give simultaneous segmentation and recognition of image regions. The approach is evaluated on the MSRC database [5] and it is shown that performance equals or is superior to previous publications on this database.

Journal Article
TL;DR: In this article, a class-specific edge classification method is proposed to prune edges which are not relevant to the object class, and thereby improve the performance of subsequent processing, and demonstrate learning class specific edges for a number of object classes under challenging scale and illumination variation.
Abstract: Recent research into recognizing object classes (such as humans, cows and hands) has made use of edge features to hypothesize and localize class instances. However, for the most part, these edge-based methods operate solely on the geometric shape of edges, treating them equally and ignoring the fact that for certain object classes, the appearance of the object on the inside of the edge may provide valuable recognition cues. We show how. for such object classes, small regions around edges can be used to classify the edge into object or non-object. This classifier may then be used to prune edges which are not relevant to the object class, and thereby improve the performance of subsequent processing. We demonstrate learning class specific edges for a number of object classes -oranges, bananas and bottles - under challenging scale and illumination variation. Because class-specific edge classification provides a low-level analysis of the image it may be integrated into any edge-based recognition strategy without significant change in the high-level algorithms. We illustrate its application to two algorithms: (i) chamfer matching for object detection, and (ii) modulating contrast terms in MRF based object-specific segmentation. We show that performance of both algorithms (matching and segmentation) is considerably improved by the class-specific edge labelling.

Book ChapterDOI
13 Dec 2006
TL;DR: It is shown how, for certain object classes, small regions around edges can be used to classify the edge into object or non-object, and performance of both algorithms (matching and segmentation) is considerably improved by the class-specific edge labelling.
Abstract: Recent research into recognizing object classes (such as humans, cows and hands) has made use of edge features to hypothesize and localize class instances. However, for the most part, these edge-based methods operate solely on the geometric shape of edges, treating them equally and ignoring the fact that for certain object classes, the appearance of the object on the “inside” of the edge may provide valuable recognition cues. We show how, for such object classes, small regions around edges can be used to classify the edge into object or non-object. This classifier may then be used to prune edges which are not relevant to the object class, and thereby improve the performance of subsequent processing. We demonstrate learning class specific edges for a number of object classes — oranges, bananas and bottles — under challenging scale and illumination variation. Because class-specific edge classification provides a low-level analysis of the image it may be integrated into any edge-based recognition strategy without significant change in the high-level algorithms. We illustrate its application to two algorithms: (i) chamfer matching for object detection, and (ii) modulating contrast terms in MRF based object-specific segmentation. We show that performance of both algorithms (matching and segmentation) is considerably improved by the class-specific edge labelling.

Proceedings ArticleDOI
01 Jan 2006
TL;DR: It is demonstrated that superior estim ates are obtained by optimizing over both the registration and image, and the parameters of the edge preserving prior are learnt automatically from the data, rather than being set by trial and error.
Abstract: In multiple-image super-resolution, a high resolution image is estimated from a number of lower-resolution images. This involves computing the parameters of a generative imaging model (such as geometric and photometric registration, and blur) and obtaining a MAP estimate by minimizing a cost function including an appropriate prior. We consider the quite general geometric registration situation modelled by a plane projective transformation, and make two novel contributions: (i) in previous approaches the MAP estimate has been obtained by fir st computing and fixing the registration, and then computing the super-re solution image with this registration. We demonstrate that superior estim ates are obtained by optimizing over both the registration and image; (ii) the parameters of the edge preserving prior are learnt automatically from the data, rather than being set by trial and error. We show examples on a number of real sequences including multiple stills, digital video, and DVDs of movies.

Proceedings ArticleDOI
17 Jun 2006
TL;DR: A generic method for solving Markov random fields (MRF) by formulating the problem of MAP estimation as 0-1 quadratic programming (QP) and proposing a second order cone programming relaxation scheme which solves a closely related (convex) approximation.
Abstract: This paper presents a generic method for solvingMarkov random fields (MRF) by formulating the problem of MAP estimation as 0-1 quadratic programming (QP). Though in general solving MRFs is NP-hard, we propose a second order cone programming relaxation scheme which solves a closely related (convex) approximation. In terms of computational efficiency, our method significantly outperforms the semidefinite relaxations previously used whilst providing equally (or even more) accurate results. Unlike popular inference schemes such as Belief Propagation and Graph Cuts, convergence is guaranteed within a small number of iterations. Furthermore, we also present a method for greatly reducing the runtime and increasing the accuracy of our approach for a large and useful class of MRFs. We compare our approach with the state-of-the-art methods for subgraph matching and object recognition and demonstrate significant improvements.

Journal Article
TL;DR: In this paper, a strong boundary fragment model (BFM) is proposed to detect object classes using only the object's boundary, which is able to detect objects principally defined by their shape rather than their appearance.
Abstract: The objective of this work is the detection of object classes, such as airplanes or horses. Instead of using a model based on salient image fragments, we show that object class detection is also possible using only the object's boundary. To this end, we develop a novel learning technique to extract class-discriminative boundary fragments. In addition to their shape, these ''codebook entries also determine the object's centroid (in the manner of Leibe et al. [19]). Boosting is used to select discriminative combinations of boundary fragments (weak detectors) to form a strong Boundary-Fragment-Model (BFM) detector. The generative aspect of the model is used to determine an approximate segmentation. We demonstrate the following results: (i) the BFM detector is able to represent and detect object classes principally defined by their shape, rather than their appearance; and (ii) in comparison with other published results on several object classes (airplanes, cars-rear, cows) the BFM detector is able to exceed previous performances, and to achieve this with less supervision (such as the number of training images).

Proceedings ArticleDOI
01 Jan 2006
TL;DR: The objective is to detect object instances in an image, as opposed to the easier task of image categorization, and two algorithms for learning and detecting object categories which both benefit from combining features are investigated.
Abstract: We present methods for recognizing object categories which are able to combine various feature types (e.g. image patches and edge boundaries). Our objective is to detect object instances in an image, as opposed to the easier task of image categorization. To this end, we investigate two algorithms for learning and detecting object categories which both benefit from combining features. The first uses a naive combination method for detectors each employing only one type of feature, the second learns the best features (from a pool of patches and boundaries). In experiments we achieve comparable results to the state of the art over a number of datasets, and for some categories we even achieve the lowest errors that have been reported so far. The results also show that certain object categories prefer certain feature types (e.g. boundary fragments for airplanes).

Journal Article
TL;DR: A parts and structure model for object category recognition that can be learnt efficiently and in a weakly-supervised manner, bypassing the need for feature detectors, to give the globally optimal match within a query image.
Abstract: We present a parts and structure model for object category recognition that can be learnt efficiently and in a weakly-supervised manner: the model is learnt from example images containing category instances, without requiring segmentation from background clutter. The model is a sparse representation of the object, and consists of a star topology configuration of parts modeling the output of a variety of feature detectors. The optimal choice of feature types (whose repertoire includes interest points, curves and regions) is made automatically. In recognition, the model may be applied efficiently in a complete manner, bypassing the need for feature detectors, to give the globally optimal match within a query image. The approach is demonstrated on a wide variety of categories, and delivers both successful classification and localization of the object within the image.

Book ChapterDOI
01 Jan 2006
TL;DR: This work considers the content-based multimedia retrieval setup: the aim is to retrieve, and rank by confidence, film shots based on the presence of specific actors based on a database of known faces with associated identities.
Abstract: The problem of automatic face recognition (AFR) concerns matching a detected (roughly localized) face against a database of known faces with associated identities. This task, although very intuitive to humans and despite the vast amounts of research behind it, still poses a significant challenge to computer-based methods. For reviews of the literature and commercial state-of-the-art see [21, 372] and [252, 253]. Much AFR research has concentrated on the user authentication paradigm (e.g. [10, 30, 183]). In contrast, we consider the content-based multimedia retrieval setup: our aim is to retrieve, and rank by confidence, film shots based on the presence of specific actors. A query to the system consists of the user choosing the person of interest in one or more keyframes.

01 Jan 2006
TL;DR: The Oxford team participated in the high-level feature extraction and interactive search tasks and developed a novel on the fly face classification system, which coupled a Google Images search with rapid Support Vector Machine (SVM) training and testing to return results containing a particular person within a few minutes.
Abstract: The Oxford team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, one using sparse and one using dense visual features to learn classifiers for all 39 required concepts, using the training data supplied by MediaMill [29] for the 2005 data. In addition, we also used a face specific classifier, with features computed for specific facial parts, to facilitate answering people-dependent queries such as “government leader”. We submitted 3 different runs for this task. OXVGG_A was the result of using the dense visual features only. OXVGG_OJ was the result of using the sparse visual features for all the concepts, except for “government leader”, “face” and “person”, where we prepended the results from the face classifier. OXVGG_AOJ was a run where we applied rank fusion to merge the outputs from the sparse and dense methods with weightings tuned to the training data, and also prepended the face results for “face”, “person” and “government leader”. In general, the sparse features tended to perform best on the more object based concepts, such as “US flag”, while the dense features performed slightly better on more scene based concepts, such as “military”. Overall, the fused run did the best with a Mean Average (inferred) Precision (MAP) of 0.093, the sparse run came second with a MAP of 0.080, followed by the dense run with a MAP of 0.053. For the interactive search task, we coupled the results generated during the high-level task with methods to facilitate efficient and productive interactive search. Our system allowed for several “expansion” methods based on the sparse and dense features, as well as a novel on the fly face classification system, which coupled a Google Images search with rapid Support Vector Machine (SVM) training and testing to return results containing a particular person within a few minutes. We submitted just one run, OXVGG_TVI, which performed well, winning two categories and coming above the median in 18 out of 24 queries. 1 High-level Feature Extraction Our approach here is to train an SVM for the concept in question, then score all key frames in the test set by the magnitude of their discriminant (the distance from the discriminating hyper-plane), and subsequently rank the test shots by the score of their keyframes. We have developed three methods for this task, each differing in their features and/or kernel. Two of the methods are applicable to general visual categories (such as airplane, mountain and road) and the third is specific to faces. The first two methods differ in that one uses sparse (based on region detectors) monochrome features, and the other uses dense (on a regular pixel grid) colour features. We now describe the three methods in some detail.

Book ChapterDOI
TL;DR: This chapter presents a principled Bayesian method for detecting and segmenting instances of a particular object category within an image, providing a coherent methodology for combining top down and bottom up cues.
Abstract: In this chapter we present a principled Bayesian method for detecting and segmenting instances of a particular object category within an image, providing a coherent methodology for combining top down and bottom up cues. The work draws together two powerful formulations: pictorial structures (ps) and Markov random fields (mrfs) both of which have efficient algorithms for their solution. The resulting combination, which we call the object category specific mrf, suggests a solution to the problem that has long dogged mrfs namely that they provide a poor prior for specific shapes. In contrast, our model provides a prior that is global across the image plane using the ps. We develop an efficient method, ObjCut, to obtain segmentations using this model. Novel aspects of this method include an efficient algorithm for sampling the ps model, and the observation that the expected log likelihood of the model can be increased by a single graph cut. Results are presented on two object categories, cows and horses. We compare our methods to the state of the art in object category specific image segmentation and demonstrate significant improvements.